public inbox for [email protected]
help / color / mirror / Atom feedRe: Batching in executor
9+ messages / 4 participants
[nested] [flat]
* Re: Batching in executor
@ 2026-01-26 09:34 Daniil Davydov <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Daniil Davydov @ 2026-01-26 09:34 UTC (permalink / raw)
To: cca5507 <[email protected]>; +Cc: Amit Langote <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
On Mon, Dec 22, 2025 at 6:46 PM cca5507 <[email protected]> wrote:
>
> Some comments for v4:
>
Agree with your (1)-(4) comments.
> 5) heapgettup_pagemode_batch()
> If the scan key filters out all tuples on a page, we may return 0 before reaching the end of scan, right?
>
Yes. I think that we should advance to the next page if "nout == 0"
at the end of walking through the rs_vistuples.
> 6) heap_begin_batch()
> ```
> hb = palloc(sizeof(HeapBatch));
> hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
> ```
> Can we just use one palloc() for cache-friendly?
>
Actually, we are using memory context when calling the palloc function.
I.e. in the general case it will not cause memory allocation. But of course
there is no guarantee for it. I saw a lot of places in the code where we
are calling the palloc function several times in a row, so I guess that
this is OK.
If you will decide to leave these palloc calls, I suggest using the
palloc_object/palloc_array functions.
A few other comments on 0001 patch:
1)
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
Is it syntactically correct?
2)
/* Initialize static fields of HeapTupleData. Row bodies remain on page. */
relid = RelationGetRelid(sscan->rs_rd);
for (int i = 0; i < maxitems; i++)
hb->tupdata[i].t_tableOid = relid;
Is it really necessary? I see that we are setting this field inside the
heapgettup_pagemode_batch function.
A few comment on 0002 patch:
1)
I guess that you should rebase your patches on the current master, because
the second patch doesn't apply.
2)
Maybe we can use tuplestore for tuples stored in TupleBatch? It is just a
proposal - I didn't check this idea carefully.
--
Best regards,
Daniil Davydov
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-01-27 03:00 Amit Langote <[email protected]>
parent: Daniil Davydov <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Amit Langote @ 2026-01-27 03:00 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
On Mon, Jan 26, 2026 at 6:34 PM Daniil Davydov <[email protected]> wrote:
>
> Hi,
>
> On Mon, Dec 22, 2025 at 6:46 PM cca5507 <[email protected]> wrote:
> >
> > Some comments for v4:
> >
>
> Agree with your (1)-(4) comments.
>
> > 5) heapgettup_pagemode_batch()
> > If the scan key filters out all tuples on a page, we may return 0 before reaching the end of scan, right?
> >
>
> Yes. I think that we should advance to the next page if "nout == 0"
> at the end of walking through the rs_vistuples.
Next version (v5) does it like that.
> > 6) heap_begin_batch()
> > ```
> > hb = palloc(sizeof(HeapBatch));
> > hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
> > ```
> > Can we just use one palloc() for cache-friendly?
> >
>
> Actually, we are using memory context when calling the palloc function.
> I.e. in the general case it will not cause memory allocation. But of course
> there is no guarantee for it. I saw a lot of places in the code where we
> are calling the palloc function several times in a row, so I guess that
> this is OK.
>
> If you will decide to leave these palloc calls, I suggest using the
> palloc_object/palloc_array functions.
I think combining those individual pallocs into one is a good idea, so
v5 does it like that.
> A few other comments on 0001 patch:
>
> 1)
> + void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
> Is it syntactically correct?
Yes, it compiles fine. Though I'm considering changing the return type
to a struct with common fields (like nitems) so callers can access
them directly without callback indirection. Maybe call it TAMBatch or
something.
> 2)
> /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
> relid = RelationGetRelid(sscan->rs_rd);
> for (int i = 0; i < maxitems; i++)
> hb->tupdata[i].t_tableOid = relid;
>
> Is it really necessary? I see that we are setting this field inside the
> heapgettup_pagemode_batch function.
It's intentional -- by initializing t_tableOid once in
heap_begin_batch, we can avoid setting it repeatedly for every tuple
in heapgettup_pagemode_batch(). Though you are correct to point out
the redundant assignment in heapgettup_pagemode_batch(); I'll change
it to an Assert instead. The relid doesn't change during the scan.
> A few comment on 0002 patch:
>
> 1)
> I guess that you should rebase your patches on the current master, because
> the second patch doesn't apply.
Yep, will do.
> 2)
> Maybe we can use tuplestore for tuples stored in TupleBatch? It is just a
> proposal - I didn't check this idea carefully.
TupleBatch is designed to be lightweight -- it holds an array of
TupleTableSlot pointers, not the tuple data itself. The slots
reference tuples that remain in the AM's buffer (no copy). Using
tuplestore would require materializing tuples, adding overhead we're
trying to avoid.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-01-29 07:35 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 2 replies; 9+ messages in thread
From: Amit Langote @ 2026-01-29 07:35 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
Here is v5 of the patch series.
Patches 0001-0003 add the core batching infrastructure. 0001 adds the
batch table AM API with heapam implementation, 0002 wires up SeqScan
to use it (still returning one slot at a time), and 0003 adds EXPLAIN
(BATCHES). I'd love to hear people's thoughts around TupleBatch
structure added in 0002. I thought about making it a separate patch so
that 0002 will still populate the single ScanState.ss_scanTupleSlot,
but that means we'd still have to call the TAM callback to populate
the tuple in the TAM's batch struct into the slot, defeating the whole
point. With TupleBatch, you have executor_batch_rows number of slots
which are filled in one TAM callback (materialize_all) call. So I
decided to keep the TupleBatch and related things in 0002.
For scans without quals, batching shows 20-30% improvement with no
visible regressions when batching is disabled (batch_rows=0):
SELECT * FROM t LIMIT n (no qual)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
(%diff: positive = faster than master, negative = slower)
Patches 0004-0005 add batched qual evaluation and are more
experimental (see below on why 0005 exists). For quals referencing
early columns, the improvement is significant:
SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 30.19 ms 15.55 ms 48.5%
3M 92.47 ms 50.01 ms 45.9%
10M 325.58 ms 211.83 ms 34.9%
However, for quals on later columns (e.g., 15th), batching provides no
benefit - deformation dominates and batching doesn't help:
SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 44.14 ms 44.56 ms -0.9%
3M 133.89 ms 137.77 ms -2.9%
10M 503.33 ms 528.88 ms -5.1%
I don't have a satisfactory explanation for why batching doesn't help
the deform-heavy case at all. One would expect at least some benefit
from reduced per-tuple overhead, but that's not materializing.
I've also been struggling to understand why 0004 affects the per-tuple
path even when batch_rows=0. For quals with 0% selectivity (all rows
fail the qual), perf shows ExecInterpExpr is noticeably hotter with
the patched code compared to master, even though batching is disabled:
SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
Compare that to 100% selectivity (all rows pass), where there's no regression:
SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
I tried moving batch opcodes to a separate interpreter (0005) thinking
it might be register pressure or jump table effects from adding cases
to ExecInterpExpr's switch. With 0005, the generated assembly for
ExecInterpExpr looks identical to master (same stack frame size, same
epilogue), yet the performance still differs. Specifically, the ldp
instruction in the function epilogue shows 53% hotness in patched vs
35% in master. We still need placeholder entries in the dispatch
table, so it's unclear if this fully isolates the per-tuple path. I'll
continue looking at perf, but I feel like at a bit of a loss here and
would appreciate any insights.
Other changes worth noting:
- I removed the BatchVector intermediate representation that copied
Datums into columnar arrays before qual evaluation (it used to be in
the batched qual patch 0004). Now quals access batch slots' tts_values
directly. This simplifies the code and the copy overhead wasn't paying
off. If we pursue serious vectorization later, this may need to be
revisited, but removing it doesn't degrade performance.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.0K, 2-v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From f772043e2104bf67964418dc80c3abb56bdb069d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 29 Jan 2026 00:57:04 +0900
Subject: [PATCH v5 1/5] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 221 +++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 18 ++
src/include/access/tableam.h | 58 ++++++
src/include/pgstat.h | 5 +
5 files changed, 306 insertions(+)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f30a56ecf55..d8d1bdf5191 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1151,6 +1151,134 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+ Relation rel = scan->rs_base.rs_rd;
+ TupleDesc tupdesc = RelationGetDescr(rel);
+
+ /*
+ * Current batching limitations (may be relaxed in future):
+ *
+ * - Forward scans only: backward scan support would require changes to
+ * batch iteration and page advancement logic.
+ *
+ * - Pagemode required: batching relies on the pre-built rs_vistuples[]
+ * array from heap_prepare_pagescan(). This is guaranteed by
+ * ScanCanUseBatching() which only enables batching when SO_ALLOW_PAGEMODE
+ * is set. Unlike heap_getnextslot, we don't support dynamic fallback to
+ * tuple-at-a-time mode since the batch execution path is selected at
+ * ExecInit time.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * Loop until we find tuples that pass the scan key, or reach end of scan.
+ * We never cross page boundaries within a single batch.
+ */
+ for (;;)
+ {
+ /*
+ * Advance to a page with visible tuples if needed.
+ */
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ lineindex = scan->rs_cindex + 1;
+ linesleft = (lineindex <= scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ }
+ else
+ linesleft = 0;
+
+ while (linesleft == 0)
+ {
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ lineindex = 0;
+ linesleft = scan->rs_ntuples;
+ }
+
+ /*
+ * Walk rs_vistuples[] copying headers into tdata[] until the page
+ * is exhausted or batch capacity is reached.
+ */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex++)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex < scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ Assert(dst->t_tableOid == RelationGetRelid(rel));
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL && !HeapKeyTest(dst, tupdesc, nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ /* Return if we found any tuples; otherwise try next page */
+ if (nout > 0)
+ return nout;
+
+ /* Mark page exhausted so we advance on next iteration */
+ scan->rs_cindex = scan->rs_ntuples;
+ }
+
+ pg_unreachable();
+ return 0;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1483,6 +1611,99 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+ Size alloc_size;
+
+ Assert(maxitems > 0);
+
+ /* Single allocation for HeapBatch header + tupdata array */
+ alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
+ hb = palloc(alloc_size);
+ hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cbef73e5d4b..e4cf7fc296b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2637,6 +2637,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3c0961ab36b..e2417650c5f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -101,6 +101,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -337,6 +350,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e2ec5289d4d..584b580f7a1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..48e4e034a33 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -697,6 +697,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.47.3
[application/octet-stream] v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.6K, 3-v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 94d0f92c807895e6edadf583a06bb39c5dc52a4c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 27 Jan 2026 14:07:55 +0900
Subject: [PATCH v5 2/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Add executor_batch_rows GUC to specify the maximum number of rows
that can be added into a batch.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Notes / current limits:
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 28 ++++
src/backend/access/heap/heapam_handler.c | 16 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 112 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 ++
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 99 ++++++++++++
src/include/executor/execScan.h | 69 +++++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 4 +
17 files changed, 592 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d8d1bdf5191..db91085b07c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1644,6 +1644,34 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e4cf7fc296b..0f6bda7b69f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,21 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps =
+{
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2631,6 +2646,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 87491796523..ffb3b738f6a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..1ef4117b87c
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+ Size alloc_size;
+
+ /* Single allocation for TupleBatch + inslots + outslots arrays */
+ alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
+ b = palloc(alloc_size);
+ inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
+ outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
+ sizeof(TupleTableSlot *) * capacity);
+
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 9f68be17b99..5023eb6756a 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..e5af90e3a0f 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index af3c788ce8b..08d93e6f0be 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = executor_batch_rows;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..4c422c854d0 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1004,6 +1004,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2417650c5f..d6154d5ab15 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,6 +354,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 584b580f7a1..bdf733c8b22 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2128,5 +2154,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..2d0066103ce
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 028edb8d9fd..d9185331e22 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -251,4 +251,73 @@ ExecScanExtended(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatchSlot
+ * Batch-driven variant of ExecScanExtended.
+ *
+ * Returns one tuple at a time to callers, but internally fetches tuples
+ * in batches from the AM via accessBatchMtd. This reduces per-tuple AM
+ * call overhead while preserving the single-slot interface expected by
+ * parent nodes.
+ *
+ * The batch is refilled when exhausted by calling accessBatchMtd, which
+ * returns false at end-of-scan.
+ *
+ * Note: EPQ is not supported in the batch path; callers must ensure
+ * es_epq_active is NULL before using this function.
+ */
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5929aabc353..e82fd6c0c8a 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -578,12 +578,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index db559b39c4d..f6bd59f2af1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e572..6a191202ced 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -31,6 +31,7 @@
#include "access/skey.h"
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "executor/instrument_node.h"
#include "fmgr.h"
@@ -1206,6 +1207,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.47.3
[application/octet-stream] v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (14.0K, 4-v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From f282f5dde3b4bc58b2cd7b66e55803df26e357aa Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v5 3/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 30 ++++++++++++++
src/backend/commands/explain_state.c | 2 +
src/backend/executor/execBatch.c | 31 +++++++++++++-
src/backend/executor/nodeSeqscan.c | 24 ++++++-----
src/include/commands/explain_state.h | 1 +
src/include/executor/execBatch.h | 16 +++++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 58 +++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 27 +++++++++++++
9 files changed, 177 insertions(+), 13 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..f3d521e1f93 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -517,6 +518,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -2294,6 +2297,33 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_buffer_usage(es, &planstate->instrument->bufusage);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->walusage);
+ if (es->batches && planstate->ps_Batch)
+ {
+ TupleBatch *b = planstate->ps_Batch;
+
+ if (b->stat_batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) b->stat_batches,
+ TupleBatchAvgRows(b),
+ b->stat_max_rows,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, b->stat_batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ TupleBatchAvgRows(b), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, b->stat_max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows, es);
+ }
+ }
+ }
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 803c74dd178..ad5b223ede7 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
index 1ef4117b87c..ed54e3165c8 100644
--- a/src/backend/executor/execBatch.c
+++ b/src/backend/executor/execBatch.c
@@ -19,7 +19,7 @@
* Allocate and initialize a new TupleBatch envelope.
*/
TupleBatch *
-TupleBatchCreate(TupleDesc scandesc, int capacity)
+TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats)
{
TupleBatch *b;
TupleTableSlot **inslots,
@@ -47,6 +47,12 @@ TupleBatchCreate(TupleDesc scandesc, int capacity)
b->nvalid = 0;
b->next = 0;
+ b->track_stats = track_stats;
+ b->stat_batches = 0;
+ b->stat_rows = 0;
+ b->stat_max_rows = 0;
+ b->stat_min_rows = INT_MAX;
+
return b;
}
@@ -110,3 +116,26 @@ TupleBatchGetNumValid(TupleBatch *b)
{
return b->nvalid;
}
+
+void
+TupleBatchRecordStats(TupleBatch *b, int rows)
+{
+ if (!b->track_stats)
+ return;
+
+ b->stat_batches++;
+ b->stat_rows += rows;
+ if (rows > b->stat_max_rows)
+ b->stat_max_rows = rows;
+ if (rows < b->stat_min_rows && rows > 0)
+ b->stat_min_rows = rows;
+}
+
+double
+TupleBatchAvgRows(TupleBatch *b)
+{
+ if (b->stat_batches == 0)
+ return 0.0;
+
+ return (double) b->stat_rows / b->stat_batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 08d93e6f0be..f36b31d4fbb 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -213,8 +213,9 @@ SeqNextBatch(SeqScanState *node)
TableScanDesc scandesc;
EState *estate;
ScanDirection direction;
+ TupleBatch *b = node->ss.ps.ps_Batch;
- Assert(node->ss.ps.ps_Batch != NULL);
+ Assert(b != NULL);
/*
* get information from the estate and scan state
@@ -237,22 +238,21 @@ SeqNextBatch(SeqScanState *node)
}
/* Lazily create the AM batch payload. */
- if (node->ss.ps.ps_Batch->am_payload == NULL)
+ if (b->am_payload == NULL)
{
const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
Assert(tam && tam->scan_begin_batch);
- node->ss.ps.ps_Batch->am_payload =
- table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
- node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ b->am_payload = table_scan_begin_batch(scandesc, b->maxslots);
+ b->ops = table_batch_callbacks(node->ss.ss_currentRelation);
}
- node->ss.ps.ps_Batch->ntuples =
- table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
- node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
- node->ss.ps.ps_Batch->materialized = false;
+ b->ntuples = table_scan_getnextbatch(scandesc, b->am_payload, direction);
+ b->nvalid = b->ntuples;
+ b->materialized = false;
+ TupleBatchRecordStats(b, b->ntuples);
- return node->ss.ps.ps_Batch->ntuples > 0;
+ return b->ntuples > 0;
}
static bool
@@ -340,8 +340,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
const int cap = executor_batch_rows;
TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
- scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 0b695f7d812..0a99f0f2341 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -55,6 +55,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 2d0066103ce..1efc194d8ff 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
/*
@@ -45,11 +47,18 @@ typedef struct TupleBatch
int nvalid; /* number of returnable tuples in outslots */
int next; /* 0-based index of next tuple to be returned */
+
+ /* Statistics (populated when EXPLAIN ANALYZE BATCHES) */
+ bool track_stats; /* whether to collect stats */
+ int64 stat_batches; /* total number of batches fetched */
+ int64 stat_rows; /* total tuples across all batches */
+ int stat_max_rows; /* max rows in any single batch */
+ int stat_min_rows; /* min rows in any single batch (non-zero) */
} TupleBatch;
/* Helpers */
-extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats);
extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
@@ -96,4 +105,9 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+/* === Batching stats. ===*/
+
+extern void TupleBatchRecordStats(TupleBatch *b, int rows);
+extern double TupleBatchAvgRows(TupleBatch *b);
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..bee69b4ac8f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..1bec59eea9e 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,61 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+reset executor_batch_rows;
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+ ?column?
+----------
+ 0
+(1 row)
+
+drop table batch_test;
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..7881c674495 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,30 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+reset executor_batch_rows;
+
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+
+drop table batch_test;
+reset executor_batch_rows;
--
2.47.3
[application/octet-stream] v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch (32.2K, 5-v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch)
download | inline diff:
From e155dc70e0370435061da70362175255d83a36ea Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 26 Jan 2026 11:01:44 +0900
Subject: [PATCH v5 4/5] WIP: Add ExecQualBatch() for batched qual evaluation
Introduce batched qual evaluation for SeqScan when quals are simple
AND-trees of Var op Const, Var op Var, or NullTest expressions.
The batch is evaluated using a bitmask, avoiding per-tuple ExecQual()
overhead.
Only leakproof operators are eligible for batching, since batching
changes evaluation order which could otherwise leak data through
side channels before security barrier quals filter rows.
Add supporting infrastructure: EEOP_SCAN_FETCHSOME_BATCH to deform
all tuples in a batch and ExprContext.scan_batch field.
The postgres_fdw regression test is updated to disable batching for
a query with LIMIT, since batching processes entire batches before
checking LIMIT, resulting in different "Rows Removed by Filter"
counts in EXPLAIN ANALYZE output.
---
.../postgres_fdw/expected/postgres_fdw.out | 1 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 1 +
src/backend/executor/execExpr.c | 335 ++++++++++++++++++
src/backend/executor/execExprInterp.c | 224 ++++++++++++
src/backend/executor/execTuples.c | 32 ++
src/backend/executor/nodeSeqscan.c | 28 +-
src/backend/jit/llvm/llvmjit_expr.c | 35 ++
src/backend/jit/llvm/llvmjit_types.c | 3 +
src/include/executor/execExpr.h | 84 ++++-
src/include/executor/execScan.h | 46 +++
src/include/executor/executor.h | 3 +
src/include/executor/tuptable.h | 2 +
src/include/nodes/execnodes.h | 11 +-
13 files changed, 802 insertions(+), 3 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6066510c7c0..67df4233235 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -12208,6 +12208,7 @@ SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
Filter: (t1_3.b === 505)
(14 rows)
+SET executor_batch_rows = 1;
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
QUERY PLAN
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f7ab2ed0ac..daffc545a5c 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -4126,6 +4126,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+SET executor_batch_rows = 1;
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 088eca24021..cc76b760ee7 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -104,6 +104,16 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
bool exists_coerce,
Datum *resv, bool *resnull);
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+ List *leaves; /* List<Node*> of accepted leaves */
+ Bitmapset *attnos; /* Vars referenced by accepted leaves */
+ bool ok; /* stays true if batchable */
+ AttrNumber last_scan; /* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -5064,3 +5074,328 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
DomainHasConstraints(returning->typid);
ExprEvalPushStep(state, &scratch);
}
+
+/*
+ * Extract Var attno from expression, unwrapping RelabelType/TargetEntry.
+ * Returns attno > 0 on success, 0 on failure (not a Var, or system column).
+ */
+static AttrNumber
+extract_var_attno(Expr *expr)
+{
+ if (expr == NULL)
+ return 0;
+ if (IsA(expr, TargetEntry))
+ return extract_var_attno(((TargetEntry *) expr)->expr);
+ if (IsA(expr, RelabelType))
+ return extract_var_attno((Expr *) ((RelabelType *) expr)->arg);
+ if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+ return ((Var *) expr)->varattno;
+ return 0;
+}
+
+/*
+ * qual_batchable_walker
+ * Check if a qual tree is eligible for batched evaluation.
+ *
+ * Walks the qual tree and validates that it consists only of:
+ * - AND expressions (OR/NOT disqualify)
+ * - NullTest on simple Vars
+ * - Binary OpExpr with Var op Const or Var op Var arguments
+ *
+ * For OpExpr, the operator must be:
+ * - Strict: ensures NULL inputs produce NULL/false, matching WHERE semantics
+ * - Leakproof: required because batching evaluates all rows before filtering,
+ * which could leak data to non-leakproof operators before security barrier
+ * quals have a chance to filter rows
+ *
+ * On success, populates cxt->leaves with the leaf nodes and cxt->attnos with
+ * the referenced attribute numbers. Sets cxt->ok = false if any node fails
+ * validation.
+ */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+ QualBatchContext *cxt = (QualBatchContext *) context;
+
+ if (node == NULL || !cxt->ok)
+ return false;
+
+ switch (nodeTag(node))
+ {
+ case T_List:
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+ case T_BoolExpr:
+ {
+ BoolExpr *b = (BoolExpr *) node;
+
+ /* Only AND trees are allowed */
+ if (b->boolop != AND_EXPR)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* Recurse normally over children */
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+ }
+
+ case T_NullTest:
+ {
+ NullTest *nt = (NullTest *) node;
+ AttrNumber attno = extract_var_attno(nt->arg);
+
+ if (attno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ cxt->attnos = bms_add_member(cxt->attnos, attno);
+ if (attno > cxt->last_scan)
+ cxt->last_scan = attno;
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Do NOT recurse into leaf */
+ return false;
+ }
+
+ case T_OpExpr:
+ {
+ OpExpr *op = (OpExpr *) node;
+ List *args = op->args;
+ AttrNumber lattno,
+ rattno;
+
+ /* Only binary operators */
+ if (list_length(args) != 2)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* Must be strict (NULL input -> NULL/false result) */
+ if (!func_strict(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /*
+ * Must be leakproof. Batching changes evaluation order, which
+ * could leak data through side channels before security barrier
+ * quals filter rows.
+ */
+ if (!get_func_leakproof(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ /* Left arg must be a Var */
+ lattno = extract_var_attno(linitial(op->args));
+ if (lattno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, lattno);
+ if (lattno > cxt->last_scan)
+ cxt->last_scan = lattno;
+
+ /* Right arg must be Const or Var */
+ if (!IsA(lsecond(op->args), Const))
+ {
+ rattno = extract_var_attno(lsecond(op->args));
+ if (rattno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, rattno);
+ if (rattno > cxt->last_scan)
+ cxt->last_scan = rattno;
+ }
+
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ return false; /* leaf; don't recurse */
+ }
+
+ /* Unhandled node type; fall back to per-tuple evaluation */
+ default:
+ cxt->ok = false;
+ break;
+ }
+
+ return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n)
+{
+ BatchQualTerm *term;
+ BatchQualTermKind kind;
+ bool strict;
+ AttrNumber l_attno;
+ AttrNumber r_attno;
+ Datum r_const = (Datum) 0;
+ bool r_isnull = false;
+ FmgrInfo *finfo = NULL;
+ Oid collation;
+
+ if (IsA(n, NullTest))
+ {
+ NullTest *nt = (NullTest *) n;
+
+ kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+ l_attno = extract_var_attno(nt->arg);
+ r_attno = 0;
+ strict = false;
+ collation = InvalidOid;
+
+ if (l_attno == 0)
+ return NULL;
+ }
+ else if (IsA(n, OpExpr))
+ {
+ OpExpr *op = (OpExpr *) n;
+ Expr *l = linitial(op->args);
+ Expr *r = lsecond(op->args);
+
+ l_attno = extract_var_attno(l);
+ if (l_attno == 0)
+ return NULL;
+
+ if (IsA(r, Const))
+ {
+ Const *c = (Const *) r;
+
+ kind = BQTK_VAR_CONST;
+ r_const = c->constvalue;
+ r_isnull = c->constisnull;
+ r_attno = 0;
+ }
+ else
+ {
+ r_attno = extract_var_attno(r);
+ if (r_attno == 0)
+ return NULL;
+ kind = BQTK_VAR_VAR;
+ }
+
+ strict = func_strict(op->opfuncid);
+ collation = exprInputCollation((Node *) op);
+ finfo = palloc(sizeof(FmgrInfo));
+ fmgr_info(op->opfuncid, finfo);
+ }
+ else
+ return NULL;
+
+ term = palloc(sizeof(BatchQualTerm));
+ term->kind = kind;
+ term->strict = strict;
+ term->l_attno = l_attno;
+ term->r_attno = r_attno;
+ term->r_const = r_const;
+ term->r_isnull = r_isnull;
+ term->finfo = finfo;
+ term->collation = collation;
+
+ return term;
+}
+
+/*
+ * ExecInitQualBatch
+ * Build a batched-qual ExprState for evaluating scan quals over a TupleBatch.
+ *
+ * Returns a dedicated ExprState that evaluates the plan's quals in batch mode,
+ * or NULL if the quals are not eligible for batching. The caller should retain
+ * the regular ps->qual for fallback when batching is not used.
+ *
+ * Batching is only possible when the qual tree consists of:
+ * - Top-level AND of simple clauses (no OR, NOT)
+ * - NullTest on a simple Var
+ * - Binary OpExpr with (Var op Const) or (Var op Var), where the operator
+ * is both strict (for proper NULL handling) and leakproof (to avoid
+ * leaking data when evaluation order changes vs. security barrier quals)
+ *
+ * The generated EEOP program:
+ * 1. EEOP_SCAN_FETCHSOME_BATCH - deforms all slots in the batch
+ * 2. EEOP_QUAL_BATCH_INITMASK - initializes bitmask to all-pass
+ * 3. EEOP_QUAL_BATCH_TERM (per leaf) - evaluates term, clears failing bits
+ *
+ * The result bitmask is stored in BatchQualRuntime (via ExprState.batch_private)
+ * for the caller to use when populating output slots.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+ Node *qual = (Node *) ps->plan->qual;
+ QualBatchContext cxt = {NIL, NULL, true, 0};
+ BatchQualRuntime *rt;
+ ExprState *state;
+ int maxrows = executor_batch_rows;
+ uint64 *mask;
+ int mask_words;
+ ListCell *lc;
+ ExprEvalStep scratch = {0};
+
+ if (qual == NULL)
+ return NULL;
+
+ /*
+ * Check if qual tree is batchable; collect leaf nodes and referenced
+ * attnos.
+ */
+ (void) qual_batchable_walker(qual, &cxt);
+ if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+ return NULL;
+
+ /* Allocate bitmask: one bit per row, rounded up to 64-bit words */
+ mask_words = (maxrows + 63) >> 6;
+ mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+ /* Bundle runtime state; attached to ExprState for access during execution */
+ rt = palloc0(sizeof(BatchQualRuntime));
+ rt->mask = mask;
+ rt->mask_words = mask_words;
+
+ /* Create ExprState for the batched program */
+ state = makeNode(ExprState);
+ state->expr = (Expr *) qual;
+ state->parent = ps;
+ state->ext_params = NULL;
+ state->flags = EEO_FLAG_IS_QUAL;
+ state->batch_private = (void *) rt;
+
+ /* Step 1: deform all slots in batch up to highest referenced attribute */
+ scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = cxt.last_scan;
+ ExprEvalPushStep(state, &scratch);
+
+ /* Step 2 initialize mask to all-ones (all rows pass initially) */
+ scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+ scratch.d.qualbatch_init.mask = mask;
+ scratch.d.qualbatch_init.mask_words = mask_words;
+ ExprEvalPushStep(state, &scratch);
+
+ /* Step 3: one TERM per qual leaf; each clears mask bits for failing rows */
+ foreach(lc, cxt.leaves)
+ {
+ BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc));
+
+ if (term == NULL)
+ return NULL;
+
+ scratch.opcode = EEOP_QUAL_BATCH_TERM;
+ scratch.d.qualbatch_term.term = term; /* by value */
+ ExprEvalPushStep(state, &scratch);
+ }
+
+ /* Done; mask now indicates which rows survived all quals */
+ scratch.opcode = EEOP_DONE_NO_RETURN;
+ ExprEvalPushStep(state, &scratch);
+
+ ExecReadyExpr(state);
+
+ return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a7a5ac1e83b..304c7f4e0fb 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
#include "access/heaptoast.h"
#include "catalog/pg_type.h"
#include "commands/sequence.h"
+#include "executor/execBatch.h"
#include "executor/execExpr.h"
#include "executor/nodeSubplan.h"
#include "funcapi.h"
@@ -466,6 +467,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
+ TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -592,6 +594,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+ &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
+ &&CASE_EEOP_QUAL_BATCH_INITMASK,
+ &&CASE_EEOP_QUAL_BATCH_TERM,
&&CASE_EEOP_LAST
};
@@ -612,6 +617,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
+ scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -2265,6 +2271,28 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, scanslot);
+
+ Assert(scanbatch);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+ {
+ ExecQualBatchInitMask(state, op, econtext);
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -5914,3 +5942,199 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
MemoryContextSwitchTo(oldContext);
}
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+ for (int i = 0; i < nwords; i++)
+ a[i] = ~UINT64CONST(0);
+
+ if ((nvalid_bits & 63) != 0)
+ {
+ int rem = nvalid_bits & 63;
+
+ a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+ }
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+ a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+ for (int i = 0; i < nwords; i++)
+ {
+ if (mask[i] != 0)
+ return false;
+ }
+ return true;
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ TupleBatch *b = econtext->scan_batch;
+ uint64 *mask = op->d.qualbatch_init.mask;
+ int nwords = op->d.qualbatch_init.mask_words;
+ int n = b->ntuples;
+
+ /* initialize to all-pass for current batch size */
+ mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);
+ TupleBatch *b = econtext->scan_batch;
+ TupleTableSlot **slots = b->activeslots;
+ uint64 *mask = rt->mask;
+ int mask_words = rt->mask_words;
+ BatchQualTerm *t = op->d.qualbatch_term.term;
+ int n = b->ntuples;
+
+ /* Early exit if no rows remain */
+ if (mask_is_empty(mask, mask_words))
+ return;
+
+ switch (t->kind)
+ {
+ case BQTK_IS_NULL:
+ {
+ /* keep bit set only if value IS NULL; clear otherwise */
+ for (int i = 0; i < n; i++)
+ {
+ if (!slots[i]->tts_isnull[t->l_attno-1])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_IS_NOT_NULL:
+ {
+ /* keep bit set only if value IS NOT NULL; clear if NULL */
+ for (int i = 0; i < n; i++)
+ {
+ if (slots[i]->tts_isnull[t->l_attno-1])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_CONST:
+ {
+ const bool r_isnull = t->r_isnull;
+ const Datum r_const = t->r_const;
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = slots[i]->tts_isnull[t->l_attno-1];
+ bool pass;
+
+ /* WHERE treats NULL as false; strict ops short-circuit */
+ if (strict && (ln || r_isnull))
+ pass = false;
+ else
+ {
+ Datum lv = slots[i]->tts_values[t->l_attno-1];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_VAR:
+ {
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = slots[i]->tts_isnull[t->l_attno-1];
+ bool rn = slots[i]->tts_isnull[t->r_attno-1];
+ bool pass;
+
+ if (strict && (ln || rn))
+ pass = false;
+ else
+ {
+ Datum lv = slots[i]->tts_values[t->l_attno-1];
+ Datum rv = slots[i]->tts_values[t->r_attno-1];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ default:
+ /* should not happen; leave mask unchanged */
+ break;
+ }
+}
+
+/*
+ * ExecQualBatch
+ * Evaluate a batched qual over all rows in a TupleBatch.
+ *
+ * Runs the EEOP program built by ExecInitQualBatch, which produces a bitmask
+ * indicating which rows pass the qual. Rows that pass are copied to the
+ * batch's output slots (b->outslots).
+ *
+ * Returns the number of qualifying rows. The caller should then call
+ * TupleBatchUseOutput(b, qualified) to switch the batch to return from
+ * outslots.
+ *
+ * The batch must be materialized (slots populated) before calling this.
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+ int i;
+ uint64 *mask;
+ int kept = 0;
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);
+
+ /* verify that expression was compiled using ExecInitQualBatch */
+ Assert(state->flags & EEO_FLAG_IS_QUAL);
+ Assert(rt && rt->mask && rt->mask_words);
+
+ /* run the batched EEOP program once */
+ econtext->scan_batch = b;
+ ExecEvalExprNoReturn(state, econtext);
+
+ mask = rt->mask;
+ if (mask_is_empty(mask, rt->mask_words))
+ return 0;
+
+ /* Add survivors into outslots */
+ TupleBatchRewind(b);
+ i = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* mask bit set => row survives */
+ if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+ TupleBatchStoreInOut(b, kept++, slot);
+ i++;
+ }
+
+ return kept;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index b768eae9e53..5082d8ecd3b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
}
}
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* Check for caller errors */
+ Assert(attnum > 0);
+
+ if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+ elog(ERROR, "invalid attribute number %d", attnum);
+
+ /* XXX - there should perhaps also be a batch-level att_nvalid */
+ if (attnum < slot->tts_nvalid)
+ continue;
+
+ /* Fetch as many attributes as possible from the underlying tuple. */
+ slot->tts_ops->getsomeattrs(slot, attnum);
+
+ /*
+ * If the underlying tuple doesn't have enough attributes, tuple
+ * descriptor must have the missing attributes.
+ */
+ if (unlikely(slot->tts_nvalid < attnum))
+ {
+ slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+ slot->tts_nvalid = attnum;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecTypeFromTL
*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f36b31d4fbb..16f15ed68aa 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -281,6 +281,28 @@ ExecSeqScanBatchSlot(PlanState *pstate)
NULL, NULL);
}
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithBatchQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+ TupleBatch *b = pstate->ps_Batch;
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual_batch != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ if (!TupleBatchHasMore(b))
+ b = ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual_batch, NULL);
+
+ return b ? TupleBatchGetNextSlot(b) : NULL;
+}
+
static TupleTableSlot *
ExecSeqScanBatchSlotWithQual(PlanState *pstate)
{
@@ -344,6 +366,7 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
+ scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
@@ -361,7 +384,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
- scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ if (scanstate->ss.ps.qual_batch == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithBatchQual;
}
else
{
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 650f1d42a93..847f265df3b 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,9 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_newslot;
LLVMValueRef v_resultslot;
+ /* batches */
+ LLVMValueRef v_scanbatch;
+
/* nulls/values of slots */
LLVMValueRef v_innervalues;
LLVMValueRef v_innernulls;
@@ -221,6 +224,11 @@ llvm_compile_expr(ExprState *state)
v_state,
FIELDNO_EXPRSTATE_RESULTSLOT,
"v_resultslot");
+ v_scanbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_SCANBATCH,
+ "v_scanbatch");
/* build global values/isnull pointers */
v_scanvalues = l_load_struct_gep(b,
@@ -2940,6 +2948,33 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_SCAN_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_scanbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_QUAL_BATCH_INITMASK:
+ build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ case EEOP_QUAL_BATCH_TERM:
+ build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 4636b90cd0f..5ba9920f3fd 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,10 @@ void *referenced_functions[] =
MakeExpandedObjectReadOnlyInternal,
slot_getmissingattrs,
slot_getsomeattrs_int,
+ slot_getsomeattrs_batch,
strlen,
varsize_any,
ExecInterpExprStillValid,
+ ExecQualBatchInitMask,
+ ExecQualBatchTerm,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index aa9b361fa31..2672d2674cc 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -292,11 +292,29 @@ typedef enum ExprEvalOp
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
+ /*
+ * Batched qual evaluation opcodes
+ *
+ * These opcodes implement batch-mode qual evaluation where an entire
+ * TupleBatch is processed at once rather than tuple-by-tuple.
+ *
+ * EEOP_SCAN_FETCHSOME_BATCH: Call slot_getsomeattrs() on all slots in
+ * the batch to ensure needed attributes are deformed.
+ *
+ * EEOP_QUAL_BATCH_INITMASK: Initialize the result bitmask to all-ones
+ * (all rows initially pass).
+ *
+ * EEOP_QUAL_BATCH_TERM: Evaluate one qual leaf (NullTest or OpExpr) over
+ * all rows, clearing mask bits for rows that fail.
+ */
+ EEOP_SCAN_FETCHSOME_BATCH,
+ EEOP_QUAL_BATCH_INITMASK,
+ EEOP_QUAL_BATCH_TERM,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
-
typedef struct ExprEvalStep
{
/*
@@ -331,6 +349,12 @@ typedef struct ExprEvalStep
const TupleTableSlotOps *kind;
} fetch;
+ struct
+ {
+ /* attribute number up to which to fetch (inclusive) */
+ int last_var;
+ } fetch_batch;
+
/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
struct
{
@@ -769,6 +793,17 @@ typedef struct ExprEvalStep
void *json_coercion_cache;
ErrorSaveContext *escontext;
} jsonexpr_coercion;
+
+ struct
+ {
+ uint64 *mask; /* shared mask buffer for this program */
+ int mask_words; /* ceil(es_max_batch/64) */
+ } qualbatch_init; /* EEOP_QUAL_BATCH_INITMASK */
+
+ struct
+ {
+ struct BatchQualTerm *term; /* compiled leaf */
+ } qualbatch_term; /* EEOP_QUAL_BATCH_TERM */
} d;
} ExprEvalStep;
@@ -917,4 +952,51 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
ExprContext *econtext);
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+ BQTK_VAR_CONST,
+ BQTK_VAR_VAR,
+ BQTK_IS_NULL,
+ BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+ BatchQualTermKind kind;
+ bool strict; /* follow strict NULL semantics if true */
+ AttrNumber l_attno; /* left VAR column */
+ AttrNumber r_attno; /* right VAR column, or -1 if Const */
+ Datum r_const; /* for VAR_CONST */
+ bool r_isnull; /* for VAR_CONST */
+ FmgrInfo *finfo; /* fmgr for generic binary ops */
+ Oid collation; /* op collation */
+} BatchQualTerm;
+
+/*
+ * BatchQualRuntime - execution state for batched qual evaluation
+ *
+ * Attached to ExprState.batch_private for the batched qual program.
+ * Contains the bitmask that tracks which rows pass the qual (bit set = pass),
+ * and references to the BatchVector for EEOP_QUAL_BATCH_TERM to use.
+ *
+ * The mask uses standard bit operations: word = i/64, bit = i%64.
+ * Initialized to all-ones by EEOP_QUAL_BATCH_INITMASK, then each
+ * EEOP_QUAL_BATCH_TERM clears bits for failing rows.
+ */
+typedef struct BatchQualRuntime
+{
+ uint64 *mask;
+ int mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+ return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index d9185331e22..008780ea230 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -320,4 +320,50 @@ ExecScanExtendedBatchSlot(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatch
+ * Batch-driven scan with batched qual evaluation.
+ *
+ * Unlike ExecScanExtendedBatchSlot which evaluates quals tuple-at-a-time,
+ * this function uses ExecQualBatch() to evaluate the entire batch at once
+ * using a bitmask. Qualifying tuples are collected into b->outslots.
+ *
+ * Returns the TupleBatch with nvalid set to the number of qualifying rows,
+ * or NULL at end-of-scan. Caller iterates b->outslots[0..nvalid-1].
+ *
+ * Note: EPQ is not supported; projection is not yet implemented.
+ */
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual_batch, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+ int qualified;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next batch from the AM */
+ if (!accessBatchMtd(node))
+ return NULL;
+
+ ResetExprContext(econtext);
+ qualified = ExecQualBatch(qual_batch, econtext, b);
+ InstrCountFiltered1(node, b->nvalid - qualified);
+ /* Update count and start using b->outslots. */
+ TupleBatchUseOutput(b, qualified);
+
+ if (qualified > 0)
+ return b;
+ /* else get the next batch from the AM */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e82fd6c0c8a..8cded15dec6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -326,6 +326,7 @@ ExecProcNode(PlanState *node)
extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -553,6 +554,8 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
}
#endif
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index a2dfd707e78..b06be83b141 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -346,6 +346,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
int lastAttNum);
extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
#ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6a191202ced..c79ee965372 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -148,6 +148,9 @@ typedef struct ExprState
* ExecInitExprRec().
*/
ErrorSaveContext *escontext;
+
+ /* batched-program runtime (e.g., BatchQualRuntime) */
+ void *batch_private;
} ExprState;
@@ -314,6 +317,10 @@ typedef struct ExprContext
#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
TupleTableSlot *ecxt_newtuple;
+ /* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 16
+ TupleBatch *scan_batch;
+
/* Link to containing EState (NULL if a standalone ExprContext) */
struct EState *ecxt_estate;
@@ -1186,7 +1193,9 @@ typedef struct PlanState
* state trees parallel links in the associated plan tree (except for the
* subPlan list, which does not exist in the plan tree).
*/
- ExprState *qual; /* boolean qual condition */
+ ExprState *qual; /* boolean qual condition (per tuple) */
+ ExprState *qual_batch; /* batched qual program, NULL if qual not
+ * batchable */
PlanState *lefttree; /* input plan tree(s) */
PlanState *righttree;
--
2.47.3
[application/octet-stream] v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch (5.9K, 6-v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch)
download | inline diff:
From 4916a0891b2e7176dee3c2a3a8018a4d174dd373 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 29 Jan 2026 05:03:55 +0900
Subject: [PATCH v5 5/5] WIP: Use dedicated interpreter for batched qual
evaluation
Move batch-related opcodes (EEOP_SCAN_FETCHSOME_BATCH,
EEOP_QUAL_BATCH_INITMASK, EEOP_QUAL_BATCH_TERM) out of the main
ExecInterpExpr switch and into a dedicated ExecInterpQualBatch
function.
Adding opcodes to ExecInterpExpr may affect performance even when
they are not executed, possibly due to changes in register allocation,
jump table layout, or code size. Use a separate interpreter to avoid
any risk of impacting the existing per-tuple evaluation path.
The batched qual program has a simple linear structure (fetch ->
initmask -> term* -> done) that doesn't need computed goto dispatch
anyway.
---
src/backend/executor/execExprInterp.c | 72 +++++++++++++++++----------
src/backend/executor/nodeSeqscan.c | 6 +--
2 files changed, 46 insertions(+), 32 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 304c7f4e0fb..04a40ec932c 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -189,6 +189,8 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
int setno);
static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static Datum ExecInterpQualBatch(ExprState *state, ExprContext *econtext);
+
/*
* ScalarArrayOpExprHashEntry
* Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -266,6 +268,12 @@ ExecReadyInterpretedExpr(ExprState *state)
*/
state->evalfunc = ExecInterpExprStillValid;
+ if (state->batch_private)
+ {
+ state->evalfunc_private = (void *) ExecInterpQualBatch;
+ return;
+ }
+
/* DIRECT_THREADED should not already be set */
Assert((state->flags & EEO_FLAG_DIRECT_THREADED) == 0);
@@ -467,7 +475,6 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
- TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -594,9 +601,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
- &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
- &&CASE_EEOP_QUAL_BATCH_INITMASK,
- &&CASE_EEOP_QUAL_BATCH_TERM,
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_SCAN_FETCHSOME_BATCH */
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_QUAL_BATCH_INITMASK */
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_QUAL_BATCH_TERM */
&&CASE_EEOP_LAST
};
@@ -617,7 +624,6 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
- scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -2271,34 +2277,18 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
- EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
- {
- CheckOpSlotCompatibility(op, scanslot);
-
- Assert(scanbatch);
- slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
-
- EEO_NEXT();
- }
-
- EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
- {
- ExecQualBatchInitMask(state, op, econtext);
- EEO_NEXT();
- }
-
- EEO_CASE(EEOP_QUAL_BATCH_TERM)
- {
- ExecQualBatchTerm(state, op, econtext);
- EEO_NEXT();
- }
-
EEO_CASE(EEOP_LAST)
{
/* unreachable */
Assert(false);
goto out_error;
}
+
+ EEO_CASE(EEOP_BATCH_UNREACHABLE)
+ {
+ Assert(false && "batch opcodes use dedicated interpreter");
+ pg_unreachable();
+ }
}
out_error:
@@ -6089,6 +6079,34 @@ ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
}
}
+static Datum
+ExecInterpQualBatch(ExprState *state, ExprContext *econtext)
+{
+ ExprEvalStep *op = state->steps;
+ TupleBatch *scanbatch = econtext->scan_batch;
+
+ /* Step 1: fetch/deform all slots */
+ Assert(ExecEvalStepOp(state, op) == EEOP_SCAN_FETCHSOME_BATCH);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+ op++;
+
+ /* Step 2: initialize mask */
+ Assert(ExecEvalStepOp(state, op) == EEOP_QUAL_BATCH_INITMASK);
+ ExecQualBatchInitMask(state, op, econtext);
+ op++;
+
+ /* Step 3: process all TERM steps */
+ while (ExecEvalStepOp(state, op) == EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
+ op++;
+ }
+
+ Assert(ExecEvalStepOp(state, op) == EEOP_DONE_NO_RETURN);
+
+ return (Datum) 0;
+}
+
/*
* ExecQualBatch
* Evaluate a batched qual over all rows in a TupleBatch.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 16f15ed68aa..4a76108bd2f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -404,7 +404,6 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
- bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -435,12 +434,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
- use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
-
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
- use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -477,7 +473,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
- if (use_batching)
+ if (ScanCanUseBatching(&scanstate->ss, eflags))
SeqScanInitBatching(scanstate, eflags);
return scanstate;
--
2.47.3
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-01-29 10:04 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 0 replies; 9+ messages in thread
From: Amit Langote @ 2026-01-29 10:04 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
On Thu, Jan 29, 2026 at 8:35 AM Amit Langote <[email protected]> wrote:
>
> Hi,
>
> Here is v5 of the patch series.
>
> Patches 0001-0003 add the core batching infrastructure. 0001 adds the
> batch table AM API with heapam implementation, 0002 wires up SeqScan
> to use it (still returning one slot at a time), and 0003 adds EXPLAIN
> (BATCHES). I'd love to hear people's thoughts around TupleBatch
> structure added in 0002. I thought about making it a separate patch so
> that 0002 will still populate the single ScanState.ss_scanTupleSlot,
> but that means we'd still have to call the TAM callback to populate
> the tuple in the TAM's batch struct into the slot, defeating the whole
> point. With TupleBatch, you have executor_batch_rows number of slots
> which are filled in one TAM callback (materialize_all) call. So I
> decided to keep the TupleBatch and related things in 0002.
>
> For scans without quals, batching shows 20-30% improvement with no
> visible regressions when batching is disabled (batch_rows=0):
>
> SELECT * FROM t LIMIT n (no qual)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
> 3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
> 10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
>
> (%diff: positive = faster than master, negative = slower)
Oops, I meant SELECT * FROM t LIMIT 1 OFFSET n (no qual).
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-02-01 14:49 Junwang Zhao <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 1 reply; 9+ messages in thread
From: Junwang Zhao @ 2026-02-01 14:49 UTC (permalink / raw)
To: Amit Langote <[email protected]>; +Cc: Daniil Davydov <[email protected]>; cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi Amit,
On Thu, Jan 29, 2026 at 3:35 PM Amit Langote <[email protected]> wrote:
>
> Hi,
>
> Here is v5 of the patch series.
>
> Patches 0001-0003 add the core batching infrastructure. 0001 adds the
> batch table AM API with heapam implementation, 0002 wires up SeqScan
> to use it (still returning one slot at a time), and 0003 adds EXPLAIN
> (BATCHES). I'd love to hear people's thoughts around TupleBatch
> structure added in 0002. I thought about making it a separate patch so
> that 0002 will still populate the single ScanState.ss_scanTupleSlot,
> but that means we'd still have to call the TAM callback to populate
> the tuple in the TAM's batch struct into the slot, defeating the whole
> point. With TupleBatch, you have executor_batch_rows number of slots
> which are filled in one TAM callback (materialize_all) call. So I
> decided to keep the TupleBatch and related things in 0002.
>
> For scans without quals, batching shows 20-30% improvement with no
> visible regressions when batching is disabled (batch_rows=0):
>
> SELECT * FROM t LIMIT n (no qual)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
> 3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
> 10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
>
> (%diff: positive = faster than master, negative = slower)
>
> Patches 0004-0005 add batched qual evaluation and are more
> experimental (see below on why 0005 exists). For quals referencing
> early columns, the improvement is significant:
>
> SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
>
> Rows Master batch=64 %diff
> ------ -------- -------- -----
> 1M 30.19 ms 15.55 ms 48.5%
> 3M 92.47 ms 50.01 ms 45.9%
> 10M 325.58 ms 211.83 ms 34.9%
>
> However, for quals on later columns (e.g., 15th), batching provides no
> benefit - deformation dominates and batching doesn't help:
>
> SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
>
> Rows Master batch=64 %diff
> ------ -------- -------- -----
> 1M 44.14 ms 44.56 ms -0.9%
> 3M 133.89 ms 137.77 ms -2.9%
> 10M 503.33 ms 528.88 ms -5.1%
>
> I don't have a satisfactory explanation for why batching doesn't help
> the deform-heavy case at all. One would expect at least some benefit
> from reduced per-tuple overhead, but that's not materializing.
>
> I've also been struggling to understand why 0004 affects the per-tuple
> path even when batch_rows=0. For quals with 0% selectivity (all rows
> fail the qual), perf shows ExecInterpExpr is noticeably hotter with
> the patched code compared to master, even though batching is disabled:
>
> SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
> 3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
> 10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
>
> Compare that to 100% selectivity (all rows pass), where there's no regression:
>
> SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
> 3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
> 10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
>
> I tried moving batch opcodes to a separate interpreter (0005) thinking
> it might be register pressure or jump table effects from adding cases
> to ExecInterpExpr's switch. With 0005, the generated assembly for
> ExecInterpExpr looks identical to master (same stack frame size, same
> epilogue), yet the performance still differs. Specifically, the ldp
> instruction in the function epilogue shows 53% hotness in patched vs
> 35% in master. We still need placeholder entries in the dispatch
> table, so it's unclear if this fully isolates the per-tuple path. I'll
> continue looking at perf, but I feel like at a bit of a loss here and
> would appreciate any insights.
>
> Other changes worth noting:
>
> - I removed the BatchVector intermediate representation that copied
> Datums into columnar arrays before qual evaluation (it used to be in
> the batched qual patch 0004). Now quals access batch slots' tts_values
> directly. This simplifies the code and the copy overhead wasn't paying
> off. If we pursue serious vectorization later, this may need to be
> revisited, but removing it doesn't degrade performance.
>
> --
> Thanks, Amit Langote
Here are some comments for v5:
0001:
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
These function names are not consistent with comments.
0002:
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
ditto.
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
Is there any chance this batch_callbacks can be NULL? In that case it
can cause a segfault. I felt changing to
if (relation->rd_tableam && relation->rd_tableam->batch_callbacks)
should be more robust, but then I found table_slot_callbacks follow
the same pattern, so this shouldn't be a problem.
0003:
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>
I guess the reason for including this header is because of the use
of INT_MAX, so maybe put that line into execBatch.c?
--
Regards
Junwang Zhao
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-02-03 13:30 =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
parent: Junwang Zhao <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: =?utf-8?B?Y2NhNTUwNw==?= @ 2026-02-03 13:30 UTC (permalink / raw)
To: =?utf-8?B?SnVud2FuZyBaaGFv?= <[email protected]>; =?utf-8?B?QW1pdCBMYW5nb3Rl?= <[email protected]>; +Cc: =?utf-8?B?RGFuaWlsIERhdnlkb3Y=?= <[email protected]>; pgsql-hackers; =?utf-8?B?VG9tYXMgVm9uZHJh?= <[email protected]>
Hi,
Some comments for v5:
0001
====
1) heap_begin_batch()
```
/* Single allocation for HeapBatch header + tupdata array */
alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
hb = palloc(alloc_size);
hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
```
Do we need a MAXALIGN() here to avoid unaligned access? Something like this:
```
/* Single allocation for HeapBatch header + tupdata array */
alloc_size = MAXALIGN(sizeof(HeapBatch)) + sizeof(HeapTupleData) * maxitems;
hb = palloc(alloc_size);
hb->tupdata = (HeapTupleData *) ((char *) hb + MAXALIGN(sizeof(HeapBatch)));
```
Or how about just using zero-length array:
```
typedef struct HeapBatch
{
Buffer buf;
int maxitems;
int nitems;
HeapTupleData tupdata[FLEXIBLE_ARRAY_MEMBER];
} HeapBatch;
// and
hb = palloc(offsetof(HeapBatch, tupdata) + sizeof(HeapTupleData) * maxitems);
```
2) pgstat_count_heap_getnext_batch()
```
#define pgstat_count_heap_getnext_batch(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned += n; \
} while (0)
```
"+= n" -> "+= (n)", just like pgstat_count_index_tuples().
0002
====
1) TupleBatchCreate()
```
/* Single allocation for TupleBatch + inslots + outslots arrays */
alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
b = palloc(alloc_size);
inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
sizeof(TupleTableSlot *) * capacity);
```
Do we need a MAXALIGN() here to avoid unaligned access?
2) TupleBatchReset()
```
for (int i = 0; i < b->maxslots; i++)
{
ExecClearTuple(b->inslots[i]);
if (drop_slots)
ExecDropSingleTupleTableSlot(b->inslots[i]);
}
```
ExecDropSingleTupleTableSlot() will call ExecClearTuple(), so ExecClearTuple() will be
called twice if drop_slots is true, I think we can avoid this.
3) ScanCanUseBatching()
In heap_beginscan(), we may disable page-at-a-time mode:
```
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
*/
if (!(snapshot && IsMVCCSnapshot(snapshot)))
scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
```
It seems that ScanCanUseBatching() didn't consider this.
4) struct TupleBatch
```
struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
struct TupleTableSlot **outslots; /* slots for tuples going "out of"
* batch */
struct TupleTableSlot **activeslots;
```
I think we can remove the word "struct".
5) ExecScanExtendedBatchSlot()
```
/* Get next input slot from current batch, or refill */
if (!TupleBatchHasMore(b))
{
if (!accessBatchMtd(node))
return NULL;
}
```
I think we cannot just return NULL here, see comments in ExecScanExtended():
```
/*
* if the slot returned by the accessMtd contains NULL, then it means
* there is nothing more to scan so we just return an empty slot,
* being careful to use the projection result slot so it has correct
* tupleDesc.
*/
if (TupIsNull(slot))
{
if (projInfo)
return ExecClearTuple(projInfo->pi_state.resultslot);
else
return slot;
}
```
And why not just write this function like ExecScanExtended() and ExecScanFetch()?
--
Regards,
ChangAo Chen
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-02-03 15:54 Junwang Zhao <[email protected]>
parent: =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Junwang Zhao @ 2026-02-03 15:54 UTC (permalink / raw)
To: cca5507 <[email protected]>; +Cc: Amit Langote <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
On Tue, Feb 3, 2026 at 9:30 PM cca5507 <[email protected]> wrote:
>
> Hi,
>
> Some comments for v5:
>
> 0001
> ====
>
> 1) heap_begin_batch()
>
> ```
> /* Single allocation for HeapBatch header + tupdata array */
> alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
> hb = palloc(alloc_size);
> hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
> ```
>
> Do we need a MAXALIGN() here to avoid unaligned access? Something like this:
TBH I don't think this single allocation helps too much, it's not on
the hot path,
but makes the code harder to read ;(
>
> ```
> /* Single allocation for HeapBatch header + tupdata array */
> alloc_size = MAXALIGN(sizeof(HeapBatch)) + sizeof(HeapTupleData) * maxitems;
> hb = palloc(alloc_size);
> hb->tupdata = (HeapTupleData *) ((char *) hb + MAXALIGN(sizeof(HeapBatch)));
> ```
>
> Or how about just using zero-length array:
>
> ```
> typedef struct HeapBatch
> {
> Buffer buf;
> int maxitems;
> int nitems;
> HeapTupleData tupdata[FLEXIBLE_ARRAY_MEMBER];
> } HeapBatch;
>
> // and
> hb = palloc(offsetof(HeapBatch, tupdata) + sizeof(HeapTupleData) * maxitems);
> ```
>
> 2) pgstat_count_heap_getnext_batch()
>
> ```
> #define pgstat_count_heap_getnext_batch(rel, n) \
> do { \
> if (pgstat_should_count_relation(rel)) \
> (rel)->pgstat_info->counts.tuples_returned += n; \
> } while (0)
> ```
>
> "+= n" -> "+= (n)", just like pgstat_count_index_tuples().
>
> 0002
> ====
>
> 1) TupleBatchCreate()
>
> ```
> /* Single allocation for TupleBatch + inslots + outslots arrays */
> alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
> b = palloc(alloc_size);
> inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
> outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
> sizeof(TupleTableSlot *) * capacity);
> ```
>
> Do we need a MAXALIGN() here to avoid unaligned access?
>
> 2) TupleBatchReset()
>
> ```
> for (int i = 0; i < b->maxslots; i++)
> {
> ExecClearTuple(b->inslots[i]);
> if (drop_slots)
> ExecDropSingleTupleTableSlot(b->inslots[i]);
> }
> ```
>
> ExecDropSingleTupleTableSlot() will call ExecClearTuple(), so ExecClearTuple() will be
> called twice if drop_slots is true, I think we can avoid this.
>
> 3) ScanCanUseBatching()
>
> In heap_beginscan(), we may disable page-at-a-time mode:
>
> ```
> /*
> * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
> */
> if (!(snapshot && IsMVCCSnapshot(snapshot)))
> scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
> ```
>
> It seems that ScanCanUseBatching() didn't consider this.
>
> 4) struct TupleBatch
>
> ```
> struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
> struct TupleTableSlot **outslots; /* slots for tuples going "out of"
> * batch */
> struct TupleTableSlot **activeslots;
> ```
>
> I think we can remove the word "struct".
>
> 5) ExecScanExtendedBatchSlot()
>
> ```
> /* Get next input slot from current batch, or refill */
> if (!TupleBatchHasMore(b))
> {
> if (!accessBatchMtd(node))
> return NULL;
> }
> ```
>
> I think we cannot just return NULL here, see comments in ExecScanExtended():
>
> ```
> /*
> * if the slot returned by the accessMtd contains NULL, then it means
> * there is nothing more to scan so we just return an empty slot,
> * being careful to use the projection result slot so it has correct
> * tupleDesc.
> */
> if (TupIsNull(slot))
> {
> if (projInfo)
> return ExecClearTuple(projInfo->pi_state.resultslot);
> else
> return slot;
> }
> ```
>
> And why not just write this function like ExecScanExtended() and ExecScanFetch()?
>
> --
> Regards,
> ChangAo Chen
--
Regards
Junwang Zhao
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-03-24 00:59 Amit Langote <[email protected]>
parent: Junwang Zhao <[email protected]>
0 siblings, 1 reply; 9+ messages in thread
From: Amit Langote @ 2026-03-24 00:59 UTC (permalink / raw)
To: Junwang Zhao <[email protected]>; +Cc: cca5507 <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
Here is a significantly revised version of the patch series. A lot has
changed since the January submission, so I want to summarize the
design changes before getting into the patches. I think it does
address the points in the two reviews that landed since v5 but maybe a
bunch of points became moot after my rewrite of the relevant portions
(thanks Junwang and ChangAo for the review in any case).
At this point it might be better to think of this as targeting v20,
except that if there is review bandwidth in the remaining two weeks
before the v19 feature freeze, the rs_vistuples[] change described
below as a standalone improvement to the existing pagemode scan path
could be considered for v19, though that too is an optimistic
scenario.
It is also worth noting that Andres identified a number of
inefficiencies in the existing scan path in:
Re: unnecessary executor overheads around seqscans
https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
that are worth fixing independently of batching. Some of those fixes
may be better pursued first, both because they benefit all scan paths
and because they would make batching's gains more honest.
Separately, after looking at the previous version, Andres pointed out
offlist two fundamental issues with the patch's design:
* The heapam implementation (in a version of the patch I didn't post
to the thread) duplicated heap_prepare_pagescan() logic in a separate
batch-specific code path, which is not acceptable as changes should
benefit the existing slot interface too. Code duplication is not good
either from a future maintainability aspect. The v5 version of that
code is not great in that respect either; it instead duplicated
heapggettup_pagemode() to slap batching on it.
* Allocating executor_batch_rows slots on the executor side to receive
rows from the AM adds significant overhead for slot initialization and
management, and for non-row-organized AMs that do not produce
individual rows at all, those slots would never be meaningfully
populated.
In any case, he just wasn't a fan of the slot-array approach the
moment I mentioned it. The previous version had two slot arrays,
inslots and outslots, of TTSOpsHeapTuple type (not
TTSOpsBufferHeapTuple because buffer pins were managed by the batch
code, which has its own modularity/correctness issues), populated via
a materialize_all callback. A batch qual evaluator would copy
qualifying tuples into outslots, with an activeslots pointer switching
between the two depending on whether batch qual evaluation was used.
The new design addresses both issues and differs from the previous
version in several other ways:
* Single slot instead of slot arrays: there is a single
TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
was already determined by the AM via table_slot_callbacks(). The slot
is re-pointed to each HeapTuple in the current buffer page via a new
repoint_slot AM callback, with no materialization or copying. Tuples
are returned one by one from the executor's perspective, but the AM
serves them in page-sized batches from pre-built HeapTupleData
descriptors in rs_vistuples[], avoiding repeated descent into heapam
per tuple. This is heapam's implementation of the batch interface;
there is no intention to force other AMs into the same row-oriented
model.
* Batch qual evaluator not included: with the single-slot model,
quals are evaluated per tuple via the existing ExecQual path after
each repoint_slot call. A natural next step would be a new opcode
(EEOP) that calls repoint_slot() internally within expression
evaluation, allowing ExecQual to advance through multiple tuples from
the same batch without returning to the scan node each time, with qual
results accumulated in a bitmask in ExprState. The details of that
will be worked out in a follow-on series.
* heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
allows heap_getnextbatch() to simply advance a slice pointer into that
array without any additional copying or re-entering heap code, making
a separate batch-specific scan function unnecessary.
* TupleBatch renamed to RowBatch: "row batch" is more natural
terminology for this concept and also consistent with how similar
abstractions are named in columnar and OLAP systems.
* AM callbacks now take RowBatch directly: previously
heap_getnextbatch() returned a void pointer that the executor would
store into RowBatch.am_payload, because only the executor knew the
internals of RowBatch. Now the AM receives RowBatch directly as a
parameter and can populate it without the executor acting as an
intermediary. This is also why RowBatch is introduced in its own
patch ahead of the AM API addition, so the struct definition is
available to both sides.
Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
instead of OffsetNumbers, as a standalone improvement to the existing
pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd
in the all-visible case) table with 1M/5M/10M rows:
query all-visible not-all-visible
count(*) -0.2% to +0.9% -0.4% to +0.5%
count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5%
SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6%
SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6%
No significant regression on either page type. The structural
improvement is most visible on not-all-visible pages where
HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
visibility checks, so persisting the result into rs_vistuples[]
eliminates the downstream re-read (in heapgettupe_pagemode()) with no
measurable overhead. That said, these numbers are somewhat noisy on
my machine. Results on other machines would be welcome.
Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
and heapam implementation including seqscan variants that use the new
scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
respectively. With batching enabled (executor_batch_rows=300,
~MaxHeapTuplesPerPage):
query all-visible not-all-visible
count(*) +11 to +15% +9 to +13%
count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14%
SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22%
SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13%
With executor_batch_rows=0, results are within noise of master across
all query types and sizes, confirming no regression from the
infrastructure changes themselves. The not-all-visible results tend
to show slightly higher gains than the all-visible case. This is
likely because the existing heapam code is more optimized for the
all-visible path, so the not-all-visible path, which goes through
HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
more headroom that batching can exploit.
Setting aside the current series for a moment, there are some broader
design questions worth raising while we have attention on this area.
Some of these echo points Tomas raised in his first reply on this
thread, and I am reiterating them deliberately since I have not
managed to fully address them on my own or I simply didn't need to for
the TAM-to-scan-node batching and think they would benefit from wider
input rather than just my own iteration.
We should also start thinking about other ways the executor can
consume batch rows, not always assuming they are presented as
HeapTupleData. For instance, an AM could expose decoded column arrays
directly to operators that can consume them, bypassing slot-based
deform entirely, or a columnar AM could implement scan_getnextbatch by
decoding column strips directly into the batch without going through
per-tuple HeapTupleData at all. Feedback on whether the current
RowBatch design and the choices made in the scan_getnextbatch and
RowBatchOps API make that sort of thing harder than it needs to be
would be appreciated. For example, heapam's implementation of
scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
to HeapTupleData entries one at a time via repoint_slot in
RowBatchHeapOps. That works for heapam but a columnar AM could
implement scan_getnextbatch to decode column strips directly into
arrays in the batch, with no per-row repoint step needed at all. Any
adjustments that would make RowBatch more AM-agnostic are worth
discussing now before the design hardens.
There are also broader open questions about how far the batch model
can extend beyond the scan node. Qual pushdown into the AM has been
discussed in nearby threads and would be one way to allow expression
evaluation to happen before data reaches the executor proper, though
that is a separate effort. For the purposes of this series, expression
evaluation still happens in the executor after scan_getnextbatch
returns. If the scan node does not project, the buffer heap slot is
passed directly to the parent node, which calls slot callbacks to
deform as needed. But once a node above projects, aggregates, or
joins, the notion of a page-sized batch from a single AM loses its
meaning and virtual slots take over. Whether RowBatch is usable or
meaningful beyond the scan/TAM boundary in any form, and whether the
core executor will ever have non-HeapTupleData batch consumption paths
or leave that entirely to extensions, are open questions worth
discussing.
For RowBatch to eventually play the role that TupleTableSlot plays for
row-at-a-time execution, something inside it would need to serve as
the common currency for batch data, analogous to TupleTableSlot's
datum/isnull arrays. Column arrays are the obvious direction, but even
that leaves open the question of representation. PostgreSQL's Datum is
a pointer-sized abstraction that boxes everything, whereas vectorized
systems use typed packed arrays of native types with validity
bitmasks, which is a significant part of why tight vectorized loops
are fast there. Whether column arrays of Datum would be good enough,
or whether going further toward typed packed arrays would be necessary
to get meaningful vectorization, is a deeper design question that this
series deliberately does not try to answer.
Even though the focus is on getting batching working at the scan/TAM
boundary first, thoughts on any of these points would be welcome.
--
Thanks, Amit Langote
Attachments:
[application/x-patch] v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch (19.0K, 2-v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From a095d26e1b5a361a7d42300e5364da948496f2ba Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 23 Mar 2026 18:21:47 +0900
Subject: [PATCH v6 3/5] Add batch table AM API and heapam implementation
Introduce table AM callbacks for batched tuple fetching:
scan_begin_batch, scan_getnextbatch, scan_reset_batch, and
scan_end_batch. AMs implement all four or none; checked by
table_supports_batching().
scan_reset_batch releases held resources (e.g. buffer pins)
without freeing, allowing reuse across rescans.
Provide the heapam implementation. HeapPageBatch (stored in
RowBatch.am_payload) is a thin slice descriptor over the scan's
rs_vistuples[] array, which was introduced in the previous commit.
Rather than owning a copy of tuple headers, HeapPageBatch holds a
pointer into scan->rs_vistuples[] for the current slice and a buffer
pin for the current page.
heap_getnextbatch() calls heap_prepare_pagescan() to populate
rs_vistuples[] for each new page, then re-points hb->tuples to the
next slice of rs_vistuples[] on each call. If the page has more
tuples than the executor's max_rows, subsequent calls return the
next slice without re-entering page preparation. The buffer pin is
held until the page is fully consumed.
scan_begin_batch creates a single TupleTableSlot with
TTSOpsBufferHeapTuple ops. heap_repoint_slot() re-points this slot
to each tuple in turn via ExecStoreBufferHeapTuple(). Consumers
that need to retain the slot across calls rely on the normal slot
materialization contract.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 229 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 8 +-
src/include/access/heapam.h | 33 ++++
src/include/access/tableam.h | 136 ++++++++++++++
src/include/pgstat.h | 4 +-
5 files changed, 403 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c6d0aacc5c9..e70c0ccbe82 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
#include "catalog/pg_database.h"
#include "catalog/pg_database_d.h"
#include "commands/vacuum.h"
+#include "executor/execRowBatch.h"
#include "pgstat.h"
#include "port/pg_bitutils.h"
#include "storage/lmgr.h"
@@ -109,6 +110,7 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static void heap_repoint_slot(RowBatch *b, int idx);
/*
@@ -1213,7 +1215,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_cbuf = InvalidBuffer;
/*
- * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
+ * Disable page-at-a-time mode if the snapshot does not allow it.
*/
if (!(snapshot && IsMVCCSnapshot(snapshot)))
scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
@@ -1463,7 +1465,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
return &scan->rs_ctup;
}
@@ -1491,13 +1493,232 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
scan->rs_cbuf);
return true;
}
+/*---------- Batching support -----------*/
+
+static const RowBatchOps RowBatchHeapOps =
+{
+ .repoint_slot = heap_repoint_slot
+};
+
+/*
+ * heap_batch_feasible
+ * Batching requires a MVCC snapshot since it relies on
+ * page-at-a-time mode, which heap_beginscan() disables for
+ * non-MVCC snapshots.
+ */
+bool
+heap_batch_feasible(Relation relation, Snapshot snapshot)
+{
+ return snapshot && IsMVCCSnapshot(snapshot);
+}
+
+/*
+ * heap_begin_batch
+ * Initialize AM-side batch state for a heap scan.
+ *
+ * Allocates a HeapPageBatch, which acts as a thin slice descriptor over
+ * the scan's rs_vistuples[] array. Unlike the previous version there is
+ * no separate tuple header storage in HeapPageBatch itself; rs_vistuples[]
+ * in HeapScanDescData (populated by page_collect_tuples() via
+ * heap_prepare_pagescan()) serves as the page-level buffer. HeapPageBatch
+ * holds a pointer into that array for the current slice and the buffer pin
+ * for the current page.
+ *
+ * b->slot must be a TTSOpsBufferHeapTuple slot.
+ */
+void
+heap_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb;
+
+ /* Batch path relies on executor-level qual eval, not AM scan keys */
+ Assert(sscan->rs_nkeys == 0);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ hb = palloc(sizeof(HeapPageBatch));
+ hb->tuples = NULL;
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+ hb->buf = InvalidBuffer;
+
+ b->am_payload = hb;
+ b->ops = &RowBatchHeapOps;
+}
+
+/*
+ * heap_reset_batch
+ * Release pin and reset for rescan, keeping allocations.
+ */
+void
+heap_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(hb != NULL);
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+}
+
+/*
+ * heap_end_batch
+ * Release all batch resources.
+ */
+void
+heap_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+ b->am_payload = NULL;
+}
+
+/*
+ * heap_getnextbatch
+ * Fetch the next slice of visible tuples from a heap scan.
+ *
+ * Serves slices from the current page's rs_vistuples[] array. If the
+ * current page has remaining tuples, sets hb->tuples to point at the next
+ * slice without re-entering the page scan. If the page is exhausted,
+ * advances to the next page via heap_fetch_next_buffer(), prepares it
+ * with heap_prepare_pagescan(), and serves the first slice from it.
+ *
+ * hb->tuples points directly into scan->rs_vistuples[]; the entries remain
+ * valid as long as hb->buf (the page's buffer pin) is held. The pin is
+ * released at the top of the next call once the page is fully consumed.
+ *
+ * Each call returns at most b->max_rows tuples.
+ *
+ * Returns true if tuples were fetched, false at end of scan.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+ int remaining;
+ int nserve;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+
+ /*
+ * Try to serve from the current page first. No page advance, no buffer
+ * management, no re-entry into heap code.
+ */
+ remaining = scan->rs_ntuples - hb->nextitem;
+ if (remaining > 0)
+ {
+ nserve = Min(remaining, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[hb->nextitem];
+ hb->ntuples = nserve;
+ hb->nextitem += nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /*
+ * Current page exhausted. Advance to the next page with visible tuples.
+ */
+ for (;;)
+ {
+ /*
+ * Release the previous page's pin. The page is fully consumed at
+ * this point -- all slices have been served.
+ */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ b->nrows = 0;
+ return false;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+ /*
+ * Prepare the page: prune, run visibility checks, and populate
+ * scan->rs_vistuples[0..rs_ntuples-1] via page_collect_tuples().
+ */
+ heap_prepare_pagescan(sscan);
+
+ if (scan->rs_ntuples > 0)
+ {
+ /*
+ * Pin the page so tuple data stays valid while the executor
+ * processes slices. Released at the top of the next call
+ * once the page is fully consumed.
+ */
+ IncrBufferRefCount(scan->rs_cbuf);
+ hb->buf = scan->rs_cbuf;
+
+ nserve = Min(scan->rs_ntuples, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[0];
+ hb->ntuples = nserve;
+ hb->nextitem = nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /* Empty page (all dead/invisible tuples), try next */
+ }
+}
+
+/*
+ * heap_repoint_slot
+ * Re-point the batch's single slot to the tuple at index idx.
+ *
+ * Called by RowBatchGetNextSlot() for each tuple served to the parent
+ * node. hb->tuples[idx] was populated by page_collect_tuples() via
+ * heap_prepare_pagescan() and remains valid as long as hb->buf is pinned.
+ */
+static void
+heap_repoint_slot(RowBatch *b, int idx)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(idx >= 0 && idx < hb->ntuples);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ ExecStoreBufferHeapTuple(&hb->tuples[idx], b->slot, hb->buf);
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
@@ -1639,7 +1860,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* if we get here it means we have a new current scan tuple, so point to
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2fd120028bb..8124d573ac3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2348,7 +2348,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
/* Count successfully-fetched tuples as heap fetches */
- pgstat_count_heap_getnext(scan->rs_rd);
+ pgstat_count_heap_getnext(scan->rs_rd, 1);
return true;
}
@@ -2637,6 +2637,12 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_batch_feasible = heap_batch_feasible,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+ .scan_reset_batch = heap_reset_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 09b9566d0ac..0783fa13c4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -107,6 +107,32 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapPageBatch -- heapam-private page-level batch state.
+ *
+ * Thin slice descriptor over the scan's rs_vistuples[] array. Rather
+ * than owning a copy of tuple headers, HeapPageBatch holds a pointer
+ * into scan->rs_vistuples[] for the current slice, which was populated
+ * by page_collect_tuples() during heap_prepare_pagescan().
+ *
+ * The executor consumes tuples in slices. Each heap_getnextbatch call
+ * re-points tuples to the next slice and advances nextitem, serving up
+ * to RowBatch.max_rows tuples from the current page before advancing
+ * to the next.
+ *
+ * buf holds the pin for the current page. tuple data referenced via
+ * tuples remains valid as long as buf is pinned.
+ *
+ * Stored in RowBatch.am_payload.
+ */
+typedef struct HeapPageBatch
+{
+ HeapTupleData *tuples; /* points into scan->rs_vistuples[nextitem] */
+ int ntuples; /* tuples in current slice */
+ int nextitem; /* next unserved tuple index in rs_vistuples[] */
+ Buffer buf; /* pinned buffer for current page */
+} HeapPageBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -362,6 +388,13 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern bool heap_batch_feasible(Relation relation, Snapshot snapshot);
+extern void heap_begin_batch(TableScanDesc sscan, RowBatch *batch);
+extern bool heap_getnextbatch(TableScanDesc sscan, RowBatch *batch, ScanDirection dir);
+extern void heap_end_batch(TableScanDesc sscan, RowBatch *batch);
+extern void heap_reset_batch(TableScanDesc sscan, RowBatch *batch);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..a72be111c26 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -275,6 +275,8 @@ typedef void (*IndexBuildCallback) (Relation index,
bool tupleIsAlive,
void *state);
+typedef struct RowBatch RowBatch;
+
/*
* API struct for a table AM. Note this must be allocated in a
* server-lifetime manner, typically as a static const struct, which then gets
@@ -351,6 +353,56 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ /*
+ * Returns true if the AM can support batching for a scan with the
+ * given snapshot. Called at plan init time before the scan descriptor
+ * exists. AMs that have no snapshot-based restrictions can omit this
+ * callback, in which case batching is considered feasible.
+ */
+ bool (*scan_batch_feasible)(Relation relation, Snapshot snapshot);
+
+ /*
+ * Initialize AM-owned batch state for a scan. Called once before
+ * the first scan_getnextbatch call. The AM allocates whatever
+ * private state it needs and stores it in b->am_payload. b->slot
+ * is the scan node's ss_ScanTupleSlot, whose type was already
+ * determined by the AM via table_slot_callbacks(). The AM's
+ * repoint_slot callback re-points it to each tuple in the batch
+ * in turn. Future interfaces may allow the AM to expose batch
+ * data in other forms without going through a slot.
+ */
+ void (*scan_begin_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Fetch the next batch of tuples from the scan into b. Sets b->nrows
+ * to the number of tuples available and resets b->pos to 0. Returns
+ * true if any tuples were fetched, false at end of scan. The caller
+ * advances through the batch via RowBatchGetNextSlot(), which calls
+ * ops->repoint_slot for each position up to b->nrows.
+ */
+ bool (*scan_getnextbatch)(TableScanDesc sscan, RowBatch *b,
+ ScanDirection dir);
+
+ /*
+ * Release all AM-owned batch resources, including any buffer pins
+ * held in am_payload. Called when the scan node is shut down.
+ * After this call b->am_payload must not be used.
+ */
+ void (*scan_end_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Reset batch state for rescan. Release any held resources (e.g.
+ * buffer pins) and reset counts, but keep the allocation so the
+ * next getnextbatch call can reuse it without re-entering
+ * begin_batch.
+ */
+ void (*scan_reset_batch)(TableScanDesc sscan, RowBatch *b);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1047,6 +1099,90 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation, Snapshot snapshot)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ if (tam->scan_getnextbatch == NULL)
+ return false;
+
+ Assert(tam->scan_begin_batch != NULL);
+ Assert(tam->scan_reset_batch != NULL);
+ Assert(tam->scan_end_batch != NULL);
+
+ /*
+ * Optional: AM may restrict batching based on snapshot or other conditions.
+ */
+ if (tam->scan_batch_feasible != NULL &&
+ !tam->scan_batch_feasible(relation, snapshot))
+ return false;
+
+ return true;
+}
+
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload in the RowBatch
+ */
+static inline void
+table_scan_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, b);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fetch the next batch of tuples from the AM. Returns true if tuples
+ * were fetched, false at end of scan. Only forward scans are supported.
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, b, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, b);
+}
+
+/*
+ * table_scan_reset_batch
+ * Reset AM-owned batch state for rescan without freeing.
+ */
+static inline void
+table_scan_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_reset_batch != NULL);
+
+ tam->scan_reset_batch(sscan, b);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 216b93492ba..0344c4e88c3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -695,10 +695,10 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.numscans++; \
} while (0)
-#define pgstat_count_heap_getnext(rel) \
+#define pgstat_count_heap_getnext(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
- (rel)->pgstat_info->counts.tuples_returned++; \
+ (rel)->pgstat_info->counts.tuples_returned += (n); \
} while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
--
2.47.3
[application/x-patch] v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch (12.8K, 3-v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch)
download | inline diff:
From d7e8f76144cb27e761e2d4bc9c687dd0a2de203e Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 12 Mar 2026 09:18:04 +0900
Subject: [PATCH v6 1/5] heapam: store full HeapTupleData in rs_vistuples[] for
pagemode scans
page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the OffsetNumber
of each survivor into rs_vistuples[]. heapgettup_pagemode() then
re-derived those same values on every call from the saved OffsetNumber
via PageGetItemId() and PageGetItem().
Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
block, and relid are already in scope, so no additional page reads are
needed. For the all_visible path (the common case on a primary not
under active modification) the write piggy-backs on the existing
per-lineoff loop. For the !all_visible path, HeapTupleData entries are
written during the visibility loop and compacted to visible survivors
afterwards using batchmvcc.visible[], avoiding a return to pd_linp[] via
PageGetItemId().
With rs_vistuples[] populated, heapgettup_pagemode() replaces the
per-tuple PageGetItemId/PageGetItem calls with a single struct copy:
*tuple = scan->rs_vistuples[lineindex];
The stack-local HeapTupleData array in BatchMVCCState is eliminated by
passing rs_vistuples[] directly to HeapTupleSatisfiesMVCCBatch(),
saving MaxHeapTuplesPerPage * 24 bytes of stack per page_collect_tuples()
call. HeapTupleSatisfiesMVCCBatch() loses its vistuples_dense parameter
since compaction is now handled by the caller.
t_tableOid is pre-initialized for all rs_vistuples[] entries at scan
start in heap_beginscan(), eliminating a store per visible tuple from the
fill loop. The raw ItemId word is read once per tuple with lp_off and
lp_len extracted via mask and shift rather than calling ItemIdGetOffset()
and ItemIdGetLength() separately, avoiding a potential second load from
the same address in the inner loop.
Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line pointer
traversal.
Suggested-by: Andres Freund <[email protected]>
---
src/backend/access/heap/heapam.c | 73 ++++++++++++---------
src/backend/access/heap/heapam_handler.c | 19 ++----
src/backend/access/heap/heapam_visibility.c | 21 +++---
src/include/access/heapam.h | 5 +-
4 files changed, 58 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e5bd062de77..c6d0aacc5c9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -524,7 +524,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
BlockNumber block, int lines,
bool all_visible, bool check_serializable)
{
- Oid relid = RelationGetRelid(scan->rs_base.rs_rd);
int ntup = 0;
int nvis = 0;
BatchMVCCState batchmvcc;
@@ -536,7 +535,7 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
{
ItemId lpp = PageGetItemId(page, lineoff);
- HeapTuple tup;
+ HeapTuple tup = &scan->rs_vistuples[ntup];
if (unlikely(!ItemIdIsNormal(lpp)))
continue;
@@ -549,25 +548,33 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
*/
if (!all_visible || check_serializable)
{
- tup = &batchmvcc.tuples[ntup];
+ uint32 lp_val = *(uint32 *) lpp;
- tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tup->t_len = ItemIdGetLength(lpp);
- tup->t_tableOid = relid;
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
ItemPointerSet(&(tup->t_self), block, lineoff);
}
- /*
- * If the page is all visible, these fields otherwise won't be
- * populated in loop below.
- */
if (all_visible)
{
if (check_serializable)
- {
batchmvcc.visible[ntup] = true;
+
+ /*
+ * In the all_visible && !check_serializable path, the block
+ * above was skipped, so tup's fields have not been set yet.
+ * Fill them here while lpp is still in hand.
+ */
+ if (!check_serializable)
+ {
+ uint32 lp_val = *(uint32 *) lpp;
+
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+ ItemPointerSet(&tup->t_self, block, lineoff);
}
- scan->rs_vistuples[ntup] = lineoff;
}
ntup++;
@@ -598,11 +605,24 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
{
HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
scan->rs_base.rs_rd,
- &batchmvcc.tuples[i],
+ &scan->rs_vistuples[i],
buffer, snapshot);
}
}
+
+ /* Now compact rs_vistuples[] to visible survivors only */
+ if (!all_visible)
+ {
+ int dst = 0;
+ for (int i = 0; i < ntup; i++)
+ {
+ if (batchmvcc.visible[i])
+ scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+ }
+ Assert(dst == nvis);
+ }
+
return nvis;
}
@@ -1073,14 +1093,13 @@ heapgettup_pagemode(HeapScanDesc scan,
ScanKey key)
{
HeapTuple tuple = &(scan->rs_ctup);
- Page page;
uint32 lineindex;
uint32 linesleft;
if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
- page = BufferGetPage(scan->rs_cbuf);
+ Assert(BufferIsValid(scan->rs_cbuf));
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
@@ -1108,29 +1127,21 @@ heapgettup_pagemode(HeapScanDesc scan,
/* prune the page and determine visible tuple offsets */
heap_prepare_pagescan((TableScanDesc) scan);
- page = BufferGetPage(scan->rs_cbuf);
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
- /* block is the same for all tuples, set it once outside the loop */
- ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
/* lineindex now references the next or previous visible tid */
continue_page:
for (; linesleft > 0; linesleft--, lineindex += dir)
{
- ItemId lpp;
- OffsetNumber lineoff;
-
- Assert(lineindex < scan->rs_ntuples);
- lineoff = scan->rs_vistuples[lineindex];
- lpp = PageGetItemId(page, lineoff);
- Assert(ItemIdIsNormal(lpp));
-
- tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tuple->t_len = ItemIdGetLength(lpp);
- ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
+ /*
+ * Headers were pre-built by page_collect_tuples() into
+ * rs_vistuples[]. Copy the entry; t_data still points into the
+ * pinned page, which is safe for the lifetime of the current page
+ * scan.
+ */
+ *tuple = scan->rs_vistuples[lineindex];
/* skip any tuples that don't match the scan key */
if (key != NULL &&
@@ -1244,6 +1255,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
/* we only need to set this up once */
scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+ for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+ scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
/*
* Allocate memory to keep track of page allocation for parallel workers
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253a735b6c1..2fd120028bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2153,9 +2153,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
{
BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
HeapScanDesc hscan = (HeapScanDesc) bscan;
- OffsetNumber targoffset;
- Page page;
- ItemId lp;
/*
* Out of range? If so, nothing more to look at on this page
@@ -2170,15 +2167,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
return false;
}
- targoffset = hscan->rs_vistuples[hscan->rs_cindex];
- page = BufferGetPage(hscan->rs_cbuf);
- lp = PageGetItemId(page, targoffset);
- Assert(ItemIdIsNormal(lp));
-
- hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
- hscan->rs_ctup.t_len = ItemIdGetLength(lp);
- hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
- ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+ hscan->rs_ctup = hscan->rs_vistuples[hscan->rs_cindex];
pgstat_count_heap_fetch(scan->rs_rd);
@@ -2456,7 +2445,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
while (start < end)
{
uint32 mid = start + (end - start) / 2;
- OffsetNumber curoffset = hscan->rs_vistuples[mid];
+ OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
if (tupoffset == curoffset)
return true;
@@ -2575,7 +2564,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
ItemPointerSet(&tid, block, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
&heapTuple, NULL, true))
- hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ hscan->rs_vistuples[ntup++] = heapTuple;
}
}
else
@@ -2604,7 +2593,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
if (valid)
{
- hscan->rs_vistuples[ntup++] = offnum;
+ hscan->rs_vistuples[ntup++] = loctup;
PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
HeapTupleHeaderGetXmin(loctup.t_data));
}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index fc64f4343ce..cd6cd4d8d69 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1670,16 +1670,16 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
}
/*
- * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
* efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
*
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[]. The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
*
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
* setting hint bits more efficiently (see the one BufferFinishSetHintBits()
* call below).
*
@@ -1689,7 +1689,7 @@ int
HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense)
+ HeapTupleData *tuples)
{
int nvis = 0;
SetHintBitsState state = SHB_INITIAL;
@@ -1699,16 +1699,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
for (int i = 0; i < ntups; i++)
{
bool valid;
- HeapTuple tup = &batchmvcc->tuples[i];
+ HeapTuple tup = &tuples[i];
valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
batchmvcc->visible[i] = valid;
if (likely(valid))
- {
- vistuples_dense[nvis] = tup->t_self.ip_posid;
nvis++;
- }
}
if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2fdc50b865b..09b9566d0ac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -103,7 +103,7 @@ typedef struct HeapScanDescData
/* these fields only used in page-at-a-time mode and for bitmap scans */
uint32 rs_cindex; /* current tuple's index in vistuples */
uint32 rs_ntuples; /* number of visible tuples on page */
- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ HeapTupleData rs_vistuples[MaxHeapTuplesPerPage]; /* tuples */
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
@@ -483,14 +483,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
*/
typedef struct BatchMVCCState
{
- HeapTupleData tuples[MaxHeapTuplesPerPage];
bool visible[MaxHeapTuplesPerPage];
} BatchMVCCState;
extern int HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense);
+ HeapTupleData *tuples);
/*
* To avoid leaking too much knowledge about reorderbuffer implementation
--
2.47.3
[application/x-patch] v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch (6.5K, 4-v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch)
download | inline diff:
From 0d810ceed77e394883ab0e95eafe36051b546040 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 17:42:19 +0900
Subject: [PATCH v6 2/5] Add RowBatch infrastructure for batched tuple
processing
Introduce RowBatch, a data carrier that allows table AMs to deliver
multiple rows per call and the executor to process them as a group.
RowBatch separates three concerns:
- am_payload: opaque, AM-owned storage (e.g. HeapBatch with pinned
page and tuple headers). The AM allocates this in its
scan_begin_batch callback.
- slots[]: TupleTableSlot array, created by RowBatchCreateSlots()
with AM-appropriate slot ops. Populated from am_payload by
ops->materialize_into_slots when the executor needs tuple data.
- max_rows: executor-set upper bound that the AM respects when
filling a batch.
RowBatch does not own selection/filtering state. Which rows survive
qual evaluation is the executor's concern, tracked separately in
scan node state. This keeps RowBatch focused on the AM-to-executor
data transfer boundary.
RowBatchOps provides a vtable for AM-specific operations; currently
only materialize_into_slots is defined.
---
src/backend/executor/Makefile | 1 +
src/backend/executor/execRowBatch.c | 54 ++++++++++++++++++
src/backend/executor/meson.build | 1 +
src/include/executor/execRowBatch.h | 88 +++++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 146 insertions(+)
create mode 100644 src/backend/executor/execRowBatch.c
create mode 100644 src/include/executor/execRowBatch.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..99a00e762f6 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execRowBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
new file mode 100644
index 00000000000..6a298813bd8
--- /dev/null
+++ b/src/backend/executor/execRowBatch.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.c
+ * Helpers for RowBatch
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execRowBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execRowBatch.h"
+
+/*
+ * RowBatchCreate
+ * Allocate and initialize a new RowBatch envelope.
+ */
+RowBatch *
+RowBatchCreate(int max_rows)
+{
+ RowBatch *b;
+
+ Assert(max_rows > 0);
+
+ b = palloc(sizeof(RowBatch));
+ b->am_payload = NULL;
+ b->ops = NULL;
+ b->max_rows = max_rows;
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ b->slot = NULL;
+
+ return b;
+}
+
+/*
+ * RowBatchReset
+ * Reset an existing RowBatch envelope to empty.
+ */
+void
+RowBatchReset(RowBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ /* b->slot belongs to the owning PlanState node */
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..fd0bf80bacd 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execRowBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
new file mode 100644
index 00000000000..021fdeecc73
--- /dev/null
+++ b/src/include/executor/execRowBatch.h
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.h
+ * Executor batch envelope for passing row batch state upward
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execRowBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECROWBATCH_H
+#define EXECROWBATCH_H
+
+#include "executor/tuptable.h"
+
+typedef struct RowBatchOps RowBatchOps;
+
+/*
+ * RowBatch
+ *
+ * Data carrier from table AM to executor. The AM populates am_payload
+ * and nrows via scan_getnextbatch(). The executor calls ops->materialize_all
+ * to populate slots[] when it needs tuple data.
+ *
+ * Selection state (which rows survived qual eval) is owned by the executor,
+ * not the batch.
+ */
+typedef struct RowBatch
+{
+ void *am_payload;
+ const RowBatchOps *ops;
+
+ int max_rows; /* executor-set upper bound */
+ int nrows; /* rows TAM put in */
+ int pos; /* iteration position */
+ bool materialized; /* tuples in slots valid? */
+
+ TupleTableSlot *slot; /* row view */
+} RowBatch;
+
+/*
+ * RowBatchOps -- AM-specific operations on a RowBatch.
+ *
+ * Table AMs set b->ops during scan_begin_batch to provide
+ * callbacks that the executor uses to access batch contents.
+ *
+ * repoint_slot re-points the batch's single slot to the tuple at
+ * index idx within the current batch. The slot remains valid until
+ * the next call or until the batch is exhausted.
+ *
+ * Additional callbacks can be added here as new AMs or executor
+ * features require them.
+ */
+typedef struct RowBatchOps
+{
+ void (*repoint_slot) (RowBatch *b, int idx);
+} RowBatchOps;
+
+/* Create/teardown */
+extern RowBatch *RowBatchCreate(int max_rows);
+extern void RowBatchReset(RowBatch *b, bool drop_slots);
+
+/* Validation */
+static inline bool
+RowBatchIsValid(RowBatch *b)
+{
+ return b != NULL && b->max_rows > 0;
+}
+
+/* Iteration over materialized slots */
+static inline bool
+RowBatchHasMore(RowBatch *b)
+{
+ return b->pos < b->nrows;
+}
+
+static inline TupleTableSlot *
+RowBatchGetNextSlot(RowBatch *b)
+{
+ if (b->pos >= b->nrows)
+ return NULL;
+ b->ops->repoint_slot(b, b->pos++);
+ return b->slot;
+}
+
+#endif /* EXECROWBATCH_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 52f8603a7be..a2b0b1d99d4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2663,6 +2663,8 @@ RoleSpec
RoleSpecType
RoleStmtType
RollupData
+RowBatch
+RowBatchOps
RowCompareExpr
RowExpr
RowIdentityVarInfo
--
2.47.3
[application/x-patch] v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (17.4K, 5-v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From c5f58f57cda191408855ab243c05f15580ca5eef Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v6 5/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 44 +++++++++++
src/backend/commands/explain_state.c | 8 ++
src/backend/executor/execRowBatch.c | 44 ++++++++++-
src/backend/executor/nodeSeqscan.c | 8 +-
src/include/commands/explain_state.h | 1 +
src/include/executor/execRowBatch.h | 22 +++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 107 ++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 59 ++++++++++++++
9 files changed, 291 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 296ea8a1ed2..b507fec0dab 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execRowBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -519,6 +520,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -1372,6 +1375,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
int save_indent = es->indent;
bool haschildren;
bool isdisabled;
+ RowBatch *batch = NULL;
/*
* Prepare per-worker output buffers, if needed. We'll append the data in
@@ -2297,6 +2301,46 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->walusage);
+ /* BATCHES */
+ switch (nodeTag(plan))
+ {
+ case T_SeqScan:
+ batch = castNode(SeqScanState, planstate)->batch;
+ break;
+ default:
+ break;
+ }
+
+ if (es->batches && batch)
+ {
+ RowBatchStats *stats = batch->stats;
+
+ Assert(stats);
+ if (stats->batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) stats->batches,
+ RowBatchAvgRows(batch), stats->max_rows,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, stats->batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ RowBatchAvgRows(batch), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, stats->max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows, es);
+ }
+ }
+ }
+
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 77f59b8e500..28022a171cd 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -198,6 +200,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
errmsg("%s options %s and %s cannot be used together",
"EXPLAIN", "ANALYZE", "GENERIC_PLAN")));
+ /* check that BATCHES is used with EXPLAIN ANALYZE */
+ if (es->batches && !es->analyze)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN option %s requires ANALYZE", "BATCHES")));
+
/* if the summary was not set explicitly, set default value */
es->summary = (summary_set) ? es->summary : es->analyze;
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
index 6a298813bd8..6ef54deca04 100644
--- a/src/backend/executor/execRowBatch.c
+++ b/src/backend/executor/execRowBatch.c
@@ -20,7 +20,7 @@
* Allocate and initialize a new RowBatch envelope.
*/
RowBatch *
-RowBatchCreate(int max_rows)
+RowBatchCreate(int max_rows, bool track_stats)
{
RowBatch *b;
@@ -35,6 +35,20 @@ RowBatchCreate(int max_rows)
b->materialized = false;
b->slot = NULL;
+ if (track_stats)
+ {
+ RowBatchStats *stats = palloc_object(RowBatchStats);
+
+ stats->batches = 0;
+ stats->rows = 0;
+ stats->max_rows = 0;
+ stats->min_rows = INT_MAX;
+
+ b->stats = stats;
+ }
+ else
+ b->stats = NULL;
+
return b;
}
@@ -52,3 +66,31 @@ RowBatchReset(RowBatch *b, bool drop_slots)
b->materialized = false;
/* b->slot belongs to the owning PlanState node */
}
+
+void
+RowBatchRecordStats(RowBatch *b, int rows)
+{
+ RowBatchStats *stats = b->stats;
+
+ if (stats == NULL)
+ return;
+
+ stats->batches++;
+ stats->rows += rows;
+ if (rows > stats->max_rows)
+ stats->max_rows = rows;
+ if (rows < stats->min_rows && rows > 0)
+ stats->min_rows = rows;
+}
+
+double
+RowBatchAvgRows(RowBatch *b)
+{
+ RowBatchStats *stats = b->stats;
+
+ Assert(stats != NULL);
+ if (stats->batches == 0)
+ return 0.0;
+
+ return (double) stats->rows / stats->batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index b41d18b67e3..c1527be946a 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -245,8 +245,12 @@ SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
static void
SeqScanInitBatching(SeqScanState *scanstate)
{
- RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+ RowBatch *batch;
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument &&
+ (estate->es_instrument & INSTRUMENT_BATCHES);
+ batch = RowBatchCreate(MaxHeapTuplesPerPage, track_stats);
batch->slot = scanstate->ss.ss_ScanTupleSlot;
scanstate->batch = batch;
@@ -347,6 +351,8 @@ SeqNextBatch(SeqScanState *node)
if (!table_scan_getnextbatch(scandesc, b, direction))
return false;
+ RowBatchRecordStats(b, b->nrows);
+
return true;
}
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 5a48bc6fbb1..579ca4cfa20 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
index 021fdeecc73..ad0b4763b70 100644
--- a/src/include/executor/execRowBatch.h
+++ b/src/include/executor/execRowBatch.h
@@ -13,9 +13,12 @@
#ifndef EXECROWBATCH_H
#define EXECROWBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
typedef struct RowBatchOps RowBatchOps;
+typedef struct RowBatchStats RowBatchStats;
/*
* RowBatch
@@ -38,6 +41,9 @@ typedef struct RowBatch
bool materialized; /* tuples in slots valid? */
TupleTableSlot *slot; /* row view */
+
+ RowBatchStats *stats; /* NULL if instrumentation stats
+ * are not requested */
} RowBatch;
/*
@@ -58,8 +64,17 @@ typedef struct RowBatchOps
void (*repoint_slot) (RowBatch *b, int idx);
} RowBatchOps;
+/* Instrumentation stats populated for EXPLAIN ANALYZE BATCHES */
+typedef struct RowBatchStats
+{
+ int64 batches; /* total number of batches fetched */
+ int64 rows; /* total tuples across all batches */
+ int max_rows; /* max rows in any single batch */
+ int min_rows; /* min rows in any single batch (non-zero) */
+} RowBatchStats;
+
/* Create/teardown */
-extern RowBatch *RowBatchCreate(int max_rows);
+extern RowBatch *RowBatchCreate(int max_rows, bool track_stats);
extern void RowBatchReset(RowBatch *b, bool drop_slots);
/* Validation */
@@ -85,4 +100,9 @@ RowBatchGetNextSlot(RowBatch *b)
return b->slot;
}
+/* === Batching stats. ===*/
+
+extern void RowBatchRecordStats(RowBatch *b, int rows);
+extern double RowBatchAvgRows(RowBatch *b);
+
#endif /* EXECROWBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..bee69b4ac8f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..950de5a9d78 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,110 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: ((a > N) OR (b IS NULL))
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches | has_avg | has_max | has_min
+-------------+---------+---------+---------
+ t | t | t | t
+(1 row)
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+ child_has_batches | child_has_avg | child_has_max | child_has_min
+-------------------+---------------+---------------+---------------
+ t | t | t | t
+(1 row)
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches
+-------------
+ f
+(1 row)
+
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..55acb9058ce 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,62 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+reset executor_batch_rows;
--
2.47.3
[application/x-patch] v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch (12.6K, 6-v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 074facc85aae66ebab49b08eadf9957a6dca778d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 11:28:16 +0900
Subject: [PATCH v6 4/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per-tuple AM crossings without
changing the node interface seen by parents.
SeqScanState gains a RowBatch pointer that holds the current batch
when batching is active. Batch state is localized to SeqScanState
-- no changes to PlanState or ScanState.
Add executor_batch_rows GUC (DEVELOPER_OPTIONS, default 64) to
control the maximum batch size. Setting it to 0 disables batching.
XXX currently ignored when reading from heapam tables.
Wire up runtime selection in ExecInitSeqScan via
SeqScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is forward-only, and the relation's AM supports
batching, ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used with zero overhead.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/executor/nodeSeqscan.c | 276 ++++++++++++++++++++++
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 2 +
5 files changed, 291 insertions(+)
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 8f219f60a93..b41d18b67e3 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,12 +29,17 @@
#include "access/relscan.h"
#include "access/tableam.h"
+#include "executor/execRowBatch.h"
#include "executor/execScan.h"
#include "executor/executor.h"
#include "executor/nodeSeqscan.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
+static TupleTableSlot *ExecSeqScanBatchSlot(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQual(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithProject(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQualProject(PlanState *pstate);
/* ----------------------------------------------------------------
* Scan Support
@@ -203,6 +208,271 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * SeqScanCanUseBatching
+ * Check whether this SeqScan can use batch mode execution.
+ *
+ * Batching requires: the GUC is enabled, no EPQ recheck is active, the scan
+ * is forward-only, and the table AM supports batching with the current
+ * snapshot (see table_supports_batching()).
+ */
+static bool
+SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss.ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ relation &&
+ table_supports_batching(relation,
+ scanstate->ss.ps.state->es_snapshot) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ scanstate->ss.ps.state->es_epq_active == NULL;
+}
+
+/*
+ * SeqScanInitBatching
+ * Set up batch execution state and select the appropriate
+ * ExecProcNode variant for batch mode.
+ *
+ * Called from ExecInitSeqScan when SeqScanCanUseBatching returns true.
+ * Overwrites the ExecProcNode pointer set by the non-batch path.
+ */
+static void
+SeqScanInitBatching(SeqScanState *scanstate)
+{
+ RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+
+ batch->slot = scanstate->ss.ss_ScanTupleSlot;
+ scanstate->batch = batch;
+
+ /* Choose batch variant */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+}
+
+/*
+ * SeqScanResetBatching
+ * Reset or tear down batch execution state.
+ *
+ * When drop is false (rescan), resets the RowBatch and releases any
+ * AM-held resources like buffer pins, but keeps allocations for reuse.
+ * When drop is true (end of node), frees everything.
+ */
+static void
+SeqScanResetBatching(SeqScanState *scanstate, bool drop)
+{
+ RowBatch *b = scanstate->batch;
+
+ if (b)
+ {
+ RowBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ if (drop)
+ {
+ table_scan_end_batch(scanstate->ss.ss_currentScanDesc, b);
+ b->am_payload = NULL;
+ }
+ else
+ table_scan_reset_batch(scanstate->ss.ss_currentScanDesc, b);
+ }
+ if (drop)
+ pfree(b);
+ }
+}
+
+/*
+ * SeqNextBatch
+ * Fetch the next batch of tuples from the table AM.
+ *
+ * Lazily initializes the scan descriptor and AM batch state on first
+ * call. Returns false at end of scan.
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+ RowBatch *b = node->batch;
+
+ Assert(b != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (b->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ table_scan_begin_batch(scandesc, b);
+ }
+
+ if (!table_scan_getnextbatch(scandesc, b, direction))
+ return false;
+
+ return true;
+}
+
+/*
+ * SeqScanBatchSlot
+ * Core loop for batch-driven SeqScan variants.
+ *
+ * Internally fetches tuples in batches from the table AM, but returns
+ * one slot at a time to preserve the single-slot interface expected by
+ * parent nodes. When the current batch is exhausted, fetches and
+ * materializes the next one.
+ *
+ * qual and projInfo are passed explicitly so the compiler can eliminate
+ * dead branches when inlined into the typed wrapper functions (e.g.
+ * ExecSeqScanBatchSlot passes NULL for both).
+ *
+ * EPQ is not supported in the batch path; asserted at entry.
+ */
+static inline TupleTableSlot *
+SeqScanBatchSlot(SeqScanState *node,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ss.ps.ps_ExprContext;
+ RowBatch *b = node->batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ss.ps.state->es_epq_active == NULL);
+ Assert(RowBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!RowBatchHasMore(b))
+ {
+ if (!SeqNextBatch(node))
+ return NULL;
+ }
+
+ in = RowBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Check qual if present */
+ if (qual != NULL && !ExecQual(qual, econtext))
+ {
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Project if needed, otherwise return scan tuple directly */
+ if (projInfo != NULL)
+ return ExecProject(projInfo);
+
+ return in;
+ }
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, pstate->ps_ProjInfo);
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -281,6 +551,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (SeqScanCanUseBatching(scanstate, eflags))
+ SeqScanInitBatching(scanstate);
+
return scanstate;
}
@@ -300,6 +573,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ SeqScanResetBatching(node, true);
+
/*
* close heap scan
*/
@@ -329,6 +604,7 @@ ExecReScanSeqScan(SeqScanState *node)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
+ SeqScanResetBatching(node, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a5a0edf2534..e1eadcf643d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1004,6 +1004,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..ad406bf53f3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0716c5a9aed..6f038cfcc60 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -67,6 +67,7 @@ typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct RowBatch RowBatch;
/* ----------------
@@ -1644,6 +1645,7 @@ typedef struct SeqScanState
{
ScanState ss; /* its first field is NodeTag */
Size pscan_len; /* size of parallel heap scan descriptor */
+ RowBatch *batch; /* NULL if batching disabled */
} SeqScanState;
/* ----------------
--
2.47.3
^ permalink raw reply [nested|flat] 9+ messages in thread
* Re: Batching in executor
@ 2026-04-06 12:02 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 0 replies; 9+ messages in thread
From: Amit Langote @ 2026-04-06 12:02 UTC (permalink / raw)
To: Junwang Zhao <[email protected]>; +Cc: cca5507 <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <[email protected]> wrote:
> Here is a significantly revised version of the patch series. A lot has
> changed since the January submission, so I want to summarize the
> design changes before getting into the patches. I think it does
> address the points in the two reviews that landed since v5 but maybe a
> bunch of points became moot after my rewrite of the relevant portions
> (thanks Junwang and ChangAo for the review in any case).
>
> At this point it might be better to think of this as targeting v20,
> except that if there is review bandwidth in the remaining two weeks
> before the v19 feature freeze, the rs_vistuples[] change described
> below as a standalone improvement to the existing pagemode scan path
> could be considered for v19, though that too is an optimistic
> scenario.
>
> It is also worth noting that Andres identified a number of
> inefficiencies in the existing scan path in:
>
> Re: unnecessary executor overheads around seqscans
> https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
>
> that are worth fixing independently of batching. Some of those fixes
> may be better pursued first, both because they benefit all scan paths
> and because they would make batching's gains more honest.
>
> Separately, after looking at the previous version, Andres pointed out
> offlist two fundamental issues with the patch's design:
>
> * The heapam implementation (in a version of the patch I didn't post
> to the thread) duplicated heap_prepare_pagescan() logic in a separate
> batch-specific code path, which is not acceptable as changes should
> benefit the existing slot interface too. Code duplication is not good
> either from a future maintainability aspect. The v5 version of that
> code is not great in that respect either; it instead duplicated
> heapggettup_pagemode() to slap batching on it.
>
> * Allocating executor_batch_rows slots on the executor side to receive
> rows from the AM adds significant overhead for slot initialization and
> management, and for non-row-organized AMs that do not produce
> individual rows at all, those slots would never be meaningfully
> populated.
>
> In any case, he just wasn't a fan of the slot-array approach the
> moment I mentioned it. The previous version had two slot arrays,
> inslots and outslots, of TTSOpsHeapTuple type (not
> TTSOpsBufferHeapTuple because buffer pins were managed by the batch
> code, which has its own modularity/correctness issues), populated via
> a materialize_all callback. A batch qual evaluator would copy
> qualifying tuples into outslots, with an activeslots pointer switching
> between the two depending on whether batch qual evaluation was used.
>
> The new design addresses both issues and differs from the previous
> version in several other ways:
>
> * Single slot instead of slot arrays: there is a single
> TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
> was already determined by the AM via table_slot_callbacks(). The slot
> is re-pointed to each HeapTuple in the current buffer page via a new
> repoint_slot AM callback, with no materialization or copying. Tuples
> are returned one by one from the executor's perspective, but the AM
> serves them in page-sized batches from pre-built HeapTupleData
> descriptors in rs_vistuples[], avoiding repeated descent into heapam
> per tuple. This is heapam's implementation of the batch interface;
> there is no intention to force other AMs into the same row-oriented
> model.
>
> * Batch qual evaluator not included: with the single-slot model,
> quals are evaluated per tuple via the existing ExecQual path after
> each repoint_slot call. A natural next step would be a new opcode
> (EEOP) that calls repoint_slot() internally within expression
> evaluation, allowing ExecQual to advance through multiple tuples from
> the same batch without returning to the scan node each time, with qual
> results accumulated in a bitmask in ExprState. The details of that
> will be worked out in a follow-on series.
>
> * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
> HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
> allows heap_getnextbatch() to simply advance a slice pointer into that
> array without any additional copying or re-entering heap code, making
> a separate batch-specific scan function unnecessary.
>
> * TupleBatch renamed to RowBatch: "row batch" is more natural
> terminology for this concept and also consistent with how similar
> abstractions are named in columnar and OLAP systems.
>
> * AM callbacks now take RowBatch directly: previously
> heap_getnextbatch() returned a void pointer that the executor would
> store into RowBatch.am_payload, because only the executor knew the
> internals of RowBatch. Now the AM receives RowBatch directly as a
> parameter and can populate it without the executor acting as an
> intermediary. This is also why RowBatch is introduced in its own
> patch ahead of the AM API addition, so the struct definition is
> available to both sides.
>
> Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
> instead of OffsetNumbers, as a standalone improvement to the existing
> pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd
> in the all-visible case) table with 1M/5M/10M rows:
>
> query all-visible not-all-visible
> count(*) -0.2% to +0.9% -0.4% to +0.5%
> count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5%
> SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6%
> SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6%
>
> No significant regression on either page type. The structural
> improvement is most visible on not-all-visible pages where
> HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
> visibility checks, so persisting the result into rs_vistuples[]
> eliminates the downstream re-read (in heapgettupe_pagemode()) with no
> measurable overhead. That said, these numbers are somewhat noisy on
> my machine. Results on other machines would be welcome.
>
> Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
> and heapam implementation including seqscan variants that use the new
> scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
> respectively. With batching enabled (executor_batch_rows=300,
> ~MaxHeapTuplesPerPage):
>
> query all-visible not-all-visible
> count(*) +11 to +15% +9 to +13%
> count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14%
> SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22%
> SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13%
>
> With executor_batch_rows=0, results are within noise of master across
> all query types and sizes, confirming no regression from the
> infrastructure changes themselves. The not-all-visible results tend
> to show slightly higher gains than the all-visible case. This is
> likely because the existing heapam code is more optimized for the
> all-visible path, so the not-all-visible path, which goes through
> HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
> more headroom that batching can exploit.
>
> Setting aside the current series for a moment, there are some broader
> design questions worth raising while we have attention on this area.
> Some of these echo points Tomas raised in his first reply on this
> thread, and I am reiterating them deliberately since I have not
> managed to fully address them on my own or I simply didn't need to for
> the TAM-to-scan-node batching and think they would benefit from wider
> input rather than just my own iteration.
>
> We should also start thinking about other ways the executor can
> consume batch rows, not always assuming they are presented as
> HeapTupleData. For instance, an AM could expose decoded column arrays
> directly to operators that can consume them, bypassing slot-based
> deform entirely, or a columnar AM could implement scan_getnextbatch by
> decoding column strips directly into the batch without going through
> per-tuple HeapTupleData at all. Feedback on whether the current
> RowBatch design and the choices made in the scan_getnextbatch and
> RowBatchOps API make that sort of thing harder than it needs to be
> would be appreciated. For example, heapam's implementation of
> scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
> to HeapTupleData entries one at a time via repoint_slot in
> RowBatchHeapOps. That works for heapam but a columnar AM could
> implement scan_getnextbatch to decode column strips directly into
> arrays in the batch, with no per-row repoint step needed at all. Any
> adjustments that would make RowBatch more AM-agnostic are worth
> discussing now before the design hardens.
>
> There are also broader open questions about how far the batch model
> can extend beyond the scan node. Qual pushdown into the AM has been
> discussed in nearby threads and would be one way to allow expression
> evaluation to happen before data reaches the executor proper, though
> that is a separate effort. For the purposes of this series, expression
> evaluation still happens in the executor after scan_getnextbatch
> returns. If the scan node does not project, the buffer heap slot is
> passed directly to the parent node, which calls slot callbacks to
> deform as needed. But once a node above projects, aggregates, or
> joins, the notion of a page-sized batch from a single AM loses its
> meaning and virtual slots take over. Whether RowBatch is usable or
> meaningful beyond the scan/TAM boundary in any form, and whether the
> core executor will ever have non-HeapTupleData batch consumption paths
> or leave that entirely to extensions, are open questions worth
> discussing.
>
> For RowBatch to eventually play the role that TupleTableSlot plays for
> row-at-a-time execution, something inside it would need to serve as
> the common currency for batch data, analogous to TupleTableSlot's
> datum/isnull arrays. Column arrays are the obvious direction, but even
> that leaves open the question of representation. PostgreSQL's Datum is
> a pointer-sized abstraction that boxes everything, whereas vectorized
> systems use typed packed arrays of native types with validity
> bitmasks, which is a significant part of why tight vectorized loops
> are fast there. Whether column arrays of Datum would be good enough,
> or whether going further toward typed packed arrays would be necessary
> to get meaningful vectorization, is a deeper design question that this
> series deliberately does not try to answer.
>
> Even though the focus is on getting batching working at the scan/TAM
> boundary first, thoughts on any of these points would be welcome.
Rebased.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch (12.8K, 2-v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch)
download | inline diff:
From 1557236686140c29be98dc461e97f8df4a0f1a73 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 12 Mar 2026 09:18:04 +0900
Subject: [PATCH v7 1/5] heapam: store full HeapTupleData in rs_vistuples[] for
pagemode scans
page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the OffsetNumber
of each survivor into rs_vistuples[]. heapgettup_pagemode() then
re-derived those same values on every call from the saved OffsetNumber
via PageGetItemId() and PageGetItem().
Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
block, and relid are already in scope, so no additional page reads are
needed. For the all_visible path (the common case on a primary not
under active modification) the write piggy-backs on the existing
per-lineoff loop. For the !all_visible path, HeapTupleData entries are
written during the visibility loop and compacted to visible survivors
afterwards using batchmvcc.visible[], avoiding a return to pd_linp[] via
PageGetItemId().
With rs_vistuples[] populated, heapgettup_pagemode() replaces the
per-tuple PageGetItemId/PageGetItem calls with a single struct copy:
*tuple = scan->rs_vistuples[lineindex];
The stack-local HeapTupleData array in BatchMVCCState is eliminated by
passing rs_vistuples[] directly to HeapTupleSatisfiesMVCCBatch(),
saving MaxHeapTuplesPerPage * 24 bytes of stack per page_collect_tuples()
call. HeapTupleSatisfiesMVCCBatch() loses its vistuples_dense parameter
since compaction is now handled by the caller.
t_tableOid is pre-initialized for all rs_vistuples[] entries at scan
start in heap_beginscan(), eliminating a store per visible tuple from the
fill loop. The raw ItemId word is read once per tuple with lp_off and
lp_len extracted via mask and shift rather than calling ItemIdGetOffset()
and ItemIdGetLength() separately, avoiding a potential second load from
the same address in the inner loop.
Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line pointer
traversal.
Suggested-by: Andres Freund <[email protected]>
---
src/backend/access/heap/heapam.c | 73 ++++++++++++---------
src/backend/access/heap/heapam_handler.c | 19 ++----
src/backend/access/heap/heapam_visibility.c | 21 +++---
src/include/access/heapam.h | 5 +-
4 files changed, 58 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e06ce2db2cf..b70c75c8288 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -524,7 +524,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
BlockNumber block, int lines,
bool all_visible, bool check_serializable)
{
- Oid relid = RelationGetRelid(scan->rs_base.rs_rd);
int ntup = 0;
int nvis = 0;
BatchMVCCState batchmvcc;
@@ -536,7 +535,7 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
{
ItemId lpp = PageGetItemId(page, lineoff);
- HeapTuple tup;
+ HeapTuple tup = &scan->rs_vistuples[ntup];
if (unlikely(!ItemIdIsNormal(lpp)))
continue;
@@ -549,25 +548,33 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
*/
if (!all_visible || check_serializable)
{
- tup = &batchmvcc.tuples[ntup];
+ uint32 lp_val = *(uint32 *) lpp;
- tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tup->t_len = ItemIdGetLength(lpp);
- tup->t_tableOid = relid;
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
ItemPointerSet(&(tup->t_self), block, lineoff);
}
- /*
- * If the page is all visible, these fields otherwise won't be
- * populated in loop below.
- */
if (all_visible)
{
if (check_serializable)
- {
batchmvcc.visible[ntup] = true;
+
+ /*
+ * In the all_visible && !check_serializable path, the block
+ * above was skipped, so tup's fields have not been set yet.
+ * Fill them here while lpp is still in hand.
+ */
+ if (!check_serializable)
+ {
+ uint32 lp_val = *(uint32 *) lpp;
+
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+ ItemPointerSet(&tup->t_self, block, lineoff);
}
- scan->rs_vistuples[ntup] = lineoff;
}
ntup++;
@@ -598,11 +605,24 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
{
HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
scan->rs_base.rs_rd,
- &batchmvcc.tuples[i],
+ &scan->rs_vistuples[i],
buffer, snapshot);
}
}
+
+ /* Now compact rs_vistuples[] to visible survivors only */
+ if (!all_visible)
+ {
+ int dst = 0;
+ for (int i = 0; i < ntup; i++)
+ {
+ if (batchmvcc.visible[i])
+ scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+ }
+ Assert(dst == nvis);
+ }
+
return nvis;
}
@@ -1074,14 +1094,13 @@ heapgettup_pagemode(HeapScanDesc scan,
ScanKey key)
{
HeapTuple tuple = &(scan->rs_ctup);
- Page page;
uint32 lineindex;
uint32 linesleft;
if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
- page = BufferGetPage(scan->rs_cbuf);
+ Assert(BufferIsValid(scan->rs_cbuf));
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
@@ -1109,29 +1128,21 @@ heapgettup_pagemode(HeapScanDesc scan,
/* prune the page and determine visible tuple offsets */
heap_prepare_pagescan((TableScanDesc) scan);
- page = BufferGetPage(scan->rs_cbuf);
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
- /* block is the same for all tuples, set it once outside the loop */
- ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
/* lineindex now references the next or previous visible tid */
continue_page:
for (; linesleft > 0; linesleft--, lineindex += dir)
{
- ItemId lpp;
- OffsetNumber lineoff;
-
- Assert(lineindex < scan->rs_ntuples);
- lineoff = scan->rs_vistuples[lineindex];
- lpp = PageGetItemId(page, lineoff);
- Assert(ItemIdIsNormal(lpp));
-
- tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tuple->t_len = ItemIdGetLength(lpp);
- ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
+ /*
+ * Headers were pre-built by page_collect_tuples() into
+ * rs_vistuples[]. Copy the entry; t_data still points into the
+ * pinned page, which is safe for the lifetime of the current page
+ * scan.
+ */
+ *tuple = scan->rs_vistuples[lineindex];
/* skip any tuples that don't match the scan key */
if (key != NULL &&
@@ -1245,6 +1256,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
/* we only need to set this up once */
scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+ for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+ scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
/*
* Allocate memory to keep track of page allocation for parallel workers
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 07f07188d46..88add129674 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2050,9 +2050,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
{
BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
HeapScanDesc hscan = (HeapScanDesc) bscan;
- OffsetNumber targoffset;
- Page page;
- ItemId lp;
/*
* Out of range? If so, nothing more to look at on this page
@@ -2067,15 +2064,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
return false;
}
- targoffset = hscan->rs_vistuples[hscan->rs_cindex];
- page = BufferGetPage(hscan->rs_cbuf);
- lp = PageGetItemId(page, targoffset);
- Assert(ItemIdIsNormal(lp));
-
- hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
- hscan->rs_ctup.t_len = ItemIdGetLength(lp);
- hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
- ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+ hscan->rs_ctup = hscan->rs_vistuples[hscan->rs_cindex];
pgstat_count_heap_fetch(scan->rs_rd);
@@ -2353,7 +2342,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
while (start < end)
{
uint32 mid = start + (end - start) / 2;
- OffsetNumber curoffset = hscan->rs_vistuples[mid];
+ OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
if (tupoffset == curoffset)
return true;
@@ -2473,7 +2462,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
ItemPointerSet(&tid, block, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
&heapTuple, NULL, true))
- hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ hscan->rs_vistuples[ntup++] = heapTuple;
}
}
else
@@ -2502,7 +2491,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
if (valid)
{
- hscan->rs_vistuples[ntup++] = offnum;
+ hscan->rs_vistuples[ntup++] = loctup;
PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
HeapTupleHeaderGetXmin(loctup.t_data));
}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3a6a1e5a084..7162c848097 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1671,16 +1671,16 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
}
/*
- * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
* efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
*
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[]. The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
*
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
* setting hint bits more efficiently (see the one BufferFinishSetHintBits()
* call below).
*
@@ -1690,7 +1690,7 @@ int
HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense)
+ HeapTupleData *tuples)
{
int nvis = 0;
SetHintBitsState state = SHB_INITIAL;
@@ -1700,16 +1700,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
for (int i = 0; i < ntups; i++)
{
bool valid;
- HeapTuple tup = &batchmvcc->tuples[i];
+ HeapTuple tup = &tuples[i];
valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
batchmvcc->visible[i] = valid;
if (likely(valid))
- {
- vistuples_dense[nvis] = tup->t_self.ip_posid;
nvis++;
- }
}
if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5176478c295..56f2d1a5748 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -102,7 +102,7 @@ typedef struct HeapScanDescData
/* these fields only used in page-at-a-time mode and for bitmap scans */
uint32 rs_cindex; /* current tuple's index in vistuples */
uint32 rs_ntuples; /* number of visible tuples on page */
- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ HeapTupleData rs_vistuples[MaxHeapTuplesPerPage]; /* tuples */
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
@@ -498,14 +498,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
*/
typedef struct BatchMVCCState
{
- HeapTupleData tuples[MaxHeapTuplesPerPage];
bool visible[MaxHeapTuplesPerPage];
} BatchMVCCState;
extern int HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense);
+ HeapTupleData *tuples);
/*
* To avoid leaking too much knowledge about reorderbuffer implementation
--
2.47.3
[application/octet-stream] v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (17.4K, 3-v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From 8beefb53e7fa94a060456d1321f36abb221cbe47 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v7 5/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 44 +++++++++++
src/backend/commands/explain_state.c | 8 ++
src/backend/executor/execRowBatch.c | 44 ++++++++++-
src/backend/executor/nodeSeqscan.c | 8 +-
src/include/commands/explain_state.h | 1 +
src/include/executor/execRowBatch.h | 22 +++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 107 ++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 59 ++++++++++++++
9 files changed, 291 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73eaaf176ac..8c98ca57c92 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execRowBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -519,6 +520,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -1370,6 +1373,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
int save_indent = es->indent;
bool haschildren;
bool isdisabled;
+ RowBatch *batch = NULL;
/*
* Prepare per-worker output buffers, if needed. We'll append the data in
@@ -2296,6 +2300,46 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
+ /* BATCHES */
+ switch (nodeTag(plan))
+ {
+ case T_SeqScan:
+ batch = castNode(SeqScanState, planstate)->batch;
+ break;
+ default:
+ break;
+ }
+
+ if (es->batches && batch)
+ {
+ RowBatchStats *stats = batch->stats;
+
+ Assert(stats);
+ if (stats->batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) stats->batches,
+ RowBatchAvgRows(batch), stats->max_rows,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, stats->batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ RowBatchAvgRows(batch), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, stats->max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows, es);
+ }
+ }
+ }
+
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 77f59b8e500..28022a171cd 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -198,6 +200,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
errmsg("%s options %s and %s cannot be used together",
"EXPLAIN", "ANALYZE", "GENERIC_PLAN")));
+ /* check that BATCHES is used with EXPLAIN ANALYZE */
+ if (es->batches && !es->analyze)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN option %s requires ANALYZE", "BATCHES")));
+
/* if the summary was not set explicitly, set default value */
es->summary = (summary_set) ? es->summary : es->analyze;
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
index 6a298813bd8..6ef54deca04 100644
--- a/src/backend/executor/execRowBatch.c
+++ b/src/backend/executor/execRowBatch.c
@@ -20,7 +20,7 @@
* Allocate and initialize a new RowBatch envelope.
*/
RowBatch *
-RowBatchCreate(int max_rows)
+RowBatchCreate(int max_rows, bool track_stats)
{
RowBatch *b;
@@ -35,6 +35,20 @@ RowBatchCreate(int max_rows)
b->materialized = false;
b->slot = NULL;
+ if (track_stats)
+ {
+ RowBatchStats *stats = palloc_object(RowBatchStats);
+
+ stats->batches = 0;
+ stats->rows = 0;
+ stats->max_rows = 0;
+ stats->min_rows = INT_MAX;
+
+ b->stats = stats;
+ }
+ else
+ b->stats = NULL;
+
return b;
}
@@ -52,3 +66,31 @@ RowBatchReset(RowBatch *b, bool drop_slots)
b->materialized = false;
/* b->slot belongs to the owning PlanState node */
}
+
+void
+RowBatchRecordStats(RowBatch *b, int rows)
+{
+ RowBatchStats *stats = b->stats;
+
+ if (stats == NULL)
+ return;
+
+ stats->batches++;
+ stats->rows += rows;
+ if (rows > stats->max_rows)
+ stats->max_rows = rows;
+ if (rows < stats->min_rows && rows > 0)
+ stats->min_rows = rows;
+}
+
+double
+RowBatchAvgRows(RowBatch *b)
+{
+ RowBatchStats *stats = b->stats;
+
+ Assert(stats != NULL);
+ if (stats->batches == 0)
+ return 0.0;
+
+ return (double) stats->rows / stats->batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index d0ce8858c49..135b0a4f9a2 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -247,8 +247,12 @@ SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
static void
SeqScanInitBatching(SeqScanState *scanstate)
{
- RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+ RowBatch *batch;
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument &&
+ (estate->es_instrument & INSTRUMENT_BATCHES);
+ batch = RowBatchCreate(MaxHeapTuplesPerPage, track_stats);
batch->slot = scanstate->ss.ss_ScanTupleSlot;
scanstate->batch = batch;
@@ -351,6 +355,8 @@ SeqNextBatch(SeqScanState *node)
if (!table_scan_getnextbatch(scandesc, b, direction))
return false;
+ RowBatchRecordStats(b, b->nrows);
+
return true;
}
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 5a48bc6fbb1..579ca4cfa20 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
index 021fdeecc73..ad0b4763b70 100644
--- a/src/include/executor/execRowBatch.h
+++ b/src/include/executor/execRowBatch.h
@@ -13,9 +13,12 @@
#ifndef EXECROWBATCH_H
#define EXECROWBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
typedef struct RowBatchOps RowBatchOps;
+typedef struct RowBatchStats RowBatchStats;
/*
* RowBatch
@@ -38,6 +41,9 @@ typedef struct RowBatch
bool materialized; /* tuples in slots valid? */
TupleTableSlot *slot; /* row view */
+
+ RowBatchStats *stats; /* NULL if instrumentation stats
+ * are not requested */
} RowBatch;
/*
@@ -58,8 +64,17 @@ typedef struct RowBatchOps
void (*repoint_slot) (RowBatch *b, int idx);
} RowBatchOps;
+/* Instrumentation stats populated for EXPLAIN ANALYZE BATCHES */
+typedef struct RowBatchStats
+{
+ int64 batches; /* total number of batches fetched */
+ int64 rows; /* total tuples across all batches */
+ int max_rows; /* max rows in any single batch */
+ int min_rows; /* min rows in any single batch (non-zero) */
+} RowBatchStats;
+
/* Create/teardown */
-extern RowBatch *RowBatchCreate(int max_rows);
+extern RowBatch *RowBatchCreate(int max_rows, bool track_stats);
extern void RowBatchReset(RowBatch *b, bool drop_slots);
/* Validation */
@@ -85,4 +100,9 @@ RowBatchGetNextSlot(RowBatch *b)
return b->slot;
}
+/* === Batching stats. ===*/
+
+extern void RowBatchRecordStats(RowBatch *b, int rows);
+extern double RowBatchAvgRows(RowBatch *b);
+
#endif /* EXECROWBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc9fbb0e2f0..89df74a86c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..950de5a9d78 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,110 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: ((a > N) OR (b IS NULL))
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches | has_avg | has_max | has_min
+-------------+---------+---------+---------
+ t | t | t | t
+(1 row)
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+ child_has_batches | child_has_avg | child_has_max | child_has_min
+-------------------+---------------+---------------+---------------
+ t | t | t | t
+(1 row)
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches
+-------------
+ f
+(1 row)
+
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..55acb9058ce 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,62 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+reset executor_batch_rows;
--
2.47.3
[application/octet-stream] v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch (6.5K, 4-v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch)
download | inline diff:
From 815d001dcc7a2cda50e3d55522bfaf30ad7fceee Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 17:42:19 +0900
Subject: [PATCH v7 2/5] Add RowBatch infrastructure for batched tuple
processing
Introduce RowBatch, a data carrier that allows table AMs to deliver
multiple rows per call and the executor to process them as a group.
RowBatch separates three concerns:
- am_payload: opaque, AM-owned storage (e.g. HeapBatch with pinned
page and tuple headers). The AM allocates this in its
scan_begin_batch callback.
- slots[]: TupleTableSlot array, created by RowBatchCreateSlots()
with AM-appropriate slot ops. Populated from am_payload by
ops->materialize_into_slots when the executor needs tuple data.
- max_rows: executor-set upper bound that the AM respects when
filling a batch.
RowBatch does not own selection/filtering state. Which rows survive
qual evaluation is the executor's concern, tracked separately in
scan node state. This keeps RowBatch focused on the AM-to-executor
data transfer boundary.
RowBatchOps provides a vtable for AM-specific operations; currently
only materialize_into_slots is defined.
---
src/backend/executor/Makefile | 1 +
src/backend/executor/execRowBatch.c | 54 ++++++++++++++++++
src/backend/executor/meson.build | 1 +
src/include/executor/execRowBatch.h | 88 +++++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 146 insertions(+)
create mode 100644 src/backend/executor/execRowBatch.c
create mode 100644 src/include/executor/execRowBatch.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..99a00e762f6 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execRowBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
new file mode 100644
index 00000000000..6a298813bd8
--- /dev/null
+++ b/src/backend/executor/execRowBatch.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.c
+ * Helpers for RowBatch
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execRowBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execRowBatch.h"
+
+/*
+ * RowBatchCreate
+ * Allocate and initialize a new RowBatch envelope.
+ */
+RowBatch *
+RowBatchCreate(int max_rows)
+{
+ RowBatch *b;
+
+ Assert(max_rows > 0);
+
+ b = palloc(sizeof(RowBatch));
+ b->am_payload = NULL;
+ b->ops = NULL;
+ b->max_rows = max_rows;
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ b->slot = NULL;
+
+ return b;
+}
+
+/*
+ * RowBatchReset
+ * Reset an existing RowBatch envelope to empty.
+ */
+void
+RowBatchReset(RowBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ /* b->slot belongs to the owning PlanState node */
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..fd0bf80bacd 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execRowBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
new file mode 100644
index 00000000000..021fdeecc73
--- /dev/null
+++ b/src/include/executor/execRowBatch.h
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.h
+ * Executor batch envelope for passing row batch state upward
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execRowBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECROWBATCH_H
+#define EXECROWBATCH_H
+
+#include "executor/tuptable.h"
+
+typedef struct RowBatchOps RowBatchOps;
+
+/*
+ * RowBatch
+ *
+ * Data carrier from table AM to executor. The AM populates am_payload
+ * and nrows via scan_getnextbatch(). The executor calls ops->materialize_all
+ * to populate slots[] when it needs tuple data.
+ *
+ * Selection state (which rows survived qual eval) is owned by the executor,
+ * not the batch.
+ */
+typedef struct RowBatch
+{
+ void *am_payload;
+ const RowBatchOps *ops;
+
+ int max_rows; /* executor-set upper bound */
+ int nrows; /* rows TAM put in */
+ int pos; /* iteration position */
+ bool materialized; /* tuples in slots valid? */
+
+ TupleTableSlot *slot; /* row view */
+} RowBatch;
+
+/*
+ * RowBatchOps -- AM-specific operations on a RowBatch.
+ *
+ * Table AMs set b->ops during scan_begin_batch to provide
+ * callbacks that the executor uses to access batch contents.
+ *
+ * repoint_slot re-points the batch's single slot to the tuple at
+ * index idx within the current batch. The slot remains valid until
+ * the next call or until the batch is exhausted.
+ *
+ * Additional callbacks can be added here as new AMs or executor
+ * features require them.
+ */
+typedef struct RowBatchOps
+{
+ void (*repoint_slot) (RowBatch *b, int idx);
+} RowBatchOps;
+
+/* Create/teardown */
+extern RowBatch *RowBatchCreate(int max_rows);
+extern void RowBatchReset(RowBatch *b, bool drop_slots);
+
+/* Validation */
+static inline bool
+RowBatchIsValid(RowBatch *b)
+{
+ return b != NULL && b->max_rows > 0;
+}
+
+/* Iteration over materialized slots */
+static inline bool
+RowBatchHasMore(RowBatch *b)
+{
+ return b->pos < b->nrows;
+}
+
+static inline TupleTableSlot *
+RowBatchGetNextSlot(RowBatch *b)
+{
+ if (b->pos >= b->nrows)
+ return NULL;
+ b->ops->repoint_slot(b, b->pos++);
+ return b->slot;
+}
+
+#endif /* EXECROWBATCH_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 35acda59851..e5c172628b3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2694,6 +2694,8 @@ RoleSpec
RoleSpecType
RoleStmtType
RollupData
+RowBatch
+RowBatchOps
RowCompareExpr
RowExpr
RowIdentityVarInfo
--
2.47.3
[application/octet-stream] v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch (19.0K, 5-v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From dd122f0913affbafe95ee4fc79eb656b482fe1e0 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 23 Mar 2026 18:21:47 +0900
Subject: [PATCH v7 3/5] Add batch table AM API and heapam implementation
Introduce table AM callbacks for batched tuple fetching:
scan_begin_batch, scan_getnextbatch, scan_reset_batch, and
scan_end_batch. AMs implement all four or none; checked by
table_supports_batching().
scan_reset_batch releases held resources (e.g. buffer pins)
without freeing, allowing reuse across rescans.
Provide the heapam implementation. HeapPageBatch (stored in
RowBatch.am_payload) is a thin slice descriptor over the scan's
rs_vistuples[] array, which was introduced in the previous commit.
Rather than owning a copy of tuple headers, HeapPageBatch holds a
pointer into scan->rs_vistuples[] for the current slice and a buffer
pin for the current page.
heap_getnextbatch() calls heap_prepare_pagescan() to populate
rs_vistuples[] for each new page, then re-points hb->tuples to the
next slice of rs_vistuples[] on each call. If the page has more
tuples than the executor's max_rows, subsequent calls return the
next slice without re-entering page preparation. The buffer pin is
held until the page is fully consumed.
scan_begin_batch creates a single TupleTableSlot with
TTSOpsBufferHeapTuple ops. heap_repoint_slot() re-points this slot
to each tuple in turn via ExecStoreBufferHeapTuple(). Consumers
that need to retain the slot across calls rely on the normal slot
materialization contract.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 229 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 8 +-
src/include/access/heapam.h | 33 ++++
src/include/access/tableam.h | 136 ++++++++++++++
src/include/pgstat.h | 4 +-
5 files changed, 403 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b70c75c8288..d45f509fa6b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
#include "catalog/pg_database.h"
#include "catalog/pg_database_d.h"
#include "commands/vacuum.h"
+#include "executor/execRowBatch.h"
#include "pgstat.h"
#include "port/pg_bitutils.h"
#include "storage/lmgr.h"
@@ -109,6 +110,7 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static void heap_repoint_slot(RowBatch *b, int idx);
/*
@@ -1214,7 +1216,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_cbuf = InvalidBuffer;
/*
- * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
+ * Disable page-at-a-time mode if the snapshot does not allow it.
*/
if (!(snapshot && IsMVCCSnapshot(snapshot)))
scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
@@ -1464,7 +1466,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
return &scan->rs_ctup;
}
@@ -1492,13 +1494,232 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
scan->rs_cbuf);
return true;
}
+/*---------- Batching support -----------*/
+
+static const RowBatchOps RowBatchHeapOps =
+{
+ .repoint_slot = heap_repoint_slot
+};
+
+/*
+ * heap_batch_feasible
+ * Batching requires a MVCC snapshot since it relies on
+ * page-at-a-time mode, which heap_beginscan() disables for
+ * non-MVCC snapshots.
+ */
+bool
+heap_batch_feasible(Relation relation, Snapshot snapshot)
+{
+ return snapshot && IsMVCCSnapshot(snapshot);
+}
+
+/*
+ * heap_begin_batch
+ * Initialize AM-side batch state for a heap scan.
+ *
+ * Allocates a HeapPageBatch, which acts as a thin slice descriptor over
+ * the scan's rs_vistuples[] array. Unlike the previous version there is
+ * no separate tuple header storage in HeapPageBatch itself; rs_vistuples[]
+ * in HeapScanDescData (populated by page_collect_tuples() via
+ * heap_prepare_pagescan()) serves as the page-level buffer. HeapPageBatch
+ * holds a pointer into that array for the current slice and the buffer pin
+ * for the current page.
+ *
+ * b->slot must be a TTSOpsBufferHeapTuple slot.
+ */
+void
+heap_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb;
+
+ /* Batch path relies on executor-level qual eval, not AM scan keys */
+ Assert(sscan->rs_nkeys == 0);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ hb = palloc(sizeof(HeapPageBatch));
+ hb->tuples = NULL;
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+ hb->buf = InvalidBuffer;
+
+ b->am_payload = hb;
+ b->ops = &RowBatchHeapOps;
+}
+
+/*
+ * heap_reset_batch
+ * Release pin and reset for rescan, keeping allocations.
+ */
+void
+heap_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(hb != NULL);
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+}
+
+/*
+ * heap_end_batch
+ * Release all batch resources.
+ */
+void
+heap_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+ b->am_payload = NULL;
+}
+
+/*
+ * heap_getnextbatch
+ * Fetch the next slice of visible tuples from a heap scan.
+ *
+ * Serves slices from the current page's rs_vistuples[] array. If the
+ * current page has remaining tuples, sets hb->tuples to point at the next
+ * slice without re-entering the page scan. If the page is exhausted,
+ * advances to the next page via heap_fetch_next_buffer(), prepares it
+ * with heap_prepare_pagescan(), and serves the first slice from it.
+ *
+ * hb->tuples points directly into scan->rs_vistuples[]; the entries remain
+ * valid as long as hb->buf (the page's buffer pin) is held. The pin is
+ * released at the top of the next call once the page is fully consumed.
+ *
+ * Each call returns at most b->max_rows tuples.
+ *
+ * Returns true if tuples were fetched, false at end of scan.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+ int remaining;
+ int nserve;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+
+ /*
+ * Try to serve from the current page first. No page advance, no buffer
+ * management, no re-entry into heap code.
+ */
+ remaining = scan->rs_ntuples - hb->nextitem;
+ if (remaining > 0)
+ {
+ nserve = Min(remaining, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[hb->nextitem];
+ hb->ntuples = nserve;
+ hb->nextitem += nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /*
+ * Current page exhausted. Advance to the next page with visible tuples.
+ */
+ for (;;)
+ {
+ /*
+ * Release the previous page's pin. The page is fully consumed at
+ * this point -- all slices have been served.
+ */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ b->nrows = 0;
+ return false;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+ /*
+ * Prepare the page: prune, run visibility checks, and populate
+ * scan->rs_vistuples[0..rs_ntuples-1] via page_collect_tuples().
+ */
+ heap_prepare_pagescan(sscan);
+
+ if (scan->rs_ntuples > 0)
+ {
+ /*
+ * Pin the page so tuple data stays valid while the executor
+ * processes slices. Released at the top of the next call
+ * once the page is fully consumed.
+ */
+ IncrBufferRefCount(scan->rs_cbuf);
+ hb->buf = scan->rs_cbuf;
+
+ nserve = Min(scan->rs_ntuples, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[0];
+ hb->ntuples = nserve;
+ hb->nextitem = nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /* Empty page (all dead/invisible tuples), try next */
+ }
+}
+
+/*
+ * heap_repoint_slot
+ * Re-point the batch's single slot to the tuple at index idx.
+ *
+ * Called by RowBatchGetNextSlot() for each tuple served to the parent
+ * node. hb->tuples[idx] was populated by page_collect_tuples() via
+ * heap_prepare_pagescan() and remains valid as long as hb->buf is pinned.
+ */
+static void
+heap_repoint_slot(RowBatch *b, int idx)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(idx >= 0 && idx < hb->ntuples);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ ExecStoreBufferHeapTuple(&hb->tuples[idx], b->slot, hb->buf);
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
@@ -1640,7 +1861,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* if we get here it means we have a new current scan tuple, so point to
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 88add129674..828b1a71362 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2245,7 +2245,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
/* Count successfully-fetched tuples as heap fetches */
- pgstat_count_heap_getnext(scan->rs_rd);
+ pgstat_count_heap_getnext(scan->rs_rd, 1);
return true;
}
@@ -2535,6 +2535,12 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_batch_feasible = heap_batch_feasible,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+ .scan_reset_batch = heap_reset_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 56f2d1a5748..d980dd29a44 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -106,6 +106,32 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapPageBatch -- heapam-private page-level batch state.
+ *
+ * Thin slice descriptor over the scan's rs_vistuples[] array. Rather
+ * than owning a copy of tuple headers, HeapPageBatch holds a pointer
+ * into scan->rs_vistuples[] for the current slice, which was populated
+ * by page_collect_tuples() during heap_prepare_pagescan().
+ *
+ * The executor consumes tuples in slices. Each heap_getnextbatch call
+ * re-points tuples to the next slice and advances nextitem, serving up
+ * to RowBatch.max_rows tuples from the current page before advancing
+ * to the next.
+ *
+ * buf holds the pin for the current page. tuple data referenced via
+ * tuples remains valid as long as buf is pinned.
+ *
+ * Stored in RowBatch.am_payload.
+ */
+typedef struct HeapPageBatch
+{
+ HeapTupleData *tuples; /* points into scan->rs_vistuples[nextitem] */
+ int ntuples; /* tuples in current slice */
+ int nextitem; /* next unserved tuple index in rs_vistuples[] */
+ Buffer buf; /* pinned buffer for current page */
+} HeapPageBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -360,6 +386,13 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern bool heap_batch_feasible(Relation relation, Snapshot snapshot);
+extern void heap_begin_batch(TableScanDesc sscan, RowBatch *batch);
+extern bool heap_getnextbatch(TableScanDesc sscan, RowBatch *batch, ScanDirection dir);
+extern void heap_end_batch(TableScanDesc sscan, RowBatch *batch);
+extern void heap_reset_batch(TableScanDesc sscan, RowBatch *batch);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4647785fd35..28caa3dcf37 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -303,6 +303,8 @@ typedef void (*IndexBuildCallback) (Relation index,
bool tupleIsAlive,
void *state);
+typedef struct RowBatch RowBatch;
+
/*
* API struct for a table AM. Note this must be allocated in a
* server-lifetime manner, typically as a static const struct, which then gets
@@ -380,6 +382,56 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ /*
+ * Returns true if the AM can support batching for a scan with the
+ * given snapshot. Called at plan init time before the scan descriptor
+ * exists. AMs that have no snapshot-based restrictions can omit this
+ * callback, in which case batching is considered feasible.
+ */
+ bool (*scan_batch_feasible)(Relation relation, Snapshot snapshot);
+
+ /*
+ * Initialize AM-owned batch state for a scan. Called once before
+ * the first scan_getnextbatch call. The AM allocates whatever
+ * private state it needs and stores it in b->am_payload. b->slot
+ * is the scan node's ss_ScanTupleSlot, whose type was already
+ * determined by the AM via table_slot_callbacks(). The AM's
+ * repoint_slot callback re-points it to each tuple in the batch
+ * in turn. Future interfaces may allow the AM to expose batch
+ * data in other forms without going through a slot.
+ */
+ void (*scan_begin_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Fetch the next batch of tuples from the scan into b. Sets b->nrows
+ * to the number of tuples available and resets b->pos to 0. Returns
+ * true if any tuples were fetched, false at end of scan. The caller
+ * advances through the batch via RowBatchGetNextSlot(), which calls
+ * ops->repoint_slot for each position up to b->nrows.
+ */
+ bool (*scan_getnextbatch)(TableScanDesc sscan, RowBatch *b,
+ ScanDirection dir);
+
+ /*
+ * Release all AM-owned batch resources, including any buffer pins
+ * held in am_payload. Called when the scan node is shut down.
+ * After this call b->am_payload must not be used.
+ */
+ void (*scan_end_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Reset batch state for rescan. Release any held resources (e.g.
+ * buffer pins) and reset counts, but keep the allocation so the
+ * next getnextbatch call can reuse it without re-entering
+ * begin_batch.
+ */
+ void (*scan_reset_batch)(TableScanDesc sscan, RowBatch *b);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1099,6 +1151,90 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation, Snapshot snapshot)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ if (tam->scan_getnextbatch == NULL)
+ return false;
+
+ Assert(tam->scan_begin_batch != NULL);
+ Assert(tam->scan_reset_batch != NULL);
+ Assert(tam->scan_end_batch != NULL);
+
+ /*
+ * Optional: AM may restrict batching based on snapshot or other conditions.
+ */
+ if (tam->scan_batch_feasible != NULL &&
+ !tam->scan_batch_feasible(relation, snapshot))
+ return false;
+
+ return true;
+}
+
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload in the RowBatch
+ */
+static inline void
+table_scan_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, b);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fetch the next batch of tuples from the AM. Returns true if tuples
+ * were fetched, false at end of scan. Only forward scans are supported.
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, b, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, b);
+}
+
+/*
+ * table_scan_reset_batch
+ * Reset AM-owned batch state for rescan without freeing.
+ */
+static inline void
+table_scan_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_reset_batch != NULL);
+
+ tam->scan_reset_batch(sscan, b);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2786a7c5ffb..df06e33fba2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,10 +719,10 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.numscans++; \
} while (0)
-#define pgstat_count_heap_getnext(rel) \
+#define pgstat_count_heap_getnext(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
- (rel)->pgstat_info->counts.tuples_returned++; \
+ (rel)->pgstat_info->counts.tuples_returned += (n); \
} while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
--
2.47.3
[application/octet-stream] v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch (12.6K, 6-v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From e76a49df42dbf22a3169eb2e1d880d9282c1f02f Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 11:28:16 +0900
Subject: [PATCH v7 4/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per-tuple AM crossings without
changing the node interface seen by parents.
SeqScanState gains a RowBatch pointer that holds the current batch
when batching is active. Batch state is localized to SeqScanState
-- no changes to PlanState or ScanState.
Add executor_batch_rows GUC (DEVELOPER_OPTIONS, default 64) to
control the maximum batch size. Setting it to 0 disables batching.
XXX currently ignored when reading from heapam tables.
Wire up runtime selection in ExecInitSeqScan via
SeqScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is forward-only, and the relation's AM supports
batching, ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used with zero overhead.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/executor/nodeSeqscan.c | 278 ++++++++++++++++++++++
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 2 +
5 files changed, 293 insertions(+)
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 04803b0e37d..d0ce8858c49 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,12 +29,17 @@
#include "access/relscan.h"
#include "access/tableam.h"
+#include "executor/execRowBatch.h"
#include "executor/execScan.h"
#include "executor/executor.h"
#include "executor/nodeSeqscan.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
+static TupleTableSlot *ExecSeqScanBatchSlot(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQual(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithProject(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQualProject(PlanState *pstate);
/* ----------------------------------------------------------------
* Scan Support
@@ -205,6 +210,273 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * SeqScanCanUseBatching
+ * Check whether this SeqScan can use batch mode execution.
+ *
+ * Batching requires: the GUC is enabled, no EPQ recheck is active, the scan
+ * is forward-only, and the table AM supports batching with the current
+ * snapshot (see table_supports_batching()).
+ */
+static bool
+SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss.ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ relation &&
+ table_supports_batching(relation,
+ scanstate->ss.ps.state->es_snapshot) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ scanstate->ss.ps.state->es_epq_active == NULL;
+}
+
+/*
+ * SeqScanInitBatching
+ * Set up batch execution state and select the appropriate
+ * ExecProcNode variant for batch mode.
+ *
+ * Called from ExecInitSeqScan when SeqScanCanUseBatching returns true.
+ * Overwrites the ExecProcNode pointer set by the non-batch path.
+ */
+static void
+SeqScanInitBatching(SeqScanState *scanstate)
+{
+ RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+
+ batch->slot = scanstate->ss.ss_ScanTupleSlot;
+ scanstate->batch = batch;
+
+ /* Choose batch variant */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+}
+
+/*
+ * SeqScanResetBatching
+ * Reset or tear down batch execution state.
+ *
+ * When drop is false (rescan), resets the RowBatch and releases any
+ * AM-held resources like buffer pins, but keeps allocations for reuse.
+ * When drop is true (end of node), frees everything.
+ */
+static void
+SeqScanResetBatching(SeqScanState *scanstate, bool drop)
+{
+ RowBatch *b = scanstate->batch;
+
+ if (b)
+ {
+ RowBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ if (drop)
+ {
+ table_scan_end_batch(scanstate->ss.ss_currentScanDesc, b);
+ b->am_payload = NULL;
+ }
+ else
+ table_scan_reset_batch(scanstate->ss.ss_currentScanDesc, b);
+ }
+ if (drop)
+ pfree(b);
+ }
+}
+
+/*
+ * SeqNextBatch
+ * Fetch the next batch of tuples from the table AM.
+ *
+ * Lazily initializes the scan descriptor and AM batch state on first
+ * call. Returns false at end of scan.
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+ RowBatch *b = node->batch;
+
+ Assert(b != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL,
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (b->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ table_scan_begin_batch(scandesc, b);
+ }
+
+ if (!table_scan_getnextbatch(scandesc, b, direction))
+ return false;
+
+ return true;
+}
+
+/*
+ * SeqScanBatchSlot
+ * Core loop for batch-driven SeqScan variants.
+ *
+ * Internally fetches tuples in batches from the table AM, but returns
+ * one slot at a time to preserve the single-slot interface expected by
+ * parent nodes. When the current batch is exhausted, fetches and
+ * materializes the next one.
+ *
+ * qual and projInfo are passed explicitly so the compiler can eliminate
+ * dead branches when inlined into the typed wrapper functions (e.g.
+ * ExecSeqScanBatchSlot passes NULL for both).
+ *
+ * EPQ is not supported in the batch path; asserted at entry.
+ */
+static inline TupleTableSlot *
+SeqScanBatchSlot(SeqScanState *node,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ss.ps.ps_ExprContext;
+ RowBatch *b = node->batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ss.ps.state->es_epq_active == NULL);
+ Assert(RowBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!RowBatchHasMore(b))
+ {
+ if (!SeqNextBatch(node))
+ return NULL;
+ }
+
+ in = RowBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Check qual if present */
+ if (qual != NULL && !ExecQual(qual, econtext))
+ {
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Project if needed, otherwise return scan tuple directly */
+ if (projInfo != NULL)
+ return ExecProject(projInfo);
+
+ return in;
+ }
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, pstate->ps_ProjInfo);
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -283,6 +555,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (SeqScanCanUseBatching(scanstate, eflags))
+ SeqScanInitBatching(scanstate);
+
return scanstate;
}
@@ -302,6 +577,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ SeqScanResetBatching(node, true);
+
/*
* close heap scan
*/
@@ -331,6 +608,7 @@ ExecReScanSeqScan(SeqScanState *node)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
+ SeqScanResetBatching(node, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a315c4ab8ab..a59b5d012a2 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1045,6 +1045,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7277c37e779..302c0e33165 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..0f8431ee854 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -70,6 +70,7 @@ typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
+typedef struct RowBatch RowBatch;
/* ----------------
@@ -1670,6 +1671,7 @@ typedef struct SeqScanState
{
ScanState ss; /* its first field is NodeTag */
Size pscan_len; /* size of parallel heap scan descriptor */
+ RowBatch *batch; /* NULL if batching disabled */
} SeqScanState;
/* ----------------
--
2.47.3
^ permalink raw reply [nested|flat] 9+ messages in thread
end of thread, other threads:[~2026-04-06 12:02 UTC | newest]
Thread overview: 9+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-26 09:34 Re: Batching in executor Daniil Davydov <[email protected]>
2026-01-27 03:00 ` Amit Langote <[email protected]>
2026-01-29 07:35 ` Amit Langote <[email protected]>
2026-01-29 10:04 ` Amit Langote <[email protected]>
2026-02-01 14:49 ` Junwang Zhao <[email protected]>
2026-02-03 13:30 ` =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-02-03 15:54 ` Junwang Zhao <[email protected]>
2026-03-24 00:59 ` Amit Langote <[email protected]>
2026-04-06 12:02 ` Amit Langote <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox