public inbox for [email protected]
help / color / mirror / Atom feedBatching in executor
29+ messages / 8 participants
[nested] [flat]
* Batching in executor
@ 2025-09-26 13:28 Amit Langote <[email protected]>
0 siblings, 2 replies; 29+ messages in thread
From: Amit Langote @ 2025-09-26 13:28 UTC (permalink / raw)
To: pgsql-hackers
At PGConf.dev this year we had an unconference session [1] on whether
the community can support an additional batch executor. The discussion
there led me to start hacking on $subject. I have also had off-list
discussions on this topic in recent months with Andres and David, who
have offered useful thoughts.
This patch series is an early attempt to make executor nodes pass
around batches of tuples instead of tuple-at-a-time slots. The main
motivation is to enable expression evaluation in batch form, which can
substantially reduce per-tuple overhead (mainly from function calls)
and open the door to further optimizations such as SIMD usage in
aggregate transition functions. We could even change algorithms of
some plan nodes to operate on batches when, for example, a child node
can return batches.
The expression evaluation changes are still exploratory, but before
moving to make them ready for serious review, we first need a way for
scan nodes to produce tuples in batches and an executor API that
allows upper nodes to consume them. The series includes both the
foundational work to let scan nodes produce batches and an executor
API to pass them around, and a set of follow-on patches that
experiment with batch-aware expression evaluation.
The patch set is structured in two parts. The first three patches lay
the groundwork in the executor and table AM, and the later patches
prototype batch-aware expression evaluation.
Patches 0001-0003 introduce a new batch table AM API and an initial
heapam implementation that can return multiple tuples per call.
SeqScan is adapted to use this interface, with new ExecSeqScanBatch*
routines that fetch tuples in bulk but can still return one
TupleTableSlot at a time to preserve compatibility. On the executor
side, ExecProcNodeBatch() is added alongside ExecProcNode(), with
TupleBatch as the new container for passing groups of tuples. ExecScan
has batch-aware variants that use the AM API internally, but can fall
back to row-at-a-time behavior when required. Plan shapes and EXPLAIN
output remain unchanged; the differences here are executor-internal.
At present, heapam batches are restricted to tuples from a single
page, which means they may not always fill EXEC_BATCH_ROWS (currently
64). That limits how much upper executor nodes can leverage batching,
especially with selective quals where batches may end up sparsely
populated. A future improvement would be to allow batches to span
pages or to let the scan node request more tuples when its buffer is
not yet full, so it avoids passing mostly empty TupleBatch to upper
nodes.
It might also be worth adding some lightweight instrumentation to make
it easier to reason about batch behavior. For example, counters for
average rows per batch, reasons why a batch ended (capacity reached,
page boundary, end of scan), or batches per million rows could help
confirm whether limitations like the single-page restriction or
EXEC_BATCH_ROWS size are showing up in benchmarks. Suggestions from
others on which forms of instrumentation would be most useful are
welcome.
Patches 0004 onwards start experimenting with making expression
evaluation batch-aware, first in the aggregate node. These patches add
new EEOPs (ExprEvalOps and ExprEvalSteps) to fetch attributes into
TupleBatch vectors, evaluate quals across a batch, and run aggregate
transitions over multiple rows at once. Agg is extended to pull
TupleBatch from its child via ExecProcNodeBatch(), with two prototype
paths: one that loops inside the interpreter and another that calls
the transition function once per batch using AggBulkArgs. These are
still PoCs, but with scan nodes and the executor capable of moving
batches around, they provide a base from which the work can be refined
into something potentially committable after the usual polish,
testing, and review.
One area that needs more thought is how TupleBatch interacts with
ExprContext. At present the patches extend ExprContext with
scan_batch, inner_batch, and outer_batch fields, but per-batch
evaluation still spills into ecxt_per_tuple_memory, effectively
reusing the per-tuple context for per-batch work. That’s arguably an
abuse of the contract described in ExecEvalExprSwitchContext(), and it
will need a cleaner definition of how batch-scoped memory should be
managed. Feedback on how best to structure that would be particularly
helpful.
To evaluate the overheads and benefits, I ran microbenchmarks with
single and multi-aggregate queries on a single table, with and without
WHERE clauses. Tables were fully VACUUMed so visibility maps are set
and IO costs are minimal. shared_buffers was large enough to fit the
whole table (up to 10M rows, ~43 on each page), and all pages were
prewarmed into cache before tests. Table schema/script is at [2].
Observations from benchmarking (Detailed benchmark tables are at [3];
below is just a high-level summary of the main patterns):
* Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
sum(a) FROM bar_N): batching scan output alone improved latency by
~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
especially once fmgr overhead was paid per batch instead of per row.
* Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
qual interpreter gave a big step up, with latencies dropping by
~30-40% compared to batching=off.
* Five aggregates, no WHERE: batching input from the child scan cut
~15% off runtime. Adding batched transition evaluation increased
improvements to ~30%.
* Five aggregates, with WHERE: modest gains from scan/input batching,
but per-batch transition evaluation and batched quals brought ~20-30%
improvement.
* Across all cases, executor overheads became visible only after IO
was minimized. Once executor cost dominated, batching consistently
reduced CPU time, with the largest benefits coming from avoiding
per-row fmgr calls and evaluating quals across batches.
I would appreciate if others could try these patches with their own
microbenchmarks or workloads and see if they can reproduce numbers
similar to mine. Feedback on both the general direction and the
details of the patches would be very helpful. In particular, patches
0001-0003, which add the basic batch APIs and integrate them into
SeqScan, are intended to be the first candidates for review and
eventual commit. Comments on the later, more experimental patches
(aggregate input batching and expression evaluation (qual, aggregate
transition) batching) are also welcome.
--
Thanks, Amit Langote
[1] https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference#Can_the_Community_Support_an...
[2] Tables:
cat create_tables.sh
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "drop table if exists bar_$i; create table bar_$i (a int, b
int, c int, d int, e int, f int, g int, h int, i text, j int, k int, l
int, m int, n int, o int);" 2>&1 > /dev/null
psql -c "insert into bar_$i select i, i, i, i, i, i, i, i, repeat('x',
100), i, i, i, i, i, i from generate_series(1, $i) i;" 2>&1 >
/dev/null
echo "bar_$i created."
done
[3] Benchmark result tables
All timings are in milliseconds. off = executor_batching off, on =
executor_batching on. Negative %diff means on is better than off.
Single aggregate, no WHERE
(~20% faster with scan batching only; ~40%+ faster with batched transitions)
With only batched-seqscan (0001-0003):
Rows off on %diff
1M 10.448 8.147 -22.0
2M 18.442 14.552 -21.1
3M 25.296 22.195 -12.3
4M 36.285 33.383 -8.0
5M 44.441 39.894 -10.2
10M 93.110 82.744 -11.1
With batched-agg on top (0001-0007):
Rows off on %diff
1M 9.891 5.579 -43.6
2M 17.648 9.653 -45.3
3M 27.451 13.919 -49.3
4M 36.394 24.269 -33.3
5M 44.665 29.260 -34.5
10M 87.898 56.221 -36.0
Single aggregate, with WHERE
(~30–40% faster once quals + transitions are batched)
With only batched-seqscan (0001-0003):
Rows off on %diff
1M 18.485 17.749 -4.0
2M 34.696 33.033 -4.8
3M 49.582 46.155 -6.9
4M 70.270 67.036 -4.6
5M 84.616 81.013 -4.3
10M 174.649 164.611 -5.7
With batched-agg and batched-qual on top (0001-0008):
Rows off on %diff
1M 18.887 12.367 -34.5
2M 35.706 22.457 -37.1
3M 51.626 30.902 -40.1
4M 72.694 48.214 -33.7
5M 88.103 57.623 -34.6
10M 181.350 124.278 -31.5
Five aggregates, no WHERE
(~15% faster with scan/input batching; ~30% with batched transitions)
Agg input batching only (0001-0004):
Rows off on %diff
1M 23.193 19.196 -17.2
2M 42.177 35.862 -15.0
3M 62.192 51.121 -17.8
4M 83.215 74.665 -10.3
5M 99.426 91.904 -7.6
10M 213.794 184.263 -13.8
Batched transition eval, per-row fmgr (0001-0006):
Rows off on %diff
1M 23.501 19.672 -16.3
2M 44.128 36.603 -17.0
3M 64.466 53.079 -17.7
5M 103.442 97.623 -5.6
10M 219.120 190.354 -13.1
Batched transition eval, per-batch fmgr (0001-0007):
Rows off on %diff
1M 24.238 16.806 -30.7
2M 43.056 30.939 -28.1
3M 62.938 43.295 -31.2
4M 83.346 63.357 -24.0
5M 100.772 78.351 -22.2
10M 213.755 162.203 -24.1
Five aggregates, with WHERE
(~10–15% faster with scan/input batching; ~30% with batched transitions + quals)
Agg input batching only (0001-0004):
Rows off on %diff
1M 24.261 22.744 -6.3
2M 45.802 41.712 -8.9
3M 79.311 72.732 -8.3
4M 107.189 93.870 -12.4
5M 129.172 115.300 -10.7
10M 278.785 236.275 -15.2
Batched transition eval, per-batch fmgr (0001-0007):
Rows off on %diff
1M 24.354 19.409 -20.3
2M 46.888 36.687 -21.8
3M 82.147 57.683 -29.8
4M 109.616 76.471 -30.2
5M 133.777 94.776 -29.2
10M 282.514 194.954 -31.0
Batched transition eval + batched qual (0001-0008):
Rows off on %diff
1M 24.691 20.193 -18.2
2M 47.182 36.530 -22.6
3M 82.030 58.663 -28.5
4M 110.573 76.500 -30.8
5M 136.701 93.299 -31.7
10M 280.551 191.021 -31.9
Attachments:
[application/octet-stream] v1-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch (11.2K, 2-v1-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch)
download | inline diff:
From 0bdb18284cb034cf80ac56125b5682e84b856a26 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 9 Sep 2025 21:43:29 +0900
Subject: [PATCH v1 7/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT
The new EEOP runs a plain aggregate transition over a TupleBatch with
a single fmgr call. Batch vectors are passed to the transfn via
AggBulkArgs stored in fcinfo->flinfo->fn_extra, avoiding per-row fmgr
overhead.
Gate selection with AggTransfnSupportsBulk(), an allowlist of
built-in transfns updated to accept AggBulkArgs. Some integer
transfns are taught to read AggBulkArgs when present, else fall
back. Rowloop batching remains available; unsupported aggregates keep
the row path.
---
src/backend/executor/execExpr.c | 28 ++++++++++++++++-
src/backend/executor/execExprInterp.c | 43 ++++++++++++++++++++++++++
src/backend/executor/nodeAgg.c | 1 -
src/backend/jit/llvm/llvmjit_expr.c | 1 +
src/backend/utils/adt/int.c | 32 +++++++++++++++++++
src/backend/utils/adt/int8.c | 44 +++++++++++++++++++++++++++
src/backend/utils/adt/numeric.c | 17 +++++++++++
src/include/executor/execExpr.h | 1 +
src/include/executor/executor.h | 20 ++++++++++++
9 files changed, 185 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index af5ed8b6368..27a5780f557 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -47,6 +47,7 @@
#include "utils/acl.h"
#include "utils/array.h"
#include "utils/builtins.h"
+#include "utils/fmgroids.h"
#include "utils/jsonfuncs.h"
#include "utils/jsonpath.h"
#include "utils/lsyscache.h"
@@ -3692,6 +3693,28 @@ AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
return true;
}
+/* Return true if this transfn OID is known to accept AggBulkArgs. */
+static bool
+AggTransfnSupportsBulk(Oid fn_oid)
+{
+ /* Phase 1: hard-coded allowlist of built-ins you updated. */
+ static const Oid ok[] =
+ {
+ F_INT8INC_ANY, /* COUNT(*) transfn */
+ F_INT8INC, /* COUNT(arg) transfn */
+ F_INT4_SUM, /* SUM(int) transfn */
+ F_INT4SMALLER, /* MIN(int) transfn */
+ F_INT4LARGER, /* MAX(int) transfn */
+ /* add others you make bulk-aware */
+ InvalidOid
+ };
+
+ for (int i = 0; OidIsValid(ok[i]); i++)
+ if (ok[i] == fn_oid)
+ return true;
+ return false;
+}
+
/*
* Build transition/combine function invocations for all aggregate transition
* / combination function invocations in a grouping sets phase. This has to
@@ -4150,7 +4173,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
{
if (bv)
bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ if (!AggTransfnSupportsBulk(pertrans->transfn_oid))
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ else
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT;
}
else if (pertrans->transtypeByVal)
{
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3176679b346..41ad9b4838d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -607,6 +607,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
&&CASE_EEOP_LAST
};
@@ -2345,6 +2346,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
+ {
+ /* too complex for an inline implementation */
+ ExecAggPlainTransBatch(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -6138,6 +6147,40 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
pergroup->transValueIsNull = fcinfo->isnull;
}
break;
+
+ case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+ {
+ void *save = fcinfo->flinfo->fn_extra;
+ AggBulkArgs ba = {batch_nrows, start_row};
+
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ Assert(bv);
+ ba.nargs = bvs->nargs;
+ ba.argoffs = bvs->argoffs;
+ ba.args = bv->cols;
+ ba.isnull = bv->nulls;
+ ba.hasnull = bv->hasnull;
+ }
+ fcinfo->flinfo->fn_extra = &ba;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+ newVal = FunctionCallInvoke(fcinfo); /* one call for the entire slice */
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ fcinfo->flinfo->fn_extra = save;
+ }
+ break;
+
default:
elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 662d8bef43b..a2286ef5e54 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2687,7 +2687,6 @@ agg_retrieve_direct_batch(AggState *aggstate)
initialize_aggregates(aggstate, aggstate->pergroups,
Max(aggstate->phase->numsets, 1));
-
if (aggstate->grp_firstTuple)
{
ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index efb3ee639fc..45346124bd7 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,7 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
v_state, op, v_econtext);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..eb1780b5590 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1363,18 +1363,50 @@ int2smaller(PG_FUNCTION_ARGS)
Datum
int4larger(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);
+ if (unlikely(ba))
+ {
+ int32 result = arg1;
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = (int32) ba->args[ba->argoffs[0]][i];
+ if (arg2 > result)
+ result = arg2;
+ }
+ }
+ PG_RETURN_INT32(result);
+ }
PG_RETURN_INT32((arg1 > arg2) ? arg1 : arg2);
}
Datum
int4smaller(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);
+ if (unlikely(ba))
+ {
+ int32 result = arg1;
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = ba->args[ba->argoffs[0]][i];
+ if (arg2 < result)
+ result = arg2;
+ }
+ }
+ PG_RETURN_INT32(result);
+ }
PG_RETURN_INT32((arg1 < arg2) ? arg1 : arg2);
}
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index bdea490202a..bbabf4e0785 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -461,10 +461,28 @@ int8up(PG_FUNCTION_ARGS)
Datum
int8pl(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int64 arg1 = PG_GETARG_INT64(0);
int64 arg2 = PG_GETARG_INT64(1);
int64 result;
+ if (unlikely(ba))
+ {
+ result = arg1;
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = ba->args[ba->argoffs[0]][i];
+ if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ arg1 = result;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
@@ -718,9 +736,35 @@ int8lcm(PG_FUNCTION_ARGS)
Datum
int8inc(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int64 arg = PG_GETARG_INT64(0);
int64 result;
+ if (unlikely(ba))
+ {
+ result = arg;
+ if (!ba->hasnull || ba->nargs == 0)
+ {
+ if (unlikely(pg_add_s64_overflow(arg, ba->nrows, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ PG_RETURN_INT64(result);
+ }
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ arg = result;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
+
if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 76269918593..b02664c97f5 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6310,6 +6310,23 @@ int4_sum(PG_FUNCTION_ARGS)
{
int64 oldsum;
int64 newval;
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
+
+ if (unlikely(ba))
+ {
+ int64 result = (!PG_ARGISNULL(0) ? PG_GETARG_INT64(0) : 0);
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ int32 arg2 = ba->args[ba->argoffs[0]][i];
+
+ result = result + arg2;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
if (PG_ARGISNULL(0))
{
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 1d33e084b69..f24782ecf58 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -304,6 +304,7 @@ typedef enum ExprEvalOp
/* Batched aggregate trans evaluation */
EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
+ EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT, /* call transfn once with AggBulkArgs */
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5ba9a523970..c72bd755b79 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -561,6 +561,26 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
}
#endif
+#ifndef FRONTEND
+/* Per-call bulk argument vectors for batched aggregate trans functions. */
+typedef struct AggBulkArgs
+{
+ int nrows; /* number of rows in this batch */
+ int start_row;
+ int16 *argoffs;
+ int nargs; /* number of argument vectors */
+ Datum **args; /* args[j][i] = j-th arg at row i */
+ bool **isnull; /* isnull[j][i] */
+ bool hasnull; /* is any datum in args NULL? */
+} AggBulkArgs;
+
+static inline AggBulkArgs *
+AggGetBulkArgs(FunctionCallInfo fcinfo)
+{
+ return (AggBulkArgs *) (fcinfo->flinfo ? fcinfo->flinfo->fn_extra : NULL);
+}
+#endif
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
--
2.43.0
[application/octet-stream] v1-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch (6.3K, 3-v1-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch)
download | inline diff:
From d5ff8e14add86233afd3c82935d4f72a31859a57 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 4 Sep 2025 22:55:25 +0900
Subject: [PATCH v1 4/8] WIP: Add agg_retrieve_direct_batch() for plain
aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Teach Agg to consume child tuples in batches for AGG_PLAIN. A new
agg_retrieve_direct_batch() pulls TupleBatch from the child via
ExecProcNodeBatch(), materializes as needed, and advances per-agg
transition state over the batch. A first tuple is copied to match
the direct path’s behavior before batch processing.
Add AggCanUsePlainBatch() and select retrieve_plain at init:
batch path when no grouping sets, strategy is AGG_PLAIN, and the
child exposes ExecProcNodeBatch(); otherwise keep the row path.
Plan shape and EXPLAIN remain unchanged. Semantics are identical
to the non-batch direct path; this only reduces per-tuple overhead.
---
src/backend/executor/nodeAgg.c | 123 +++++++++++++++++++++++++++++++++
src/include/nodes/execnodes.h | 5 ++
2 files changed, 128 insertions(+)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a4f3d30f307..3ace6363509 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -820,6 +820,20 @@ advance_aggregates(AggState *aggstate)
aggstate->tmpcontext);
}
+static void
+advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
+{
+ ExprContext *tmpcontext = aggstate->tmpcontext;
+ ExprState *evaltrans = aggstate->phase->evaltrans;
+
+ while (TupleBatchHasMore(b))
+ {
+ tmpcontext->ecxt_outertuple = TupleBatchGetNextSlot(b);
+ ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+ ResetExprContext(tmpcontext);
+ }
+}
+
/*
* Run the transition function for a DISTINCT or ORDER BY aggregate
* with only one input. This is called after we have completed
@@ -2260,6 +2274,9 @@ ExecAgg(PlanState *pstate)
result = agg_retrieve_hash_table(node);
break;
case AGG_PLAIN:
+ /* init-time choice */
+ result = node->retrieve_plain(node);
+ break;
case AGG_SORTED:
result = agg_retrieve_direct(node);
break;
@@ -2618,6 +2635,91 @@ agg_retrieve_direct(AggState *aggstate)
return NULL;
}
+static TupleTableSlot *
+agg_retrieve_direct_batch(AggState *aggstate)
+{
+ PlanState *child = outerPlanState(aggstate);
+ ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+ ExprContext *tmpcontext = aggstate->tmpcontext;
+ const bool hasGroupingSets = aggstate->phase->numsets > 0;
+ TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+ TupleBatch *b = NULL;
+
+ Assert(child->ExecProcNodeBatch);
+
+ /* mimic the first-tuple copy from agg_retrieve_direct() */
+ for (;;)
+ {
+ b = ExecProcNodeBatch(child);
+ if (b == NULL)
+ {
+ if (hasGroupingSets)
+ {
+ aggstate->input_done = true;
+ break;
+ }
+ aggstate->agg_done = true;
+ break;
+ }
+ if (b->nvalid == 0)
+ continue;
+
+ TupleBatchMaterializeAll(b);
+ aggstate->grp_firstTuple = ExecCopySlotHeapTuple(TupleBatchGetSlot(b, 0));
+ break;
+ }
+
+ /* initialize_aggregates etc. as in the direct path */
+ ReScanExprContext(econtext);
+ for (int i = 0; i < Max(aggstate->phase->numsets, 1); i++)
+ ReScanExprContext(aggstate->aggcontexts[i]);
+
+ initialize_aggregates(aggstate, aggstate->pergroups,
+ Max(aggstate->phase->numsets, 1));
+
+ if (aggstate->grp_firstTuple)
+ {
+ ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
+ aggstate->grp_firstTuple = NULL;
+ tmpcontext->ecxt_outertuple = firstSlot;
+
+ advance_aggregates_batch(aggstate, b);
+ ResetExprContext(tmpcontext);
+ }
+
+ /* consume remaining rows in current and subsequent batches */
+ if (b)
+ {
+ if (TupleBatchHasMore(b))
+ advance_aggregates_batch(aggstate, b);
+ for (;;)
+ {
+ b = ExecProcNodeBatch(child);
+ if (b == NULL)
+ {
+ if (hasGroupingSets)
+ aggstate->input_done = true;
+ else
+ aggstate->agg_done = true;
+ break;
+ }
+ if (b->nvalid == 0)
+ continue;
+
+ TupleBatchMaterializeAll(b);
+ advance_aggregates_batch(aggstate, b);
+ }
+ }
+
+ /* finalize and project like the direct path */
+ econtext->ecxt_outertuple = firstSlot;
+ prepare_projection_slot(aggstate, econtext->ecxt_outertuple, 0);
+ select_current_set(aggstate, 0, false);
+ finalize_aggregates(aggstate, aggstate->peragg, aggstate->pergroups[0]);
+
+ return project_aggregates(aggstate);
+}
+
/*
* ExecAgg for hashed case: read input and build hash table
*/
@@ -3265,6 +3367,22 @@ hashagg_reset_spill_state(AggState *aggstate)
}
}
+static bool
+AggCanUsePlainBatch(AggState *aggstate)
+{
+ const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
+
+ Assert(outerPlanState(aggstate));
+
+ /* grouping sets present -> bail */
+ if (aggnode->groupingSets != NIL)
+ return false;
+
+ if (aggstate->phase->aggstrategy != AGG_PLAIN)
+ return false;
+
+ return outerPlanState(aggstate)->ExecProcNodeBatch;
+}
/* -----------------
* ExecInitAgg
@@ -4060,6 +4178,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
(errcode(ERRCODE_GROUPING_ERROR),
errmsg("aggregate function calls cannot be nested")));
+ if (AggCanUsePlainBatch(aggstate))
+ aggstate->retrieve_plain = agg_retrieve_direct_batch;
+ else
+ aggstate->retrieve_plain = agg_retrieve_direct;
+
/*
* Build expressions doing all the transition work at once. We build a
* different one for each phase, as the number of transition function
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a104591ac20..9b81b842161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2535,6 +2535,9 @@ typedef struct AggStatePerGroupData *AggStatePerGroup;
typedef struct AggStatePerPhaseData *AggStatePerPhase;
typedef struct AggStatePerHashData *AggStatePerHash;
+struct AggState;
+typedef TupleTableSlot *(*AggRetrievePlainFn)(struct AggState *);
+
typedef struct AggState
{
ScanState ss; /* its first field is NodeTag */
@@ -2610,6 +2613,8 @@ typedef struct AggState
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
SharedAggInfo *shared_info; /* one entry per worker */
+
+ AggRetrievePlainFn retrieve_plain; /* init-time choice */
} AggState;
/* ----------------
--
2.43.0
[application/octet-stream] v1-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch (21.5K, 4-v1-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch)
download | inline diff:
From 992a5e21f7039825b12a6e800efb0265061bbe3a Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 2 Sep 2025 23:46:34 +0900
Subject: [PATCH v1 6/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP
Introduce a batch EEOP that runs plain aggregate transitions by
looping over rows of a TupleBatch. This keeps transition logic in
the interpreter while amortizing per-row costs.
Gate with AggTransCanUseBatch(): plain, non-hashed, single-set
aggregates with no DISTINCT/ORDER/FILTER, and simple Var args.
Extend ExecBuildAggTrans() to prepare batch fetch/build steps and
to return whether a batch path is used.
---
src/backend/executor/execExpr.c | 228 ++++++++++++++++++++++++--
src/backend/executor/execExprInterp.c | 103 ++++++++++++
src/backend/executor/nodeAgg.c | 17 +-
src/backend/jit/llvm/llvmjit_expr.c | 6 +
src/backend/jit/llvm/llvmjit_types.c | 1 +
src/include/executor/execBatch.h | 6 +
src/include/executor/execExpr.h | 14 ++
src/include/executor/executor.h | 3 +-
src/include/executor/nodeAgg.h | 2 +
9 files changed, 363 insertions(+), 17 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..af5ed8b6368 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -95,7 +95,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
int transno, int setno, int setoff, bool ishash,
- bool nullcheck);
+ bool nullcheck, bool batch,
+ BatchVector *bv);
+
static void ExecInitJsonExpr(JsonExpr *jsexpr, ExprState *state,
Datum *resv, bool *resnull,
ExprEvalStep *scratch);
@@ -104,6 +106,10 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
bool exists_coerce,
Datum *resv, bool *resnull);
+static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
+static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
+static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
+ const BatchVector *bv);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -3659,6 +3665,33 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
}
}
+/* plain agg, single set, not hashed, no DISTINCT/ORDER/FILTER */
+static inline bool
+AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
+{
+ Agg *aggnode = (Agg *) as->ss.ps.plan;
+
+ if (!AggCanUsePlainBatch(as))
+ return false;
+ if (as->aggstrategy == AGG_HASHED)
+ return false;
+ if (aggnode->groupingSets != NIL)
+ return false;
+ if (as->phase == NULL || as->phase->numsets > 0)
+ return false;
+
+ /* per-aggregate complications */
+ if (pt->aggsortrequired)
+ return false;
+ if (pt->aggref &&
+ (pt->aggref->aggdistinct != NIL ||
+ pt->aggref->aggorder != NIL ||
+ pt->aggref->aggfilter != NULL))
+ return false;
+
+ return true;
+}
+
/*
* Build transition/combine function invocations for all aggregate transition
* / combination function invocations in a grouping sets phase. This has to
@@ -3675,13 +3708,17 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash, bool nullcheck)
+ bool doSort, bool doHash, bool nullcheck,
+ bool *batch_trans)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
ExprEvalStep scratch = {0};
bool isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
ExprSetupInfo deform = {0, 0, 0, 0, 0, NIL};
+ bool batch = AggCanUsePlainBatch(aggstate);
+ Bitmapset *allattnos = NULL;
+ BatchVector *bv = NULL;
state->expr = (Expr *) aggstate;
state->parent = parent;
@@ -3707,8 +3744,36 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
&deform);
expr_setup_walker((Node *) pertrans->aggref->aggfilter,
&deform);
+
+ if (!AggTransCanUseBatch(aggstate, pertrans) ||
+ !ExprListAllSimpleVars(pertrans->aggref->args, &allattnos))
+ batch = false;
}
- ExecPushExprSetupSteps(state, &deform);
+
+ if (batch)
+ {
+ if (deform.last_outer > 0)
+ {
+ Assert(!bms_is_empty(allattnos));
+ bv = BatchVectorCreate(allattnos, deform.last_outer);
+
+ /*
+ * Deform all tuples upto last_outer in batch
+ */
+ scratch.opcode = EEOP_OUTER_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = deform.last_outer;
+ ExprEvalPushStep(state, &scratch);
+
+ /*
+ * Put all arg Vars into vectors once per batch slice
+ */
+ scratch.opcode = EEOP_BUILD_OUTER_BATCH_VECTOR;
+ scratch.d.batch_vector.bv = bv;
+ ExprEvalPushStep(state, &scratch);
+ }
+ }
+ else
+ ExecPushExprSetupSteps(state, &deform);
/*
* Emit instructions for each transition value / grouping set combination.
@@ -3746,7 +3811,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
* Evaluate arguments to aggregate/combine function.
*/
argno = 0;
- if (isCombine)
+ if (isCombine && !batch)
{
/*
* Combining two aggregate transition values. Instead of directly
@@ -3816,7 +3881,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
Assert(pertrans->numInputs == argno);
}
- else if (!pertrans->aggsortrequired)
+ else if (!pertrans->aggsortrequired && !batch)
{
ListCell *arg;
@@ -3849,7 +3914,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
}
Assert(pertrans->numTransInputs == argno);
}
- else if (pertrans->numInputs == 1)
+ else if (pertrans->numInputs == 1 && !batch)
{
/*
* Non-presorted DISTINCT and/or ORDER BY case, with a single
@@ -3868,7 +3933,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
Assert(pertrans->numInputs == argno);
}
- else
+ else if (!batch)
{
/*
* Non-presorted DISTINCT and/or ORDER BY case, with multiple
@@ -3896,7 +3961,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
* just keep the prior transValue. This is true for both plain and
* sorted/distinct aggregates.
*/
- if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0)
+ if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0 && !batch)
{
if (strictnulls)
scratch.opcode = EEOP_AGG_STRICT_INPUT_CHECK_NULLS;
@@ -3914,7 +3979,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
}
/* Handle DISTINCT aggregates which have pre-sorted input */
- if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired)
+ if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired && !batch)
{
if (pertrans->numDistinctCols > 1)
scratch.opcode = EEOP_AGG_PRESORTED_DISTINCT_MULTI;
@@ -3942,7 +4007,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
pertrans, transno, setno, setoff, false,
- nullcheck);
+ nullcheck, batch, bv);
setoff++;
}
}
@@ -3962,7 +4027,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
pertrans, transno, setno, setoff, true,
- nullcheck);
+ nullcheck, false, NULL);
setoff++;
}
}
@@ -4007,6 +4072,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
ExecReadyExpr(state);
+ if (batch_trans)
+ *batch_trans = batch;
+
return state;
}
@@ -4020,10 +4088,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
int transno, int setno, int setoff, bool ishash,
- bool nullcheck)
+ bool nullcheck, bool batch, BatchVector *bv)
{
ExprContext *aggcontext;
int adjust_jumpnull = -1;
+ BatchVectorSlice *bvs = NULL;
if (ishash)
aggcontext = aggstate->hashcontext;
@@ -4077,7 +4146,13 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
*/
if (!pertrans->aggsortrequired)
{
- if (pertrans->transtypeByVal)
+ if (batch)
+ {
+ if (bv)
+ bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ }
+ else if (pertrans->transtypeByVal)
{
if (fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
@@ -4108,6 +4183,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
scratch->d.agg_trans.setoff = setoff;
scratch->d.agg_trans.transno = transno;
scratch->d.agg_trans.aggcontext = aggcontext;
+ scratch->d.agg_trans.bvs = bvs;
ExprEvalPushStep(state, scratch);
/* fix up jumpnull */
@@ -5070,3 +5146,129 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
DomainHasConstraints(returning->typid);
ExprEvalPushStep(state, &scratch);
}
+
+/* Is expr a Var node for a non-system attribute? */
+static bool
+expr_is_simple_var(Expr *expr, AttrNumber *out_attno)
+{
+ if (expr == NULL)
+ return false;
+
+ if (IsA(expr, TargetEntry))
+ return expr_is_simple_var((Expr *) ((TargetEntry *) expr)->expr,
+ out_attno);
+ if (IsA(expr, RelabelType))
+ return expr_is_simple_var((Expr *) ((RelabelType *) expr)->arg,
+ out_attno);
+
+ if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+ {
+ *out_attno = ((Var *) expr)->varattno;
+ return true;
+ }
+
+ return false;
+}
+
+/* Are all inputs plain Vars (optionally allow RelabelType->Var)? Collect attnos. */
+static bool
+ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
+{
+ ListCell *lc;
+
+ foreach(lc, args)
+ {
+ TargetEntry *tle = lfirst_node(TargetEntry, lc);
+ Expr *arg = tle->expr;
+ AttrNumber attno;
+
+ if (!expr_is_simple_var(arg, &attno))
+ return false;
+
+ if (!IsA(arg, Var))
+ return false;
+
+ Assert(attno > 0);
+ *allattnos = bms_add_member(*allattnos, attno);
+ }
+
+ return true;
+}
+
+/* ---------- BatchVector stuff ------------- */
+
+static BatchVector *
+BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var)
+{
+ int maxrows = EXEC_BATCH_ROWS;
+ BatchVector *bv;
+ AttrNumber attno;
+ int i;
+
+ bv = palloc(sizeof(BatchVector));
+ bv->ncols = bms_num_members(attnos);
+ bv->maxrows = maxrows;
+ bv->last_var = last_var;
+ bv->attnos = palloc(sizeof(AttrNumber) * bv->ncols);
+ attno = -1;
+ i = 0;
+ while ((attno = bms_next_member(attnos, attno)) > 0)
+ bv->attnos[i++] = attno;
+ bv->cols = palloc(sizeof(Datum *) * bv->ncols);
+ bv->nulls = palloc(sizeof(bool *) * bv->ncols);
+
+ for (i =0; i < bv->ncols; i++)
+ {
+ bv->cols[i] = palloc(sizeof(Datum) * maxrows);
+ bv->nulls[i] = palloc(sizeof(bool) * maxrows);
+ }
+
+ bv->nrows = 0;
+ bv->hasnull = false;
+
+ return bv;
+}
+
+static int16
+BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno)
+{
+ for (int i = 0; i < bv->ncols; i++)
+ if (bv->attnos[i] == attno)
+ return i;
+
+ return -1;
+}
+
+/*
+ * BatchVectorSliceFromExprArgs
+ * Build a BatchVectorSlice for a List of args.
+ *
+ * For Var args (possibly under RelabelType), store the col index.
+ * For non-Var args, store -1. Caller can handle Consts, etc.
+ */
+static BatchVectorSlice *
+BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
+{
+ BatchVectorSlice *bvs = palloc(sizeof(BatchVectorSlice));
+ int nargs = list_length(args);
+ int i = 0;
+ ListCell *lc;
+
+ Assert(bv);
+ bvs->bv = bv;
+ bvs->nargs = nargs;
+ bvs->argoffs = (int16 *) palloc(sizeof(int16) * nargs);
+
+ foreach (lc, args)
+ {
+ Expr *arg = (Expr *) lfirst(lc);
+ AttrNumber attno;
+
+ if (expr_is_simple_var(arg, &attno))
+ bvs->argoffs[i++] = BatchVectorFindAttColno(bv, attno);
+ else
+ bvs->argoffs[i++] = -1; /* non-Var */
+ }
+
+ return bvs;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 68629ad7991..3176679b346 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -606,6 +606,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
&&CASE_EEOP_LAST
};
@@ -2336,6 +2337,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
+ {
+ /* too complex for an inline implementation */
+ ExecAggPlainTransBatch(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -6039,3 +6048,97 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
}
bv->nrows = i;
}
+
+void
+ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
+ AggStatePerGroup pergroup =
+ &aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+ BatchVectorSlice *bvs = op->d.agg_trans.bvs;
+ FunctionCallInfo fcinfo = pertrans->transfn_fcinfo;
+ FmgrInfo *finfo = fcinfo->flinfo;
+ Datum newVal;
+ TupleBatch *batch = econtext->outer_batch;
+ int batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
+ int start_row = 0;
+
+ if (finfo->fn_strict)
+ {
+ if (pergroup->noTransValue && bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+ bool found = false;
+
+ Assert(bv);
+ for (int i = 0; i < batch_nrows; i++)
+ {
+ for (int j = 0; j < bvs->nargs; j++)
+ {
+ if (!bv->nulls[bvs->argoffs[j]][i])
+ {
+ fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+ fcinfo->args[1].isnull = false;
+ if (j == bvs->nargs - 1)
+ {
+ found = true;
+ break;
+ }
+ }
+ }
+ if (found)
+ break;
+ }
+ /* If transValue has not yet been initialized, do so now. */
+ ExecAggInitGroup(aggstate, pertrans, pergroup,
+ op->d.agg_trans.aggcontext);
+ start_row = 1;
+ }
+ else if (pergroup->transValueIsNull)
+ return;
+ }
+
+ switch (ExecEvalStepOp(state, op))
+ {
+ case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+ /* Loop rows, call the original transfn per element using vector cols. */
+ for (int i = start_row; i < batch_nrows; i++)
+ {
+ bool hasnull = false;
+
+ /* Set up fcinfo args 1..m from column vectors at row i. */
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ for (int j = 0; j < bvs->nargs; j++)
+ {
+ int16 argoff = bvs->argoffs[j];
+
+ fcinfo->args[j+1].value = bv->cols[argoff][i];
+ fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+ if (!hasnull && bv->nulls[argoff][i])
+ hasnull = true;
+ }
+ }
+ /* fcinfo->args[0] is the existing transition state */
+ if (finfo->fn_strict && hasnull)
+ continue;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ newVal = FunctionCallInvoke(fcinfo);
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ }
+ break;
+ default:
+ elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+ }
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ace6363509..662d8bef43b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -825,6 +825,16 @@ advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
{
ExprContext *tmpcontext = aggstate->tmpcontext;
ExprState *evaltrans = aggstate->phase->evaltrans;
+ bool batch_trans = aggstate->phase->batch_trans;
+
+ if (batch_trans)
+ {
+ tmpcontext->ecxt_outertuple = TupleBatchGetSlot(b, 0);
+ tmpcontext->outer_batch = b;
+ ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+ TupleBatchConsumeAll(b);
+ return;
+ }
while (TupleBatchHasMore(b))
{
@@ -1800,7 +1810,8 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
phase->evaltrans_cache[i][j] = ExecBuildAggTrans(aggstate, phase,
dosort, dohash,
- nullcheck);
+ nullcheck,
+ NULL);
/* change back */
aggstate->ss.ps.outerops = outerops;
@@ -3367,7 +3378,7 @@ hashagg_reset_spill_state(AggState *aggstate)
}
}
-static bool
+bool
AggCanUsePlainBatch(AggState *aggstate)
{
const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
@@ -4233,7 +4244,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
Assert(false);
phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
- false);
+ false, &phase->batch_trans);
/* cache compiled expression for outer slot without NULL check */
phase->evaltrans_cache[0][0] = phase->evaltrans;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 848f0b52d6f..efb3ee639fc 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,12 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+ build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 6bb527c3f6f..1b5e06f60cc 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,4 +186,5 @@ void *referenced_functions[] =
ExecBuildInnerBatchVector,
ExecBuildOuterBatchVector,
ExecBuildScanBatchVector,
+ ExecAggPlainTransBatch,
};
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 6f1a38d14bd..b50961fc0c9 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -99,4 +99,10 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+static inline void
+TupleBatchConsumeAll(TupleBatch *b)
+{
+ b->next = b->nvalid;
+}
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 99c86bac702..1d33e084b69 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -302,6 +302,9 @@ typedef enum ExprEvalOp
EEOP_BUILD_OUTER_BATCH_VECTOR,
EEOP_BUILD_SCAN_BATCH_VECTOR,
+ /* Batched aggregate trans evaluation */
+ EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
@@ -750,6 +753,7 @@ typedef struct ExprEvalStep
/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
+ /* for EEOP_AGG_PLAIN_TRANS_{BATCH,BATCH_ROWLOOP}*/
struct
{
AggStatePerTrans pertrans;
@@ -757,6 +761,7 @@ typedef struct ExprEvalStep
int setno;
int transno;
int setoff;
+ struct BatchVectorSlice *bvs;
} agg_trans;
/* for EEOP_IS_JSON */
@@ -956,8 +961,17 @@ typedef struct BatchVector
int nrows; /* #rows loaded into cols/nulls */
} BatchVector;
+/* A slice of BatchVector that maps caller args to BatchVector columns. */
+typedef struct BatchVectorSlice
+{
+ const BatchVector *bv;
+ int nargs; /* number of args covered */
+ int16 *argoffs; /* length nargs, -1 for non-Var entries */
+} BatchVectorSlice;
+
extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cf5b0c7e05c..5ba9a523970 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -336,7 +336,8 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash, bool nullcheck);
+ bool doSort, bool doHash, bool nullcheck,
+ bool *batch_trans);
extern ExprState *ExecBuildHash32FromAttrs(TupleDesc desc,
const TupleTableSlotOps *ops,
FmgrInfo *hashfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..5c5ebfc73f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -289,6 +289,7 @@ typedef struct AggStatePerPhaseData
Sort *sortnode; /* Sort node for input ordering for phase */
ExprState *evaltrans; /* evaluation of transition functions */
+ bool batch_trans; /* true if evaltrans contains batch EEOPs */
/*----------
* Cached variants of the compiled expression.
@@ -338,4 +339,5 @@ extern void ExecAggInitializeDSM(AggState *node, ParallelContext *pcxt);
extern void ExecAggInitializeWorker(AggState *node, ParallelWorkerContext *pwcxt);
extern void ExecAggRetrieveInstrumentation(AggState *node);
+extern bool AggCanUsePlainBatch(AggState *aggstate);
#endif /* NODEAGG_H */
--
2.43.0
[application/octet-stream] v1-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch (16.9K, 5-v1-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch)
download | inline diff:
From b63b357ea48e55b43913559471fd10f5a65e1b8e Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 17:01:29 +0900
Subject: [PATCH v1 5/8] WIP: Add EEOPs and helpers for TupleBatch processing
Introduce new EEOP cases to fetch attributes into TupleBatch
vectors:
- EEOP_{INNER,OUTER,SCAN}_FETCHSOME_BATCH
- EEOP_BUILD_{INNER,OUTER,SCAN}_BATCH_VECTOR
Add ExecBuild{Inner,Outer,Scan}BatchVector() helpers to populate
column vectors (values, nulls, nrows, hasnull) from a TupleBatch.
Extend ExprContext with inner_batch, outer_batch, and scan_batch
fields so expression programs can access active batches directly.
Add slot_getsomeattrs_batch() to prefetch attributes across all
slots in a TupleBatch, similar to slot_getsomeattrs() for one slot.
---
src/backend/executor/execExprInterp.c | 127 +++++++++++++++++++++++++-
src/backend/executor/execTuples.c | 32 +++++++
src/backend/jit/llvm/llvmjit_expr.c | 86 +++++++++++++++++
src/backend/jit/llvm/llvmjit_types.c | 4 +
src/include/executor/execExpr.h | 45 ++++++++-
src/include/executor/tuptable.h | 2 +
src/include/nodes/execnodes.h | 24 +++--
7 files changed, 310 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0e1a74976f7..68629ad7991 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
#include "access/heaptoast.h"
#include "catalog/pg_type.h"
#include "commands/sequence.h"
+#include "executor/execBatch.h"
#include "executor/execExpr.h"
#include "executor/nodeSubplan.h"
#include "funcapi.h"
@@ -188,6 +189,11 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
int setno);
static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static pg_attribute_always_inline void ExecBuildBatchVector(ExprState *state,
+ ExprEvalStep *op,
+ ExprContext *econtext,
+ TupleBatch *b);
+
/*
* ScalarArrayOpExprHashEntry
* Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -446,7 +452,6 @@ ExecReadyInterpretedExpr(ExprState *state)
state->evalfunc_private = ExecInterpExpr;
}
-
/*
* Evaluate expression identified by "state" in the execution context
* given by "econtext". *isnull is set to the is-null flag for the result,
@@ -466,6 +471,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
+ TupleBatch *innerbatch;
+ TupleBatch *outerbatch;
+ TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -479,6 +487,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_SCAN_FETCHSOME,
&&CASE_EEOP_OLD_FETCHSOME,
&&CASE_EEOP_NEW_FETCHSOME,
+ &&CASE_EEOP_INNER_FETCHSOME_BATCH,
+ &&CASE_EEOP_OUTER_FETCHSOME_BATCH,
+ &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
&&CASE_EEOP_INNER_VAR,
&&CASE_EEOP_OUTER_VAR,
&&CASE_EEOP_SCAN_VAR,
@@ -592,6 +603,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+ &&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
+ &&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
+ &&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_LAST
};
@@ -612,6 +626,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
+ innerbatch = econtext->inner_batch;
+ outerbatch = econtext->outer_batch;
+ scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -658,6 +675,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_INNER_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, innerslot);
+
+ Assert(innerbatch);
+ slot_getsomeattrs_batch(innerbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_OUTER_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, outerslot);
+
+ Assert(outerbatch);
+ slot_getsomeattrs_batch(outerbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, scanslot);
+
+ Assert(scanbatch);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_OLD_FETCHSOME)
{
CheckOpSlotCompatibility(op, oldslot);
@@ -2265,6 +2312,30 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_BUILD_INNER_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildInnerBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_BUILD_OUTER_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildOuterBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_BUILD_SCAN_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildScanBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -5914,3 +5985,57 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
MemoryContextSwitchTo(oldContext);
}
+
+void
+ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->inner_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->inner_batch);
+}
+
+void
+ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->outer_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->outer_batch);
+}
+
+void
+ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->scan_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->scan_batch);
+}
+
+static pg_attribute_always_inline void
+ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
+ TupleBatch *b)
+{
+ struct BatchVector *bv = op->d.batch_vector.bv;
+ int i = 0;
+
+ if (bv->ncols == 0)
+ return;
+
+ /* Fetch each requested attribute into column vectors. */
+ TupleBatchRewind(b);
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ for (int j = 0; j < bv->ncols; j++)
+ {
+ AttrNumber attno = bv->attnos[j];
+ Datum *cols = bv->cols[j];
+ bool *nulls = bv->nulls[j];
+
+ Assert(attno <= slot->tts_nvalid);
+ cols[i] = slot->tts_values[attno - 1];
+ nulls[i] = slot->tts_isnull[attno - 1];
+ if (!bv->hasnull && nulls[i])
+ bv->hasnull = true;
+ }
+ i++;
+ }
+ bv->nrows = i;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..86d5dea8f8b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
}
}
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* Check for caller errors */
+ Assert(attnum > 0);
+
+ if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+ elog(ERROR, "invalid attribute number %d", attnum);
+
+ /* XXX - there should perhaps also be a batch-level att_nvalid */
+ if (attnum < slot->tts_nvalid)
+ continue;
+
+ /* Fetch as many attributes as possible from the underlying tuple. */
+ slot->tts_ops->getsomeattrs(slot, attnum);
+
+ /*
+ * If the underlying tuple doesn't have enough attributes, tuple
+ * descriptor must have the missing attributes.
+ */
+ if (unlikely(slot->tts_nvalid < attnum))
+ {
+ slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+ slot->tts_nvalid = attnum;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecTypeFromTL
*
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 712b35df7e5..848f0b52d6f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,11 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_newslot;
LLVMValueRef v_resultslot;
+ /* batches */
+ LLVMValueRef v_innerbatch;
+ LLVMValueRef v_outerbatch;
+ LLVMValueRef v_scanbatch;
+
/* nulls/values of slots */
LLVMValueRef v_innervalues;
LLVMValueRef v_innernulls;
@@ -221,6 +226,21 @@ llvm_compile_expr(ExprState *state)
v_state,
FIELDNO_EXPRSTATE_RESULTSLOT,
"v_resultslot");
+ v_innerbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_OUTERBATCH,
+ "v_innerbatch");
+ v_outerbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_OUTERBATCH,
+ "v_outerbatch");
+ v_scanbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_SCANBATCH,
+ "v_scanbatch");
/* build global values/isnull pointers */
v_scanvalues = l_load_struct_gep(b,
@@ -439,6 +459,54 @@ llvm_compile_expr(ExprState *state)
break;
}
+ case EEOP_INNER_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_innerbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_OUTER_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_outerbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_SCAN_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_scanbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
case EEOP_INNER_VAR:
case EEOP_OUTER_VAR:
case EEOP_SCAN_VAR:
@@ -2940,6 +3008,24 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_BUILD_INNER_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildInnerBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
+ case EEOP_BUILD_OUTER_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildOuterBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
+ case EEOP_BUILD_SCAN_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildScanBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 167cd554b9c..6bb527c3f6f 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,11 @@ void *referenced_functions[] =
MakeExpandedObjectReadOnlyInternal,
slot_getmissingattrs,
slot_getsomeattrs_int,
+ slot_getsomeattrs_batch,
strlen,
varsize_any,
ExecInterpExprStillValid,
+ ExecBuildInnerBatchVector,
+ ExecBuildOuterBatchVector,
+ ExecBuildScanBatchVector,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 75366203706..99c86bac702 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,11 @@ typedef enum ExprEvalOp
EEOP_OLD_FETCHSOME,
EEOP_NEW_FETCHSOME,
+ /* apply slot_getsomeattrs_batch() to corresponding batch */
+ EEOP_INNER_FETCHSOME_BATCH,
+ EEOP_OUTER_FETCHSOME_BATCH,
+ EEOP_SCAN_FETCHSOME_BATCH,
+
/* compute non-system Var value */
EEOP_INNER_VAR,
EEOP_OUTER_VAR,
@@ -292,11 +297,15 @@ typedef enum ExprEvalOp
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
+ /* ExprContext.*_batch -> BatchVector */
+ EEOP_BUILD_INNER_BATCH_VECTOR,
+ EEOP_BUILD_OUTER_BATCH_VECTOR,
+ EEOP_BUILD_SCAN_BATCH_VECTOR,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
-
typedef struct ExprEvalStep
{
/*
@@ -331,6 +340,12 @@ typedef struct ExprEvalStep
const TupleTableSlotOps *kind;
} fetch;
+ struct
+ {
+ /* attribute number up to which to fetch (inclusive) */
+ int last_var;
+ } fetch_batch;
+
/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
struct
{
@@ -769,6 +784,12 @@ typedef struct ExprEvalStep
void *json_coercion_cache;
ErrorSaveContext *escontext;
} jsonexpr_coercion;
+
+ /* for batch vector construction */
+ struct
+ {
+ struct BatchVector *bv;
+ } batch_vector;
} d;
} ExprEvalStep;
@@ -917,4 +938,26 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
ExprContext *econtext);
+/* ---------- BatchVector stuff ------------- */
+
+/* Vector fetch spec for a list of simple Vars. */
+typedef struct BatchVector
+{
+ /* immutable after BatchVectorCreate */
+ AttrNumber *attnos; /* [ncols] */
+ int ncols;
+ int maxrows;
+ int last_var;
+
+ /* per batch state */
+ Datum **cols; /* [ncols][maxbatch] */
+ bool **nulls; /* [ncols][maxbatch] */
+ bool hasnull; /* is any datum in cols NULL? */
+ int nrows; /* #rows loaded into cols/nulls */
+} BatchVector;
+
+extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 095e4cc82e3..2e2192fb3cf 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -347,6 +347,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
int lastAttNum);
extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
#ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9b81b842161..fdfe8b4ddaf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -277,6 +277,14 @@ typedef struct ExprContext
#define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
TupleTableSlot *ecxt_outertuple;
+ /* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_INNERBATCH 4
+ TupleBatch *inner_batch;
+#define FIELDNO_EXPRCONTEXT_OUTERBATCH 5
+ TupleBatch *outer_batch;
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 6
+ TupleBatch *scan_batch;
+
/* Memory contexts for expression evaluation --- see notes above */
MemoryContext ecxt_per_query_memory;
MemoryContext ecxt_per_tuple_memory;
@@ -289,27 +297,27 @@ typedef struct ExprContext
* Values to substitute for Aggref nodes in the expressions of an Agg
* node, or for WindowFunc nodes within a WindowAgg node.
*/
-#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
+#define FIELDNO_EXPRCONTEXT_AGGVALUES 11
Datum *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
-#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
+#define FIELDNO_EXPRCONTEXT_AGGNULLS 12
bool *ecxt_aggnulls; /* null flags for aggs/windowfuncs */
/* Value to substitute for CaseTestExpr nodes in expression */
-#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
+#define FIELDNO_EXPRCONTEXT_CASEDATUM 13
Datum caseValue_datum;
-#define FIELDNO_EXPRCONTEXT_CASENULL 11
+#define FIELDNO_EXPRCONTEXT_CASENULL 14
bool caseValue_isNull;
/* Value to substitute for CoerceToDomainValue nodes in expression */
-#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
+#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 15
Datum domainValue_datum;
-#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
+#define FIELDNO_EXPRCONTEXT_DOMAINNULL 16
bool domainValue_isNull;
/* Tuples that OLD/NEW Var nodes in RETURNING may refer to */
-#define FIELDNO_EXPRCONTEXT_OLDTUPLE 14
+#define FIELDNO_EXPRCONTEXT_OLDTUPLE 17
TupleTableSlot *ecxt_oldtuple;
-#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
+#define FIELDNO_EXPRCONTEXT_NEWTUPLE 18
TupleTableSlot *ecxt_newtuple;
/* Link to containing EState (NULL if a standalone ExprContext) */
--
2.43.0
[application/octet-stream] v1-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.2K, 6-v1-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 6a43a40037e4b656739743b3c0abdfb73a8f9b92 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:59:56 +0900
Subject: [PATCH v1 2/8] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Notes / current limits:
- Batching uses EXEC_BATCH_ROWS (currently 64) as the target capacity.
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
---
src/backend/access/heap/heapam.c | 29 ++++
src/backend/access/heap/heapam_handler.c | 15 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 117 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 7 +
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 102 +++++++++++++
src/include/executor/execScan.h | 54 +++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 8 +
17 files changed, 587 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f62f7edbf5e..9fd7948482d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1570,6 +1570,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec4eeccf19c..8e88cc9e8f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,20 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps = {
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2617,6 +2631,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5a8ebb8b97c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+
+ inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->outslots = outslots;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ if (b == NULL)
+ return;
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ if (drop_slots)
+ {
+ pfree(b->inslots);
+ pfree(b->outslots);
+ b->inslots = b->outslots = NULL;
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 90726949a87..f24c5d73ae1 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batching &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..2552d420f1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(direction == ForwardScanDirection);
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = EXEC_BATCH_ROWS;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..b4a0996a717 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+bool executor_batching = false;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..c9fbb7ffef9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -880,6 +880,13 @@
boot_val => 'true',
},
+{ name => 'executor_batching', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Use tuple batching during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batching',
+ boot_val => 'true',
+},
+
{ name => 'data_sync_retry', type => 'bool', context => 'PGC_POSTMASTER', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Whether to continue running after a failure to sync data files.',
variable => 'data_sync_retry',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 02f7793fba0..13ce6166ec3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -314,6 +314,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 953207eac50..05f828b9762 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2116,5 +2142,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..6f1a38d14bd
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/* XXX fixed 64 for PoC */
+#define EXEC_BATCH_ROWS 64
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 837ea7785bb..fec606471c8 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -243,4 +243,58 @@ ExecScanExtended(ScanState *node,
}
}
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..17258f7ae2d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -575,12 +575,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b8e7afda57c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -287,6 +287,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT bool executor_batching;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..f4bb8f7dd7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
#define EXECNODES_H
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "fmgr.h"
#include "lib/ilist.h"
@@ -1143,6 +1144,10 @@ typedef struct JsonExprState
*/
typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
+struct TupleBatch;
+typedef struct TupleBatch TupleBatch;
+
/* ----------------
* PlanState node
*
@@ -1198,6 +1203,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.43.0
[application/octet-stream] v1-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.7K, 7-v1-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From 3318650e720a01cbd5948349b9fbcdbb8ddda7cf Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:56:17 +0900
Subject: [PATCH v1 1/8] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
---
src/backend/access/heap/heapam.c | 212 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 21 +++
src/include/access/tableam.h | 58 +++++++
src/include/pgstat.h | 5 +
5 files changed, 299 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed0c0c2dc9f..f62f7edbf5e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1008,7 +1008,7 @@ heapgettup_pagemode(HeapScanDesc scan,
int nkeys,
ScanKey key)
{
- HeapTuple tuple = &(scan->rs_ctup);
+ HeapTuple tuple = &scan->rs_ctup;
Page page;
uint32 lineindex;
uint32 linesleft;
@@ -1089,6 +1089,121 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * If we have no current page (or the current page is exhausted),
+ * advance to the next page that has any visible tuples and prepare it.
+ * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+ * as soon as we have a prepared page; we never produce from two pages.
+ */
+ for (;;)
+ {
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ /* Are there more visible tuples left on this page? */
+ lineindex = scan->rs_cindex + dir;
+ if (ScanDirectionIsForward(dir))
+ linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ else
+ linesleft = scan->rs_cindex;
+ if (linesleft > 0)
+ break; /* continue on this page */
+ }
+
+ /* Move to next page and prepare its visible tuple list. */
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ /* After prepare, either rs_ntuples > 0 or we'll loop again. */
+ if (scan->rs_ntuples > 0)
+ {
+ lineindex = ScanDirectionIsForward(dir) ? 0 : scan->rs_ntuples - 1;
+ linesleft = scan->rs_ntuples - (ScanDirectionIsForward(dir) ? 0 : 0);
+ break;
+ }
+ /* else: page had no visible tuples; continue to next page */
+ }
+
+ /* From here on, we must only read tuples from this single page. */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ /*
+ * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+ * until either the page is exhausted or the batch capacity is reached.
+ */
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex <= (uint32) scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ dst->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL &&
+ !HeapKeyTest(dst, RelationGetDescr(scan->rs_base.rs_rd),
+ nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ return nout;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1136,6 +1251,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
scan->rs_cbuf = InvalidBuffer;
+ scan->rs_batch_ctup = NULL;
+ scan->rs_batch_cbuf = InvalidBuffer;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1315,6 +1432,8 @@ heap_endscan(TableScanDesc sscan)
*/
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_batch_cbuf))
+ ReleaseBuffer(scan->rs_batch_cbuf);
/*
* Must free the read stream before freeing the BufferAccessStrategy.
@@ -1421,6 +1540,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+
+ Assert(maxitems > 0);
+
+ hb = palloc(sizeof(HeapBatch));
+ hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb->tupdata);
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..ec4eeccf19c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..02f7793fba0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,9 @@ typedef struct HeapScanDescData
HeapTupleData rs_ctup; /* current tuple in scan, if any */
+ HeapTupleData *rs_batch_ctup; /* NULL when not using batched mode */
+ Buffer rs_batch_cbuf; /* buffer feeding the batch */
+
/* For scans that stream reads */
ReadStream *rs_read_stream;
@@ -101,6 +104,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -294,6 +310,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..953207eac50 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(dir == ForwardScanDirection);
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e4a59a30b8c..aaea9520b1d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -687,6 +687,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.43.0
[application/octet-stream] v1-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch (9.0K, 8-v1-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch)
download | inline diff:
From 64971ee050c86326c2ca6023c302cff661383251 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 22:18:30 +0900
Subject: [PATCH v1 3/8] Executor: add ExecProcNodeBatch() and integrate
SeqScan with batch API
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Introduce a batch-capable executor interface alongside the existing
slot-at-a-time path:
* ExecProcNodeBatch() is added to return a TupleBatch instead of a
TupleTableSlot. PlanState gains ExecProcNodeBatch as a function
pointer.
Integrate SeqScan with this interface:
* Add ExecSeqScanBatch* routines that drive heap via the batch table
AM API and return a TupleBatch.
* At init, set ps.ExecProcNodeBatch to these routines when
ScanCanUseBatching() allows.
* Retain ExecSeqScanBatchSlot* variants for slot-at-a-time consumers.
This builds on 0002, which introduced TupleBatch and made SeqScan
consume the AM’s batch API internally but still surface slots. With this
patch, SeqScan can surface batches directly to batch-aware upper nodes.
Plan shape and EXPLAIN output remain unchanged; only internal tuple flow
differs when batching is enabled and allowed.
---
src/backend/executor/execProcnode.c | 52 +++++++++++++++++++++++++++++
src/backend/executor/nodeSeqscan.c | 35 +++++++++++++++++++
src/include/executor/execScan.h | 51 ++++++++++++++++++++++++++++
src/include/executor/executor.h | 10 ++++++
src/include/nodes/execnodes.h | 5 +++
5 files changed, 153 insertions(+)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..a8c0315e874 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,6 +121,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+static TupleBatch *ExecProcNodeBatchFirst(PlanState *node);
+static TupleBatch *ExecProcNodeBatchInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
@@ -389,6 +391,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
}
ExecSetExecProcNode(result, result->ExecProcNode);
+ if (result->ExecProcNodeBatch)
+ ExecSetExecProcNodeBatch(result, result->ExecProcNodeBatch);
/*
* Initialize any initPlans present in this node. The planner put them in
@@ -489,6 +493,54 @@ ExecProcNodeInstr(PlanState *node)
return result;
}
+/*
+ * ExecSetExecProcNodeBatch
+ * Install ExecProcNodeBatch with first-call wrapper, mirroring row path.
+ */
+void
+ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function)
+{
+ node->ExecProcNodeBatchReal = function;
+ node->ExecProcNodeBatch = ExecProcNodeBatchFirst;
+}
+
+/*
+ * ExecProcNodeBatchFirst
+ * One-time stack-depth check; then pick instrument/no-instrument wrapper.
+ */
+static TupleBatch *
+ExecProcNodeBatchFirst(PlanState *node)
+{
+ check_stack_depth();
+
+ if (node->instrument)
+ node->ExecProcNodeBatch = ExecProcNodeBatchInstr;
+ else
+ node->ExecProcNodeBatch = node->ExecProcNodeBatchReal;
+
+ return node->ExecProcNodeBatch(node);
+}
+
+/*
+ * ExecProcNodeBatchInstr
+ * Instrumentation wrapper for batch calls.
+ *
+ * Note: we can record nrows as the "tuple" count for this call. That keeps
+ * instrumentation meaningful without changing Instr API.
+ */
+static TupleBatch *
+ExecProcNodeBatchInstr(PlanState *node)
+{
+ TupleBatch *b;
+
+ InstrStartNode(node->instrument);
+
+ b = node->ExecProcNodeBatchReal(node);
+
+ InstrStopNode(node->instrument, b ? (double) b->nvalid : 0.0);
+
+ return b;
+}
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 2552d420f1c..3f7e40c8908 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -334,6 +334,37 @@ ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
pstate->qual, pstate->ps_ProjInfo);
}
+static TupleBatch *
+ExecSeqScanBatch(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation is required.
+ */
+static TupleBatch *
+ExecSeqScanBatchWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
/* Batch SeqScan enablement and dispatch */
static void
SeqScanInitBatching(SeqScanState *scanstate, int eflags)
@@ -348,10 +379,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
+ scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatch;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
}
else
{
+ scanstate->ss.ps.ExecProcNodeBatch = NULL;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
}
}
@@ -359,10 +392,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
+ scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatchWithQual;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
}
else
{
+ scanstate->ss.ps.ExecProcNodeBatch = NULL;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
}
}
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fec606471c8..fb4b57a831c 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -297,4 +297,55 @@ ExecScanExtendedBatchSlot(ScanState *node,
}
}
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+ int qualified;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next batch from the AM */
+ if (!accessBatchMtd(node))
+ return NULL;
+
+ if (qual != NULL)
+ {
+ qualified = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *in = TupleBatchGetNextSlot(b);
+
+ Assert(in);
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ if (ExecQual(qual, econtext))
+ {
+ TupleBatchStoreInOut(b, qualified, in);
+ qualified++;
+ }
+ else
+ InstrCountFiltered1(node, 1);
+ }
+ TupleBatchUseOutput(b, qualified);
+ }
+ else
+ qualified = b->nvalid;
+
+ if (qualified > 0)
+ return b;
+ /* else get the next batch from the AM */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 17258f7ae2d..cf5b0c7e05c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -294,6 +294,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
*/
extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
+extern void ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
@@ -315,6 +316,15 @@ ExecProcNode(PlanState *node)
return node->ExecProcNode(node);
}
+
+static inline TupleBatch *
+ExecProcNodeBatch(PlanState *node)
+{
+ if (node->chgParam != NULL) /* something changed? */
+ ExecReScan(node); /* let ReScan handle this */
+
+ return node->ExecProcNodeBatch(node);
+}
#endif
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f4bb8f7dd7f..a104591ac20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1147,6 +1147,7 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
struct TupleBatch;
typedef struct TupleBatch TupleBatch;
+typedef TupleBatch *(*ExecProcNodeBatchMtd)(struct PlanState *ps);
/* ----------------
* PlanState node
@@ -1171,6 +1172,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
+ /* Optional batch-producing entry point (NULL => no batching). */
+ ExecProcNodeBatchMtd ExecProcNodeBatch;
+ ExecProcNodeBatchMtd ExecProcNodeBatchReal;
+
Instrumentation *instrument; /* Optional runtime stats for this node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
--
2.43.0
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-09-26 13:49 Bruce Momjian <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 1 reply; 29+ messages in thread
From: Bruce Momjian @ 2025-09-26 13:49 UTC (permalink / raw)
To: Amit Langote <[email protected]>; +Cc: pgsql-hackers
On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> At PGConf.dev this year we had an unconference session [1] on whether
> the community can support an additional batch executor. The discussion
> there led me to start hacking on $subject. I have also had off-list
> discussions on this topic in recent months with Andres and David, who
> have offered useful thoughts.
>
> This patch series is an early attempt to make executor nodes pass
> around batches of tuples instead of tuple-at-a-time slots. The main
> motivation is to enable expression evaluation in batch form, which can
> substantially reduce per-tuple overhead (mainly from function calls)
> and open the door to further optimizations such as SIMD usage in
> aggregate transition functions. We could even change algorithms of
> some plan nodes to operate on batches when, for example, a child node
> can return batches.
For background, people might want to watch these two videos from POSETTE
2025. The first video explains how data warehouse query needs are
different from OLTP needs:
Building a PostgreSQL data warehouse
https://www.youtube.com/watch?v=tpq4nfEoioE
and the second one explains the executor optimizations done in PG 18:
Hacking Postgres Executor For Performance
https://www.youtube.com/watch?v=D3Ye9UlcR5Y
I learned from these two videos that to handle new workloads, I need to
think of the query demands differently, and of course can this be
accomplished without hampering OLTP workloads?
--
Bruce Momjian <[email protected]> https://momjian.us
EDB https://enterprisedb.com
Do not let urgent matters crowd out time for investment in the future.
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-09-29 11:01 Tomas Vondra <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 4 replies; 29+ messages in thread
From: Tomas Vondra @ 2025-09-29 11:01 UTC (permalink / raw)
To: Amit Langote <[email protected]>; pgsql-hackers
Hi Amit,
Thanks for the patch. I took a look over the weekend, and done a couple
experiments / benchmarks, so let me share some initial feedback (or
rather a bunch of questions I came up with).
I'll start with some general thoughts, before going into some nitpicky
comments about patches / code and perf results.
I think the general goal of the patch - reducing the per-tuple overhead
and making the executor more efficient for OLAP workloads - is very
desirable. I believe the limitations of per-row executor are one of the
reasons why attempts to implement a columnar TAM mostly failed. The
compression is nice, but it's hard to be competitive without an executor
that leverages that too. So starting with an executor, in a way that
helps even heap, seems like a good plan. So +1 to this.
While looking at the patch, I couldn't help but think about the index
prefetching stuff that I work on. It also introduces the concept of a
"batch", for passing data between an index AM and the executor. It's
interesting how different the designs are in some respects. I'm not
saying one of those designs is wrong, it's more due different goals.
For example, the index prefetching patch establishes a "shared" batch
struct, and the index AM is expected to fill it with data. After that,
the batch is managed entirely by indexam.c, with no AM calls. The only
AM-specific bit in the batch is "position", but that's used only when
advancing to the next page, etc.
This patch does things differently. IIUC, each TAM may produce it's own
"batch", which is then wrapped in a generic one. For example, heap
produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
is fine. In the prefetching we chose to move all this code (walking the
batch items) from the AMs into the layer above, and make it AM agnostic.
But for the batching, we want to retain the custom format as long as
possible. Presumably, the various advantages of the TAMs are tied to the
custom/columnar storage format. Memory efficiency thanks to compression,
execution on compressed data, etc. Keeping the custom format as long as
possible is the whole point of "late materialization" (and materializing
as late as possible is one of the important details in column stores).
How far ahead have you though about these capabilities? I was wondering
about two things in particular. First, at which point do we have to
"materialize" the TupleBatch into some generic format (e.g. TupleSlots).
I get it that you want to enable passing batches between nodes, but
would those use the same "format" as the underlying scan node, or some
generic one? Second, will it be possible to execute expressions on the
custom batches (i.e. on "compressed data")? Or is it necessary to
"materialize" the batch into regular tuple slots? I realize those may
not be there "now" but maybe it'd be nice to plan for the future.
It might be worth exploring some columnar formats, and see if this
design would be a good fit. Let's say we want to process data read from
a parquet file. Would we be able to leverage the format, or would we
need to "materialize" into slots too early? Or maybe it'd be good to
look at the VCI extension [1], discussed in a nearby thread. AFAICS
that's still based on an index AM, but there were suggestions to use TAM
instead (and maybe that'd be a better choice).
The other option would be to "create batches" during execution, say by
having a new node that accumulates tuples, builds a batch and sends it
to the node above. This would help both in cases when either the lower
node does not produce batches at all, or the batches are too small (due
to filtering, aggregation, ...). Or course, it'd only win if this
increases efficiency of the upper part of the plan enough to pay for
building the batches. That can be a hard decision.
You also mentioned we could make batches larger by letting them span
multiple pages, etc. I'm not sure that's worth it - wouldn't that
substantially complicate the TAM code, which would need to pin+track
multiple buffers for each batch, etc.? Possible, but is it worth it?
I'm not sure allowing multi-page batches would actually solve the issue.
It'd help with batches at the "scan level", but presumably the batch
size in the upper nodes matters just as much. Large scan batches may
help, but hard to predict.
In the index prefetching patch we chose to keep batches 1:1 with leaf
pages, at least for now. Instead we allowed having multiple batches at
once. I'm not sure that'd be necessary for TAMs, though.
This also reminds me of LIMIT queries. The way I imagine a "batchified"
executor to work is that batches are essentially "units of work". For
example, a nested loop would grab a batch of tuples from the outer
relation, lookup inner tuples for the whole batch, and only then pass
the result batch. (I'm ignoring the cases when the batch explodes due to
duplicates.)
But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
just the first tuple, and the rest of the batch is wasted work? Plenty
of (very expensive) OLAP have that, and many would likely benefit from
batching, so just disabling batching if there's LIMIT seems way too
heavy handed.
Perhaps it'd be good to gradually ramp up the batch size? Start with
small batches, and then make them larger. The index prefetching does
that too, indirectly - it reads the whole leaf page as a batch, but then
gradually ramps up the prefetch distance (well, read_stream does that).
Maybe the batching should have similar thing ...
In fact, how shall the optimizer decide whether to use batching? It's
one thing to decide whether a node can produce/consume batches, but
another thing is "should it"? With a node that "builds" a batch, this
decision would apply to even more plans, I guess.
I don't have a great answer to this, it seems like an incredibly tricky
costing issue. I'm a bit worried we might end up with something too
coarse, like "jit=on" which we know is causing problems (admittedly,
mostly due to a lot of the LLVM work being unpredictable/external). But
having some "adaptive" heuristics (like the gradual ramp up) might make
it less risky.
FWIW the current batch size limit (64 tuples) seems rather low, but it's
hard to say. It'd be good to be able to experiment with different
values, so I suggest we make this a GUC and not a hard-coded constant.
As for what to add to explain, I'd start by adding info about which
nodes are "batched" (consuming/producing batches), and some info about
the batch sizes. An average size, maybe a histogram if you want to be a
bit fancy.
I have no thoughts about the expression patches, at least not beyond
what I already wrote above. I don't know enough about that part.
[1]
https://www.postgresql.org/message-id/OS7PR01MB119648CA4E8502FE89056E56EEA7D2%40OS7PR01MB11964.jpnpr...
Now, numbers from some microbenchmarks:
On 9/26/25 15:28, Amit Langote wrote:
>
> To evaluate the overheads and benefits, I ran microbenchmarks with
> single and multi-aggregate queries on a single table, with and without
> WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> and IO costs are minimal. shared_buffers was large enough to fit the
> whole table (up to 10M rows, ~43 on each page), and all pages were
> prewarmed into cache before tests. Table schema/script is at [2].
>
> Observations from benchmarking (Detailed benchmark tables are at [3];
> below is just a high-level summary of the main patterns):
>
> * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> sum(a) FROM bar_N): batching scan output alone improved latency by
> ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> especially once fmgr overhead was paid per batch instead of per row.
>
> * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> qual interpreter gave a big step up, with latencies dropping by
> ~30-40% compared to batching=off.
>
> * Five aggregates, no WHERE: batching input from the child scan cut
> ~15% off runtime. Adding batched transition evaluation increased
> improvements to ~30%.
>
> * Five aggregates, with WHERE: modest gains from scan/input batching,
> but per-batch transition evaluation and batched quals brought ~20-30%
> improvement.
>
> * Across all cases, executor overheads became visible only after IO
> was minimized. Once executor cost dominated, batching consistently
> reduced CPU time, with the largest benefits coming from avoiding
> per-row fmgr calls and evaluating quals across batches.
>
> I would appreciate if others could try these patches with their own
> microbenchmarks or workloads and see if they can reproduce numbers
> similar to mine. Feedback on both the general direction and the
> details of the patches would be very helpful. In particular, patches
> 0001-0003, which add the basic batch APIs and integrate them into
> SeqScan, are intended to be the first candidates for review and
> eventual commit. Comments on the later, more experimental patches
> (aggregate input batching and expression evaluation (qual, aggregate
> transition) batching) are also welcome.
>
I tried to replicate the results, but the numbers I see are not this
good. In fact, I see a fair number of regressions (and some are not
negligible).
I'm attaching the scripts I used to build the tables / run the test. I
used the same table structure, and tried to follow the same query
pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
conditions (with 100% selectivity).
I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
And I did that on my (relatively) new ryzen machine, and old xeon. The
behavior is quite different for the two machines, but none of them shows
such improvements. I used clang 19.0, and --with-llvm.
See the attached PDFs with a summary of the results, comparing the
results for master and the two batching branches.
The ryzen is much "smoother" - it shows almost no difference with
batching "off" (as expected). The "scan" branch (with 0001-0003) shows
an improvement of 5-10% - it's consistent, but much less than the 10-20%
you report. For the "agg" branch the benefits are much larger, but
there's also a significant regression for the largest table with 100M
rows (which is ~18GB on disk).
For xeon, the results are a bit more variable, but it affects runs both
with batching "on" and "off". The machine is just more noisy. There
seems to be a small benefit of "scan" batching (in most cases much less
than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
and no regression similar to the ryzen.
Perhaps I did something wrong. It does not surprise me this is somewhat
CPU dependent. It's a bit sad the improvements are smaller for the newer
CPU, though.
I also tried running TPC-H. I don't have useful numbers yet, but I ran
into a segfault - see the attached backtrace. It only happens with the
batching, and only on Q22 for some reason. I initially thought it's a
bug in clang, because I saw it with clang-22 built from git, and not
with clang-14 or gcc. But since then I reproduced it with clang-19 (on
debian 13). Still could be a clang bug, of course. I've seen ~20 of
those segfaults so far, and the backtraces look exactly the same.
regards
--
Tomas Vondra
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Section `.reg-xstate/1569550' in core file too small.
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412 return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1 pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2 0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3 numeric_avg_accum (fcinfo=0x5570b3cd0100) at numeric.c:5052
#4 0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3cd0258, op=op@entry=0x5570b3cd0950, econtext=econtext@entry=0x5570b3cb4718) at execExprInterp.c:6171
#5 0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3cd0258, econtext=0x5570b3cb4718, isnull=0x0) at execExprInterp.c:2338
#6 0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:431
#7 ExecEvalExprNoReturnSwitchContext (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:472
#8 advance_aggregates_batch (aggstate=0x5570b3cb4300, b=<optimized out>) at nodeAgg.c:834
#9 agg_retrieve_direct_batch (aggstate=0x5570b3cb4300) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb4300) at nodeAgg.c:2289
#11 0x00005570aa0ee7e2 in ExecProcNode (node=0x5570b3cb4300) at ../../../src/include/executor/executor.h:317
#12 gather_getnext (gatherstate=0x5570b3cb3ff0) at nodeGather.c:294
#13 ExecGather (pstate=0x5570b3cb3ff0) at nodeGather.c:229
#14 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cb3ff0) at ../../../src/include/executor/executor.h:317
#15 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cb3878) at nodeAgg.c:562
#16 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=0x5570b3cb3878) at nodeAgg.c:2477
#17 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb3878) at nodeAgg.c:2289
#18 0x00005570aa108565 in ExecProcNode (node=0x5570b3cb3878) at ../../../src/include/executor/executor.h:317
#19 ExecSetParamPlan (node=0x5570b3cf4778, econtext=econtext@entry=0x5570b3cf4cd0) at nodeSubplan.c:1116
#20 0x00005570aa108a3b in ExecSetParamPlanMulti (params=params@entry=0x7ff33d3e2a08, econtext=0x5570b3cf4cd0) at nodeSubplan.c:1263
#21 0x00005570aa0d523f in ExecInitParallelPlan (planstate=0x5570b3cd2f48, estate=estate@entry=0x5570b3cb3588, sendParams=0x7ff33d3e2a08, nworkers=4, tuples_needed=-1)
at execParallel.c:636
#22 0x00005570aa0eece2 in ExecGatherMerge (pstate=0x5570b3cd2c38) at nodeGatherMerge.c:210
#23 0x00005570aa104056 in ExecProcNode (node=0x5570b3cd2c38) at ../../../src/include/executor/executor.h:317
#24 ExecNestLoop (pstate=0x5570b3cd2a28) at nodeNestloop.c:108
#25 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cd2a28) at ../../../src/include/executor/executor.h:317
#26 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:562
#27 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:2477
#28 0x00005570aa0e694d in ExecAgg (pstate=0x5570b3cd22f8) at nodeAgg.c:2292
#29 0x00005570aa0f95e0 in ExecProcNode (node=0x5570b3cd22f8) at ../../../src/include/executor/executor.h:317
#30 ExecLimit (pstate=0x5570b3cd1fe8) at nodeLimit.c:95
#31 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3cd1fe8) at ../../../src/include/executor/executor.h:317
#32 ExecutePlan (queryDesc=0x5570b3cba668, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x7ff33d3e42a0) at execMain.c:1697
#33 standard_ExecutorRun (queryDesc=0x5570b3cba668, direction=<optimized out>, count=0) at execMain.c:366
#34 0x00005570aa2a3ccb in PortalRunSelect (portal=portal@entry=0x5570b3c1cda8, forward=<optimized out>, count=0, count@entry=9223372036854775807, dest=dest@entry=0x7ff33d3e42a0)
at pquery.c:921
#35 0x00005570aa2a392d in PortalRun (portal=portal@entry=0x5570b3c1cda8, count=count@entry=9223372036854775807, isTopLevel=true, dest=dest@entry=0x7ff33d3e42a0,
altdest=altdest@entry=0x7ff33d3e42a0, qc=qc@entry=0x7ffd38218570) at pquery.c:765
#36 0x00005570aa2a2b26 in exec_simple_query (
query_string=query_string@entry=0x5570b3b9b0d8 "select\r\n\tcntrycode,\r\n\tcount(*) as numcust,\r\n\tsum(c_acctbal) as totacctbal\r\nfrom\r\n\t(\r\n\t\tselect\r\n\t\t\tsubstring(c_phone from 1 for 2) as cntrycode,\r\n\t\t\tc_acctbal\r\n\t\tfrom\r\n\t\t\tcustomer\r\n\t\twhere\r\n\t\t\tsubstrin"...) at postgres.c:1278
#37 0x00005570aa2a04cd in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4770
--Type <RET> for more, q to quit, c to continue without paging--
#38 0x00005570aa29b81b in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:124
#39 0x00005570aa1f85a1 in postmaster_child_launch (child_type=<optimized out>, child_slot=1, startup_data=startup_data@entry=0x7ffd38218988,
startup_data_len=startup_data_len@entry=24, client_sock=client_sock@entry=0x7ffd382188f8) at launch_backend.c:268
#40 0x00005570aa1fcb0c in BackendStartup (client_sock=0x7ffd382188f8) at postmaster.c:3590
#41 ServerLoop () at postmaster.c:1705
#42 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#43 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231
(gdb)
warning: Section `.reg-xstate/1569551' in core file too small.
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412 return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1 pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2 0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3 numeric_avg_accum (fcinfo=0x5570b3c9f718) at numeric.c:5052
#4 0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3c9a440, op=op@entry=0x5570b3c9f978, econtext=econtext@entry=0x5570b3c6b0a0) at execExprInterp.c:6171
#5 0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0, isnull=0x0) at execExprInterp.c:2338
#6 0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:431
#7 ExecEvalExprNoReturnSwitchContext (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:472
#8 advance_aggregates_batch (aggstate=0x5570b3c6b330, b=<optimized out>) at nodeAgg.c:834
#9 agg_retrieve_direct_batch (aggstate=0x5570b3c6b330) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3c6b330) at nodeAgg.c:2289
#11 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3c6b330) at ../../../src/include/executor/executor.h:317
#12 ExecutePlan (queryDesc=0x5570b3c679d8, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x5570b3c26ee8) at execMain.c:1697
#13 standard_ExecutorRun (queryDesc=0x5570b3c679d8, direction=<optimized out>, count=0) at execMain.c:366
#14 0x00005570aa0d6857 in ParallelQueryMain (seg=seg@entry=0x5570b3bcfc30, toc=toc@entry=0x7ff33de00000) at execParallel.c:1499
#15 0x00005570a9f7fce3 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1563
#16 0x00005570aa1f5d8e in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:843
#17 0x00005570aa1f85a1 in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, child_slot=239, startup_data=startup_data@entry=0x5570b3bd5d30,
startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at launch_backend.c:268
#18 0x00005570aa1fb2e3 in StartBackgroundWorker (rw=0x5570b3bd5d30) at postmaster.c:4160
#19 maybe_start_bgworkers () at postmaster.c:4326
#20 0x00005570aa1fce85 in LaunchMissingBackgroundProcesses () at postmaster.c:3400
#21 ServerLoop () at postmaster.c:1720
#22 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#23 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231
Attachments:
[application/x-shellscript] run-test.sh (1.8K, 2-run-test.sh)
download
[application/x-shellscript] create-tables.sh (709B, 3-create-tables.sh)
download
[application/pdf] batching-xeon.pdf (39.1K, 4-batching-xeon.pdf)
download
[application/pdf] batching-ryzen.pdf (52.0K, 5-batching-ryzen.pdf)
download
[text/plain] batching-backtrace.txt (7.9K, 6-batching-backtrace.txt)
download | inline:
Program terminated with signal SIGSEGV, Segmentation fault.
warning: Section `.reg-xstate/1569550' in core file too small.
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412 return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1 pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2 0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3 numeric_avg_accum (fcinfo=0x5570b3cd0100) at numeric.c:5052
#4 0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3cd0258, op=op@entry=0x5570b3cd0950, econtext=econtext@entry=0x5570b3cb4718) at execExprInterp.c:6171
#5 0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3cd0258, econtext=0x5570b3cb4718, isnull=0x0) at execExprInterp.c:2338
#6 0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:431
#7 ExecEvalExprNoReturnSwitchContext (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:472
#8 advance_aggregates_batch (aggstate=0x5570b3cb4300, b=<optimized out>) at nodeAgg.c:834
#9 agg_retrieve_direct_batch (aggstate=0x5570b3cb4300) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb4300) at nodeAgg.c:2289
#11 0x00005570aa0ee7e2 in ExecProcNode (node=0x5570b3cb4300) at ../../../src/include/executor/executor.h:317
#12 gather_getnext (gatherstate=0x5570b3cb3ff0) at nodeGather.c:294
#13 ExecGather (pstate=0x5570b3cb3ff0) at nodeGather.c:229
#14 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cb3ff0) at ../../../src/include/executor/executor.h:317
#15 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cb3878) at nodeAgg.c:562
#16 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=0x5570b3cb3878) at nodeAgg.c:2477
#17 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb3878) at nodeAgg.c:2289
#18 0x00005570aa108565 in ExecProcNode (node=0x5570b3cb3878) at ../../../src/include/executor/executor.h:317
#19 ExecSetParamPlan (node=0x5570b3cf4778, econtext=econtext@entry=0x5570b3cf4cd0) at nodeSubplan.c:1116
#20 0x00005570aa108a3b in ExecSetParamPlanMulti (params=params@entry=0x7ff33d3e2a08, econtext=0x5570b3cf4cd0) at nodeSubplan.c:1263
#21 0x00005570aa0d523f in ExecInitParallelPlan (planstate=0x5570b3cd2f48, estate=estate@entry=0x5570b3cb3588, sendParams=0x7ff33d3e2a08, nworkers=4, tuples_needed=-1)
at execParallel.c:636
#22 0x00005570aa0eece2 in ExecGatherMerge (pstate=0x5570b3cd2c38) at nodeGatherMerge.c:210
#23 0x00005570aa104056 in ExecProcNode (node=0x5570b3cd2c38) at ../../../src/include/executor/executor.h:317
#24 ExecNestLoop (pstate=0x5570b3cd2a28) at nodeNestloop.c:108
#25 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cd2a28) at ../../../src/include/executor/executor.h:317
#26 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:562
#27 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:2477
#28 0x00005570aa0e694d in ExecAgg (pstate=0x5570b3cd22f8) at nodeAgg.c:2292
#29 0x00005570aa0f95e0 in ExecProcNode (node=0x5570b3cd22f8) at ../../../src/include/executor/executor.h:317
#30 ExecLimit (pstate=0x5570b3cd1fe8) at nodeLimit.c:95
#31 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3cd1fe8) at ../../../src/include/executor/executor.h:317
#32 ExecutePlan (queryDesc=0x5570b3cba668, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x7ff33d3e42a0) at execMain.c:1697
#33 standard_ExecutorRun (queryDesc=0x5570b3cba668, direction=<optimized out>, count=0) at execMain.c:366
#34 0x00005570aa2a3ccb in PortalRunSelect (portal=portal@entry=0x5570b3c1cda8, forward=<optimized out>, count=0, count@entry=9223372036854775807, dest=dest@entry=0x7ff33d3e42a0)
at pquery.c:921
#35 0x00005570aa2a392d in PortalRun (portal=portal@entry=0x5570b3c1cda8, count=count@entry=9223372036854775807, isTopLevel=true, dest=dest@entry=0x7ff33d3e42a0,
altdest=altdest@entry=0x7ff33d3e42a0, qc=qc@entry=0x7ffd38218570) at pquery.c:765
#36 0x00005570aa2a2b26 in exec_simple_query (
query_string=query_string@entry=0x5570b3b9b0d8 "select\r\n\tcntrycode,\r\n\tcount(*) as numcust,\r\n\tsum(c_acctbal) as totacctbal\r\nfrom\r\n\t(\r\n\t\tselect\r\n\t\t\tsubstring(c_phone from 1 for 2) as cntrycode,\r\n\t\t\tc_acctbal\r\n\t\tfrom\r\n\t\t\tcustomer\r\n\t\twhere\r\n\t\t\tsubstrin"...) at postgres.c:1278
#37 0x00005570aa2a04cd in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4770
--Type <RET> for more, q to quit, c to continue without paging--
#38 0x00005570aa29b81b in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:124
#39 0x00005570aa1f85a1 in postmaster_child_launch (child_type=<optimized out>, child_slot=1, startup_data=startup_data@entry=0x7ffd38218988,
startup_data_len=startup_data_len@entry=24, client_sock=client_sock@entry=0x7ffd382188f8) at launch_backend.c:268
#40 0x00005570aa1fcb0c in BackendStartup (client_sock=0x7ffd382188f8) at postmaster.c:3590
#41 ServerLoop () at postmaster.c:1705
#42 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#43 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231
(gdb)
warning: Section `.reg-xstate/1569551' in core file too small.
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412 return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0 VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1 pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2 0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3 numeric_avg_accum (fcinfo=0x5570b3c9f718) at numeric.c:5052
#4 0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3c9a440, op=op@entry=0x5570b3c9f978, econtext=econtext@entry=0x5570b3c6b0a0) at execExprInterp.c:6171
#5 0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0, isnull=0x0) at execExprInterp.c:2338
#6 0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:431
#7 ExecEvalExprNoReturnSwitchContext (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:472
#8 advance_aggregates_batch (aggstate=0x5570b3c6b330, b=<optimized out>) at nodeAgg.c:834
#9 agg_retrieve_direct_batch (aggstate=0x5570b3c6b330) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3c6b330) at nodeAgg.c:2289
#11 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3c6b330) at ../../../src/include/executor/executor.h:317
#12 ExecutePlan (queryDesc=0x5570b3c679d8, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x5570b3c26ee8) at execMain.c:1697
#13 standard_ExecutorRun (queryDesc=0x5570b3c679d8, direction=<optimized out>, count=0) at execMain.c:366
#14 0x00005570aa0d6857 in ParallelQueryMain (seg=seg@entry=0x5570b3bcfc30, toc=toc@entry=0x7ff33de00000) at execParallel.c:1499
#15 0x00005570a9f7fce3 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1563
#16 0x00005570aa1f5d8e in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:843
#17 0x00005570aa1f85a1 in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, child_slot=239, startup_data=startup_data@entry=0x5570b3bd5d30,
startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at launch_backend.c:268
#18 0x00005570aa1fb2e3 in StartBackgroundWorker (rw=0x5570b3bd5d30) at postmaster.c:4160
#19 maybe_start_bgworkers () at postmaster.c:4326
#20 0x00005570aa1fce85 in LaunchMissingBackgroundProcesses () at postmaster.c:3400
#21 ServerLoop () at postmaster.c:1720
#22 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#23 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-09-30 02:11 Amit Langote <[email protected]>
parent: Tomas Vondra <[email protected]>
3 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2025-09-30 02:11 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
Hi Tomas,
Thanks a lot for your comments and benchmarking.
I plan to reply to your detailed comments and benchmark results, but I
just realized I had forgotten to attach patch 0008 (oops!) in my last
email. That patch adds batched qual evaluation.
I also noticed that the batched path was unnecessarily doing early
“batch-materialization” in cases like SELECT count(*) FROM bar. I’ve
fixed that as well. It was originally designed to avoid such
materialization, but I must have broken it while refactoring.
Attachments:
[application/octet-stream] v2-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch (22.8K, 2-v2-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch)
download | inline diff:
From 0ac98eedfef945403822d23e3efc9f7248602895 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 16:19:26 +0900
Subject: [PATCH v2 8/8] WIP: Add ExecQualBatch() and EEOPs for batched quals
Introduce ExecInitQualBatch()/ExecQualBatch() to evaluate scan quals
over a TupleBatch. The batched qual interpreter produces a boolean
mask aligned with the batch, marking which rows satisfy the qual.
The scan node later uses this mask to copy only passing rows into
its output slots. If batching is not possible, fall back to the
existing per-tuple engine.
Add EEOP_QUAL_BATCH_INITMASK and EEOP_QUAL_BATCH_TERM, and wire them
after EEOP_SCAN_FETCHSOME_BATCH and EEOP_BUILD_SCAN_BATCH_VECTOR.
Batching is limited to quals that are a top-level AND of simple
clauses: either NullTest(var) or strict binary OpExpr with var/const
or var/var arguments. A walker validates the tree, collects the
referenced attnos, and builds a BatchVector; terms are compiled from
the leaves and evaluated to update the mask.
ExprState gains batch_private to hold BatchQualRuntime (mask, words)
which are used by the parent node to populate output slots in
TupleBatch.
---
src/backend/executor/execExpr.c | 324 ++++++++++++++++++++++++++
src/backend/executor/execExprInterp.c | 202 ++++++++++++++++
src/backend/executor/nodeSeqscan.c | 2 +
src/backend/jit/llvm/llvmjit_expr.c | 11 +
src/backend/jit/llvm/llvmjit_types.c | 2 +
src/include/executor/execExpr.h | 60 +++++
src/include/executor/execScan.h | 35 +--
src/include/executor/executor.h | 3 +
src/include/nodes/execnodes.h | 4 +
9 files changed, 630 insertions(+), 13 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 27a5780f557..63df560d5f1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -111,6 +111,19 @@ static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
const BatchVector *bv);
+static int16 BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno);
+static int16 BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv);
+
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+ List *leaves; /* List<Node*> of accepted leaves */
+ Bitmapset *attnos; /* Vars referenced by accepted leaves */
+ bool ok; /* stays true if batchable */
+ AttrNumber last_scan; /* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -5221,6 +5234,209 @@ ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
return true;
}
+/* helper: extract Var (allowing RelabelType->Var); returns NULL if not */
+static Var *
+strip_to_var(Node *n)
+{
+ if (n == NULL)
+ return NULL;
+ if (IsA(n, RelabelType))
+ n = (Node *) ((RelabelType *) n)->arg;
+ if (!IsA(n, Var))
+ return NULL;
+ if (((Var *) n)->varattno < 0)
+ return NULL;
+ return (Var *) n;
+}
+
+/* main walker; return true to abort traversal early, false to continue */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+ QualBatchContext *cxt = (QualBatchContext *) context;
+
+ if (node == NULL || !cxt->ok)
+ return false;
+
+ switch (nodeTag(node))
+ {
+ case T_List:
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+ case T_BoolExpr:
+ {
+ BoolExpr *b = (BoolExpr *) node;
+
+ /* Only AND trees are allowed */
+ if (b->boolop != AND_EXPR)
+ {
+ cxt->ok = false;
+ return true; /* abort */
+ }
+ /* Recurse normally over children */
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+ }
+
+ case T_NullTest:
+ {
+ NullTest *nt = (NullTest *) node;
+ Var *v = strip_to_var((Node *) nt->arg);
+
+ if (v == NULL)
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ cxt->attnos = bms_add_member(cxt->attnos, v->varattno);
+ if (v->varattno > cxt->last_scan)
+ cxt->last_scan = v->varattno;
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Do NOT recurse into leaf */
+ return false;
+ }
+
+ case T_OpExpr:
+ {
+ OpExpr *op = (OpExpr *) node;
+ List *args = op->args;
+ Node *l, *r;
+ Var *lv,
+ *rv = NULL;
+
+ /* binary only */
+ if (list_length(args) != 2)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* strict operator only (NULL -> false semantics) */
+ if (!func_strict(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ l = linitial(args);
+ r = lsecond(args);
+ lv = strip_to_var(l);
+ if (lv == NULL)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, lv->varattno);
+ if (lv->varattno > cxt->last_scan)
+ cxt->last_scan = lv->varattno;
+
+ if (IsA(r, Const))
+ {
+ /* ok; no attno to add */
+ }
+ else
+ {
+ rv = strip_to_var(r);
+ if (rv == NULL)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, rv->varattno);
+ if (rv->varattno > cxt->last_scan)
+ cxt->last_scan = rv->varattno;
+ }
+
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Leaf handled; do NOT recurse into args */
+ return false;
+ }
+
+ /* Whitelist ends here; anything else in the tree rejects */
+ default:
+ cxt->ok = false;
+ break;
+ }
+
+ return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n, BatchVector *bv)
+{
+ BatchQualTerm *term;
+ BatchQualTermKind kind;
+ bool strict;
+ int16 l_off;
+ int16 r_off;
+ Datum r_const = (Datum) 0;
+ bool r_isnull = false;
+ FmgrInfo *finfo = NULL;
+ Oid collation;
+
+ if (IsA(n, NullTest))
+ {
+ NullTest *nt = (NullTest *) n;
+
+ kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+ l_off = BatchVectorOffsetForVarExpr(nt->arg, bv);
+ r_off = -1;
+ strict = false;
+ collation = InvalidOid;
+
+ if (l_off < 0)
+ return NULL;
+ }
+ else if (IsA(n, OpExpr))
+ {
+ OpExpr *op = (OpExpr *) n;
+ Expr *l = linitial(op->args);
+ Expr *r = lsecond(op->args);
+
+ l_off = BatchVectorOffsetForVarExpr(l, bv);
+ if (l_off < 0)
+ return NULL;
+
+ r_off = BatchVectorOffsetForVarExpr(r, bv);
+ if (IsA(r, Const))
+ {
+ Const *c = (Const *) r;
+
+ kind = BQTK_VAR_CONST;
+ r_const = c->constvalue;
+ r_isnull = c->constisnull;
+ r_off = -1;
+ }
+ else
+ {
+ if (r_off < 0)
+ return NULL;
+ kind = BQTK_VAR_VAR;
+ }
+
+ strict = func_strict(op->opfuncid);
+ collation = exprInputCollation((Node *) op);
+ finfo = palloc(sizeof(FmgrInfo));
+ fmgr_info(op->opfuncid, finfo);
+ }
+ else
+ return NULL;
+
+ term = palloc(sizeof(BatchQualTerm));
+ term->kind = kind;
+ term->strict = strict;
+ term->l_off = l_off;
+ term->r_off = r_off;
+ term->r_const = r_const;
+ term->r_isnull = r_isnull;
+ term->finfo = finfo;
+ term->collation = collation;
+
+ return term;
+}
+
/* ---------- BatchVector stuff ------------- */
static BatchVector *
@@ -5298,3 +5514,111 @@ BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
return bvs;
}
+
+/*
+ * BatchVectorOffsetForVarExpr
+ * Map a Var (or RelabelType->Var) to its BatchVector column index.
+ * Returns -1 if the Var’s attno is not present.
+ */
+static int16
+BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv)
+{
+ AttrNumber attno;
+
+ if (!expr_is_simple_var(expr, &attno))
+ return -1;
+
+ return (int16) BatchVectorFindAttColno(bv, attno);
+}
+
+/*
+ * ExecInitQualBatch
+ * Build a batched-qual EEOP program (AND-only).
+ * Caller should also keep scalar ps->qual for runtime fallback.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+ Node *qual = (Node *) ps->plan->qual;
+ QualBatchContext cxt = {NIL, NULL, true, 0};
+ BatchQualRuntime *rt;
+ ExprState *state;
+ BatchVector *bv;
+ uint64 *mask;
+ int mask_words;
+ ListCell *lc;
+ ExprEvalStep scratch = {0};
+
+ if (qual == NULL)
+ return NULL;
+
+ /* validate + collect leaves/attnos with walker */
+ (void) qual_batchable_walker(qual, &cxt);
+ if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+ return NULL;
+
+ bv = BatchVectorCreate(cxt.attnos, cxt.last_scan);
+
+ mask_words = (bv->maxrows + 63) >> 6;
+ mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+ /* Runtime carrier (lifetime == exprstate) */
+ rt = palloc0(sizeof(BatchQualRuntime));
+ rt->mask = mask;
+ rt->mask_words = mask_words;
+
+ /* dedicated ExprState for batched program */
+
+ state = makeNode(ExprState);
+ state->expr = (Expr *) qual;
+ state->parent = ps;
+ state->ext_params = NULL;
+
+ /* mark expression as to be used with ExecQual() */
+ state->flags = EEO_FLAG_IS_QUAL;
+
+ /* Only valid as batch qual if this is set. */
+ state->batch_private = (void *) rt;
+
+ scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = cxt.last_scan;
+ ExprEvalPushStep(state, &scratch);
+
+ scratch.opcode = EEOP_BUILD_SCAN_BATCH_VECTOR;
+ scratch.d.batch_vector.bv = bv;
+ ExprEvalPushStep(state, &scratch);
+
+ scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+ scratch.d.qualbatch_init.bv = bv;
+ scratch.d.qualbatch_init.mask = mask;
+ scratch.d.qualbatch_init.mask_words = mask_words;
+ ExprEvalPushStep(state, &scratch);
+
+ /* TERM per leaf */
+ foreach(lc, cxt.leaves)
+ {
+ BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc), bv);
+
+ if (term == NULL)
+ return NULL;
+
+ scratch.opcode = EEOP_QUAL_BATCH_TERM;
+ scratch.d.qualbatch_term.bv = bv;
+ scratch.d.qualbatch_term.mask = mask;
+ scratch.d.qualbatch_term.mask_words = mask_words;
+ scratch.d.qualbatch_term.term = term; /* by value */
+ ExprEvalPushStep(state, &scratch);
+ }
+
+ /*
+ * At the end, we don't need to do anything more. The last qual expr must
+ * have yielded TRUE, and since its result is stored in the desired output
+ * location, we're done.
+ */
+ scratch.opcode = EEOP_DONE_NO_RETURN;
+ ExprEvalPushStep(state, &scratch);
+
+ ExecReadyExpr(state);
+
+ return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 41ad9b4838d..5c2baa0e19d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -608,6 +608,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
+ &&CASE_EEOP_QUAL_BATCH_INITMASK,
+ &&CASE_EEOP_QUAL_BATCH_TERM,
&&CASE_EEOP_LAST
};
@@ -2350,7 +2352,19 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
/* too complex for an inline implementation */
ExecAggPlainTransBatch(state, op, econtext);
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+ {
+ ExecQualBatchInitMask(state, op, econtext);
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
EEO_NEXT();
}
@@ -6185,3 +6199,191 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
}
}
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+ for (int i = 0; i < nwords; i++)
+ a[i] = ~UINT64CONST(0);
+
+ if ((nvalid_bits & 63) != 0)
+ {
+ int rem = nvalid_bits & 63;
+
+ a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+ }
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+ a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchVector *bv = op->d.qualbatch_init.bv;
+ uint64 *mask = op->d.qualbatch_init.mask;
+ int nwords = op->d.qualbatch_init.mask_words;
+ int n = bv->nrows;
+
+ /* initialize to all-pass for current batch size */
+ mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchVector *bv = op->d.qualbatch_term.bv;
+ uint64 *mask = op->d.qualbatch_term.mask;
+ BatchQualTerm *t = op->d.qualbatch_term.term;
+ int n = bv->nrows;
+
+ switch (t->kind)
+ {
+ case BQTK_IS_NULL:
+ {
+ /* keep bit set only if value IS NULL; clear otherwise */
+ for (int i = 0; i < n; i++)
+ {
+ if (!bv->nulls[t->l_off][i])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_IS_NOT_NULL:
+ {
+ /* keep bit set only if value IS NOT NULL; clear if NULL */
+ for (int i = 0; i < n; i++)
+ {
+ if (bv->nulls[t->l_off][i])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_CONST:
+ {
+ const bool r_isnull = t->r_isnull;
+ const Datum r_const = t->r_const;
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+ int loff = t->l_off;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = bv->nulls[loff][i];
+ bool pass;
+
+ /* WHERE treats NULL as false; strict ops short-circuit */
+ if (strict && (ln || r_isnull))
+ pass = false;
+ else
+ {
+ Datum lv = bv->cols[loff][i];
+
+ /* fast-paths could go here based on t->fastclass */
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_VAR:
+ {
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+ int loff = t->l_off;
+ int roff = t->r_off;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = bv->nulls[loff][i];
+ bool rn = bv->nulls[roff][i];
+ bool pass;
+
+ if (strict && (ln || rn))
+ pass = false;
+ else
+ {
+ Datum lv = bv->cols[loff][i];
+ Datum rv = bv->cols[roff][i];
+
+ /* fast-paths could go here based on t->fastclass */
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ default:
+ /* should not happen; leave mask unchanged */
+ break;
+ }
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+ for (int i = 0; i < nwords; i++)
+ {
+ if (mask[i] != 0)
+ return false;
+ }
+ return true;
+}
+
+/*
+ * ExecQualBatch
+ * Evaluate a compiled qual (EEOP_QUAL) for a batch of rows.
+ *
+ * Returns the number of true rows (optional convenience for callers).
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+ int i;
+ uint64 *mask;
+ int kept = 0;
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);;
+
+ /* verify that expression was compiled using ExecInitQual */
+ Assert(state->flags & EEO_FLAG_IS_QUAL);
+ Assert(rt && rt->mask && rt->mask_words);
+
+ /* run the batched EEOP program once */
+ econtext->scan_batch = b;
+ ExecEvalExprNoReturn(state, econtext);
+
+ mask = rt->mask;
+ if (mask_is_empty(mask, rt->mask_words))
+ return 0;
+
+ /* Add survivors into outslots */
+ TupleBatchRewind(b);
+ i = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* mask bit set => row survives */
+ if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+ TupleBatchStoreInOut(b, kept++, slot);
+ i++;
+ }
+
+ return kept;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index a4cf1e51af0..e5ca619731f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -401,6 +401,8 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
}
}
+
+ scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 45346124bd7..b97d5faebde 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3033,6 +3033,17 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_QUAL_BATCH_INITMASK:
+ build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ case EEOP_QUAL_BATCH_TERM:
+ build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 1b5e06f60cc..f4f756e7cb5 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -187,4 +187,6 @@ void *referenced_functions[] =
ExecBuildOuterBatchVector,
ExecBuildScanBatchVector,
ExecAggPlainTransBatch,
+ ExecQualBatchInitMask,
+ ExecQualBatchTerm,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index f24782ecf58..f50936acaaa 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -306,6 +306,10 @@ typedef enum ExprEvalOp
EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT, /* call transfn once with AggBulkArgs */
+ /* Batched qual evaluation */
+ EEOP_QUAL_BATCH_INITMASK,
+ EEOP_QUAL_BATCH_TERM,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
@@ -796,6 +800,21 @@ typedef struct ExprEvalStep
{
struct BatchVector *bv;
} batch_vector;
+
+ struct
+ {
+ struct BatchVector *bv; /* filled earlier by BUILD_BATCH_VECTOR */
+ uint64 *mask; /* shared mask buffer for this program */
+ int mask_words; /* ceil(es_max_batch/64) */
+ } qualbatch_init; /* EEOP_QUAL_BATCH_INITMASK */
+
+ struct
+ {
+ struct BatchVector *bv; /* same bv as init */
+ uint64 *mask; /* same mask buffer */
+ int mask_words; /* same word count */
+ struct BatchQualTerm *term; /* compiled leaf */
+ } qualbatch_term; /* EEOP_QUAL_BATCH_TERM */
} d;
} ExprEvalStep;
@@ -975,4 +994,45 @@ extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprCo
extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+ BQTK_VAR_CONST,
+ BQTK_VAR_VAR,
+ BQTK_IS_NULL,
+ BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+ BatchQualTermKind kind;
+ bool strict; /* follow strict NULL semantics if true */
+ int16 l_off; /* left VAR column (index into BatchVector) */
+ int16 r_off; /* right VAR column, or -1 if Const */
+ Datum r_const; /* for VAR_CONST */
+ bool r_isnull; /* for VAR_CONST */
+ FmgrInfo *finfo; /* fmgr for generic binary ops */
+ Oid collation; /* op collation */
+} BatchQualTerm;
+
+/*
+ * Runtime view for batched qual programs.
+ * Owned by the ExprState; lifetime == ExprState.
+ */
+typedef struct BatchQualRuntime
+{
+ uint64 *mask;
+ int mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+ return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fb4b57a831c..568a7a33b7d 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -304,7 +304,8 @@ ExecScanExtendedBatch(ScanState *node,
{
ExprContext *econtext = node->ps.ps_ExprContext;
TupleBatch *b = node->ps.ps_Batch;
- int qualified;
+ ExprState *qual_batch = node->ps.qual_batch;
+ int qualified = 0;
/* Batch path does not support EPQ */
Assert(node->ps.state->es_epq_active == NULL);
@@ -320,23 +321,31 @@ ExecScanExtendedBatch(ScanState *node,
if (qual != NULL)
{
- qualified = 0;
- while (TupleBatchHasMore(b))
+ ResetExprContext(econtext);
+ if (qual_batch)
{
- TupleTableSlot *in = TupleBatchGetNextSlot(b);
-
- Assert(in);
- ResetExprContext(econtext);
- econtext->ecxt_scantuple = in;
+ qualified = ExecQualBatch(qual_batch, econtext, b);
+ }
+ else
+ {
+ int i = 0;
- if (ExecQual(qual, econtext))
+ while (TupleBatchHasMore(b))
{
- TupleBatchStoreInOut(b, qualified, in);
- qualified++;
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ Assert(slot);
+ econtext->ecxt_scantuple = slot;
+ if (ExecQual(qual, econtext))
+ {
+ TupleBatchStoreInOut(b, qualified, slot);
+ qualified++;
+ }
+ i++;
}
- else
- InstrCountFiltered1(node, 1);
}
+ InstrCountFiltered1(node, b->nvalid - qualified);
+ /* Update count and start using b->outslots. */
TupleBatchUseOutput(b, qualified);
}
else
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c72bd755b79..dd0f2c74ae5 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -333,6 +333,7 @@ ExecProcNodeBatch(PlanState *node)
extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -581,6 +582,8 @@ AggGetBulkArgs(FunctionCallInfo fcinfo)
}
#endif
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fdfe8b4ddaf..78c5abbb23a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -146,6 +146,9 @@ typedef struct ExprState
* ExecInitExprRec().
*/
ErrorSaveContext *escontext;
+
+ /* batched-program runtime (e.g., BatchQualRuntime) */
+ void *batch_private;
} ExprState;
@@ -1196,6 +1199,7 @@ typedef struct PlanState
* subPlan list, which does not exist in the plan tree).
*/
ExprState *qual; /* boolean qual condition */
+ ExprState *qual_batch; /* boolean qual condition evaluated on batches */
PlanState *lefttree; /* input plan tree(s) */
PlanState *righttree;
--
2.43.0
[application/octet-stream] v2-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch (21.5K, 3-v2-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch)
download | inline diff:
From c0797084b54d1e5d9ffe1af49c76c9396126ea1c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 2 Sep 2025 23:46:34 +0900
Subject: [PATCH v2 6/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP
Introduce a batch EEOP that runs plain aggregate transitions by
looping over rows of a TupleBatch. This keeps transition logic in
the interpreter while amortizing per-row costs.
Gate with AggTransCanUseBatch(): plain, non-hashed, single-set
aggregates with no DISTINCT/ORDER/FILTER, and simple Var args.
Extend ExecBuildAggTrans() to prepare batch fetch/build steps and
to return whether a batch path is used.
---
src/backend/executor/execExpr.c | 228 ++++++++++++++++++++++++--
src/backend/executor/execExprInterp.c | 103 ++++++++++++
src/backend/executor/nodeAgg.c | 17 +-
src/backend/jit/llvm/llvmjit_expr.c | 6 +
src/backend/jit/llvm/llvmjit_types.c | 1 +
src/include/executor/execBatch.h | 6 +
src/include/executor/execExpr.h | 14 ++
src/include/executor/executor.h | 3 +-
src/include/executor/nodeAgg.h | 2 +
9 files changed, 363 insertions(+), 17 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..af5ed8b6368 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -95,7 +95,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
int transno, int setno, int setoff, bool ishash,
- bool nullcheck);
+ bool nullcheck, bool batch,
+ BatchVector *bv);
+
static void ExecInitJsonExpr(JsonExpr *jsexpr, ExprState *state,
Datum *resv, bool *resnull,
ExprEvalStep *scratch);
@@ -104,6 +106,10 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
bool exists_coerce,
Datum *resv, bool *resnull);
+static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
+static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
+static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
+ const BatchVector *bv);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -3659,6 +3665,33 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
}
}
+/* plain agg, single set, not hashed, no DISTINCT/ORDER/FILTER */
+static inline bool
+AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
+{
+ Agg *aggnode = (Agg *) as->ss.ps.plan;
+
+ if (!AggCanUsePlainBatch(as))
+ return false;
+ if (as->aggstrategy == AGG_HASHED)
+ return false;
+ if (aggnode->groupingSets != NIL)
+ return false;
+ if (as->phase == NULL || as->phase->numsets > 0)
+ return false;
+
+ /* per-aggregate complications */
+ if (pt->aggsortrequired)
+ return false;
+ if (pt->aggref &&
+ (pt->aggref->aggdistinct != NIL ||
+ pt->aggref->aggorder != NIL ||
+ pt->aggref->aggfilter != NULL))
+ return false;
+
+ return true;
+}
+
/*
* Build transition/combine function invocations for all aggregate transition
* / combination function invocations in a grouping sets phase. This has to
@@ -3675,13 +3708,17 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash, bool nullcheck)
+ bool doSort, bool doHash, bool nullcheck,
+ bool *batch_trans)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
ExprEvalStep scratch = {0};
bool isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
ExprSetupInfo deform = {0, 0, 0, 0, 0, NIL};
+ bool batch = AggCanUsePlainBatch(aggstate);
+ Bitmapset *allattnos = NULL;
+ BatchVector *bv = NULL;
state->expr = (Expr *) aggstate;
state->parent = parent;
@@ -3707,8 +3744,36 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
&deform);
expr_setup_walker((Node *) pertrans->aggref->aggfilter,
&deform);
+
+ if (!AggTransCanUseBatch(aggstate, pertrans) ||
+ !ExprListAllSimpleVars(pertrans->aggref->args, &allattnos))
+ batch = false;
}
- ExecPushExprSetupSteps(state, &deform);
+
+ if (batch)
+ {
+ if (deform.last_outer > 0)
+ {
+ Assert(!bms_is_empty(allattnos));
+ bv = BatchVectorCreate(allattnos, deform.last_outer);
+
+ /*
+ * Deform all tuples upto last_outer in batch
+ */
+ scratch.opcode = EEOP_OUTER_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = deform.last_outer;
+ ExprEvalPushStep(state, &scratch);
+
+ /*
+ * Put all arg Vars into vectors once per batch slice
+ */
+ scratch.opcode = EEOP_BUILD_OUTER_BATCH_VECTOR;
+ scratch.d.batch_vector.bv = bv;
+ ExprEvalPushStep(state, &scratch);
+ }
+ }
+ else
+ ExecPushExprSetupSteps(state, &deform);
/*
* Emit instructions for each transition value / grouping set combination.
@@ -3746,7 +3811,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
* Evaluate arguments to aggregate/combine function.
*/
argno = 0;
- if (isCombine)
+ if (isCombine && !batch)
{
/*
* Combining two aggregate transition values. Instead of directly
@@ -3816,7 +3881,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
Assert(pertrans->numInputs == argno);
}
- else if (!pertrans->aggsortrequired)
+ else if (!pertrans->aggsortrequired && !batch)
{
ListCell *arg;
@@ -3849,7 +3914,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
}
Assert(pertrans->numTransInputs == argno);
}
- else if (pertrans->numInputs == 1)
+ else if (pertrans->numInputs == 1 && !batch)
{
/*
* Non-presorted DISTINCT and/or ORDER BY case, with a single
@@ -3868,7 +3933,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
Assert(pertrans->numInputs == argno);
}
- else
+ else if (!batch)
{
/*
* Non-presorted DISTINCT and/or ORDER BY case, with multiple
@@ -3896,7 +3961,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
* just keep the prior transValue. This is true for both plain and
* sorted/distinct aggregates.
*/
- if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0)
+ if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0 && !batch)
{
if (strictnulls)
scratch.opcode = EEOP_AGG_STRICT_INPUT_CHECK_NULLS;
@@ -3914,7 +3979,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
}
/* Handle DISTINCT aggregates which have pre-sorted input */
- if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired)
+ if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired && !batch)
{
if (pertrans->numDistinctCols > 1)
scratch.opcode = EEOP_AGG_PRESORTED_DISTINCT_MULTI;
@@ -3942,7 +4007,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
pertrans, transno, setno, setoff, false,
- nullcheck);
+ nullcheck, batch, bv);
setoff++;
}
}
@@ -3962,7 +4027,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
pertrans, transno, setno, setoff, true,
- nullcheck);
+ nullcheck, false, NULL);
setoff++;
}
}
@@ -4007,6 +4072,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
ExecReadyExpr(state);
+ if (batch_trans)
+ *batch_trans = batch;
+
return state;
}
@@ -4020,10 +4088,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
int transno, int setno, int setoff, bool ishash,
- bool nullcheck)
+ bool nullcheck, bool batch, BatchVector *bv)
{
ExprContext *aggcontext;
int adjust_jumpnull = -1;
+ BatchVectorSlice *bvs = NULL;
if (ishash)
aggcontext = aggstate->hashcontext;
@@ -4077,7 +4146,13 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
*/
if (!pertrans->aggsortrequired)
{
- if (pertrans->transtypeByVal)
+ if (batch)
+ {
+ if (bv)
+ bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ }
+ else if (pertrans->transtypeByVal)
{
if (fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
@@ -4108,6 +4183,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
scratch->d.agg_trans.setoff = setoff;
scratch->d.agg_trans.transno = transno;
scratch->d.agg_trans.aggcontext = aggcontext;
+ scratch->d.agg_trans.bvs = bvs;
ExprEvalPushStep(state, scratch);
/* fix up jumpnull */
@@ -5070,3 +5146,129 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
DomainHasConstraints(returning->typid);
ExprEvalPushStep(state, &scratch);
}
+
+/* Is expr a Var node for a non-system attribute? */
+static bool
+expr_is_simple_var(Expr *expr, AttrNumber *out_attno)
+{
+ if (expr == NULL)
+ return false;
+
+ if (IsA(expr, TargetEntry))
+ return expr_is_simple_var((Expr *) ((TargetEntry *) expr)->expr,
+ out_attno);
+ if (IsA(expr, RelabelType))
+ return expr_is_simple_var((Expr *) ((RelabelType *) expr)->arg,
+ out_attno);
+
+ if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+ {
+ *out_attno = ((Var *) expr)->varattno;
+ return true;
+ }
+
+ return false;
+}
+
+/* Are all inputs plain Vars (optionally allow RelabelType->Var)? Collect attnos. */
+static bool
+ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
+{
+ ListCell *lc;
+
+ foreach(lc, args)
+ {
+ TargetEntry *tle = lfirst_node(TargetEntry, lc);
+ Expr *arg = tle->expr;
+ AttrNumber attno;
+
+ if (!expr_is_simple_var(arg, &attno))
+ return false;
+
+ if (!IsA(arg, Var))
+ return false;
+
+ Assert(attno > 0);
+ *allattnos = bms_add_member(*allattnos, attno);
+ }
+
+ return true;
+}
+
+/* ---------- BatchVector stuff ------------- */
+
+static BatchVector *
+BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var)
+{
+ int maxrows = EXEC_BATCH_ROWS;
+ BatchVector *bv;
+ AttrNumber attno;
+ int i;
+
+ bv = palloc(sizeof(BatchVector));
+ bv->ncols = bms_num_members(attnos);
+ bv->maxrows = maxrows;
+ bv->last_var = last_var;
+ bv->attnos = palloc(sizeof(AttrNumber) * bv->ncols);
+ attno = -1;
+ i = 0;
+ while ((attno = bms_next_member(attnos, attno)) > 0)
+ bv->attnos[i++] = attno;
+ bv->cols = palloc(sizeof(Datum *) * bv->ncols);
+ bv->nulls = palloc(sizeof(bool *) * bv->ncols);
+
+ for (i =0; i < bv->ncols; i++)
+ {
+ bv->cols[i] = palloc(sizeof(Datum) * maxrows);
+ bv->nulls[i] = palloc(sizeof(bool) * maxrows);
+ }
+
+ bv->nrows = 0;
+ bv->hasnull = false;
+
+ return bv;
+}
+
+static int16
+BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno)
+{
+ for (int i = 0; i < bv->ncols; i++)
+ if (bv->attnos[i] == attno)
+ return i;
+
+ return -1;
+}
+
+/*
+ * BatchVectorSliceFromExprArgs
+ * Build a BatchVectorSlice for a List of args.
+ *
+ * For Var args (possibly under RelabelType), store the col index.
+ * For non-Var args, store -1. Caller can handle Consts, etc.
+ */
+static BatchVectorSlice *
+BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
+{
+ BatchVectorSlice *bvs = palloc(sizeof(BatchVectorSlice));
+ int nargs = list_length(args);
+ int i = 0;
+ ListCell *lc;
+
+ Assert(bv);
+ bvs->bv = bv;
+ bvs->nargs = nargs;
+ bvs->argoffs = (int16 *) palloc(sizeof(int16) * nargs);
+
+ foreach (lc, args)
+ {
+ Expr *arg = (Expr *) lfirst(lc);
+ AttrNumber attno;
+
+ if (expr_is_simple_var(arg, &attno))
+ bvs->argoffs[i++] = BatchVectorFindAttColno(bv, attno);
+ else
+ bvs->argoffs[i++] = -1; /* non-Var */
+ }
+
+ return bvs;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 68629ad7991..3176679b346 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -606,6 +606,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
&&CASE_EEOP_LAST
};
@@ -2336,6 +2337,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
+ {
+ /* too complex for an inline implementation */
+ ExecAggPlainTransBatch(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -6039,3 +6048,97 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
}
bv->nrows = i;
}
+
+void
+ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
+ AggStatePerGroup pergroup =
+ &aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+ BatchVectorSlice *bvs = op->d.agg_trans.bvs;
+ FunctionCallInfo fcinfo = pertrans->transfn_fcinfo;
+ FmgrInfo *finfo = fcinfo->flinfo;
+ Datum newVal;
+ TupleBatch *batch = econtext->outer_batch;
+ int batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
+ int start_row = 0;
+
+ if (finfo->fn_strict)
+ {
+ if (pergroup->noTransValue && bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+ bool found = false;
+
+ Assert(bv);
+ for (int i = 0; i < batch_nrows; i++)
+ {
+ for (int j = 0; j < bvs->nargs; j++)
+ {
+ if (!bv->nulls[bvs->argoffs[j]][i])
+ {
+ fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+ fcinfo->args[1].isnull = false;
+ if (j == bvs->nargs - 1)
+ {
+ found = true;
+ break;
+ }
+ }
+ }
+ if (found)
+ break;
+ }
+ /* If transValue has not yet been initialized, do so now. */
+ ExecAggInitGroup(aggstate, pertrans, pergroup,
+ op->d.agg_trans.aggcontext);
+ start_row = 1;
+ }
+ else if (pergroup->transValueIsNull)
+ return;
+ }
+
+ switch (ExecEvalStepOp(state, op))
+ {
+ case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+ /* Loop rows, call the original transfn per element using vector cols. */
+ for (int i = start_row; i < batch_nrows; i++)
+ {
+ bool hasnull = false;
+
+ /* Set up fcinfo args 1..m from column vectors at row i. */
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ for (int j = 0; j < bvs->nargs; j++)
+ {
+ int16 argoff = bvs->argoffs[j];
+
+ fcinfo->args[j+1].value = bv->cols[argoff][i];
+ fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+ if (!hasnull && bv->nulls[argoff][i])
+ hasnull = true;
+ }
+ }
+ /* fcinfo->args[0] is the existing transition state */
+ if (finfo->fn_strict && hasnull)
+ continue;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ newVal = FunctionCallInvoke(fcinfo);
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ }
+ break;
+ default:
+ elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+ }
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ace6363509..662d8bef43b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -825,6 +825,16 @@ advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
{
ExprContext *tmpcontext = aggstate->tmpcontext;
ExprState *evaltrans = aggstate->phase->evaltrans;
+ bool batch_trans = aggstate->phase->batch_trans;
+
+ if (batch_trans)
+ {
+ tmpcontext->ecxt_outertuple = TupleBatchGetSlot(b, 0);
+ tmpcontext->outer_batch = b;
+ ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+ TupleBatchConsumeAll(b);
+ return;
+ }
while (TupleBatchHasMore(b))
{
@@ -1800,7 +1810,8 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
phase->evaltrans_cache[i][j] = ExecBuildAggTrans(aggstate, phase,
dosort, dohash,
- nullcheck);
+ nullcheck,
+ NULL);
/* change back */
aggstate->ss.ps.outerops = outerops;
@@ -3367,7 +3378,7 @@ hashagg_reset_spill_state(AggState *aggstate)
}
}
-static bool
+bool
AggCanUsePlainBatch(AggState *aggstate)
{
const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
@@ -4233,7 +4244,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
Assert(false);
phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
- false);
+ false, &phase->batch_trans);
/* cache compiled expression for outer slot without NULL check */
phase->evaltrans_cache[0][0] = phase->evaltrans;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 848f0b52d6f..efb3ee639fc 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,12 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+ build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 6bb527c3f6f..1b5e06f60cc 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,4 +186,5 @@ void *referenced_functions[] =
ExecBuildInnerBatchVector,
ExecBuildOuterBatchVector,
ExecBuildScanBatchVector,
+ ExecAggPlainTransBatch,
};
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 6f1a38d14bd..b50961fc0c9 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -99,4 +99,10 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+static inline void
+TupleBatchConsumeAll(TupleBatch *b)
+{
+ b->next = b->nvalid;
+}
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 99c86bac702..1d33e084b69 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -302,6 +302,9 @@ typedef enum ExprEvalOp
EEOP_BUILD_OUTER_BATCH_VECTOR,
EEOP_BUILD_SCAN_BATCH_VECTOR,
+ /* Batched aggregate trans evaluation */
+ EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
@@ -750,6 +753,7 @@ typedef struct ExprEvalStep
/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
+ /* for EEOP_AGG_PLAIN_TRANS_{BATCH,BATCH_ROWLOOP}*/
struct
{
AggStatePerTrans pertrans;
@@ -757,6 +761,7 @@ typedef struct ExprEvalStep
int setno;
int transno;
int setoff;
+ struct BatchVectorSlice *bvs;
} agg_trans;
/* for EEOP_IS_JSON */
@@ -956,8 +961,17 @@ typedef struct BatchVector
int nrows; /* #rows loaded into cols/nulls */
} BatchVector;
+/* A slice of BatchVector that maps caller args to BatchVector columns. */
+typedef struct BatchVectorSlice
+{
+ const BatchVector *bv;
+ int nargs; /* number of args covered */
+ int16 *argoffs; /* length nargs, -1 for non-Var entries */
+} BatchVectorSlice;
+
extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cf5b0c7e05c..5ba9a523970 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -336,7 +336,8 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash, bool nullcheck);
+ bool doSort, bool doHash, bool nullcheck,
+ bool *batch_trans);
extern ExprState *ExecBuildHash32FromAttrs(TupleDesc desc,
const TupleTableSlotOps *ops,
FmgrInfo *hashfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..5c5ebfc73f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -289,6 +289,7 @@ typedef struct AggStatePerPhaseData
Sort *sortnode; /* Sort node for input ordering for phase */
ExprState *evaltrans; /* evaluation of transition functions */
+ bool batch_trans; /* true if evaltrans contains batch EEOPs */
/*----------
* Cached variants of the compiled expression.
@@ -338,4 +339,5 @@ extern void ExecAggInitializeDSM(AggState *node, ParallelContext *pcxt);
extern void ExecAggInitializeWorker(AggState *node, ParallelWorkerContext *pwcxt);
extern void ExecAggRetrieveInstrumentation(AggState *node);
+extern bool AggCanUsePlainBatch(AggState *aggstate);
#endif /* NODEAGG_H */
--
2.43.0
[application/octet-stream] v2-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch (11.2K, 4-v2-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch)
download | inline diff:
From c88299a33c376aa8a5a1a5359217e9c8e67b60e8 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 9 Sep 2025 21:43:29 +0900
Subject: [PATCH v2 7/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT
The new EEOP runs a plain aggregate transition over a TupleBatch with
a single fmgr call. Batch vectors are passed to the transfn via
AggBulkArgs stored in fcinfo->flinfo->fn_extra, avoiding per-row fmgr
overhead.
Gate selection with AggTransfnSupportsBulk(), an allowlist of
built-in transfns updated to accept AggBulkArgs. Some integer
transfns are taught to read AggBulkArgs when present, else fall
back. Rowloop batching remains available; unsupported aggregates keep
the row path.
---
src/backend/executor/execExpr.c | 28 ++++++++++++++++-
src/backend/executor/execExprInterp.c | 43 ++++++++++++++++++++++++++
src/backend/executor/nodeAgg.c | 1 -
src/backend/jit/llvm/llvmjit_expr.c | 1 +
src/backend/utils/adt/int.c | 32 +++++++++++++++++++
src/backend/utils/adt/int8.c | 44 +++++++++++++++++++++++++++
src/backend/utils/adt/numeric.c | 17 +++++++++++
src/include/executor/execExpr.h | 1 +
src/include/executor/executor.h | 20 ++++++++++++
9 files changed, 185 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index af5ed8b6368..27a5780f557 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -47,6 +47,7 @@
#include "utils/acl.h"
#include "utils/array.h"
#include "utils/builtins.h"
+#include "utils/fmgroids.h"
#include "utils/jsonfuncs.h"
#include "utils/jsonpath.h"
#include "utils/lsyscache.h"
@@ -3692,6 +3693,28 @@ AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
return true;
}
+/* Return true if this transfn OID is known to accept AggBulkArgs. */
+static bool
+AggTransfnSupportsBulk(Oid fn_oid)
+{
+ /* Phase 1: hard-coded allowlist of built-ins you updated. */
+ static const Oid ok[] =
+ {
+ F_INT8INC_ANY, /* COUNT(*) transfn */
+ F_INT8INC, /* COUNT(arg) transfn */
+ F_INT4_SUM, /* SUM(int) transfn */
+ F_INT4SMALLER, /* MIN(int) transfn */
+ F_INT4LARGER, /* MAX(int) transfn */
+ /* add others you make bulk-aware */
+ InvalidOid
+ };
+
+ for (int i = 0; OidIsValid(ok[i]); i++)
+ if (ok[i] == fn_oid)
+ return true;
+ return false;
+}
+
/*
* Build transition/combine function invocations for all aggregate transition
* / combination function invocations in a grouping sets phase. This has to
@@ -4150,7 +4173,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
{
if (bv)
bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ if (!AggTransfnSupportsBulk(pertrans->transfn_oid))
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ else
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT;
}
else if (pertrans->transtypeByVal)
{
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3176679b346..41ad9b4838d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -607,6 +607,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
&&CASE_EEOP_LAST
};
@@ -2345,6 +2346,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
+ {
+ /* too complex for an inline implementation */
+ ExecAggPlainTransBatch(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -6138,6 +6147,40 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
pergroup->transValueIsNull = fcinfo->isnull;
}
break;
+
+ case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+ {
+ void *save = fcinfo->flinfo->fn_extra;
+ AggBulkArgs ba = {batch_nrows, start_row};
+
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ Assert(bv);
+ ba.nargs = bvs->nargs;
+ ba.argoffs = bvs->argoffs;
+ ba.args = bv->cols;
+ ba.isnull = bv->nulls;
+ ba.hasnull = bv->hasnull;
+ }
+ fcinfo->flinfo->fn_extra = &ba;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+ newVal = FunctionCallInvoke(fcinfo); /* one call for the entire slice */
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ fcinfo->flinfo->fn_extra = save;
+ }
+ break;
+
default:
elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 662d8bef43b..a2286ef5e54 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2687,7 +2687,6 @@ agg_retrieve_direct_batch(AggState *aggstate)
initialize_aggregates(aggstate, aggstate->pergroups,
Max(aggstate->phase->numsets, 1));
-
if (aggstate->grp_firstTuple)
{
ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index efb3ee639fc..45346124bd7 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,7 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
v_state, op, v_econtext);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..eb1780b5590 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1363,18 +1363,50 @@ int2smaller(PG_FUNCTION_ARGS)
Datum
int4larger(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);
+ if (unlikely(ba))
+ {
+ int32 result = arg1;
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = (int32) ba->args[ba->argoffs[0]][i];
+ if (arg2 > result)
+ result = arg2;
+ }
+ }
+ PG_RETURN_INT32(result);
+ }
PG_RETURN_INT32((arg1 > arg2) ? arg1 : arg2);
}
Datum
int4smaller(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);
+ if (unlikely(ba))
+ {
+ int32 result = arg1;
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = ba->args[ba->argoffs[0]][i];
+ if (arg2 < result)
+ result = arg2;
+ }
+ }
+ PG_RETURN_INT32(result);
+ }
PG_RETURN_INT32((arg1 < arg2) ? arg1 : arg2);
}
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index bdea490202a..bbabf4e0785 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -461,10 +461,28 @@ int8up(PG_FUNCTION_ARGS)
Datum
int8pl(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int64 arg1 = PG_GETARG_INT64(0);
int64 arg2 = PG_GETARG_INT64(1);
int64 result;
+ if (unlikely(ba))
+ {
+ result = arg1;
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = ba->args[ba->argoffs[0]][i];
+ if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ arg1 = result;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
@@ -718,9 +736,35 @@ int8lcm(PG_FUNCTION_ARGS)
Datum
int8inc(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int64 arg = PG_GETARG_INT64(0);
int64 result;
+ if (unlikely(ba))
+ {
+ result = arg;
+ if (!ba->hasnull || ba->nargs == 0)
+ {
+ if (unlikely(pg_add_s64_overflow(arg, ba->nrows, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ PG_RETURN_INT64(result);
+ }
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ arg = result;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
+
if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 76269918593..b02664c97f5 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6310,6 +6310,23 @@ int4_sum(PG_FUNCTION_ARGS)
{
int64 oldsum;
int64 newval;
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
+
+ if (unlikely(ba))
+ {
+ int64 result = (!PG_ARGISNULL(0) ? PG_GETARG_INT64(0) : 0);
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ int32 arg2 = ba->args[ba->argoffs[0]][i];
+
+ result = result + arg2;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
if (PG_ARGISNULL(0))
{
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 1d33e084b69..f24782ecf58 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -304,6 +304,7 @@ typedef enum ExprEvalOp
/* Batched aggregate trans evaluation */
EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
+ EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT, /* call transfn once with AggBulkArgs */
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5ba9a523970..c72bd755b79 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -561,6 +561,26 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
}
#endif
+#ifndef FRONTEND
+/* Per-call bulk argument vectors for batched aggregate trans functions. */
+typedef struct AggBulkArgs
+{
+ int nrows; /* number of rows in this batch */
+ int start_row;
+ int16 *argoffs;
+ int nargs; /* number of argument vectors */
+ Datum **args; /* args[j][i] = j-th arg at row i */
+ bool **isnull; /* isnull[j][i] */
+ bool hasnull; /* is any datum in args NULL? */
+} AggBulkArgs;
+
+static inline AggBulkArgs *
+AggGetBulkArgs(FunctionCallInfo fcinfo)
+{
+ return (AggBulkArgs *) (fcinfo->flinfo ? fcinfo->flinfo->fn_extra : NULL);
+}
+#endif
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
--
2.43.0
[application/octet-stream] v2-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch (16.9K, 5-v2-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch)
download | inline diff:
From 3cf02cab36bc9b2420f98ff08c17dea082a84f59 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 17:01:29 +0900
Subject: [PATCH v2 5/8] WIP: Add EEOPs and helpers for TupleBatch processing
Introduce new EEOP cases to fetch attributes into TupleBatch
vectors:
- EEOP_{INNER,OUTER,SCAN}_FETCHSOME_BATCH
- EEOP_BUILD_{INNER,OUTER,SCAN}_BATCH_VECTOR
Add ExecBuild{Inner,Outer,Scan}BatchVector() helpers to populate
column vectors (values, nulls, nrows, hasnull) from a TupleBatch.
Extend ExprContext with inner_batch, outer_batch, and scan_batch
fields so expression programs can access active batches directly.
Add slot_getsomeattrs_batch() to prefetch attributes across all
slots in a TupleBatch, similar to slot_getsomeattrs() for one slot.
---
src/backend/executor/execExprInterp.c | 127 +++++++++++++++++++++++++-
src/backend/executor/execTuples.c | 32 +++++++
src/backend/jit/llvm/llvmjit_expr.c | 86 +++++++++++++++++
src/backend/jit/llvm/llvmjit_types.c | 4 +
src/include/executor/execExpr.h | 45 ++++++++-
src/include/executor/tuptable.h | 2 +
src/include/nodes/execnodes.h | 24 +++--
7 files changed, 310 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0e1a74976f7..68629ad7991 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
#include "access/heaptoast.h"
#include "catalog/pg_type.h"
#include "commands/sequence.h"
+#include "executor/execBatch.h"
#include "executor/execExpr.h"
#include "executor/nodeSubplan.h"
#include "funcapi.h"
@@ -188,6 +189,11 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
int setno);
static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static pg_attribute_always_inline void ExecBuildBatchVector(ExprState *state,
+ ExprEvalStep *op,
+ ExprContext *econtext,
+ TupleBatch *b);
+
/*
* ScalarArrayOpExprHashEntry
* Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -446,7 +452,6 @@ ExecReadyInterpretedExpr(ExprState *state)
state->evalfunc_private = ExecInterpExpr;
}
-
/*
* Evaluate expression identified by "state" in the execution context
* given by "econtext". *isnull is set to the is-null flag for the result,
@@ -466,6 +471,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
+ TupleBatch *innerbatch;
+ TupleBatch *outerbatch;
+ TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -479,6 +487,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_SCAN_FETCHSOME,
&&CASE_EEOP_OLD_FETCHSOME,
&&CASE_EEOP_NEW_FETCHSOME,
+ &&CASE_EEOP_INNER_FETCHSOME_BATCH,
+ &&CASE_EEOP_OUTER_FETCHSOME_BATCH,
+ &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
&&CASE_EEOP_INNER_VAR,
&&CASE_EEOP_OUTER_VAR,
&&CASE_EEOP_SCAN_VAR,
@@ -592,6 +603,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+ &&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
+ &&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
+ &&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_LAST
};
@@ -612,6 +626,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
+ innerbatch = econtext->inner_batch;
+ outerbatch = econtext->outer_batch;
+ scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -658,6 +675,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_INNER_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, innerslot);
+
+ Assert(innerbatch);
+ slot_getsomeattrs_batch(innerbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_OUTER_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, outerslot);
+
+ Assert(outerbatch);
+ slot_getsomeattrs_batch(outerbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, scanslot);
+
+ Assert(scanbatch);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_OLD_FETCHSOME)
{
CheckOpSlotCompatibility(op, oldslot);
@@ -2265,6 +2312,30 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_BUILD_INNER_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildInnerBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_BUILD_OUTER_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildOuterBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_BUILD_SCAN_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildScanBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -5914,3 +5985,57 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
MemoryContextSwitchTo(oldContext);
}
+
+void
+ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->inner_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->inner_batch);
+}
+
+void
+ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->outer_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->outer_batch);
+}
+
+void
+ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->scan_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->scan_batch);
+}
+
+static pg_attribute_always_inline void
+ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
+ TupleBatch *b)
+{
+ struct BatchVector *bv = op->d.batch_vector.bv;
+ int i = 0;
+
+ if (bv->ncols == 0)
+ return;
+
+ /* Fetch each requested attribute into column vectors. */
+ TupleBatchRewind(b);
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ for (int j = 0; j < bv->ncols; j++)
+ {
+ AttrNumber attno = bv->attnos[j];
+ Datum *cols = bv->cols[j];
+ bool *nulls = bv->nulls[j];
+
+ Assert(attno <= slot->tts_nvalid);
+ cols[i] = slot->tts_values[attno - 1];
+ nulls[i] = slot->tts_isnull[attno - 1];
+ if (!bv->hasnull && nulls[i])
+ bv->hasnull = true;
+ }
+ i++;
+ }
+ bv->nrows = i;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..86d5dea8f8b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
}
}
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* Check for caller errors */
+ Assert(attnum > 0);
+
+ if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+ elog(ERROR, "invalid attribute number %d", attnum);
+
+ /* XXX - there should perhaps also be a batch-level att_nvalid */
+ if (attnum < slot->tts_nvalid)
+ continue;
+
+ /* Fetch as many attributes as possible from the underlying tuple. */
+ slot->tts_ops->getsomeattrs(slot, attnum);
+
+ /*
+ * If the underlying tuple doesn't have enough attributes, tuple
+ * descriptor must have the missing attributes.
+ */
+ if (unlikely(slot->tts_nvalid < attnum))
+ {
+ slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+ slot->tts_nvalid = attnum;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecTypeFromTL
*
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 712b35df7e5..848f0b52d6f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,11 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_newslot;
LLVMValueRef v_resultslot;
+ /* batches */
+ LLVMValueRef v_innerbatch;
+ LLVMValueRef v_outerbatch;
+ LLVMValueRef v_scanbatch;
+
/* nulls/values of slots */
LLVMValueRef v_innervalues;
LLVMValueRef v_innernulls;
@@ -221,6 +226,21 @@ llvm_compile_expr(ExprState *state)
v_state,
FIELDNO_EXPRSTATE_RESULTSLOT,
"v_resultslot");
+ v_innerbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_OUTERBATCH,
+ "v_innerbatch");
+ v_outerbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_OUTERBATCH,
+ "v_outerbatch");
+ v_scanbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_SCANBATCH,
+ "v_scanbatch");
/* build global values/isnull pointers */
v_scanvalues = l_load_struct_gep(b,
@@ -439,6 +459,54 @@ llvm_compile_expr(ExprState *state)
break;
}
+ case EEOP_INNER_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_innerbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_OUTER_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_outerbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_SCAN_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_scanbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
case EEOP_INNER_VAR:
case EEOP_OUTER_VAR:
case EEOP_SCAN_VAR:
@@ -2940,6 +3008,24 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_BUILD_INNER_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildInnerBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
+ case EEOP_BUILD_OUTER_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildOuterBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
+ case EEOP_BUILD_SCAN_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildScanBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 167cd554b9c..6bb527c3f6f 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,11 @@ void *referenced_functions[] =
MakeExpandedObjectReadOnlyInternal,
slot_getmissingattrs,
slot_getsomeattrs_int,
+ slot_getsomeattrs_batch,
strlen,
varsize_any,
ExecInterpExprStillValid,
+ ExecBuildInnerBatchVector,
+ ExecBuildOuterBatchVector,
+ ExecBuildScanBatchVector,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 75366203706..99c86bac702 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,11 @@ typedef enum ExprEvalOp
EEOP_OLD_FETCHSOME,
EEOP_NEW_FETCHSOME,
+ /* apply slot_getsomeattrs_batch() to corresponding batch */
+ EEOP_INNER_FETCHSOME_BATCH,
+ EEOP_OUTER_FETCHSOME_BATCH,
+ EEOP_SCAN_FETCHSOME_BATCH,
+
/* compute non-system Var value */
EEOP_INNER_VAR,
EEOP_OUTER_VAR,
@@ -292,11 +297,15 @@ typedef enum ExprEvalOp
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
+ /* ExprContext.*_batch -> BatchVector */
+ EEOP_BUILD_INNER_BATCH_VECTOR,
+ EEOP_BUILD_OUTER_BATCH_VECTOR,
+ EEOP_BUILD_SCAN_BATCH_VECTOR,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
-
typedef struct ExprEvalStep
{
/*
@@ -331,6 +340,12 @@ typedef struct ExprEvalStep
const TupleTableSlotOps *kind;
} fetch;
+ struct
+ {
+ /* attribute number up to which to fetch (inclusive) */
+ int last_var;
+ } fetch_batch;
+
/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
struct
{
@@ -769,6 +784,12 @@ typedef struct ExprEvalStep
void *json_coercion_cache;
ErrorSaveContext *escontext;
} jsonexpr_coercion;
+
+ /* for batch vector construction */
+ struct
+ {
+ struct BatchVector *bv;
+ } batch_vector;
} d;
} ExprEvalStep;
@@ -917,4 +938,26 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
ExprContext *econtext);
+/* ---------- BatchVector stuff ------------- */
+
+/* Vector fetch spec for a list of simple Vars. */
+typedef struct BatchVector
+{
+ /* immutable after BatchVectorCreate */
+ AttrNumber *attnos; /* [ncols] */
+ int ncols;
+ int maxrows;
+ int last_var;
+
+ /* per batch state */
+ Datum **cols; /* [ncols][maxbatch] */
+ bool **nulls; /* [ncols][maxbatch] */
+ bool hasnull; /* is any datum in cols NULL? */
+ int nrows; /* #rows loaded into cols/nulls */
+} BatchVector;
+
+extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 095e4cc82e3..2e2192fb3cf 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -347,6 +347,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
int lastAttNum);
extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
#ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9b81b842161..fdfe8b4ddaf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -277,6 +277,14 @@ typedef struct ExprContext
#define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
TupleTableSlot *ecxt_outertuple;
+ /* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_INNERBATCH 4
+ TupleBatch *inner_batch;
+#define FIELDNO_EXPRCONTEXT_OUTERBATCH 5
+ TupleBatch *outer_batch;
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 6
+ TupleBatch *scan_batch;
+
/* Memory contexts for expression evaluation --- see notes above */
MemoryContext ecxt_per_query_memory;
MemoryContext ecxt_per_tuple_memory;
@@ -289,27 +297,27 @@ typedef struct ExprContext
* Values to substitute for Aggref nodes in the expressions of an Agg
* node, or for WindowFunc nodes within a WindowAgg node.
*/
-#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
+#define FIELDNO_EXPRCONTEXT_AGGVALUES 11
Datum *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
-#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
+#define FIELDNO_EXPRCONTEXT_AGGNULLS 12
bool *ecxt_aggnulls; /* null flags for aggs/windowfuncs */
/* Value to substitute for CaseTestExpr nodes in expression */
-#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
+#define FIELDNO_EXPRCONTEXT_CASEDATUM 13
Datum caseValue_datum;
-#define FIELDNO_EXPRCONTEXT_CASENULL 11
+#define FIELDNO_EXPRCONTEXT_CASENULL 14
bool caseValue_isNull;
/* Value to substitute for CoerceToDomainValue nodes in expression */
-#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
+#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 15
Datum domainValue_datum;
-#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
+#define FIELDNO_EXPRCONTEXT_DOMAINNULL 16
bool domainValue_isNull;
/* Tuples that OLD/NEW Var nodes in RETURNING may refer to */
-#define FIELDNO_EXPRCONTEXT_OLDTUPLE 14
+#define FIELDNO_EXPRCONTEXT_OLDTUPLE 17
TupleTableSlot *ecxt_oldtuple;
-#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
+#define FIELDNO_EXPRCONTEXT_NEWTUPLE 18
TupleTableSlot *ecxt_newtuple;
/* Link to containing EState (NULL if a standalone ExprContext) */
--
2.43.0
[application/octet-stream] v2-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch (6.3K, 6-v2-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch)
download | inline diff:
From abb8b1ded7cf192d286662dd320ad93802ce05d2 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 4 Sep 2025 22:55:25 +0900
Subject: [PATCH v2 4/8] WIP: Add agg_retrieve_direct_batch() for plain
aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Teach Agg to consume child tuples in batches for AGG_PLAIN. A new
agg_retrieve_direct_batch() pulls TupleBatch from the child via
ExecProcNodeBatch(), materializes as needed, and advances per-agg
transition state over the batch. A first tuple is copied to match
the direct path’s behavior before batch processing.
Add AggCanUsePlainBatch() and select retrieve_plain at init:
batch path when no grouping sets, strategy is AGG_PLAIN, and the
child exposes ExecProcNodeBatch(); otherwise keep the row path.
Plan shape and EXPLAIN remain unchanged. Semantics are identical
to the non-batch direct path; this only reduces per-tuple overhead.
---
src/backend/executor/nodeAgg.c | 123 +++++++++++++++++++++++++++++++++
src/include/nodes/execnodes.h | 5 ++
2 files changed, 128 insertions(+)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a4f3d30f307..3ace6363509 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -820,6 +820,20 @@ advance_aggregates(AggState *aggstate)
aggstate->tmpcontext);
}
+static void
+advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
+{
+ ExprContext *tmpcontext = aggstate->tmpcontext;
+ ExprState *evaltrans = aggstate->phase->evaltrans;
+
+ while (TupleBatchHasMore(b))
+ {
+ tmpcontext->ecxt_outertuple = TupleBatchGetNextSlot(b);
+ ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+ ResetExprContext(tmpcontext);
+ }
+}
+
/*
* Run the transition function for a DISTINCT or ORDER BY aggregate
* with only one input. This is called after we have completed
@@ -2260,6 +2274,9 @@ ExecAgg(PlanState *pstate)
result = agg_retrieve_hash_table(node);
break;
case AGG_PLAIN:
+ /* init-time choice */
+ result = node->retrieve_plain(node);
+ break;
case AGG_SORTED:
result = agg_retrieve_direct(node);
break;
@@ -2618,6 +2635,91 @@ agg_retrieve_direct(AggState *aggstate)
return NULL;
}
+static TupleTableSlot *
+agg_retrieve_direct_batch(AggState *aggstate)
+{
+ PlanState *child = outerPlanState(aggstate);
+ ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+ ExprContext *tmpcontext = aggstate->tmpcontext;
+ const bool hasGroupingSets = aggstate->phase->numsets > 0;
+ TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+ TupleBatch *b = NULL;
+
+ Assert(child->ExecProcNodeBatch);
+
+ /* mimic the first-tuple copy from agg_retrieve_direct() */
+ for (;;)
+ {
+ b = ExecProcNodeBatch(child);
+ if (b == NULL)
+ {
+ if (hasGroupingSets)
+ {
+ aggstate->input_done = true;
+ break;
+ }
+ aggstate->agg_done = true;
+ break;
+ }
+ if (b->nvalid == 0)
+ continue;
+
+ TupleBatchMaterializeAll(b);
+ aggstate->grp_firstTuple = ExecCopySlotHeapTuple(TupleBatchGetSlot(b, 0));
+ break;
+ }
+
+ /* initialize_aggregates etc. as in the direct path */
+ ReScanExprContext(econtext);
+ for (int i = 0; i < Max(aggstate->phase->numsets, 1); i++)
+ ReScanExprContext(aggstate->aggcontexts[i]);
+
+ initialize_aggregates(aggstate, aggstate->pergroups,
+ Max(aggstate->phase->numsets, 1));
+
+ if (aggstate->grp_firstTuple)
+ {
+ ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
+ aggstate->grp_firstTuple = NULL;
+ tmpcontext->ecxt_outertuple = firstSlot;
+
+ advance_aggregates_batch(aggstate, b);
+ ResetExprContext(tmpcontext);
+ }
+
+ /* consume remaining rows in current and subsequent batches */
+ if (b)
+ {
+ if (TupleBatchHasMore(b))
+ advance_aggregates_batch(aggstate, b);
+ for (;;)
+ {
+ b = ExecProcNodeBatch(child);
+ if (b == NULL)
+ {
+ if (hasGroupingSets)
+ aggstate->input_done = true;
+ else
+ aggstate->agg_done = true;
+ break;
+ }
+ if (b->nvalid == 0)
+ continue;
+
+ TupleBatchMaterializeAll(b);
+ advance_aggregates_batch(aggstate, b);
+ }
+ }
+
+ /* finalize and project like the direct path */
+ econtext->ecxt_outertuple = firstSlot;
+ prepare_projection_slot(aggstate, econtext->ecxt_outertuple, 0);
+ select_current_set(aggstate, 0, false);
+ finalize_aggregates(aggstate, aggstate->peragg, aggstate->pergroups[0]);
+
+ return project_aggregates(aggstate);
+}
+
/*
* ExecAgg for hashed case: read input and build hash table
*/
@@ -3265,6 +3367,22 @@ hashagg_reset_spill_state(AggState *aggstate)
}
}
+static bool
+AggCanUsePlainBatch(AggState *aggstate)
+{
+ const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
+
+ Assert(outerPlanState(aggstate));
+
+ /* grouping sets present -> bail */
+ if (aggnode->groupingSets != NIL)
+ return false;
+
+ if (aggstate->phase->aggstrategy != AGG_PLAIN)
+ return false;
+
+ return outerPlanState(aggstate)->ExecProcNodeBatch;
+}
/* -----------------
* ExecInitAgg
@@ -4060,6 +4178,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
(errcode(ERRCODE_GROUPING_ERROR),
errmsg("aggregate function calls cannot be nested")));
+ if (AggCanUsePlainBatch(aggstate))
+ aggstate->retrieve_plain = agg_retrieve_direct_batch;
+ else
+ aggstate->retrieve_plain = agg_retrieve_direct;
+
/*
* Build expressions doing all the transition work at once. We build a
* different one for each phase, as the number of transition function
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a104591ac20..9b81b842161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2535,6 +2535,9 @@ typedef struct AggStatePerGroupData *AggStatePerGroup;
typedef struct AggStatePerPhaseData *AggStatePerPhase;
typedef struct AggStatePerHashData *AggStatePerHash;
+struct AggState;
+typedef TupleTableSlot *(*AggRetrievePlainFn)(struct AggState *);
+
typedef struct AggState
{
ScanState ss; /* its first field is NodeTag */
@@ -2610,6 +2613,8 @@ typedef struct AggState
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
SharedAggInfo *shared_info; /* one entry per worker */
+
+ AggRetrievePlainFn retrieve_plain; /* init-time choice */
} AggState;
/* ----------------
--
2.43.0
[application/octet-stream] v2-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.7K, 7-v2-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From 3318650e720a01cbd5948349b9fbcdbb8ddda7cf Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:56:17 +0900
Subject: [PATCH v2 1/8] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
---
src/backend/access/heap/heapam.c | 212 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 21 +++
src/include/access/tableam.h | 58 +++++++
src/include/pgstat.h | 5 +
5 files changed, 299 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed0c0c2dc9f..f62f7edbf5e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1008,7 +1008,7 @@ heapgettup_pagemode(HeapScanDesc scan,
int nkeys,
ScanKey key)
{
- HeapTuple tuple = &(scan->rs_ctup);
+ HeapTuple tuple = &scan->rs_ctup;
Page page;
uint32 lineindex;
uint32 linesleft;
@@ -1089,6 +1089,121 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * If we have no current page (or the current page is exhausted),
+ * advance to the next page that has any visible tuples and prepare it.
+ * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+ * as soon as we have a prepared page; we never produce from two pages.
+ */
+ for (;;)
+ {
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ /* Are there more visible tuples left on this page? */
+ lineindex = scan->rs_cindex + dir;
+ if (ScanDirectionIsForward(dir))
+ linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ else
+ linesleft = scan->rs_cindex;
+ if (linesleft > 0)
+ break; /* continue on this page */
+ }
+
+ /* Move to next page and prepare its visible tuple list. */
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ /* After prepare, either rs_ntuples > 0 or we'll loop again. */
+ if (scan->rs_ntuples > 0)
+ {
+ lineindex = ScanDirectionIsForward(dir) ? 0 : scan->rs_ntuples - 1;
+ linesleft = scan->rs_ntuples - (ScanDirectionIsForward(dir) ? 0 : 0);
+ break;
+ }
+ /* else: page had no visible tuples; continue to next page */
+ }
+
+ /* From here on, we must only read tuples from this single page. */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ /*
+ * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+ * until either the page is exhausted or the batch capacity is reached.
+ */
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex <= (uint32) scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ dst->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL &&
+ !HeapKeyTest(dst, RelationGetDescr(scan->rs_base.rs_rd),
+ nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ return nout;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1136,6 +1251,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
scan->rs_cbuf = InvalidBuffer;
+ scan->rs_batch_ctup = NULL;
+ scan->rs_batch_cbuf = InvalidBuffer;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1315,6 +1432,8 @@ heap_endscan(TableScanDesc sscan)
*/
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_batch_cbuf))
+ ReleaseBuffer(scan->rs_batch_cbuf);
/*
* Must free the read stream before freeing the BufferAccessStrategy.
@@ -1421,6 +1540,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+
+ Assert(maxitems > 0);
+
+ hb = palloc(sizeof(HeapBatch));
+ hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb->tupdata);
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..ec4eeccf19c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..02f7793fba0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,9 @@ typedef struct HeapScanDescData
HeapTupleData rs_ctup; /* current tuple in scan, if any */
+ HeapTupleData *rs_batch_ctup; /* NULL when not using batched mode */
+ Buffer rs_batch_cbuf; /* buffer feeding the batch */
+
/* For scans that stream reads */
ReadStream *rs_read_stream;
@@ -101,6 +104,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -294,6 +310,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..953207eac50 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(dir == ForwardScanDirection);
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e4a59a30b8c..aaea9520b1d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -687,6 +687,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.43.0
[application/octet-stream] v2-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch (9.0K, 8-v2-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch)
download | inline diff:
From 10d0df2676462f1931b2ef5072eed7129d936328 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 22:18:30 +0900
Subject: [PATCH v2 3/8] Executor: add ExecProcNodeBatch() and integrate
SeqScan with batch API
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Introduce a batch-capable executor interface alongside the existing
slot-at-a-time path:
* ExecProcNodeBatch() is added to return a TupleBatch instead of a
TupleTableSlot. PlanState gains ExecProcNodeBatch as a function
pointer.
Integrate SeqScan with this interface:
* Add ExecSeqScanBatch* routines that drive heap via the batch table
AM API and return a TupleBatch.
* At init, set ps.ExecProcNodeBatch to these routines when
ScanCanUseBatching() allows.
* Retain ExecSeqScanBatchSlot* variants for slot-at-a-time consumers.
This builds on 0002, which introduced TupleBatch and made SeqScan
consume the AM’s batch API internally but still surface slots. With this
patch, SeqScan can surface batches directly to batch-aware upper nodes.
Plan shape and EXPLAIN output remain unchanged; only internal tuple flow
differs when batching is enabled and allowed.
---
src/backend/executor/execProcnode.c | 52 +++++++++++++++++++++++++++++
src/backend/executor/nodeSeqscan.c | 35 +++++++++++++++++++
src/include/executor/execScan.h | 51 ++++++++++++++++++++++++++++
src/include/executor/executor.h | 10 ++++++
src/include/nodes/execnodes.h | 5 +++
5 files changed, 153 insertions(+)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..a8c0315e874 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,6 +121,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+static TupleBatch *ExecProcNodeBatchFirst(PlanState *node);
+static TupleBatch *ExecProcNodeBatchInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
@@ -389,6 +391,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
}
ExecSetExecProcNode(result, result->ExecProcNode);
+ if (result->ExecProcNodeBatch)
+ ExecSetExecProcNodeBatch(result, result->ExecProcNodeBatch);
/*
* Initialize any initPlans present in this node. The planner put them in
@@ -489,6 +493,54 @@ ExecProcNodeInstr(PlanState *node)
return result;
}
+/*
+ * ExecSetExecProcNodeBatch
+ * Install ExecProcNodeBatch with first-call wrapper, mirroring row path.
+ */
+void
+ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function)
+{
+ node->ExecProcNodeBatchReal = function;
+ node->ExecProcNodeBatch = ExecProcNodeBatchFirst;
+}
+
+/*
+ * ExecProcNodeBatchFirst
+ * One-time stack-depth check; then pick instrument/no-instrument wrapper.
+ */
+static TupleBatch *
+ExecProcNodeBatchFirst(PlanState *node)
+{
+ check_stack_depth();
+
+ if (node->instrument)
+ node->ExecProcNodeBatch = ExecProcNodeBatchInstr;
+ else
+ node->ExecProcNodeBatch = node->ExecProcNodeBatchReal;
+
+ return node->ExecProcNodeBatch(node);
+}
+
+/*
+ * ExecProcNodeBatchInstr
+ * Instrumentation wrapper for batch calls.
+ *
+ * Note: we can record nrows as the "tuple" count for this call. That keeps
+ * instrumentation meaningful without changing Instr API.
+ */
+static TupleBatch *
+ExecProcNodeBatchInstr(PlanState *node)
+{
+ TupleBatch *b;
+
+ InstrStartNode(node->instrument);
+
+ b = node->ExecProcNodeBatchReal(node);
+
+ InstrStopNode(node->instrument, b ? (double) b->nvalid : 0.0);
+
+ return b;
+}
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 2552d420f1c..a4cf1e51af0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -334,6 +334,37 @@ ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
pstate->qual, pstate->ps_ProjInfo);
}
+static TupleBatch *
+ExecSeqScanBatch(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatch,
+ NULL, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation is required.
+ */
+static TupleBatch *
+ExecSeqScanBatchWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
/* Batch SeqScan enablement and dispatch */
static void
SeqScanInitBatching(SeqScanState *scanstate, int eflags)
@@ -348,10 +379,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
+ scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatch;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
}
else
{
+ scanstate->ss.ps.ExecProcNodeBatch = NULL;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
}
}
@@ -359,10 +392,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
+ scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatchWithQual;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
}
else
{
+ scanstate->ss.ps.ExecProcNodeBatch = NULL;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
}
}
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fec606471c8..fb4b57a831c 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -297,4 +297,55 @@ ExecScanExtendedBatchSlot(ScanState *node,
}
}
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+ int qualified;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next batch from the AM */
+ if (!accessBatchMtd(node))
+ return NULL;
+
+ if (qual != NULL)
+ {
+ qualified = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *in = TupleBatchGetNextSlot(b);
+
+ Assert(in);
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ if (ExecQual(qual, econtext))
+ {
+ TupleBatchStoreInOut(b, qualified, in);
+ qualified++;
+ }
+ else
+ InstrCountFiltered1(node, 1);
+ }
+ TupleBatchUseOutput(b, qualified);
+ }
+ else
+ qualified = b->nvalid;
+
+ if (qualified > 0)
+ return b;
+ /* else get the next batch from the AM */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 17258f7ae2d..cf5b0c7e05c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -294,6 +294,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
*/
extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
+extern void ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
@@ -315,6 +316,15 @@ ExecProcNode(PlanState *node)
return node->ExecProcNode(node);
}
+
+static inline TupleBatch *
+ExecProcNodeBatch(PlanState *node)
+{
+ if (node->chgParam != NULL) /* something changed? */
+ ExecReScan(node); /* let ReScan handle this */
+
+ return node->ExecProcNodeBatch(node);
+}
#endif
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f4bb8f7dd7f..a104591ac20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1147,6 +1147,7 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
struct TupleBatch;
typedef struct TupleBatch TupleBatch;
+typedef TupleBatch *(*ExecProcNodeBatchMtd)(struct PlanState *ps);
/* ----------------
* PlanState node
@@ -1171,6 +1172,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
+ /* Optional batch-producing entry point (NULL => no batching). */
+ ExecProcNodeBatchMtd ExecProcNodeBatch;
+ ExecProcNodeBatchMtd ExecProcNodeBatchReal;
+
Instrumentation *instrument; /* Optional runtime stats for this node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
--
2.43.0
[application/octet-stream] v2-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.2K, 9-v2-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 6a43a40037e4b656739743b3c0abdfb73a8f9b92 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:59:56 +0900
Subject: [PATCH v2 2/8] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Notes / current limits:
- Batching uses EXEC_BATCH_ROWS (currently 64) as the target capacity.
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
---
src/backend/access/heap/heapam.c | 29 ++++
src/backend/access/heap/heapam_handler.c | 15 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 117 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 7 +
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 102 +++++++++++++
src/include/executor/execScan.h | 54 +++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 8 +
17 files changed, 587 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f62f7edbf5e..9fd7948482d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1570,6 +1570,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec4eeccf19c..8e88cc9e8f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,20 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps = {
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2617,6 +2631,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5a8ebb8b97c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+
+ inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->outslots = outslots;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ if (b == NULL)
+ return;
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ if (drop_slots)
+ {
+ pfree(b->inslots);
+ pfree(b->outslots);
+ b->inslots = b->outslots = NULL;
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 90726949a87..f24c5d73ae1 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batching &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..2552d420f1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(direction == ForwardScanDirection);
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = EXEC_BATCH_ROWS;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..b4a0996a717 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+bool executor_batching = false;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..c9fbb7ffef9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -880,6 +880,13 @@
boot_val => 'true',
},
+{ name => 'executor_batching', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Use tuple batching during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batching',
+ boot_val => 'true',
+},
+
{ name => 'data_sync_retry', type => 'bool', context => 'PGC_POSTMASTER', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Whether to continue running after a failure to sync data files.',
variable => 'data_sync_retry',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 02f7793fba0..13ce6166ec3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -314,6 +314,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 953207eac50..05f828b9762 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2116,5 +2142,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..6f1a38d14bd
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/* XXX fixed 64 for PoC */
+#define EXEC_BATCH_ROWS 64
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 837ea7785bb..fec606471c8 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -243,4 +243,58 @@ ExecScanExtended(ScanState *node,
}
}
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..17258f7ae2d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -575,12 +575,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b8e7afda57c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -287,6 +287,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT bool executor_batching;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..f4bb8f7dd7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
#define EXECNODES_H
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "fmgr.h"
#include "lib/ilist.h"
@@ -1143,6 +1144,10 @@ typedef struct JsonExprState
*/
typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
+struct TupleBatch;
+typedef struct TupleBatch TupleBatch;
+
/* ----------------
* PlanState node
*
@@ -1198,6 +1203,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.43.0
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-09-30 02:15 Amit Langote <[email protected]>
parent: Bruce Momjian <[email protected]>
0 siblings, 0 replies; 29+ messages in thread
From: Amit Langote @ 2025-09-30 02:15 UTC (permalink / raw)
To: Bruce Momjian <[email protected]>; +Cc: pgsql-hackers
Hi Bruce,
On Fri, Sep 26, 2025 at 10:49 PM Bruce Momjian <[email protected]> wrote:
> On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> > At PGConf.dev this year we had an unconference session [1] on whether
> > the community can support an additional batch executor. The discussion
> > there led me to start hacking on $subject. I have also had off-list
> > discussions on this topic in recent months with Andres and David, who
> > have offered useful thoughts.
> >
> > This patch series is an early attempt to make executor nodes pass
> > around batches of tuples instead of tuple-at-a-time slots. The main
> > motivation is to enable expression evaluation in batch form, which can
> > substantially reduce per-tuple overhead (mainly from function calls)
> > and open the door to further optimizations such as SIMD usage in
> > aggregate transition functions. We could even change algorithms of
> > some plan nodes to operate on batches when, for example, a child node
> > can return batches.
>
> For background, people might want to watch these two videos from POSETTE
> 2025. The first video explains how data warehouse query needs are
> different from OLTP needs:
>
> Building a PostgreSQL data warehouse
> https://www.youtube.com/watch?v=tpq4nfEoioE
>
> and the second one explains the executor optimizations done in PG 18:
>
> Hacking Postgres Executor For Performance
> https://www.youtube.com/watch?v=D3Ye9UlcR5Y
>
> I learned from these two videos that to handle new workloads, I need to
> think of the query demands differently, and of course can this be
> accomplished without hampering OLTP workloads?
Thanks for pointing to those talks -- I gave the second one. :-)
Yes, the idea here is to introduce batching without adding much
overhead or new code into the OLTP path.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-09-30 13:35 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 0 replies; 29+ messages in thread
From: Amit Langote @ 2025-09-30 13:35 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
On Tue, Sep 30, 2025 at 11:11 AM Amit Langote <[email protected]> wrote:
> Hi Tomas,
>
> Thanks a lot for your comments and benchmarking.
>
> I plan to reply to your detailed comments and benchmark results
For now, I reran a few benchmarks with the master branch as an
explicit baseline, since Tomas reported possible regressions with
executor_batching=off. I can reproduce that on my side:
5 aggregates, no where:
select avg(a), avg(b), avg(c), avg(d), avg(e) from bar;
parallel_workers=0, jit=off
Rows master batching off batching on master vs off master vs on
1M 47.118 48.545 39.531 +3.0% -16.1%
2M 95.098 97.241 80.189 +2.3% -15.7%
3M 141.821 148.540 122.005 +4.7% -14.0%
4M 188.969 197.056 163.779 +4.3% -13.3%
5M 240.113 245.902 213.645 +2.4% -11.0%
10M 556.738 564.120 486.359 +1.3% -12.6%
parallel_workers=2, jit=on
Rows master batching off batching on master vs off master vs on
1M 21.147 22.278 20.737 +5.3% -1.9%
2M 40.319 41.509 37.851 +3.0% -6.1%
3M 61.582 63.026 55.927 +2.3% -9.2%
4M 96.363 95.245 78.494 -1.2% -18.5%
5M 117.226 117.649 97.968 +0.4% -16.4%
10M 245.503 246.896 196.335 +0.6% -20.0%
1 aggregate, no where:
select count(*) from bar;
parallel_workers=0, jit=off
Rows master batching off batching on master vs off master vs on
1M 17.071 20.135 6.698 +17.9% -60.8%
2M 36.905 41.522 15.188 +12.5% -58.9%
3M 56.094 63.110 23.485 +12.5% -58.1%
4M 74.299 83.912 32.950 +12.9% -55.7%
5M 94.229 108.621 41.338 +15.2% -56.1%
10M 234.425 261.490 117.833 +11.6% -49.7%
parallel_workers=2, jit=on
Rows master batching off batching on master vs off master vs on
1M 8.820 9.832 5.324 +11.5% -39.6%
2M 16.368 18.001 9.526 +10.0% -41.8%
3M 24.810 28.193 14.482 +13.6% -41.6%
4M 34.369 35.741 23.212 +4.0% -32.5%
5M 41.595 45.103 27.918 +8.4% -32.9%
10M 99.494 112.226 94.081 +12.8% -5.4%
The regression is more noticeable in the single aggregate case, where
more time is spent in scanning.
Looking into it.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-10 06:40 Amit Langote <[email protected]>
parent: Tomas Vondra <[email protected]>
3 siblings, 0 replies; 29+ messages in thread
From: Amit Langote @ 2025-10-10 06:40 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
Hi,
On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <[email protected]> wrote:
> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.
I can reproduce the Q22 segfault with clang-17 on macOS and the
attached patch 0009 fixes it.
The issue I observed is that two EEOPs both called the same helper,
and that helper re-peeked ExecExprEvalOp(op) to choose its path; in
this particular build the two EEOP cases in ExecInterpExpr() compiled
to identical code so their dispatch labels had the same address
(reverse_dispatch_table logging in ExecInitInterpreter() showed the
duplicate), and because ExecEvalStepOp() maps by label address the
reverse lookup could yield the other EEOP -- I saw ExprInit select
ROWLOOP EEOP while the ExprExec-time helper observed DIRECT EEOP and
ran code for it, which then crashed.
In 0009 (the fix), I split the helper into two functions, one per
EEOP, so the helper does not re-derive the opcode; with that change I
cannot reproduce the crash on macOS clang-17.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v3-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch (21.5K, 2-v3-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch)
download | inline diff:
From 20a99f908e6dc9499ba927b1321918cff306aca7 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 2 Sep 2025 23:46:34 +0900
Subject: [PATCH v3 6/9] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP
Introduce a batch EEOP that runs plain aggregate transitions by
looping over rows of a TupleBatch. This keeps transition logic in
the interpreter while amortizing per-row costs.
Gate with AggTransCanUseBatch(): plain, non-hashed, single-set
aggregates with no DISTINCT/ORDER/FILTER, and simple Var args.
Extend ExecBuildAggTrans() to prepare batch fetch/build steps and
to return whether a batch path is used.
---
src/backend/executor/execExpr.c | 228 ++++++++++++++++++++++++--
src/backend/executor/execExprInterp.c | 103 ++++++++++++
src/backend/executor/nodeAgg.c | 17 +-
src/backend/jit/llvm/llvmjit_expr.c | 6 +
src/backend/jit/llvm/llvmjit_types.c | 1 +
src/include/executor/execBatch.h | 6 +
src/include/executor/execExpr.h | 14 ++
src/include/executor/executor.h | 3 +-
src/include/executor/nodeAgg.h | 2 +
9 files changed, 363 insertions(+), 17 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..af5ed8b6368 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -95,7 +95,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
int transno, int setno, int setoff, bool ishash,
- bool nullcheck);
+ bool nullcheck, bool batch,
+ BatchVector *bv);
+
static void ExecInitJsonExpr(JsonExpr *jsexpr, ExprState *state,
Datum *resv, bool *resnull,
ExprEvalStep *scratch);
@@ -104,6 +106,10 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
bool exists_coerce,
Datum *resv, bool *resnull);
+static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
+static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
+static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
+ const BatchVector *bv);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -3659,6 +3665,33 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
}
}
+/* plain agg, single set, not hashed, no DISTINCT/ORDER/FILTER */
+static inline bool
+AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
+{
+ Agg *aggnode = (Agg *) as->ss.ps.plan;
+
+ if (!AggCanUsePlainBatch(as))
+ return false;
+ if (as->aggstrategy == AGG_HASHED)
+ return false;
+ if (aggnode->groupingSets != NIL)
+ return false;
+ if (as->phase == NULL || as->phase->numsets > 0)
+ return false;
+
+ /* per-aggregate complications */
+ if (pt->aggsortrequired)
+ return false;
+ if (pt->aggref &&
+ (pt->aggref->aggdistinct != NIL ||
+ pt->aggref->aggorder != NIL ||
+ pt->aggref->aggfilter != NULL))
+ return false;
+
+ return true;
+}
+
/*
* Build transition/combine function invocations for all aggregate transition
* / combination function invocations in a grouping sets phase. This has to
@@ -3675,13 +3708,17 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
*/
ExprState *
ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
- bool doSort, bool doHash, bool nullcheck)
+ bool doSort, bool doHash, bool nullcheck,
+ bool *batch_trans)
{
ExprState *state = makeNode(ExprState);
PlanState *parent = &aggstate->ss.ps;
ExprEvalStep scratch = {0};
bool isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
ExprSetupInfo deform = {0, 0, 0, 0, 0, NIL};
+ bool batch = AggCanUsePlainBatch(aggstate);
+ Bitmapset *allattnos = NULL;
+ BatchVector *bv = NULL;
state->expr = (Expr *) aggstate;
state->parent = parent;
@@ -3707,8 +3744,36 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
&deform);
expr_setup_walker((Node *) pertrans->aggref->aggfilter,
&deform);
+
+ if (!AggTransCanUseBatch(aggstate, pertrans) ||
+ !ExprListAllSimpleVars(pertrans->aggref->args, &allattnos))
+ batch = false;
}
- ExecPushExprSetupSteps(state, &deform);
+
+ if (batch)
+ {
+ if (deform.last_outer > 0)
+ {
+ Assert(!bms_is_empty(allattnos));
+ bv = BatchVectorCreate(allattnos, deform.last_outer);
+
+ /*
+ * Deform all tuples upto last_outer in batch
+ */
+ scratch.opcode = EEOP_OUTER_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = deform.last_outer;
+ ExprEvalPushStep(state, &scratch);
+
+ /*
+ * Put all arg Vars into vectors once per batch slice
+ */
+ scratch.opcode = EEOP_BUILD_OUTER_BATCH_VECTOR;
+ scratch.d.batch_vector.bv = bv;
+ ExprEvalPushStep(state, &scratch);
+ }
+ }
+ else
+ ExecPushExprSetupSteps(state, &deform);
/*
* Emit instructions for each transition value / grouping set combination.
@@ -3746,7 +3811,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
* Evaluate arguments to aggregate/combine function.
*/
argno = 0;
- if (isCombine)
+ if (isCombine && !batch)
{
/*
* Combining two aggregate transition values. Instead of directly
@@ -3816,7 +3881,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
Assert(pertrans->numInputs == argno);
}
- else if (!pertrans->aggsortrequired)
+ else if (!pertrans->aggsortrequired && !batch)
{
ListCell *arg;
@@ -3849,7 +3914,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
}
Assert(pertrans->numTransInputs == argno);
}
- else if (pertrans->numInputs == 1)
+ else if (pertrans->numInputs == 1 && !batch)
{
/*
* Non-presorted DISTINCT and/or ORDER BY case, with a single
@@ -3868,7 +3933,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
Assert(pertrans->numInputs == argno);
}
- else
+ else if (!batch)
{
/*
* Non-presorted DISTINCT and/or ORDER BY case, with multiple
@@ -3896,7 +3961,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
* just keep the prior transValue. This is true for both plain and
* sorted/distinct aggregates.
*/
- if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0)
+ if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0 && !batch)
{
if (strictnulls)
scratch.opcode = EEOP_AGG_STRICT_INPUT_CHECK_NULLS;
@@ -3914,7 +3979,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
}
/* Handle DISTINCT aggregates which have pre-sorted input */
- if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired)
+ if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired && !batch)
{
if (pertrans->numDistinctCols > 1)
scratch.opcode = EEOP_AGG_PRESORTED_DISTINCT_MULTI;
@@ -3942,7 +4007,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
pertrans, transno, setno, setoff, false,
- nullcheck);
+ nullcheck, batch, bv);
setoff++;
}
}
@@ -3962,7 +4027,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
{
ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
pertrans, transno, setno, setoff, true,
- nullcheck);
+ nullcheck, false, NULL);
setoff++;
}
}
@@ -4007,6 +4072,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
ExecReadyExpr(state);
+ if (batch_trans)
+ *batch_trans = batch;
+
return state;
}
@@ -4020,10 +4088,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
ExprEvalStep *scratch,
FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
int transno, int setno, int setoff, bool ishash,
- bool nullcheck)
+ bool nullcheck, bool batch, BatchVector *bv)
{
ExprContext *aggcontext;
int adjust_jumpnull = -1;
+ BatchVectorSlice *bvs = NULL;
if (ishash)
aggcontext = aggstate->hashcontext;
@@ -4077,7 +4146,13 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
*/
if (!pertrans->aggsortrequired)
{
- if (pertrans->transtypeByVal)
+ if (batch)
+ {
+ if (bv)
+ bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ }
+ else if (pertrans->transtypeByVal)
{
if (fcinfo->flinfo->fn_strict &&
pertrans->initValueIsNull)
@@ -4108,6 +4183,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
scratch->d.agg_trans.setoff = setoff;
scratch->d.agg_trans.transno = transno;
scratch->d.agg_trans.aggcontext = aggcontext;
+ scratch->d.agg_trans.bvs = bvs;
ExprEvalPushStep(state, scratch);
/* fix up jumpnull */
@@ -5070,3 +5146,129 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
DomainHasConstraints(returning->typid);
ExprEvalPushStep(state, &scratch);
}
+
+/* Is expr a Var node for a non-system attribute? */
+static bool
+expr_is_simple_var(Expr *expr, AttrNumber *out_attno)
+{
+ if (expr == NULL)
+ return false;
+
+ if (IsA(expr, TargetEntry))
+ return expr_is_simple_var((Expr *) ((TargetEntry *) expr)->expr,
+ out_attno);
+ if (IsA(expr, RelabelType))
+ return expr_is_simple_var((Expr *) ((RelabelType *) expr)->arg,
+ out_attno);
+
+ if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+ {
+ *out_attno = ((Var *) expr)->varattno;
+ return true;
+ }
+
+ return false;
+}
+
+/* Are all inputs plain Vars (optionally allow RelabelType->Var)? Collect attnos. */
+static bool
+ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
+{
+ ListCell *lc;
+
+ foreach(lc, args)
+ {
+ TargetEntry *tle = lfirst_node(TargetEntry, lc);
+ Expr *arg = tle->expr;
+ AttrNumber attno;
+
+ if (!expr_is_simple_var(arg, &attno))
+ return false;
+
+ if (!IsA(arg, Var))
+ return false;
+
+ Assert(attno > 0);
+ *allattnos = bms_add_member(*allattnos, attno);
+ }
+
+ return true;
+}
+
+/* ---------- BatchVector stuff ------------- */
+
+static BatchVector *
+BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var)
+{
+ int maxrows = EXEC_BATCH_ROWS;
+ BatchVector *bv;
+ AttrNumber attno;
+ int i;
+
+ bv = palloc(sizeof(BatchVector));
+ bv->ncols = bms_num_members(attnos);
+ bv->maxrows = maxrows;
+ bv->last_var = last_var;
+ bv->attnos = palloc(sizeof(AttrNumber) * bv->ncols);
+ attno = -1;
+ i = 0;
+ while ((attno = bms_next_member(attnos, attno)) > 0)
+ bv->attnos[i++] = attno;
+ bv->cols = palloc(sizeof(Datum *) * bv->ncols);
+ bv->nulls = palloc(sizeof(bool *) * bv->ncols);
+
+ for (i =0; i < bv->ncols; i++)
+ {
+ bv->cols[i] = palloc(sizeof(Datum) * maxrows);
+ bv->nulls[i] = palloc(sizeof(bool) * maxrows);
+ }
+
+ bv->nrows = 0;
+ bv->hasnull = false;
+
+ return bv;
+}
+
+static int16
+BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno)
+{
+ for (int i = 0; i < bv->ncols; i++)
+ if (bv->attnos[i] == attno)
+ return i;
+
+ return -1;
+}
+
+/*
+ * BatchVectorSliceFromExprArgs
+ * Build a BatchVectorSlice for a List of args.
+ *
+ * For Var args (possibly under RelabelType), store the col index.
+ * For non-Var args, store -1. Caller can handle Consts, etc.
+ */
+static BatchVectorSlice *
+BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
+{
+ BatchVectorSlice *bvs = palloc(sizeof(BatchVectorSlice));
+ int nargs = list_length(args);
+ int i = 0;
+ ListCell *lc;
+
+ Assert(bv);
+ bvs->bv = bv;
+ bvs->nargs = nargs;
+ bvs->argoffs = (int16 *) palloc(sizeof(int16) * nargs);
+
+ foreach (lc, args)
+ {
+ Expr *arg = (Expr *) lfirst(lc);
+ AttrNumber attno;
+
+ if (expr_is_simple_var(arg, &attno))
+ bvs->argoffs[i++] = BatchVectorFindAttColno(bv, attno);
+ else
+ bvs->argoffs[i++] = -1; /* non-Var */
+ }
+
+ return bvs;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 68629ad7991..3176679b346 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -606,6 +606,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
&&CASE_EEOP_LAST
};
@@ -2336,6 +2337,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
+ {
+ /* too complex for an inline implementation */
+ ExecAggPlainTransBatch(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -6039,3 +6048,97 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
}
bv->nrows = i;
}
+
+void
+ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
+ AggStatePerGroup pergroup =
+ &aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+ BatchVectorSlice *bvs = op->d.agg_trans.bvs;
+ FunctionCallInfo fcinfo = pertrans->transfn_fcinfo;
+ FmgrInfo *finfo = fcinfo->flinfo;
+ Datum newVal;
+ TupleBatch *batch = econtext->outer_batch;
+ int batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
+ int start_row = 0;
+
+ if (finfo->fn_strict)
+ {
+ if (pergroup->noTransValue && bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+ bool found = false;
+
+ Assert(bv);
+ for (int i = 0; i < batch_nrows; i++)
+ {
+ for (int j = 0; j < bvs->nargs; j++)
+ {
+ if (!bv->nulls[bvs->argoffs[j]][i])
+ {
+ fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+ fcinfo->args[1].isnull = false;
+ if (j == bvs->nargs - 1)
+ {
+ found = true;
+ break;
+ }
+ }
+ }
+ if (found)
+ break;
+ }
+ /* If transValue has not yet been initialized, do so now. */
+ ExecAggInitGroup(aggstate, pertrans, pergroup,
+ op->d.agg_trans.aggcontext);
+ start_row = 1;
+ }
+ else if (pergroup->transValueIsNull)
+ return;
+ }
+
+ switch (ExecEvalStepOp(state, op))
+ {
+ case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+ /* Loop rows, call the original transfn per element using vector cols. */
+ for (int i = start_row; i < batch_nrows; i++)
+ {
+ bool hasnull = false;
+
+ /* Set up fcinfo args 1..m from column vectors at row i. */
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ for (int j = 0; j < bvs->nargs; j++)
+ {
+ int16 argoff = bvs->argoffs[j];
+
+ fcinfo->args[j+1].value = bv->cols[argoff][i];
+ fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+ if (!hasnull && bv->nulls[argoff][i])
+ hasnull = true;
+ }
+ }
+ /* fcinfo->args[0] is the existing transition state */
+ if (finfo->fn_strict && hasnull)
+ continue;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ newVal = FunctionCallInvoke(fcinfo);
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ }
+ break;
+ default:
+ elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+ }
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ace6363509..662d8bef43b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -825,6 +825,16 @@ advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
{
ExprContext *tmpcontext = aggstate->tmpcontext;
ExprState *evaltrans = aggstate->phase->evaltrans;
+ bool batch_trans = aggstate->phase->batch_trans;
+
+ if (batch_trans)
+ {
+ tmpcontext->ecxt_outertuple = TupleBatchGetSlot(b, 0);
+ tmpcontext->outer_batch = b;
+ ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+ TupleBatchConsumeAll(b);
+ return;
+ }
while (TupleBatchHasMore(b))
{
@@ -1800,7 +1810,8 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
phase->evaltrans_cache[i][j] = ExecBuildAggTrans(aggstate, phase,
dosort, dohash,
- nullcheck);
+ nullcheck,
+ NULL);
/* change back */
aggstate->ss.ps.outerops = outerops;
@@ -3367,7 +3378,7 @@ hashagg_reset_spill_state(AggState *aggstate)
}
}
-static bool
+bool
AggCanUsePlainBatch(AggState *aggstate)
{
const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
@@ -4233,7 +4244,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
Assert(false);
phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
- false);
+ false, &phase->batch_trans);
/* cache compiled expression for outer slot without NULL check */
phase->evaltrans_cache[0][0] = phase->evaltrans;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 848f0b52d6f..efb3ee639fc 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,12 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+ build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 6bb527c3f6f..1b5e06f60cc 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,4 +186,5 @@ void *referenced_functions[] =
ExecBuildInnerBatchVector,
ExecBuildOuterBatchVector,
ExecBuildScanBatchVector,
+ ExecAggPlainTransBatch,
};
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 6f1a38d14bd..b50961fc0c9 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -99,4 +99,10 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+static inline void
+TupleBatchConsumeAll(TupleBatch *b)
+{
+ b->next = b->nvalid;
+}
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 99c86bac702..1d33e084b69 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -302,6 +302,9 @@ typedef enum ExprEvalOp
EEOP_BUILD_OUTER_BATCH_VECTOR,
EEOP_BUILD_SCAN_BATCH_VECTOR,
+ /* Batched aggregate trans evaluation */
+ EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
@@ -750,6 +753,7 @@ typedef struct ExprEvalStep
/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
+ /* for EEOP_AGG_PLAIN_TRANS_{BATCH,BATCH_ROWLOOP}*/
struct
{
AggStatePerTrans pertrans;
@@ -757,6 +761,7 @@ typedef struct ExprEvalStep
int setno;
int transno;
int setoff;
+ struct BatchVectorSlice *bvs;
} agg_trans;
/* for EEOP_IS_JSON */
@@ -956,8 +961,17 @@ typedef struct BatchVector
int nrows; /* #rows loaded into cols/nulls */
} BatchVector;
+/* A slice of BatchVector that maps caller args to BatchVector columns. */
+typedef struct BatchVectorSlice
+{
+ const BatchVector *bv;
+ int nargs; /* number of args covered */
+ int16 *argoffs; /* length nargs, -1 for non-Var entries */
+} BatchVectorSlice;
+
extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cf5b0c7e05c..5ba9a523970 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -336,7 +336,8 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
- bool doSort, bool doHash, bool nullcheck);
+ bool doSort, bool doHash, bool nullcheck,
+ bool *batch_trans);
extern ExprState *ExecBuildHash32FromAttrs(TupleDesc desc,
const TupleTableSlotOps *ops,
FmgrInfo *hashfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..5c5ebfc73f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -289,6 +289,7 @@ typedef struct AggStatePerPhaseData
Sort *sortnode; /* Sort node for input ordering for phase */
ExprState *evaltrans; /* evaluation of transition functions */
+ bool batch_trans; /* true if evaltrans contains batch EEOPs */
/*----------
* Cached variants of the compiled expression.
@@ -338,4 +339,5 @@ extern void ExecAggInitializeDSM(AggState *node, ParallelContext *pcxt);
extern void ExecAggInitializeWorker(AggState *node, ParallelWorkerContext *pwcxt);
extern void ExecAggRetrieveInstrumentation(AggState *node);
+extern bool AggCanUsePlainBatch(AggState *aggstate);
#endif /* NODEAGG_H */
--
2.47.3
[application/octet-stream] v3-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch (11.2K, 3-v3-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch)
download | inline diff:
From 9eea71db3c7bb137e676ad0a27f6256d9c6971f0 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 9 Sep 2025 21:43:29 +0900
Subject: [PATCH v3 7/9] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT
The new EEOP runs a plain aggregate transition over a TupleBatch with
a single fmgr call. Batch vectors are passed to the transfn via
AggBulkArgs stored in fcinfo->flinfo->fn_extra, avoiding per-row fmgr
overhead.
Gate selection with AggTransfnSupportsBulk(), an allowlist of
built-in transfns updated to accept AggBulkArgs. Some integer
transfns are taught to read AggBulkArgs when present, else fall
back. Rowloop batching remains available; unsupported aggregates keep
the row path.
---
src/backend/executor/execExpr.c | 28 ++++++++++++++++-
src/backend/executor/execExprInterp.c | 43 ++++++++++++++++++++++++++
src/backend/executor/nodeAgg.c | 1 -
src/backend/jit/llvm/llvmjit_expr.c | 1 +
src/backend/utils/adt/int.c | 32 +++++++++++++++++++
src/backend/utils/adt/int8.c | 44 +++++++++++++++++++++++++++
src/backend/utils/adt/numeric.c | 17 +++++++++++
src/include/executor/execExpr.h | 1 +
src/include/executor/executor.h | 20 ++++++++++++
9 files changed, 185 insertions(+), 2 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index af5ed8b6368..27a5780f557 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -47,6 +47,7 @@
#include "utils/acl.h"
#include "utils/array.h"
#include "utils/builtins.h"
+#include "utils/fmgroids.h"
#include "utils/jsonfuncs.h"
#include "utils/jsonpath.h"
#include "utils/lsyscache.h"
@@ -3692,6 +3693,28 @@ AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
return true;
}
+/* Return true if this transfn OID is known to accept AggBulkArgs. */
+static bool
+AggTransfnSupportsBulk(Oid fn_oid)
+{
+ /* Phase 1: hard-coded allowlist of built-ins you updated. */
+ static const Oid ok[] =
+ {
+ F_INT8INC_ANY, /* COUNT(*) transfn */
+ F_INT8INC, /* COUNT(arg) transfn */
+ F_INT4_SUM, /* SUM(int) transfn */
+ F_INT4SMALLER, /* MIN(int) transfn */
+ F_INT4LARGER, /* MAX(int) transfn */
+ /* add others you make bulk-aware */
+ InvalidOid
+ };
+
+ for (int i = 0; OidIsValid(ok[i]); i++)
+ if (ok[i] == fn_oid)
+ return true;
+ return false;
+}
+
/*
* Build transition/combine function invocations for all aggregate transition
* / combination function invocations in a grouping sets phase. This has to
@@ -4150,7 +4173,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
{
if (bv)
bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
- scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ if (!AggTransfnSupportsBulk(pertrans->transfn_oid))
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+ else
+ scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT;
}
else if (pertrans->transtypeByVal)
{
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3176679b346..41ad9b4838d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -607,6 +607,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
+ &&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
&&CASE_EEOP_LAST
};
@@ -2345,6 +2346,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
+ {
+ /* too complex for an inline implementation */
+ ExecAggPlainTransBatch(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -6138,6 +6147,40 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
pergroup->transValueIsNull = fcinfo->isnull;
}
break;
+
+ case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+ {
+ void *save = fcinfo->flinfo->fn_extra;
+ AggBulkArgs ba = {batch_nrows, start_row};
+
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ Assert(bv);
+ ba.nargs = bvs->nargs;
+ ba.argoffs = bvs->argoffs;
+ ba.args = bv->cols;
+ ba.isnull = bv->nulls;
+ ba.hasnull = bv->hasnull;
+ }
+ fcinfo->flinfo->fn_extra = &ba;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+ newVal = FunctionCallInvoke(fcinfo); /* one call for the entire slice */
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ fcinfo->flinfo->fn_extra = save;
+ }
+ break;
+
default:
elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 662d8bef43b..a2286ef5e54 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2687,7 +2687,6 @@ agg_retrieve_direct_batch(AggState *aggstate)
initialize_aggregates(aggstate, aggstate->pergroups,
Max(aggstate->phase->numsets, 1));
-
if (aggstate->grp_firstTuple)
{
ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index efb3ee639fc..45346124bd7 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,7 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
v_state, op, v_econtext);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..eb1780b5590 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1363,18 +1363,50 @@ int2smaller(PG_FUNCTION_ARGS)
Datum
int4larger(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);
+ if (unlikely(ba))
+ {
+ int32 result = arg1;
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = (int32) ba->args[ba->argoffs[0]][i];
+ if (arg2 > result)
+ result = arg2;
+ }
+ }
+ PG_RETURN_INT32(result);
+ }
PG_RETURN_INT32((arg1 > arg2) ? arg1 : arg2);
}
Datum
int4smaller(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int32 arg1 = PG_GETARG_INT32(0);
int32 arg2 = PG_GETARG_INT32(1);
+ if (unlikely(ba))
+ {
+ int32 result = arg1;
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = ba->args[ba->argoffs[0]][i];
+ if (arg2 < result)
+ result = arg2;
+ }
+ }
+ PG_RETURN_INT32(result);
+ }
PG_RETURN_INT32((arg1 < arg2) ? arg1 : arg2);
}
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index bdea490202a..bbabf4e0785 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -461,10 +461,28 @@ int8up(PG_FUNCTION_ARGS)
Datum
int8pl(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int64 arg1 = PG_GETARG_INT64(0);
int64 arg2 = PG_GETARG_INT64(1);
int64 result;
+ if (unlikely(ba))
+ {
+ result = arg1;
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ arg2 = ba->args[ba->argoffs[0]][i];
+ if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ arg1 = result;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
@@ -718,9 +736,35 @@ int8lcm(PG_FUNCTION_ARGS)
Datum
int8inc(PG_FUNCTION_ARGS)
{
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
int64 arg = PG_GETARG_INT64(0);
int64 result;
+ if (unlikely(ba))
+ {
+ result = arg;
+ if (!ba->hasnull || ba->nargs == 0)
+ {
+ if (unlikely(pg_add_s64_overflow(arg, ba->nrows, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ PG_RETURN_INT64(result);
+ }
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
+ ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("bigint out of range")));
+ arg = result;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
+
if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
ereport(ERROR,
(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 2501007d981..907c4fddba0 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6310,6 +6310,23 @@ int4_sum(PG_FUNCTION_ARGS)
{
int64 oldsum;
int64 newval;
+ AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
+
+ if (unlikely(ba))
+ {
+ int64 result = (!PG_ARGISNULL(0) ? PG_GETARG_INT64(0) : 0);
+
+ for (int i = ba->start_row; i < ba->nrows; i++)
+ {
+ if (!ba->isnull[ba->argoffs[0]][i])
+ {
+ int32 arg2 = ba->args[ba->argoffs[0]][i];
+
+ result = result + arg2;
+ }
+ }
+ PG_RETURN_INT64(result);
+ }
if (PG_ARGISNULL(0))
{
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 1d33e084b69..f24782ecf58 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -304,6 +304,7 @@ typedef enum ExprEvalOp
/* Batched aggregate trans evaluation */
EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
+ EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT, /* call transfn once with AggBulkArgs */
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5ba9a523970..c72bd755b79 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -561,6 +561,26 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
}
#endif
+#ifndef FRONTEND
+/* Per-call bulk argument vectors for batched aggregate trans functions. */
+typedef struct AggBulkArgs
+{
+ int nrows; /* number of rows in this batch */
+ int start_row;
+ int16 *argoffs;
+ int nargs; /* number of argument vectors */
+ Datum **args; /* args[j][i] = j-th arg at row i */
+ bool **isnull; /* isnull[j][i] */
+ bool hasnull; /* is any datum in args NULL? */
+} AggBulkArgs;
+
+static inline AggBulkArgs *
+AggGetBulkArgs(FunctionCallInfo fcinfo)
+{
+ return (AggBulkArgs *) (fcinfo->flinfo ? fcinfo->flinfo->fn_extra : NULL);
+}
+#endif
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
--
2.47.3
[application/octet-stream] v3-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch (22.7K, 4-v3-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch)
download | inline diff:
From eec61e901c54ec2149f60c0ff8a0b1b3e63f7a0b Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 16:19:26 +0900
Subject: [PATCH v3 8/9] WIP: Add ExecQualBatch() and EEOPs for batched quals
Introduce ExecInitQualBatch()/ExecQualBatch() to evaluate scan quals
over a TupleBatch. The batched qual interpreter produces a boolean
mask aligned with the batch, marking which rows satisfy the qual.
The scan node later uses this mask to copy only passing rows into
its output slots. If batching is not possible, fall back to the
existing per-tuple engine.
Add EEOP_QUAL_BATCH_INITMASK and EEOP_QUAL_BATCH_TERM, and wire them
after EEOP_SCAN_FETCHSOME_BATCH and EEOP_BUILD_SCAN_BATCH_VECTOR.
Batching is limited to quals that are a top-level AND of simple
clauses: either NullTest(var) or strict binary OpExpr with var/const
or var/var arguments. A walker validates the tree, collects the
referenced attnos, and builds a BatchVector; terms are compiled from
the leaves and evaluated to update the mask.
ExprState gains batch_private to hold BatchQualRuntime (mask, words)
which are used by the parent node to populate output slots in
TupleBatch.
---
src/backend/executor/execExpr.c | 324 ++++++++++++++++++++++++++
src/backend/executor/execExprInterp.c | 198 ++++++++++++++++
src/backend/executor/nodeSeqscan.c | 2 +
src/backend/jit/llvm/llvmjit_expr.c | 11 +
src/backend/jit/llvm/llvmjit_types.c | 2 +
src/include/executor/execExpr.h | 60 +++++
src/include/executor/execScan.h | 35 +--
src/include/executor/executor.h | 3 +
src/include/nodes/execnodes.h | 4 +
9 files changed, 626 insertions(+), 13 deletions(-)
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 27a5780f557..63df560d5f1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -111,6 +111,19 @@ static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
const BatchVector *bv);
+static int16 BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno);
+static int16 BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv);
+
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+ List *leaves; /* List<Node*> of accepted leaves */
+ Bitmapset *attnos; /* Vars referenced by accepted leaves */
+ bool ok; /* stays true if batchable */
+ AttrNumber last_scan; /* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -5221,6 +5234,209 @@ ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
return true;
}
+/* helper: extract Var (allowing RelabelType->Var); returns NULL if not */
+static Var *
+strip_to_var(Node *n)
+{
+ if (n == NULL)
+ return NULL;
+ if (IsA(n, RelabelType))
+ n = (Node *) ((RelabelType *) n)->arg;
+ if (!IsA(n, Var))
+ return NULL;
+ if (((Var *) n)->varattno < 0)
+ return NULL;
+ return (Var *) n;
+}
+
+/* main walker; return true to abort traversal early, false to continue */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+ QualBatchContext *cxt = (QualBatchContext *) context;
+
+ if (node == NULL || !cxt->ok)
+ return false;
+
+ switch (nodeTag(node))
+ {
+ case T_List:
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+ case T_BoolExpr:
+ {
+ BoolExpr *b = (BoolExpr *) node;
+
+ /* Only AND trees are allowed */
+ if (b->boolop != AND_EXPR)
+ {
+ cxt->ok = false;
+ return true; /* abort */
+ }
+ /* Recurse normally over children */
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+ }
+
+ case T_NullTest:
+ {
+ NullTest *nt = (NullTest *) node;
+ Var *v = strip_to_var((Node *) nt->arg);
+
+ if (v == NULL)
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ cxt->attnos = bms_add_member(cxt->attnos, v->varattno);
+ if (v->varattno > cxt->last_scan)
+ cxt->last_scan = v->varattno;
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Do NOT recurse into leaf */
+ return false;
+ }
+
+ case T_OpExpr:
+ {
+ OpExpr *op = (OpExpr *) node;
+ List *args = op->args;
+ Node *l, *r;
+ Var *lv,
+ *rv = NULL;
+
+ /* binary only */
+ if (list_length(args) != 2)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* strict operator only (NULL -> false semantics) */
+ if (!func_strict(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ l = linitial(args);
+ r = lsecond(args);
+ lv = strip_to_var(l);
+ if (lv == NULL)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, lv->varattno);
+ if (lv->varattno > cxt->last_scan)
+ cxt->last_scan = lv->varattno;
+
+ if (IsA(r, Const))
+ {
+ /* ok; no attno to add */
+ }
+ else
+ {
+ rv = strip_to_var(r);
+ if (rv == NULL)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, rv->varattno);
+ if (rv->varattno > cxt->last_scan)
+ cxt->last_scan = rv->varattno;
+ }
+
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Leaf handled; do NOT recurse into args */
+ return false;
+ }
+
+ /* Whitelist ends here; anything else in the tree rejects */
+ default:
+ cxt->ok = false;
+ break;
+ }
+
+ return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n, BatchVector *bv)
+{
+ BatchQualTerm *term;
+ BatchQualTermKind kind;
+ bool strict;
+ int16 l_off;
+ int16 r_off;
+ Datum r_const = (Datum) 0;
+ bool r_isnull = false;
+ FmgrInfo *finfo = NULL;
+ Oid collation;
+
+ if (IsA(n, NullTest))
+ {
+ NullTest *nt = (NullTest *) n;
+
+ kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+ l_off = BatchVectorOffsetForVarExpr(nt->arg, bv);
+ r_off = -1;
+ strict = false;
+ collation = InvalidOid;
+
+ if (l_off < 0)
+ return NULL;
+ }
+ else if (IsA(n, OpExpr))
+ {
+ OpExpr *op = (OpExpr *) n;
+ Expr *l = linitial(op->args);
+ Expr *r = lsecond(op->args);
+
+ l_off = BatchVectorOffsetForVarExpr(l, bv);
+ if (l_off < 0)
+ return NULL;
+
+ r_off = BatchVectorOffsetForVarExpr(r, bv);
+ if (IsA(r, Const))
+ {
+ Const *c = (Const *) r;
+
+ kind = BQTK_VAR_CONST;
+ r_const = c->constvalue;
+ r_isnull = c->constisnull;
+ r_off = -1;
+ }
+ else
+ {
+ if (r_off < 0)
+ return NULL;
+ kind = BQTK_VAR_VAR;
+ }
+
+ strict = func_strict(op->opfuncid);
+ collation = exprInputCollation((Node *) op);
+ finfo = palloc(sizeof(FmgrInfo));
+ fmgr_info(op->opfuncid, finfo);
+ }
+ else
+ return NULL;
+
+ term = palloc(sizeof(BatchQualTerm));
+ term->kind = kind;
+ term->strict = strict;
+ term->l_off = l_off;
+ term->r_off = r_off;
+ term->r_const = r_const;
+ term->r_isnull = r_isnull;
+ term->finfo = finfo;
+ term->collation = collation;
+
+ return term;
+}
+
/* ---------- BatchVector stuff ------------- */
static BatchVector *
@@ -5298,3 +5514,111 @@ BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
return bvs;
}
+
+/*
+ * BatchVectorOffsetForVarExpr
+ * Map a Var (or RelabelType->Var) to its BatchVector column index.
+ * Returns -1 if the Var’s attno is not present.
+ */
+static int16
+BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv)
+{
+ AttrNumber attno;
+
+ if (!expr_is_simple_var(expr, &attno))
+ return -1;
+
+ return (int16) BatchVectorFindAttColno(bv, attno);
+}
+
+/*
+ * ExecInitQualBatch
+ * Build a batched-qual EEOP program (AND-only).
+ * Caller should also keep scalar ps->qual for runtime fallback.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+ Node *qual = (Node *) ps->plan->qual;
+ QualBatchContext cxt = {NIL, NULL, true, 0};
+ BatchQualRuntime *rt;
+ ExprState *state;
+ BatchVector *bv;
+ uint64 *mask;
+ int mask_words;
+ ListCell *lc;
+ ExprEvalStep scratch = {0};
+
+ if (qual == NULL)
+ return NULL;
+
+ /* validate + collect leaves/attnos with walker */
+ (void) qual_batchable_walker(qual, &cxt);
+ if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+ return NULL;
+
+ bv = BatchVectorCreate(cxt.attnos, cxt.last_scan);
+
+ mask_words = (bv->maxrows + 63) >> 6;
+ mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+ /* Runtime carrier (lifetime == exprstate) */
+ rt = palloc0(sizeof(BatchQualRuntime));
+ rt->mask = mask;
+ rt->mask_words = mask_words;
+
+ /* dedicated ExprState for batched program */
+
+ state = makeNode(ExprState);
+ state->expr = (Expr *) qual;
+ state->parent = ps;
+ state->ext_params = NULL;
+
+ /* mark expression as to be used with ExecQual() */
+ state->flags = EEO_FLAG_IS_QUAL;
+
+ /* Only valid as batch qual if this is set. */
+ state->batch_private = (void *) rt;
+
+ scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = cxt.last_scan;
+ ExprEvalPushStep(state, &scratch);
+
+ scratch.opcode = EEOP_BUILD_SCAN_BATCH_VECTOR;
+ scratch.d.batch_vector.bv = bv;
+ ExprEvalPushStep(state, &scratch);
+
+ scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+ scratch.d.qualbatch_init.bv = bv;
+ scratch.d.qualbatch_init.mask = mask;
+ scratch.d.qualbatch_init.mask_words = mask_words;
+ ExprEvalPushStep(state, &scratch);
+
+ /* TERM per leaf */
+ foreach(lc, cxt.leaves)
+ {
+ BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc), bv);
+
+ if (term == NULL)
+ return NULL;
+
+ scratch.opcode = EEOP_QUAL_BATCH_TERM;
+ scratch.d.qualbatch_term.bv = bv;
+ scratch.d.qualbatch_term.mask = mask;
+ scratch.d.qualbatch_term.mask_words = mask_words;
+ scratch.d.qualbatch_term.term = term; /* by value */
+ ExprEvalPushStep(state, &scratch);
+ }
+
+ /*
+ * At the end, we don't need to do anything more. The last qual expr must
+ * have yielded TRUE, and since its result is stored in the desired output
+ * location, we're done.
+ */
+ scratch.opcode = EEOP_DONE_NO_RETURN;
+ ExprEvalPushStep(state, &scratch);
+
+ ExecReadyExpr(state);
+
+ return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 41ad9b4838d..c2b76a5e5db 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -608,6 +608,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
+ &&CASE_EEOP_QUAL_BATCH_INITMASK,
+ &&CASE_EEOP_QUAL_BATCH_TERM,
&&CASE_EEOP_LAST
};
@@ -2350,7 +2352,19 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
{
/* too complex for an inline implementation */
ExecAggPlainTransBatch(state, op, econtext);
+ EEO_NEXT();
+ }
+
+
+ EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+ {
+ ExecQualBatchInitMask(state, op, econtext);
+ EEO_NEXT();
+ }
+ EEO_CASE(EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
EEO_NEXT();
}
@@ -6185,3 +6199,187 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
}
}
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+ for (int i = 0; i < nwords; i++)
+ a[i] = ~UINT64CONST(0);
+
+ if ((nvalid_bits & 63) != 0)
+ {
+ int rem = nvalid_bits & 63;
+
+ a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+ }
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+ a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchVector *bv = op->d.qualbatch_init.bv;
+ uint64 *mask = op->d.qualbatch_init.mask;
+ int nwords = op->d.qualbatch_init.mask_words;
+ int n = bv->nrows;
+
+ /* initialize to all-pass for current batch size */
+ mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchVector *bv = op->d.qualbatch_term.bv;
+ uint64 *mask = op->d.qualbatch_term.mask;
+ BatchQualTerm *t = op->d.qualbatch_term.term;
+ int n = bv->nrows;
+
+ switch (t->kind)
+ {
+ case BQTK_IS_NULL:
+ {
+ /* keep bit set only if value IS NULL; clear otherwise */
+ for (int i = 0; i < n; i++)
+ {
+ if (!bv->nulls[t->l_off][i])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_IS_NOT_NULL:
+ {
+ /* keep bit set only if value IS NOT NULL; clear if NULL */
+ for (int i = 0; i < n; i++)
+ {
+ if (bv->nulls[t->l_off][i])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_CONST:
+ {
+ const bool r_isnull = t->r_isnull;
+ const Datum r_const = t->r_const;
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+ int loff = t->l_off;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = bv->nulls[loff][i];
+ bool pass;
+
+ /* WHERE treats NULL as false; strict ops short-circuit */
+ if (strict && (ln || r_isnull))
+ pass = false;
+ else
+ {
+ Datum lv = bv->cols[loff][i];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_VAR:
+ {
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+ int loff = t->l_off;
+ int roff = t->r_off;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = bv->nulls[loff][i];
+ bool rn = bv->nulls[roff][i];
+ bool pass;
+
+ if (strict && (ln || rn))
+ pass = false;
+ else
+ {
+ Datum lv = bv->cols[loff][i];
+ Datum rv = bv->cols[roff][i];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ default:
+ /* should not happen; leave mask unchanged */
+ break;
+ }
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+ for (int i = 0; i < nwords; i++)
+ {
+ if (mask[i] != 0)
+ return false;
+ }
+ return true;
+}
+
+/*
+ * ExecQualBatch
+ * Evaluate a compiled qual (EEOP_QUAL) for a batch of rows.
+ *
+ * Returns the number of true rows (optional convenience for callers).
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+ int i;
+ uint64 *mask;
+ int kept = 0;
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);;
+
+ /* verify that expression was compiled using ExecInitQual */
+ Assert(state->flags & EEO_FLAG_IS_QUAL);
+ Assert(rt && rt->mask && rt->mask_words);
+
+ /* run the batched EEOP program once */
+ econtext->scan_batch = b;
+ ExecEvalExprNoReturn(state, econtext);
+
+ mask = rt->mask;
+ if (mask_is_empty(mask, rt->mask_words))
+ return 0;
+
+ /* Add survivors into outslots */
+ TupleBatchRewind(b);
+ i = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* mask bit set => row survives */
+ if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+ TupleBatchStoreInOut(b, kept++, slot);
+ i++;
+ }
+
+ return kept;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index a4cf1e51af0..e5ca619731f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -401,6 +401,8 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
}
}
+
+ scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
}
/* ----------------------------------------------------------------
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 45346124bd7..b97d5faebde 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3033,6 +3033,17 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_QUAL_BATCH_INITMASK:
+ build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ case EEOP_QUAL_BATCH_TERM:
+ build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 1b5e06f60cc..f4f756e7cb5 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -187,4 +187,6 @@ void *referenced_functions[] =
ExecBuildOuterBatchVector,
ExecBuildScanBatchVector,
ExecAggPlainTransBatch,
+ ExecQualBatchInitMask,
+ ExecQualBatchTerm,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index f24782ecf58..f50936acaaa 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -306,6 +306,10 @@ typedef enum ExprEvalOp
EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP, /* per-row fmgr calls */
EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT, /* call transfn once with AggBulkArgs */
+ /* Batched qual evaluation */
+ EEOP_QUAL_BATCH_INITMASK,
+ EEOP_QUAL_BATCH_TERM,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
@@ -796,6 +800,21 @@ typedef struct ExprEvalStep
{
struct BatchVector *bv;
} batch_vector;
+
+ struct
+ {
+ struct BatchVector *bv; /* filled earlier by BUILD_BATCH_VECTOR */
+ uint64 *mask; /* shared mask buffer for this program */
+ int mask_words; /* ceil(es_max_batch/64) */
+ } qualbatch_init; /* EEOP_QUAL_BATCH_INITMASK */
+
+ struct
+ {
+ struct BatchVector *bv; /* same bv as init */
+ uint64 *mask; /* same mask buffer */
+ int mask_words; /* same word count */
+ struct BatchQualTerm *term; /* compiled leaf */
+ } qualbatch_term; /* EEOP_QUAL_BATCH_TERM */
} d;
} ExprEvalStep;
@@ -975,4 +994,45 @@ extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprCo
extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+ BQTK_VAR_CONST,
+ BQTK_VAR_VAR,
+ BQTK_IS_NULL,
+ BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+ BatchQualTermKind kind;
+ bool strict; /* follow strict NULL semantics if true */
+ int16 l_off; /* left VAR column (index into BatchVector) */
+ int16 r_off; /* right VAR column, or -1 if Const */
+ Datum r_const; /* for VAR_CONST */
+ bool r_isnull; /* for VAR_CONST */
+ FmgrInfo *finfo; /* fmgr for generic binary ops */
+ Oid collation; /* op collation */
+} BatchQualTerm;
+
+/*
+ * Runtime view for batched qual programs.
+ * Owned by the ExprState; lifetime == ExprState.
+ */
+typedef struct BatchQualRuntime
+{
+ uint64 *mask;
+ int mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+ return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fb4b57a831c..568a7a33b7d 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -304,7 +304,8 @@ ExecScanExtendedBatch(ScanState *node,
{
ExprContext *econtext = node->ps.ps_ExprContext;
TupleBatch *b = node->ps.ps_Batch;
- int qualified;
+ ExprState *qual_batch = node->ps.qual_batch;
+ int qualified = 0;
/* Batch path does not support EPQ */
Assert(node->ps.state->es_epq_active == NULL);
@@ -320,23 +321,31 @@ ExecScanExtendedBatch(ScanState *node,
if (qual != NULL)
{
- qualified = 0;
- while (TupleBatchHasMore(b))
+ ResetExprContext(econtext);
+ if (qual_batch)
{
- TupleTableSlot *in = TupleBatchGetNextSlot(b);
-
- Assert(in);
- ResetExprContext(econtext);
- econtext->ecxt_scantuple = in;
+ qualified = ExecQualBatch(qual_batch, econtext, b);
+ }
+ else
+ {
+ int i = 0;
- if (ExecQual(qual, econtext))
+ while (TupleBatchHasMore(b))
{
- TupleBatchStoreInOut(b, qualified, in);
- qualified++;
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ Assert(slot);
+ econtext->ecxt_scantuple = slot;
+ if (ExecQual(qual, econtext))
+ {
+ TupleBatchStoreInOut(b, qualified, slot);
+ qualified++;
+ }
+ i++;
}
- else
- InstrCountFiltered1(node, 1);
}
+ InstrCountFiltered1(node, b->nvalid - qualified);
+ /* Update count and start using b->outslots. */
TupleBatchUseOutput(b, qualified);
}
else
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c72bd755b79..dd0f2c74ae5 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -333,6 +333,7 @@ ExecProcNodeBatch(PlanState *node)
extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -581,6 +582,8 @@ AggGetBulkArgs(FunctionCallInfo fcinfo)
}
#endif
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fdfe8b4ddaf..78c5abbb23a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -146,6 +146,9 @@ typedef struct ExprState
* ExecInitExprRec().
*/
ErrorSaveContext *escontext;
+
+ /* batched-program runtime (e.g., BatchQualRuntime) */
+ void *batch_private;
} ExprState;
@@ -1196,6 +1199,7 @@ typedef struct PlanState
* subPlan list, which does not exist in the plan tree).
*/
ExprState *qual; /* boolean qual condition */
+ ExprState *qual_batch; /* boolean qual condition evaluated on batches */
PlanState *lefttree; /* input plan tree(s) */
PlanState *righttree;
--
2.47.3
[application/octet-stream] v3-0009-Blind-guess-at-fixing-segfault-on-running-tpch-q2.patch (11.6K, 5-v3-0009-Blind-guess-at-fixing-segfault-on-running-tpch-q2.patch)
download | inline diff:
From 92ef364a8f650022a139bc32a2e518804a41767a Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Wed, 8 Oct 2025 08:06:59 -0400
Subject: [PATCH v3 9/9] Blind guess at fixing segfault on running tpch q22
---
src/backend/executor/execExprInterp.c | 225 ++++++++++++++------------
src/backend/jit/llvm/llvmjit_expr.c | 7 +-
src/backend/jit/llvm/llvmjit_types.c | 3 +-
src/include/executor/execExpr.h | 3 +-
4 files changed, 136 insertions(+), 102 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index c2b76a5e5db..aee37cf50d5 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2343,7 +2343,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
{
/* too complex for an inline implementation */
- ExecAggPlainTransBatch(state, op, econtext);
+ ExecAggPlainTransBatchRowloop(state, op, econtext);
EEO_NEXT();
}
@@ -2351,7 +2351,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
{
/* too complex for an inline implementation */
- ExecAggPlainTransBatch(state, op, econtext);
+ ExecAggPlainTransBatchDirect(state, op, econtext);
+
EEO_NEXT();
}
@@ -6072,131 +6073,157 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
bv->nrows = i;
}
-void
-ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+static bool
+ExecAggPlainTransBatchInitTrans(ExprState *state, ExprEvalStep *op,
+ TupleBatch *b)
{
AggState *aggstate = castNode(AggState, state->parent);
AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
AggStatePerGroup pergroup =
&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
BatchVectorSlice *bvs = op->d.agg_trans.bvs;
+ const BatchVector *bv = bvs->bv;
+ int batch_nrows = bvs ? bvs->bv->nrows : b->nvalid;
+ bool found = false;
FunctionCallInfo fcinfo = pertrans->transfn_fcinfo;
FmgrInfo *finfo = fcinfo->flinfo;
- Datum newVal;
- TupleBatch *batch = econtext->outer_batch;
- int batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
- int start_row = 0;
- if (finfo->fn_strict)
+ if (!finfo->fn_strict || bvs == NULL)
+ return false;
+
+ for (int i = 0; i < batch_nrows; i++)
{
- if (pergroup->noTransValue && bvs)
+ for (int j = 0; j < bvs->nargs; j++)
{
- const BatchVector *bv = bvs->bv;
- bool found = false;
-
- Assert(bv);
- for (int i = 0; i < batch_nrows; i++)
+ if (!bv->nulls[bvs->argoffs[j]][i])
{
- for (int j = 0; j < bvs->nargs; j++)
+ fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+ fcinfo->args[1].isnull = false;
+ if (j == bvs->nargs - 1)
{
- if (!bv->nulls[bvs->argoffs[j]][i])
- {
- fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
- fcinfo->args[1].isnull = false;
- if (j == bvs->nargs - 1)
- {
- found = true;
- break;
- }
- }
- }
- if (found)
+ found = true;
break;
+ }
}
- /* If transValue has not yet been initialized, do so now. */
- ExecAggInitGroup(aggstate, pertrans, pergroup,
- op->d.agg_trans.aggcontext);
- start_row = 1;
}
- else if (pergroup->transValueIsNull)
+ if (found)
+ break;
+ }
+ /* If transValue has not yet been initialized, do so now. */
+ ExecAggInitGroup(aggstate, pertrans, pergroup,
+ op->d.agg_trans.aggcontext);
+ return true;
+}
+
+void
+ExecAggPlainTransBatchDirect(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
+ AggStatePerGroup pergroup =
+ &aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+ BatchVectorSlice *bvs = op->d.agg_trans.bvs;
+ FunctionCallInfo fcinfo = pertrans->transfn_fcinfo;
+ Datum newVal;
+ TupleBatch *b = econtext->outer_batch;
+ int batch_nrows = bvs ? bvs->bv->nrows : b->nvalid;
+ int start_row = 0;
+ void *save = fcinfo->flinfo->fn_extra;
+ AggBulkArgs ba = {batch_nrows, start_row};
+
+ if (pergroup->noTransValue)
+ {
+ if (ExecAggPlainTransBatchInitTrans(state, op, b))
+ start_row = 1;
+ else if (pergroup->transValueIsNull && fcinfo->flinfo->fn_strict)
return;
}
- switch (ExecEvalStepOp(state, op))
+ if (bvs)
{
- case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
- /* Loop rows, call the original transfn per element using vector cols. */
- for (int i = start_row; i < batch_nrows; i++)
- {
- bool hasnull = false;
+ const BatchVector *bv = bvs->bv;
+
+ Assert(bv);
+ ba.nargs = bvs->nargs;
+ ba.argoffs = bvs->argoffs;
+ ba.args = bv->cols;
+ ba.isnull = bv->nulls;
+ ba.hasnull = bv->hasnull;
+ }
+ fcinfo->flinfo->fn_extra = &ba;
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ fcinfo->isnull = false; /* just in case transfn doesn't set it */
+ newVal = FunctionCallInvoke(fcinfo); /* one call for the entire slice */
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
+ fcinfo->flinfo->fn_extra = save;
+}
- /* Set up fcinfo args 1..m from column vectors at row i. */
- if (bvs)
- {
- const BatchVector *bv = bvs->bv;
+void
+ExecAggPlainTransBatchRowloop(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ AggState *aggstate = castNode(AggState, state->parent);
+ AggStatePerTrans pertrans = op->d.agg_trans.pertrans;
+ AggStatePerGroup pergroup =
+ &aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+ BatchVectorSlice *bvs = op->d.agg_trans.bvs;
+ FunctionCallInfo fcinfo = pertrans->transfn_fcinfo;
+ FmgrInfo *finfo = fcinfo->flinfo;
+ Datum newVal;
+ TupleBatch *b = econtext->outer_batch;
+ int batch_nrows = bvs ? bvs->bv->nrows : b->nvalid;
+ int start_row = 0;
- for (int j = 0; j < bvs->nargs; j++)
- {
- int16 argoff = bvs->argoffs[j];
+ if (pergroup->noTransValue)
+ {
+ if (ExecAggPlainTransBatchInitTrans(state, op, b))
+ start_row = 1;
+ else if (pergroup->transValueIsNull && fcinfo->flinfo->fn_strict)
+ return;
+ }
- fcinfo->args[j+1].value = bv->cols[argoff][i];
- fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
- if (!hasnull && bv->nulls[argoff][i])
- hasnull = true;
- }
- }
- /* fcinfo->args[0] is the existing transition state */
- if (finfo->fn_strict && hasnull)
- continue;
- fcinfo->args[0].value = pergroup->transValue;
- fcinfo->args[0].isnull = pergroup->transValueIsNull;
- newVal = FunctionCallInvoke(fcinfo);
- if (!pertrans->transtypeByVal &&
- DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
- newVal = ExecAggCopyTransValue(aggstate, pertrans,
- newVal, fcinfo->isnull,
- pergroup->transValue,
- pergroup->transValueIsNull);
- pergroup->transValue = newVal;
- pergroup->transValueIsNull = fcinfo->isnull;
- }
- break;
+ /* Loop rows, call the original transfn per element using vector cols. */
+ for (int i = start_row; i < batch_nrows; i++)
+ {
+ bool hasnull = false;
- case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+ /* Set up fcinfo args 1..m from column vectors at row i. */
+ if (bvs)
+ {
+ const BatchVector *bv = bvs->bv;
+
+ for (int j = 0; j < bvs->nargs; j++)
{
- void *save = fcinfo->flinfo->fn_extra;
- AggBulkArgs ba = {batch_nrows, start_row};
+ int16 argoff = bvs->argoffs[j];
- if (bvs)
- {
- const BatchVector *bv = bvs->bv;
-
- Assert(bv);
- ba.nargs = bvs->nargs;
- ba.argoffs = bvs->argoffs;
- ba.args = bv->cols;
- ba.isnull = bv->nulls;
- ba.hasnull = bv->hasnull;
- }
- fcinfo->flinfo->fn_extra = &ba;
- fcinfo->args[0].value = pergroup->transValue;
- fcinfo->args[0].isnull = pergroup->transValueIsNull;
- fcinfo->isnull = false; /* just in case transfn doesn't set it */
- newVal = FunctionCallInvoke(fcinfo); /* one call for the entire slice */
- if (!pertrans->transtypeByVal &&
- DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
- newVal = ExecAggCopyTransValue(aggstate, pertrans,
- newVal, fcinfo->isnull,
- pergroup->transValue,
- pergroup->transValueIsNull);
- pergroup->transValue = newVal;
- pergroup->transValueIsNull = fcinfo->isnull;
- fcinfo->flinfo->fn_extra = save;
+ fcinfo->args[j+1].value = bv->cols[argoff][i];
+ fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+ if (!hasnull && bv->nulls[argoff][i])
+ hasnull = true;
}
- break;
+ }
- default:
- elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+ if (finfo->fn_strict && hasnull)
+ continue;
+ /* fcinfo->args[0] is the existing transition state */
+ fcinfo->args[0].value = pergroup->transValue;
+ fcinfo->args[0].isnull = pergroup->transValueIsNull;
+ newVal = FunctionCallInvoke(fcinfo);
+ if (!pertrans->transtypeByVal &&
+ DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+ newVal = ExecAggCopyTransValue(aggstate, pertrans,
+ newVal, fcinfo->isnull,
+ pergroup->transValue,
+ pergroup->transValueIsNull);
+ pergroup->transValue = newVal;
+ pergroup->transValueIsNull = fcinfo->isnull;
}
}
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index b97d5faebde..2d1c8259d1a 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3027,8 +3027,13 @@ llvm_compile_expr(ExprState *state)
break;
case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+ build_EvalXFunc(b, mod, "ExecAggPlainTransBatchDirect",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
- build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+ build_EvalXFunc(b, mod, "ExecAggPlainTransBatchRowloop",
v_state, op, v_econtext);
LLVMBuildBr(b, opblocks[opno + 1]);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index f4f756e7cb5..2cf3a60be51 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,7 +186,8 @@ void *referenced_functions[] =
ExecBuildInnerBatchVector,
ExecBuildOuterBatchVector,
ExecBuildScanBatchVector,
- ExecAggPlainTransBatch,
+ ExecAggPlainTransBatchDirect,
+ ExecAggPlainTransBatchRowloop,
ExecQualBatchInitMask,
ExecQualBatchTerm,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index f50936acaaa..a3314ffd0c9 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -993,7 +993,8 @@ extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprCo
extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
-extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatchDirect(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatchRowloop(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
/* See ExecQualBatchTerm(). */
typedef enum BatchQualTermKind
--
2.47.3
[application/octet-stream] v3-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch (16.9K, 6-v3-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch)
download | inline diff:
From f3239ed6c0f196be5b495a586e6b390465d0326d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 17:01:29 +0900
Subject: [PATCH v3 5/9] WIP: Add EEOPs and helpers for TupleBatch processing
Introduce new EEOP cases to fetch attributes into TupleBatch
vectors:
- EEOP_{INNER,OUTER,SCAN}_FETCHSOME_BATCH
- EEOP_BUILD_{INNER,OUTER,SCAN}_BATCH_VECTOR
Add ExecBuild{Inner,Outer,Scan}BatchVector() helpers to populate
column vectors (values, nulls, nrows, hasnull) from a TupleBatch.
Extend ExprContext with inner_batch, outer_batch, and scan_batch
fields so expression programs can access active batches directly.
Add slot_getsomeattrs_batch() to prefetch attributes across all
slots in a TupleBatch, similar to slot_getsomeattrs() for one slot.
---
src/backend/executor/execExprInterp.c | 127 +++++++++++++++++++++++++-
src/backend/executor/execTuples.c | 32 +++++++
src/backend/jit/llvm/llvmjit_expr.c | 86 +++++++++++++++++
src/backend/jit/llvm/llvmjit_types.c | 4 +
src/include/executor/execExpr.h | 45 ++++++++-
src/include/executor/tuptable.h | 2 +
src/include/nodes/execnodes.h | 24 +++--
7 files changed, 310 insertions(+), 10 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0e1a74976f7..68629ad7991 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
#include "access/heaptoast.h"
#include "catalog/pg_type.h"
#include "commands/sequence.h"
+#include "executor/execBatch.h"
#include "executor/execExpr.h"
#include "executor/nodeSubplan.h"
#include "funcapi.h"
@@ -188,6 +189,11 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
int setno);
static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static pg_attribute_always_inline void ExecBuildBatchVector(ExprState *state,
+ ExprEvalStep *op,
+ ExprContext *econtext,
+ TupleBatch *b);
+
/*
* ScalarArrayOpExprHashEntry
* Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -446,7 +452,6 @@ ExecReadyInterpretedExpr(ExprState *state)
state->evalfunc_private = ExecInterpExpr;
}
-
/*
* Evaluate expression identified by "state" in the execution context
* given by "econtext". *isnull is set to the is-null flag for the result,
@@ -466,6 +471,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
+ TupleBatch *innerbatch;
+ TupleBatch *outerbatch;
+ TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -479,6 +487,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_SCAN_FETCHSOME,
&&CASE_EEOP_OLD_FETCHSOME,
&&CASE_EEOP_NEW_FETCHSOME,
+ &&CASE_EEOP_INNER_FETCHSOME_BATCH,
+ &&CASE_EEOP_OUTER_FETCHSOME_BATCH,
+ &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
&&CASE_EEOP_INNER_VAR,
&&CASE_EEOP_OUTER_VAR,
&&CASE_EEOP_SCAN_VAR,
@@ -592,6 +603,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+ &&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
+ &&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
+ &&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
&&CASE_EEOP_LAST
};
@@ -612,6 +626,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
+ innerbatch = econtext->inner_batch;
+ outerbatch = econtext->outer_batch;
+ scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -658,6 +675,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_INNER_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, innerslot);
+
+ Assert(innerbatch);
+ slot_getsomeattrs_batch(innerbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_OUTER_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, outerslot);
+
+ Assert(outerbatch);
+ slot_getsomeattrs_batch(outerbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, scanslot);
+
+ Assert(scanbatch);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_OLD_FETCHSOME)
{
CheckOpSlotCompatibility(op, oldslot);
@@ -2265,6 +2312,30 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_BUILD_INNER_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildInnerBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_BUILD_OUTER_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildOuterBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_BUILD_SCAN_BATCH_VECTOR)
+ {
+ /* too complex for an inline implementation */
+ ExecBuildScanBatchVector(state, op, econtext);
+
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -5914,3 +5985,57 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
MemoryContextSwitchTo(oldContext);
}
+
+void
+ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->inner_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->inner_batch);
+}
+
+void
+ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->outer_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->outer_batch);
+}
+
+void
+ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ Assert(econtext->scan_batch);
+ ExecBuildBatchVector(state, op, econtext, econtext->scan_batch);
+}
+
+static pg_attribute_always_inline void
+ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
+ TupleBatch *b)
+{
+ struct BatchVector *bv = op->d.batch_vector.bv;
+ int i = 0;
+
+ if (bv->ncols == 0)
+ return;
+
+ /* Fetch each requested attribute into column vectors. */
+ TupleBatchRewind(b);
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ for (int j = 0; j < bv->ncols; j++)
+ {
+ AttrNumber attno = bv->attnos[j];
+ Datum *cols = bv->cols[j];
+ bool *nulls = bv->nulls[j];
+
+ Assert(attno <= slot->tts_nvalid);
+ cols[i] = slot->tts_values[attno - 1];
+ nulls[i] = slot->tts_isnull[attno - 1];
+ if (!bv->hasnull && nulls[i])
+ bv->hasnull = true;
+ }
+ i++;
+ }
+ bv->nrows = i;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..86d5dea8f8b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
}
}
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* Check for caller errors */
+ Assert(attnum > 0);
+
+ if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+ elog(ERROR, "invalid attribute number %d", attnum);
+
+ /* XXX - there should perhaps also be a batch-level att_nvalid */
+ if (attnum < slot->tts_nvalid)
+ continue;
+
+ /* Fetch as many attributes as possible from the underlying tuple. */
+ slot->tts_ops->getsomeattrs(slot, attnum);
+
+ /*
+ * If the underlying tuple doesn't have enough attributes, tuple
+ * descriptor must have the missing attributes.
+ */
+ if (unlikely(slot->tts_nvalid < attnum))
+ {
+ slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+ slot->tts_nvalid = attnum;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecTypeFromTL
*
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 712b35df7e5..848f0b52d6f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,11 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_newslot;
LLVMValueRef v_resultslot;
+ /* batches */
+ LLVMValueRef v_innerbatch;
+ LLVMValueRef v_outerbatch;
+ LLVMValueRef v_scanbatch;
+
/* nulls/values of slots */
LLVMValueRef v_innervalues;
LLVMValueRef v_innernulls;
@@ -221,6 +226,21 @@ llvm_compile_expr(ExprState *state)
v_state,
FIELDNO_EXPRSTATE_RESULTSLOT,
"v_resultslot");
+ v_innerbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_OUTERBATCH,
+ "v_innerbatch");
+ v_outerbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_OUTERBATCH,
+ "v_outerbatch");
+ v_scanbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_SCANBATCH,
+ "v_scanbatch");
/* build global values/isnull pointers */
v_scanvalues = l_load_struct_gep(b,
@@ -439,6 +459,54 @@ llvm_compile_expr(ExprState *state)
break;
}
+ case EEOP_INNER_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_innerbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_OUTER_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_outerbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_SCAN_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_scanbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
case EEOP_INNER_VAR:
case EEOP_OUTER_VAR:
case EEOP_SCAN_VAR:
@@ -2940,6 +3008,24 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_BUILD_INNER_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildInnerBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
+ case EEOP_BUILD_OUTER_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildOuterBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
+ case EEOP_BUILD_SCAN_BATCH_VECTOR:
+ build_EvalXFunc(b, mod, "ExecBuildScanBatchVector",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 167cd554b9c..6bb527c3f6f 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,11 @@ void *referenced_functions[] =
MakeExpandedObjectReadOnlyInternal,
slot_getmissingattrs,
slot_getsomeattrs_int,
+ slot_getsomeattrs_batch,
strlen,
varsize_any,
ExecInterpExprStillValid,
+ ExecBuildInnerBatchVector,
+ ExecBuildOuterBatchVector,
+ ExecBuildScanBatchVector,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 75366203706..99c86bac702 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,11 @@ typedef enum ExprEvalOp
EEOP_OLD_FETCHSOME,
EEOP_NEW_FETCHSOME,
+ /* apply slot_getsomeattrs_batch() to corresponding batch */
+ EEOP_INNER_FETCHSOME_BATCH,
+ EEOP_OUTER_FETCHSOME_BATCH,
+ EEOP_SCAN_FETCHSOME_BATCH,
+
/* compute non-system Var value */
EEOP_INNER_VAR,
EEOP_OUTER_VAR,
@@ -292,11 +297,15 @@ typedef enum ExprEvalOp
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
+ /* ExprContext.*_batch -> BatchVector */
+ EEOP_BUILD_INNER_BATCH_VECTOR,
+ EEOP_BUILD_OUTER_BATCH_VECTOR,
+ EEOP_BUILD_SCAN_BATCH_VECTOR,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
-
typedef struct ExprEvalStep
{
/*
@@ -331,6 +340,12 @@ typedef struct ExprEvalStep
const TupleTableSlotOps *kind;
} fetch;
+ struct
+ {
+ /* attribute number up to which to fetch (inclusive) */
+ int last_var;
+ } fetch_batch;
+
/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
struct
{
@@ -769,6 +784,12 @@ typedef struct ExprEvalStep
void *json_coercion_cache;
ErrorSaveContext *escontext;
} jsonexpr_coercion;
+
+ /* for batch vector construction */
+ struct
+ {
+ struct BatchVector *bv;
+ } batch_vector;
} d;
} ExprEvalStep;
@@ -917,4 +938,26 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
ExprContext *econtext);
+/* ---------- BatchVector stuff ------------- */
+
+/* Vector fetch spec for a list of simple Vars. */
+typedef struct BatchVector
+{
+ /* immutable after BatchVectorCreate */
+ AttrNumber *attnos; /* [ncols] */
+ int ncols;
+ int maxrows;
+ int last_var;
+
+ /* per batch state */
+ Datum **cols; /* [ncols][maxbatch] */
+ bool **nulls; /* [ncols][maxbatch] */
+ bool hasnull; /* is any datum in cols NULL? */
+ int nrows; /* #rows loaded into cols/nulls */
+} BatchVector;
+
+extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 43f1d999b91..82369fa6e8e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -346,6 +346,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
int lastAttNum);
extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
#ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9b81b842161..fdfe8b4ddaf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -277,6 +277,14 @@ typedef struct ExprContext
#define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
TupleTableSlot *ecxt_outertuple;
+ /* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_INNERBATCH 4
+ TupleBatch *inner_batch;
+#define FIELDNO_EXPRCONTEXT_OUTERBATCH 5
+ TupleBatch *outer_batch;
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 6
+ TupleBatch *scan_batch;
+
/* Memory contexts for expression evaluation --- see notes above */
MemoryContext ecxt_per_query_memory;
MemoryContext ecxt_per_tuple_memory;
@@ -289,27 +297,27 @@ typedef struct ExprContext
* Values to substitute for Aggref nodes in the expressions of an Agg
* node, or for WindowFunc nodes within a WindowAgg node.
*/
-#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
+#define FIELDNO_EXPRCONTEXT_AGGVALUES 11
Datum *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
-#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
+#define FIELDNO_EXPRCONTEXT_AGGNULLS 12
bool *ecxt_aggnulls; /* null flags for aggs/windowfuncs */
/* Value to substitute for CaseTestExpr nodes in expression */
-#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
+#define FIELDNO_EXPRCONTEXT_CASEDATUM 13
Datum caseValue_datum;
-#define FIELDNO_EXPRCONTEXT_CASENULL 11
+#define FIELDNO_EXPRCONTEXT_CASENULL 14
bool caseValue_isNull;
/* Value to substitute for CoerceToDomainValue nodes in expression */
-#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
+#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 15
Datum domainValue_datum;
-#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
+#define FIELDNO_EXPRCONTEXT_DOMAINNULL 16
bool domainValue_isNull;
/* Tuples that OLD/NEW Var nodes in RETURNING may refer to */
-#define FIELDNO_EXPRCONTEXT_OLDTUPLE 14
+#define FIELDNO_EXPRCONTEXT_OLDTUPLE 17
TupleTableSlot *ecxt_oldtuple;
-#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
+#define FIELDNO_EXPRCONTEXT_NEWTUPLE 18
TupleTableSlot *ecxt_newtuple;
/* Link to containing EState (NULL if a standalone ExprContext) */
--
2.47.3
[application/octet-stream] v3-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.7K, 7-v3-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From 51192c52275005649df88b5e3a75360942dc0fcd Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:56:17 +0900
Subject: [PATCH v3 1/9] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
---
src/backend/access/heap/heapam.c | 212 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 21 +++
src/include/access/tableam.h | 58 +++++++
src/include/pgstat.h | 5 +
5 files changed, 299 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 568696333c2..8b9a80449c1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1008,7 +1008,7 @@ heapgettup_pagemode(HeapScanDesc scan,
int nkeys,
ScanKey key)
{
- HeapTuple tuple = &(scan->rs_ctup);
+ HeapTuple tuple = &scan->rs_ctup;
Page page;
uint32 lineindex;
uint32 linesleft;
@@ -1089,6 +1089,121 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * If we have no current page (or the current page is exhausted),
+ * advance to the next page that has any visible tuples and prepare it.
+ * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+ * as soon as we have a prepared page; we never produce from two pages.
+ */
+ for (;;)
+ {
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ /* Are there more visible tuples left on this page? */
+ lineindex = scan->rs_cindex + dir;
+ if (ScanDirectionIsForward(dir))
+ linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ else
+ linesleft = scan->rs_cindex;
+ if (linesleft > 0)
+ break; /* continue on this page */
+ }
+
+ /* Move to next page and prepare its visible tuple list. */
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ /* After prepare, either rs_ntuples > 0 or we'll loop again. */
+ if (scan->rs_ntuples > 0)
+ {
+ lineindex = ScanDirectionIsForward(dir) ? 0 : scan->rs_ntuples - 1;
+ linesleft = scan->rs_ntuples - (ScanDirectionIsForward(dir) ? 0 : 0);
+ break;
+ }
+ /* else: page had no visible tuples; continue to next page */
+ }
+
+ /* From here on, we must only read tuples from this single page. */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ /*
+ * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+ * until either the page is exhausted or the batch capacity is reached.
+ */
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex <= (uint32) scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ dst->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL &&
+ !HeapKeyTest(dst, RelationGetDescr(scan->rs_base.rs_rd),
+ nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ return nout;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1136,6 +1251,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_base.rs_parallel = parallel_scan;
scan->rs_strategy = NULL; /* set in initscan */
scan->rs_cbuf = InvalidBuffer;
+ scan->rs_batch_ctup = NULL;
+ scan->rs_batch_cbuf = InvalidBuffer;
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1315,6 +1432,8 @@ heap_endscan(TableScanDesc sscan)
*/
if (BufferIsValid(scan->rs_cbuf))
ReleaseBuffer(scan->rs_cbuf);
+ if (BufferIsValid(scan->rs_batch_cbuf))
+ ReleaseBuffer(scan->rs_batch_cbuf);
/*
* Must free the read stream before freeing the BufferAccessStrategy.
@@ -1421,6 +1540,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+
+ Assert(maxitems > 0);
+
+ hb = palloc(sizeof(HeapBatch));
+ hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb->tupdata);
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..ec4eeccf19c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..02f7793fba0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,9 @@ typedef struct HeapScanDescData
HeapTupleData rs_ctup; /* current tuple in scan, if any */
+ HeapTupleData *rs_batch_ctup; /* NULL when not using batched mode */
+ Buffer rs_batch_cbuf; /* buffer feeding the batch */
+
/* For scans that stream reads */
ReadStream *rs_read_stream;
@@ -101,6 +104,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -294,6 +310,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..953207eac50 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(dir == ForwardScanDirection);
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bc8077cbae6..249f3583f92 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -691,6 +691,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.47.3
[application/octet-stream] v3-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch (6.3K, 8-v3-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch)
download | inline diff:
From 87728dd22a56c35d3b7ee11e71e15a8d4193afd1 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 4 Sep 2025 22:55:25 +0900
Subject: [PATCH v3 4/9] WIP: Add agg_retrieve_direct_batch() for plain
aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Teach Agg to consume child tuples in batches for AGG_PLAIN. A new
agg_retrieve_direct_batch() pulls TupleBatch from the child via
ExecProcNodeBatch(), materializes as needed, and advances per-agg
transition state over the batch. A first tuple is copied to match
the direct path’s behavior before batch processing.
Add AggCanUsePlainBatch() and select retrieve_plain at init:
batch path when no grouping sets, strategy is AGG_PLAIN, and the
child exposes ExecProcNodeBatch(); otherwise keep the row path.
Plan shape and EXPLAIN remain unchanged. Semantics are identical
to the non-batch direct path; this only reduces per-tuple overhead.
---
src/backend/executor/nodeAgg.c | 123 +++++++++++++++++++++++++++++++++
src/include/nodes/execnodes.h | 5 ++
2 files changed, 128 insertions(+)
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a4f3d30f307..3ace6363509 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -820,6 +820,20 @@ advance_aggregates(AggState *aggstate)
aggstate->tmpcontext);
}
+static void
+advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
+{
+ ExprContext *tmpcontext = aggstate->tmpcontext;
+ ExprState *evaltrans = aggstate->phase->evaltrans;
+
+ while (TupleBatchHasMore(b))
+ {
+ tmpcontext->ecxt_outertuple = TupleBatchGetNextSlot(b);
+ ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+ ResetExprContext(tmpcontext);
+ }
+}
+
/*
* Run the transition function for a DISTINCT or ORDER BY aggregate
* with only one input. This is called after we have completed
@@ -2260,6 +2274,9 @@ ExecAgg(PlanState *pstate)
result = agg_retrieve_hash_table(node);
break;
case AGG_PLAIN:
+ /* init-time choice */
+ result = node->retrieve_plain(node);
+ break;
case AGG_SORTED:
result = agg_retrieve_direct(node);
break;
@@ -2618,6 +2635,91 @@ agg_retrieve_direct(AggState *aggstate)
return NULL;
}
+static TupleTableSlot *
+agg_retrieve_direct_batch(AggState *aggstate)
+{
+ PlanState *child = outerPlanState(aggstate);
+ ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+ ExprContext *tmpcontext = aggstate->tmpcontext;
+ const bool hasGroupingSets = aggstate->phase->numsets > 0;
+ TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+ TupleBatch *b = NULL;
+
+ Assert(child->ExecProcNodeBatch);
+
+ /* mimic the first-tuple copy from agg_retrieve_direct() */
+ for (;;)
+ {
+ b = ExecProcNodeBatch(child);
+ if (b == NULL)
+ {
+ if (hasGroupingSets)
+ {
+ aggstate->input_done = true;
+ break;
+ }
+ aggstate->agg_done = true;
+ break;
+ }
+ if (b->nvalid == 0)
+ continue;
+
+ TupleBatchMaterializeAll(b);
+ aggstate->grp_firstTuple = ExecCopySlotHeapTuple(TupleBatchGetSlot(b, 0));
+ break;
+ }
+
+ /* initialize_aggregates etc. as in the direct path */
+ ReScanExprContext(econtext);
+ for (int i = 0; i < Max(aggstate->phase->numsets, 1); i++)
+ ReScanExprContext(aggstate->aggcontexts[i]);
+
+ initialize_aggregates(aggstate, aggstate->pergroups,
+ Max(aggstate->phase->numsets, 1));
+
+ if (aggstate->grp_firstTuple)
+ {
+ ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
+ aggstate->grp_firstTuple = NULL;
+ tmpcontext->ecxt_outertuple = firstSlot;
+
+ advance_aggregates_batch(aggstate, b);
+ ResetExprContext(tmpcontext);
+ }
+
+ /* consume remaining rows in current and subsequent batches */
+ if (b)
+ {
+ if (TupleBatchHasMore(b))
+ advance_aggregates_batch(aggstate, b);
+ for (;;)
+ {
+ b = ExecProcNodeBatch(child);
+ if (b == NULL)
+ {
+ if (hasGroupingSets)
+ aggstate->input_done = true;
+ else
+ aggstate->agg_done = true;
+ break;
+ }
+ if (b->nvalid == 0)
+ continue;
+
+ TupleBatchMaterializeAll(b);
+ advance_aggregates_batch(aggstate, b);
+ }
+ }
+
+ /* finalize and project like the direct path */
+ econtext->ecxt_outertuple = firstSlot;
+ prepare_projection_slot(aggstate, econtext->ecxt_outertuple, 0);
+ select_current_set(aggstate, 0, false);
+ finalize_aggregates(aggstate, aggstate->peragg, aggstate->pergroups[0]);
+
+ return project_aggregates(aggstate);
+}
+
/*
* ExecAgg for hashed case: read input and build hash table
*/
@@ -3265,6 +3367,22 @@ hashagg_reset_spill_state(AggState *aggstate)
}
}
+static bool
+AggCanUsePlainBatch(AggState *aggstate)
+{
+ const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
+
+ Assert(outerPlanState(aggstate));
+
+ /* grouping sets present -> bail */
+ if (aggnode->groupingSets != NIL)
+ return false;
+
+ if (aggstate->phase->aggstrategy != AGG_PLAIN)
+ return false;
+
+ return outerPlanState(aggstate)->ExecProcNodeBatch;
+}
/* -----------------
* ExecInitAgg
@@ -4060,6 +4178,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
(errcode(ERRCODE_GROUPING_ERROR),
errmsg("aggregate function calls cannot be nested")));
+ if (AggCanUsePlainBatch(aggstate))
+ aggstate->retrieve_plain = agg_retrieve_direct_batch;
+ else
+ aggstate->retrieve_plain = agg_retrieve_direct;
+
/*
* Build expressions doing all the transition work at once. We build a
* different one for each phase, as the number of transition function
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a104591ac20..9b81b842161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2535,6 +2535,9 @@ typedef struct AggStatePerGroupData *AggStatePerGroup;
typedef struct AggStatePerPhaseData *AggStatePerPhase;
typedef struct AggStatePerHashData *AggStatePerHash;
+struct AggState;
+typedef TupleTableSlot *(*AggRetrievePlainFn)(struct AggState *);
+
typedef struct AggState
{
ScanState ss; /* its first field is NodeTag */
@@ -2610,6 +2613,8 @@ typedef struct AggState
AggStatePerGroup *all_pergroups; /* array of first ->pergroups, than
* ->hash_pergroup */
SharedAggInfo *shared_info; /* one entry per worker */
+
+ AggRetrievePlainFn retrieve_plain; /* init-time choice */
} AggState;
/* ----------------
--
2.47.3
[application/octet-stream] v3-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch (9.0K, 9-v3-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch)
download | inline diff:
From 1ee09ba42c595d108356f78a46ea4e00a03ce123 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 22:18:30 +0900
Subject: [PATCH v3 3/9] Executor: add ExecProcNodeBatch() and integrate
SeqScan with batch API
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Introduce a batch-capable executor interface alongside the existing
slot-at-a-time path:
* ExecProcNodeBatch() is added to return a TupleBatch instead of a
TupleTableSlot. PlanState gains ExecProcNodeBatch as a function
pointer.
Integrate SeqScan with this interface:
* Add ExecSeqScanBatch* routines that drive heap via the batch table
AM API and return a TupleBatch.
* At init, set ps.ExecProcNodeBatch to these routines when
ScanCanUseBatching() allows.
* Retain ExecSeqScanBatchSlot* variants for slot-at-a-time consumers.
This builds on 0002, which introduced TupleBatch and made SeqScan
consume the AM’s batch API internally but still surface slots. With this
patch, SeqScan can surface batches directly to batch-aware upper nodes.
Plan shape and EXPLAIN output remain unchanged; only internal tuple flow
differs when batching is enabled and allowed.
---
src/backend/executor/execProcnode.c | 52 +++++++++++++++++++++++++++++
src/backend/executor/nodeSeqscan.c | 35 +++++++++++++++++++
src/include/executor/execScan.h | 51 ++++++++++++++++++++++++++++
src/include/executor/executor.h | 10 ++++++
src/include/nodes/execnodes.h | 5 +++
5 files changed, 153 insertions(+)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..a8c0315e874 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,6 +121,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+static TupleBatch *ExecProcNodeBatchFirst(PlanState *node);
+static TupleBatch *ExecProcNodeBatchInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
@@ -389,6 +391,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
}
ExecSetExecProcNode(result, result->ExecProcNode);
+ if (result->ExecProcNodeBatch)
+ ExecSetExecProcNodeBatch(result, result->ExecProcNodeBatch);
/*
* Initialize any initPlans present in this node. The planner put them in
@@ -489,6 +493,54 @@ ExecProcNodeInstr(PlanState *node)
return result;
}
+/*
+ * ExecSetExecProcNodeBatch
+ * Install ExecProcNodeBatch with first-call wrapper, mirroring row path.
+ */
+void
+ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function)
+{
+ node->ExecProcNodeBatchReal = function;
+ node->ExecProcNodeBatch = ExecProcNodeBatchFirst;
+}
+
+/*
+ * ExecProcNodeBatchFirst
+ * One-time stack-depth check; then pick instrument/no-instrument wrapper.
+ */
+static TupleBatch *
+ExecProcNodeBatchFirst(PlanState *node)
+{
+ check_stack_depth();
+
+ if (node->instrument)
+ node->ExecProcNodeBatch = ExecProcNodeBatchInstr;
+ else
+ node->ExecProcNodeBatch = node->ExecProcNodeBatchReal;
+
+ return node->ExecProcNodeBatch(node);
+}
+
+/*
+ * ExecProcNodeBatchInstr
+ * Instrumentation wrapper for batch calls.
+ *
+ * Note: we can record nrows as the "tuple" count for this call. That keeps
+ * instrumentation meaningful without changing Instr API.
+ */
+static TupleBatch *
+ExecProcNodeBatchInstr(PlanState *node)
+{
+ TupleBatch *b;
+
+ InstrStartNode(node->instrument);
+
+ b = node->ExecProcNodeBatchReal(node);
+
+ InstrStopNode(node->instrument, b ? (double) b->nvalid : 0.0);
+
+ return b;
+}
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 2552d420f1c..a4cf1e51af0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -334,6 +334,37 @@ ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
pstate->qual, pstate->ps_ProjInfo);
}
+static TupleBatch *
+ExecSeqScanBatch(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatch,
+ NULL, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation is required.
+ */
+static TupleBatch *
+ExecSeqScanBatchWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
/* Batch SeqScan enablement and dispatch */
static void
SeqScanInitBatching(SeqScanState *scanstate, int eflags)
@@ -348,10 +379,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
+ scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatch;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
}
else
{
+ scanstate->ss.ps.ExecProcNodeBatch = NULL;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
}
}
@@ -359,10 +392,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
+ scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatchWithQual;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
}
else
{
+ scanstate->ss.ps.ExecProcNodeBatch = NULL;
scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
}
}
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fec606471c8..fb4b57a831c 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -297,4 +297,55 @@ ExecScanExtendedBatchSlot(ScanState *node,
}
}
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+ int qualified;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next batch from the AM */
+ if (!accessBatchMtd(node))
+ return NULL;
+
+ if (qual != NULL)
+ {
+ qualified = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *in = TupleBatchGetNextSlot(b);
+
+ Assert(in);
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ if (ExecQual(qual, econtext))
+ {
+ TupleBatchStoreInOut(b, qualified, in);
+ qualified++;
+ }
+ else
+ InstrCountFiltered1(node, 1);
+ }
+ TupleBatchUseOutput(b, qualified);
+ }
+ else
+ qualified = b->nvalid;
+
+ if (qualified > 0)
+ return b;
+ /* else get the next batch from the AM */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 17258f7ae2d..cf5b0c7e05c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -294,6 +294,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
*/
extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
+extern void ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
@@ -315,6 +316,15 @@ ExecProcNode(PlanState *node)
return node->ExecProcNode(node);
}
+
+static inline TupleBatch *
+ExecProcNodeBatch(PlanState *node)
+{
+ if (node->chgParam != NULL) /* something changed? */
+ ExecReScan(node); /* let ReScan handle this */
+
+ return node->ExecProcNodeBatch(node);
+}
#endif
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f4bb8f7dd7f..a104591ac20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1147,6 +1147,7 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
struct TupleBatch;
typedef struct TupleBatch TupleBatch;
+typedef TupleBatch *(*ExecProcNodeBatchMtd)(struct PlanState *ps);
/* ----------------
* PlanState node
@@ -1171,6 +1172,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
+ /* Optional batch-producing entry point (NULL => no batching). */
+ ExecProcNodeBatchMtd ExecProcNodeBatch;
+ ExecProcNodeBatchMtd ExecProcNodeBatchReal;
+
Instrumentation *instrument; /* Optional runtime stats for this node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
--
2.47.3
[application/octet-stream] v3-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.2K, 10-v3-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From dac7cf1cd2a01347faf6b7fab3107c08da88ac90 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:59:56 +0900
Subject: [PATCH v3 2/9] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Notes / current limits:
- Batching uses EXEC_BATCH_ROWS (currently 64) as the target capacity.
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
---
src/backend/access/heap/heapam.c | 29 ++++
src/backend/access/heap/heapam_handler.c | 15 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 117 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 7 +
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 102 +++++++++++++
src/include/executor/execScan.h | 54 +++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 8 +
17 files changed, 587 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8b9a80449c1..355ddd9838d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1570,6 +1570,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec4eeccf19c..8e88cc9e8f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,20 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps = {
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2617,6 +2631,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5a8ebb8b97c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+
+ inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->outslots = outslots;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ if (b == NULL)
+ return;
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ if (drop_slots)
+ {
+ pfree(b->inslots);
+ pfree(b->outslots);
+ b->inslots = b->outslots = NULL;
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 90726949a87..f24c5d73ae1 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batching &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..2552d420f1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(direction == ForwardScanDirection);
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = EXEC_BATCH_ROWS;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..b4a0996a717 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+bool executor_batching = false;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index b176d5130e4..a4bc8c10cc2 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -887,6 +887,13 @@
boot_val => 'true',
},
+{ name => 'executor_batching', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Use tuple batching during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batching',
+ boot_val => 'true',
+},
+
{ name => 'data_sync_retry', type => 'bool', context => 'PGC_POSTMASTER', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Whether to continue running after a failure to sync data files.',
variable => 'data_sync_retry',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 02f7793fba0..13ce6166ec3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -314,6 +314,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 953207eac50..05f828b9762 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2116,5 +2142,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..6f1a38d14bd
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/* XXX fixed 64 for PoC */
+#define EXEC_BATCH_ROWS 64
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 837ea7785bb..fec606471c8 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -243,4 +243,58 @@ ExecScanExtended(ScanState *node,
}
}
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..17258f7ae2d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -575,12 +575,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b8e7afda57c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -287,6 +287,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT bool executor_batching;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..f4bb8f7dd7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
#define EXECNODES_H
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "fmgr.h"
#include "lib/ilist.h"
@@ -1143,6 +1144,10 @@ typedef struct JsonExprState
*/
typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
+struct TupleBatch;
+typedef struct TupleBatch TupleBatch;
+
/* ----------------
* PlanState node
*
@@ -1198,6 +1203,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.47.3
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-27 07:24 Amit Langote <[email protected]>
parent: Tomas Vondra <[email protected]>
3 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2025-10-27 07:24 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
Hi Tomas,
On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <[email protected]> wrote:
>
> Hi Amit,
>
> Thanks for the patch. I took a look over the weekend, and done a couple
> experiments / benchmarks, so let me share some initial feedback (or
> rather a bunch of questions I came up with).
Thank you for reviewing the patch and taking the time to run those
experiments. I appreciate the detailed feedback and questions. I also
apologize for my late reply, I spent perhaps way too much time going
over your index prefetching thread trying to understand the notion of
batching that it uses and getting sidelined by other things while
writing this reply.
> I'll start with some general thoughts, before going into some nitpicky
> comments about patches / code and perf results.
>
> I think the general goal of the patch - reducing the per-tuple overhead
> and making the executor more efficient for OLAP workloads - is very
> desirable. I believe the limitations of per-row executor are one of the
> reasons why attempts to implement a columnar TAM mostly failed. The
> compression is nice, but it's hard to be competitive without an executor
> that leverages that too. So starting with an executor, in a way that
> helps even heap, seems like a good plan. So +1 to this.
I'm happy to hear that you find the overall direction worthwhile.
> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.
>
> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.
>
> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.
Yes, the design of this patch does differ from the index prefetching
approach, and that’s largely due to the differing goals as you say.
AIUI, the index prefetching patch uses a shared batch structure
managed mostly by indexam.c and populated by the index AM. In my
patch, each table AM produces its own batch format that gets wrapped
in a generic TupleBatch which contains the AM-specified TupleBatchOps
for operations on the AM's opaque data. This was a conscious choice:
in prefetching, the aim seems to be to make indexam.c manage batches
and operations based on it in a mostly AM-agnostic manner. But for
executor batching, the aim is to retain TAM-specific optimizations as
much as possible and rely on the TAM for most operations on the batch
contents. Both designs have their merits given their respective use
cases, but I guess you understand that very well.
> But for the batching, we want to retain the custom format as long as
> possible. Presumably, the various advantages of the TAMs are tied to the
> custom/columnar storage format. Memory efficiency thanks to compression,
> execution on compressed data, etc. Keeping the custom format as long as
> possible is the whole point of "late materialization" (and materializing
> as late as possible is one of the important details in column stores).
Exactly -- keeping the TAM-specific batch format as long as possible
is a key goal here. As you noted, the benefits of a custom storage
format (compression, operating on compressed data, etc.) are best
realized when we delay materialization until absolutely necessary. I
want to design this patch that each TAM can produce and use its own
batch representation internally, only wrapping it when interfacing
with the executor in a generic way. I admit that's not entirely true
with the patch as it stands as I write above below.
> How far ahead have you though about these capabilities? I was wondering
> about two things in particular. First, at which point do we have to
> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> I get it that you want to enable passing batches between nodes, but
> would those use the same "format" as the underlying scan node, or some
> generic one? Second, will it be possible to execute expressions on the
> custom batches (i.e. on "compressed data")? Or is it necessary to
> "materialize" the batch into regular tuple slots? I realize those may
> not be there "now" but maybe it'd be nice to plan for the future.
I have been thinking about those future capabilities. Currently, the
patch keeps tuples in the TAM-specific batch format up until they need
to be consumed by a node that doesn’t understand that format or has
not been modified to invoke the TAM callbacks to decode it. In the
current patch, that means we materialize to regular TupleTableSlots at
nodes that require it (for example, the scan node reading from TAM
needing to evaluate quals, etc.). However, the intention is to allow
batches to be passed through as many nodes as possible without
materialization, ideally using the same format produced by the scan
node all the way up until reaching a node that can only work with
tuples in TupleTableSlots.
As for executing expressions directly on the custom batch data: that’s
something I would like to enable in the future. Right now, expressions
(quals, projections, etc.) are evaluated after materializing into
normal tuples in TupleTableSlots stored in TupleBatch, because the
expression evaluation code isn’t yet totally batch-aware or is very
from doing things like operate on compressed data in its native form.
Patches 0004-0008 do try to add batch-aware expression evaluation but
that's just a prototype. In the long term, the goal is to allow
expression evaluation on batch data (for example, applying a WHERE
clause or aggregate transition directly on a columnar batch without
converting it to heap tuples first). This will require significant new
infrastructure (perhaps specialized batch-aware expression operators
and functions), so it's not in the current patch, but I agree it's
important to plan for it. The current design doesn’t preclude it, it
lays some groundwork by introducing the batch abstraction -- but fully
supporting that will be future work.
That said, one area I’d like to mention while at it, especially to
enable native execution on compressed or columnar batches, is giving
the table AM more control over how expression evaluation is performed
on its batch data. In the current patch, the AM can provide a
materialize function via TupleBatchOps, but that always produces an
array of TupleTableSlots stored in the TupleBatch, not an opaque
representation that remains under AM control. Maybe that's not bad for
a v1 patch. When evaluating expressions over a batch, a BatchVector
is built by looping over these slots and invoking the standard
per-tuple getsomeattrs() to "deform" a tuple into needed columns.
While that enables batch-style EEOPs for qual evaluation and aggregate
transition (and is already a gain over per-row evaluation), it misses
the opportunity to leverage any batch-specific optimizations the AM
could offer, such as vectorized decoding or filtering over compressed
data, and other AM optimizations for getting only the necessary
columns out possibly in a vector format.
I’m considering extending TupleTableSlotOps with a batch-aware variant
of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
AMs can populate column vectors (e.g., BatchVector) directly from
their native format. That would allow bypassing slot materialization
entirely and plug AM-provided decoding logic directly into the
executor’s batch expression paths. This isn’t implemented yet, but I
see it as a necessary step toward supporting fully native expression
evaluation over compressed or columnar formats. I’m not yet sure if
TupleTableSlotOps is the right place for such a hook, it might belong
elsewhere in the abstraction, but exposing a batch-aware interface for
this purpose seems like the right direction.
> It might be worth exploring some columnar formats, and see if this
> design would be a good fit. Let's say we want to process data read from
> a parquet file. Would we be able to leverage the format, or would we
> need to "materialize" into slots too early? Or maybe it'd be good to
> look at the VCI extension [1], discussed in a nearby thread. AFAICS
> that's still based on an index AM, but there were suggestions to use TAM
> instead (and maybe that'd be a better choice).
Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
design should be able to accommodate true columnar formats like
Parquet. If we had a table AM (or FDW) that reads Parquet files into a
columnar batch structure, the executor batching framework should
ideally allow us to pass that batch along without immediately
materializing to tuples. As mentioned before, we might have to adjust
or extend the TupleBatch abstraction to handle a wider variety of
batch formats, but conceptually it fits -- the goal is to avoid
forcing early materialization. I will definitely keep the Parquet
use-case in mind and perhaps do some experiments with a columnar
source to ensure we aren’t baking in any unnecessary materialization.
Also, thanks for the reference to the VCI extension thread; I'll take
a look at that.
> The other option would be to "create batches" during execution, say by
> having a new node that accumulates tuples, builds a batch and sends it
> to the node above. This would help both in cases when either the lower
> node does not produce batches at all, or the batches are too small (due
> to filtering, aggregation, ...). Or course, it'd only win if this
> increases efficiency of the upper part of the plan enough to pay for
> building the batches. That can be a hard decision.
Yes, introducing a dedicated executor node to accumulate and form
batches on the fly is an interesting idea, I have thought about it and
even mentioned it in passing in the pgconf.dev unconference. This
could indeed cover scenarios where the data source (a node) doesn't
produce batches (e.g., a non-batching node feeding into a
batching-aware upper node) or where batches coming from below are too
small to be efficient. The current patch set doesn’t implement such a
node; I focused on enabling batching at the scan/TAM level first. The
cost/benefit decision for a batch-aggregator node is tricky, as you
said. We’d need a way to decide when the overhead of gathering tuples
into a batch is outweighed by the benefits to the upper node. This
likely ties into costing or adaptive execution decisions. It's
something I’m open to exploring in a future iteration, perhaps once we
have more feedback on how the existing batching performs in various
scenarios. It might also require some planner or executor smarts
(maybe the executor can decide to batch on the fly if it sees a
pattern of use, or the planner could insert such nodes when
beneficial).
> You also mentioned we could make batches larger by letting them span
> multiple pages, etc. I'm not sure that's worth it - wouldn't that
> substantially complicate the TAM code, which would need to pin+track
> multiple buffers for each batch, etc.? Possible, but is it worth it?
>
> I'm not sure allowing multi-page batches would actually solve the issue.
> It'd help with batches at the "scan level", but presumably the batch
> size in the upper nodes matters just as much. Large scan batches may
> help, but hard to predict.
>
> In the index prefetching patch we chose to keep batches 1:1 with leaf
> pages, at least for now. Instead we allowed having multiple batches at
> once. I'm not sure that'd be necessary for TAMs, though.
I tend to agree with you here. Allowing a single batch to span
multiple pages would add quite a bit of complexity to the table AM
implementations (managing multiple buffer pins per batch, tracking
page boundaries, etc.), and it's unclear if the benefit would justify
that complexity. For now, I'm inclined not to pursue multi-page
batches at the scan level in this patch. We can keep the batch
page-local (e.g., for heap, one batch corresponds to max one page, as
it does now). If we need larger batch sizes overall, we might address
that by other means -- for example, by the above-mentioned idea of a
higher-level batching node or by simply producing multiple batches in
quick succession.
You’re right that even if we made scan batches larger, it doesn’t
necessarily solve everything, since the effective batch size at
higher-level nodes could still be constrained by other factors. So
rather than complicating the low-level TAM code with multi-page
batches, I'd prefer to first see if the current approach (with
one-page batches) yields good benefits and then consider alternatives.
We could also consider letting a scan node produce multiple batches
before yielding to the upper node (similar to how the index
prefetching patch can have multiple leaf page batches in flight) if
needed, but as you note, it might not be necessary for TAMs yet. So at
this stage, I'll keep it simple.
> This also reminds me of LIMIT queries. The way I imagine a "batchified"
> executor to work is that batches are essentially "units of work". For
> example, a nested loop would grab a batch of tuples from the outer
> relation, lookup inner tuples for the whole batch, and only then pass
> the result batch. (I'm ignoring the cases when the batch explodes due to
> duplicates.)
>
> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
> just the first tuple, and the rest of the batch is wasted work? Plenty
> of (very expensive) OLAP have that, and many would likely benefit from
> batching, so just disabling batching if there's LIMIT seems way too
> heavy handed.
Yeah, LIMIT does complicate downstream batching decisions. If we
always use a full-size batch (say 64 tuples) for every operation, a
query with LIMIT 1 could end up doing a lot of unnecessary work
fetching and processing 63 tuples that never get used. Disabling
batching entirely for queries with LIMIT would indeed be overkill and
lose benefits for cases where the limit is not extremely selective.
> Perhaps it'd be good to gradually ramp up the batch size? Start with
> small batches, and then make them larger. The index prefetching does
> that too, indirectly - it reads the whole leaf page as a batch, but then
> gradually ramps up the prefetch distance (well, read_stream does that).
> Maybe the batching should have similar thing ...
An adaptive batch size that ramps up makes a lot of sense as a
solution. We could start with a very small batch (say 4 tuples) and if
we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
yet or more output is still being consumed), then increase the batch
size for subsequent operations. This way, a query that stops early
doesn’t incur the full batching overhead, whereas a query that does
process lots of tuples will gradually get to a larger batch size to
gain efficiency. This is analogous to how the index prefetching ramps
up prefetch distance, as you mentioned.
Implementing that will require some careful thought. It could be done
either in the planner (choose initial batch sizes based on context
like LIMIT) or more dynamically in the executor (adjust on the fly). I
lean towards a runtime heuristic because it’s hard for the planner to
predict exactly how a LIMIT will play out, especially in complex
plans. In any case, I agree that a gradual ramp-up or other adaptive
approach would make batching more robust in the presence of query
execution variability. I will definitely consider adding such logic,
perhaps as an improvement once the basic framework is in.
> In fact, how shall the optimizer decide whether to use batching? It's
> one thing to decide whether a node can produce/consume batches, but
> another thing is "should it"? With a node that "builds" a batch, this
> decision would apply to even more plans, I guess.
>
> I don't have a great answer to this, it seems like an incredibly tricky
> costing issue. I'm a bit worried we might end up with something too
> coarse, like "jit=on" which we know is causing problems (admittedly,
> mostly due to a lot of the LLVM work being unpredictable/external). But
> having some "adaptive" heuristics (like the gradual ramp up) might make
> it less risky.
I agree that deciding when to use batching is tricky. So far, the
patch takes a fairly simplistic approach: if a node (particularly a
scan node) supports batching, it just does it, and other parts of the
plan will consume batches if they are capable. There isn’t yet a
nuanced cost-based decision in the planner for enabling batching. This
is indeed something we’ll have to refine. We don’t want to end up with
a blunt on/off GUC that could cause regressions in some cases.
One idea is to introduce costing for batching: for example, estimate
the per-tuple savings from batching vs the overhead of materialization
or batch setup. However, developing a reliable cost model for that
will take time and experimentation, especially with the possibility of
variable batch sizes or adaptive behavior. Not to mention, that will
be adding one more dimension to planner's costing model making the
planning more expensive and unpredictable. In the near term, I’m fine
with relying on feedback and perhaps manual tuning (GUCs, etc.) to
decide on batching, but that’s perhaps not a long-term solution.
I share your inclination that adaptive heuristics might be the safer
path initially. Perhaps the executor can decide to batch or not batch
based on runtime conditions. The gradual ramp-up of batch size is one
such adaptive approach. We could also consider things like monitoring
how effective batching is (are we actually processing full batches or
frequently getting cut off?) and adjust behavior. These are somewhat
speculative ideas at the moment, but the bottom line is I’m aware we
need a smarter strategy than a simple switch. This will likely evolve
as we test the patch in more scenarios.
> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> hard to say. It'd be good to be able to experiment with different
> values, so I suggest we make this a GUC and not a hard-coded constant.
Yeah, I was thinking the same while testing -- the optimal batch size
might vary by workload or hardware, and 64 was a somewhat arbitrary
starting point. I will make the batch size limit configurable
(probably as a GUC executor_batch_tuples, maybe only developer-focused
at first). That will let us and others experiment easily with
different batch sizes to see how it affects performance. It should
also help with your earlier point: for example, on a machine where 64
is too low or too high, we can adjust it without recompiling. So yes,
I'll add a GUC for the batch size in the next version of the patch.
> As for what to add to explain, I'd start by adding info about which
> nodes are "batched" (consuming/producing batches), and some info about
> the batch sizes. An average size, maybe a histogram if you want to be a
> bit fancy.
Adding more information to EXPLAIN is a good idea. In the current
patch, EXPLAIN does not show anything about batching, but it would be
very helpful for debugging and user transparency to indicate which
nodes are operating in batch mode. I will update EXPLAIN to mark
nodes that produce or consume batches. Likely I’ll start with
something simple like an extra line or tag for a node, e.g., "Batch:
true (avg batch size 64)" or something along those lines. An average
batch size could be computed if we have instrumentation, which would
be useful to see if, say, the batch sizes ended up smaller due to
LIMIT or other factors. A full histogram might be more detail than
most users need, but I agree even just knowing average or maximum
batch size per node could be useful for performance analysis. I'll
implement at least the basics for now, and we can refine it (maybe add
more stats) if needed.
(I had added a flag in the EXPLAIN output at one point, but removed it
due to finding the regression output churn too noisy, though I
understand I'll have to bite the bullet at some point.)
> Now, numbers from some microbenchmarks:
>
> On 9/26/25 15:28, Amit Langote wrote:
> > To evaluate the overheads and benefits, I ran microbenchmarks with
> > single and multi-aggregate queries on a single table, with and without
> > WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> > and IO costs are minimal. shared_buffers was large enough to fit the
> > whole table (up to 10M rows, ~43 on each page), and all pages were
> > prewarmed into cache before tests. Table schema/script is at [2].
> >
> > Observations from benchmarking (Detailed benchmark tables are at [3];
> > below is just a high-level summary of the main patterns):
> >
> > * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> > sum(a) FROM bar_N): batching scan output alone improved latency by
> > ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> > especially once fmgr overhead was paid per batch instead of per row.
> >
> > * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> > qual interpreter gave a big step up, with latencies dropping by
> > ~30-40% compared to batching=off.
> >
> > * Five aggregates, no WHERE: batching input from the child scan cut
> > ~15% off runtime. Adding batched transition evaluation increased
> > improvements to ~30%.
> >
> > * Five aggregates, with WHERE: modest gains from scan/input batching,
> > but per-batch transition evaluation and batched quals brought ~20-30%
> > improvement.
> >
> > * Across all cases, executor overheads became visible only after IO
> > was minimized. Once executor cost dominated, batching consistently
> > reduced CPU time, with the largest benefits coming from avoiding
> > per-row fmgr calls and evaluating quals across batches.
> >
> > I would appreciate if others could try these patches with their own
> > microbenchmarks or workloads and see if they can reproduce numbers
> > similar to mine. Feedback on both the general direction and the
> > details of the patches would be very helpful. In particular, patches
> > 0001-0003, which add the basic batch APIs and integrate them into
> > SeqScan, are intended to be the first candidates for review and
> > eventual commit. Comments on the later, more experimental patches
> > (aggregate input batching and expression evaluation (qual, aggregate
> > transition) batching) are also welcome.
> >
>
> I tried to replicate the results, but the numbers I see are not this
> good. In fact, I see a fair number of regressions (and some are not
> negligible).
>
> I'm attaching the scripts I used to build the tables / run the test. I
> used the same table structure, and tried to follow the same query
> pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
> conditions (with 100% selectivity).
>
> I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
> And I did that on my (relatively) new ryzen machine, and old xeon. The
> behavior is quite different for the two machines, but none of them shows
> such improvements. I used clang 19.0, and --with-llvm.
>
> See the attached PDFs with a summary of the results, comparing the
> results for master and the two batching branches.
>
> The ryzen is much "smoother" - it shows almost no difference with
> batching "off" (as expected). The "scan" branch (with 0001-0003) shows
> an improvement of 5-10% - it's consistent, but much less than the 10-20%
> you report. For the "agg" branch the benefits are much larger, but
> there's also a significant regression for the largest table with 100M
> rows (which is ~18GB on disk).
>
> For xeon, the results are a bit more variable, but it affects runs both
> with batching "on" and "off". The machine is just more noisy. There
> seems to be a small benefit of "scan" batching (in most cases much less
> than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
> and no regression similar to the ryzen.
>
> Perhaps I did something wrong. It does not surprise me this is somewhat
> CPU dependent. It's a bit sad the improvements are smaller for the newer
> CPU, though.
Thanks for sharing your benchmark results -- that’s very useful data.
I haven’t yet finished investigating why there's a regression relative
to master when executor_batching is turned off. I re-ran my benchmarks
to include comparisons with master and did observe some regressions in
a few cases too, but I didn't see anything obvious in profiles that
explained the slowdown. I initially assumed it might be noise, but now
I suspect it could be related to structural changes in the scan code
-- for example, I added a few new fields in the middle of
HeapScanDescData, and even though the batching logic is bypassed when
executor_batching is off, it’s possible that change alone affects
memory layout or cache behavior in a way that penalizes the unbatched
path. I haven’t confirmed that yet, but it’s on my list to look into
more closely.
Your observation that newer CPUs like the Ryzen may see smaller
improvements makes sense -- perhaps they handle the per-tuple overhead
more efficiently to begin with. Still, I’d prefer not to see
regressions at all, even in the unbatched case, so I’ll focus on
understanding and fixing that part before drawing conclusions from the
performance data.
Thanks again for the scripts -- those will help a lot in narrowing things down.
> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.
The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
evaluation that I suspect is also causing the crash you saw.
I'll try to post a v4 in a couple of weeks with some of the things I
mentioned above.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-27 16:18 Tomas Vondra <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Tomas Vondra @ 2025-10-27 16:18 UTC (permalink / raw)
To: Amit Langote <[email protected]>; +Cc: pgsql-hackers
On 10/27/25 08:24, Amit Langote wrote:
> Hi Tomas,
>
> On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <[email protected]> wrote:
>>
>> Hi Amit,
>>
>> Thanks for the patch. I took a look over the weekend, and done a couple
>> experiments / benchmarks, so let me share some initial feedback (or
>> rather a bunch of questions I came up with).
>
> Thank you for reviewing the patch and taking the time to run those
> experiments. I appreciate the detailed feedback and questions. I also
> apologize for my late reply, I spent perhaps way too much time going
> over your index prefetching thread trying to understand the notion of
> batching that it uses and getting sidelined by other things while
> writing this reply.
>
Cool! Now you can do a review of the index prefetch patch ;-)
>> I'll start with some general thoughts, before going into some nitpicky
>> comments about patches / code and perf results.
>>
>> I think the general goal of the patch - reducing the per-tuple overhead
>> and making the executor more efficient for OLAP workloads - is very
>> desirable. I believe the limitations of per-row executor are one of the
>> reasons why attempts to implement a columnar TAM mostly failed. The
>> compression is nice, but it's hard to be competitive without an executor
>> that leverages that too. So starting with an executor, in a way that
>> helps even heap, seems like a good plan. So +1 to this.
>
> I'm happy to hear that you find the overall direction worthwhile.
>
>> While looking at the patch, I couldn't help but think about the index
>> prefetching stuff that I work on. It also introduces the concept of a
>> "batch", for passing data between an index AM and the executor. It's
>> interesting how different the designs are in some respects. I'm not
>> saying one of those designs is wrong, it's more due different goals.
>>
>> For example, the index prefetching patch establishes a "shared" batch
>> struct, and the index AM is expected to fill it with data. After that,
>> the batch is managed entirely by indexam.c, with no AM calls. The only
>> AM-specific bit in the batch is "position", but that's used only when
>> advancing to the next page, etc.
>>
>> This patch does things differently. IIUC, each TAM may produce it's own
>> "batch", which is then wrapped in a generic one. For example, heap
>> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
>> is fine. In the prefetching we chose to move all this code (walking the
>> batch items) from the AMs into the layer above, and make it AM agnostic.
>
> ...
>
>> But for the batching, we want to retain the custom format as long as
>> possible. Presumably, the various advantages of the TAMs are tied to the
>> custom/columnar storage format. Memory efficiency thanks to compression,
>> execution on compressed data, etc. Keeping the custom format as long as
>> possible is the whole point of "late materialization" (and materializing
>> as late as possible is one of the important details in column stores).
>
> Exactly -- keeping the TAM-specific batch format as long as possible
> is a key goal here. As you noted, the benefits of a custom storage
> format (compression, operating on compressed data, etc.) are best
> realized when we delay materialization until absolutely necessary. I
> want to design this patch that each TAM can produce and use its own
> batch representation internally, only wrapping it when interfacing
> with the executor in a generic way. I admit that's not entirely true
> with the patch as it stands as I write above below.
>
Understood. Makes sense in general.
>> How far ahead have you though about these capabilities? I was wondering
>> about two things in particular. First, at which point do we have to
>> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
>> I get it that you want to enable passing batches between nodes, but
>> would those use the same "format" as the underlying scan node, or some
>> generic one? Second, will it be possible to execute expressions on the
>> custom batches (i.e. on "compressed data")? Or is it necessary to
>> "materialize" the batch into regular tuple slots? I realize those may
>> not be there "now" but maybe it'd be nice to plan for the future.
>
> I have been thinking about those future capabilities. Currently, the
> patch keeps tuples in the TAM-specific batch format up until they need
> to be consumed by a node that doesn’t understand that format or has
> not been modified to invoke the TAM callbacks to decode it. In the
> current patch, that means we materialize to regular TupleTableSlots at
> nodes that require it (for example, the scan node reading from TAM
> needing to evaluate quals, etc.). However, the intention is to allow
> batches to be passed through as many nodes as possible without
> materialization, ideally using the same format produced by the scan
> node all the way up until reaching a node that can only work with
> tuples in TupleTableSlots.
>
> As for executing expressions directly on the custom batch data: that’s
> something I would like to enable in the future. Right now, expressions
> (quals, projections, etc.) are evaluated after materializing into
> normal tuples in TupleTableSlots stored in TupleBatch, because the
> expression evaluation code isn’t yet totally batch-aware or is very
> from doing things like operate on compressed data in its native form.
> Patches 0004-0008 do try to add batch-aware expression evaluation but
> that's just a prototype. In the long term, the goal is to allow
> expression evaluation on batch data (for example, applying a WHERE
> clause or aggregate transition directly on a columnar batch without
> converting it to heap tuples first). This will require significant new
> infrastructure (perhaps specialized batch-aware expression operators
> and functions), so it's not in the current patch, but I agree it's
> important to plan for it. The current design doesn’t preclude it, it
> lays some groundwork by introducing the batch abstraction -- but fully
> supporting that will be future work.
>
> That said, one area I’d like to mention while at it, especially to
> enable native execution on compressed or columnar batches, is giving
> the table AM more control over how expression evaluation is performed
> on its batch data. In the current patch, the AM can provide a
> materialize function via TupleBatchOps, but that always produces an
> array of TupleTableSlots stored in the TupleBatch, not an opaque
> representation that remains under AM control. Maybe that's not bad for
> a v1 patch.
I think materializing into a batch of TupleTableSlots (and then doing
the regular expression evaluation) seems perfectly fine for v1. It's the
simplest fallback possible, and we'll need it anyway if overriding the
expression evaluation will be optional (which I assume it will be?).
> When evaluating expressions over a batch, a BatchVector
> is built by looping over these slots and invoking the standard
> per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> While that enables batch-style EEOPs for qual evaluation and aggregate
> transition (and is already a gain over per-row evaluation), it misses
> the opportunity to leverage any batch-specific optimizations the AM
> could offer, such as vectorized decoding or filtering over compressed
> data, and other AM optimizations for getting only the necessary
> columns out possibly in a vector format.
>
I'm not sure about this BatchVector thing. I haven't looked into that
very much, I'd expect the construction to be more expensive than the
benefits (compared to just doing the materialize + regular evaluation),
but maybe I'm completely wrong. Or maybe we could keep the vector
representation for multiple operations? No idea.
But it seems like a great area for experimenting ...
> I’m considering extending TupleTableSlotOps with a batch-aware variant
> of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> AMs can populate column vectors (e.g., BatchVector) directly from
> their native format. That would allow bypassing slot materialization
> entirely and plug AM-provided decoding logic directly into the
> executor’s batch expression paths. This isn’t implemented yet, but I
> see it as a necessary step toward supporting fully native expression
> evaluation over compressed or columnar formats. I’m not yet sure if
> TupleTableSlotOps is the right place for such a hook, it might belong
> elsewhere in the abstraction, but exposing a batch-aware interface for
> this purpose seems like the right direction.
>
No opinion. I don't see it as a necessary prerequisite for the other
parts of the patch series, but maybe the BatchVector really helps, and
then this would make perfect sense. I'm not sure there's a single
"correct" sequence in which to do these improvements, it's always a
matter of opinion.
>> It might be worth exploring some columnar formats, and see if this
>> design would be a good fit. Let's say we want to process data read from
>> a parquet file. Would we be able to leverage the format, or would we
>> need to "materialize" into slots too early? Or maybe it'd be good to
>> look at the VCI extension [1], discussed in a nearby thread. AFAICS
>> that's still based on an index AM, but there were suggestions to use TAM
>> instead (and maybe that'd be a better choice).
>
> Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
> design should be able to accommodate true columnar formats like
> Parquet. If we had a table AM (or FDW) that reads Parquet files into a
> columnar batch structure, the executor batching framework should
> ideally allow us to pass that batch along without immediately
> materializing to tuples. As mentioned before, we might have to adjust
> or extend the TupleBatch abstraction to handle a wider variety of
> batch formats, but conceptually it fits -- the goal is to avoid
> forcing early materialization. I will definitely keep the Parquet
> use-case in mind and perhaps do some experiments with a columnar
> source to ensure we aren’t baking in any unnecessary materialization.
> Also, thanks for the reference to the VCI extension thread; I'll take
> a look at that.
>
+1 I think having a TAM/FDW reading those established and common formats
is a good way to validate the overall design.
>> The other option would be to "create batches" during execution, say by
>> having a new node that accumulates tuples, builds a batch and sends it
>> to the node above. This would help both in cases when either the lower
>> node does not produce batches at all, or the batches are too small (due
>> to filtering, aggregation, ...). Or course, it'd only win if this
>> increases efficiency of the upper part of the plan enough to pay for
>> building the batches. That can be a hard decision.
>
> Yes, introducing a dedicated executor node to accumulate and form
> batches on the fly is an interesting idea, I have thought about it and
> even mentioned it in passing in the pgconf.dev unconference. This
> could indeed cover scenarios where the data source (a node) doesn't
> produce batches (e.g., a non-batching node feeding into a
> batching-aware upper node) or where batches coming from below are too
> small to be efficient. The current patch set doesn’t implement such a
> node; I focused on enabling batching at the scan/TAM level first. The
> cost/benefit decision for a batch-aggregator node is tricky, as you
> said. We’d need a way to decide when the overhead of gathering tuples
> into a batch is outweighed by the benefits to the upper node. This
> likely ties into costing or adaptive execution decisions. It's
> something I’m open to exploring in a future iteration, perhaps once we
> have more feedback on how the existing batching performs in various
> scenarios. It might also require some planner or executor smarts
> (maybe the executor can decide to batch on the fly if it sees a
> pattern of use, or the planner could insert such nodes when
> beneficial).
>
Yeah, those are good questions. I don't have a clear idea how should we
decide when to do this batching. Costing during planning is the
"traditional" option, with all the issues (e.g. it requires a reasonably
good cost model). Another option would be some sort of execution-time
heuristics - buts then which node would be responsible for building the
batches (if we didn't create them during planning)?
I agree it makes sense to focus on batching at the TAM/scan level for
now. That's a pretty big project already.
>> You also mentioned we could make batches larger by letting them span
>> multiple pages, etc. I'm not sure that's worth it - wouldn't that
>> substantially complicate the TAM code, which would need to pin+track
>> multiple buffers for each batch, etc.? Possible, but is it worth it?
>>
>> I'm not sure allowing multi-page batches would actually solve the issue.
>> It'd help with batches at the "scan level", but presumably the batch
>> size in the upper nodes matters just as much. Large scan batches may
>> help, but hard to predict.
>>
>> In the index prefetching patch we chose to keep batches 1:1 with leaf
>> pages, at least for now. Instead we allowed having multiple batches at
>> once. I'm not sure that'd be necessary for TAMs, though.
>
> I tend to agree with you here. Allowing a single batch to span
> multiple pages would add quite a bit of complexity to the table AM
> implementations (managing multiple buffer pins per batch, tracking
> page boundaries, etc.), and it's unclear if the benefit would justify
> that complexity. For now, I'm inclined not to pursue multi-page
> batches at the scan level in this patch. We can keep the batch
> page-local (e.g., for heap, one batch corresponds to max one page, as
> it does now). If we need larger batch sizes overall, we might address
> that by other means -- for example, by the above-mentioned idea of a
> higher-level batching node or by simply producing multiple batches in
> quick succession.
>
+1
> You’re right that even if we made scan batches larger, it doesn’t
> necessarily solve everything, since the effective batch size at
> higher-level nodes could still be constrained by other factors. So
> rather than complicating the low-level TAM code with multi-page
> batches, I'd prefer to first see if the current approach (with
> one-page batches) yields good benefits and then consider alternatives.
> We could also consider letting a scan node produce multiple batches
> before yielding to the upper node (similar to how the index
> prefetching patch can have multiple leaf page batches in flight) if
> needed, but as you note, it might not be necessary for TAMs yet. So at
> this stage, I'll keep it simple.
>
+1
>> This also reminds me of LIMIT queries. The way I imagine a "batchified"
>> executor to work is that batches are essentially "units of work". For
>> example, a nested loop would grab a batch of tuples from the outer
>> relation, lookup inner tuples for the whole batch, and only then pass
>> the result batch. (I'm ignoring the cases when the batch explodes due to
>> duplicates.)
>>
>> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
>> just the first tuple, and the rest of the batch is wasted work? Plenty
>> of (very expensive) OLAP have that, and many would likely benefit from
>> batching, so just disabling batching if there's LIMIT seems way too
>> heavy handed.
>
> Yeah, LIMIT does complicate downstream batching decisions. If we
> always use a full-size batch (say 64 tuples) for every operation, a
> query with LIMIT 1 could end up doing a lot of unnecessary work
> fetching and processing 63 tuples that never get used. Disabling
> batching entirely for queries with LIMIT would indeed be overkill and
> lose benefits for cases where the limit is not extremely selective.
>
>> Perhaps it'd be good to gradually ramp up the batch size? Start with
>> small batches, and then make them larger. The index prefetching does
>> that too, indirectly - it reads the whole leaf page as a batch, but then
>> gradually ramps up the prefetch distance (well, read_stream does that).
>> Maybe the batching should have similar thing ...
>
> An adaptive batch size that ramps up makes a lot of sense as a
> solution. We could start with a very small batch (say 4 tuples) and if
> we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
> yet or more output is still being consumed), then increase the batch
> size for subsequent operations. This way, a query that stops early
> doesn’t incur the full batching overhead, whereas a query that does
> process lots of tuples will gradually get to a larger batch size to
> gain efficiency. This is analogous to how the index prefetching ramps
> up prefetch distance, as you mentioned.
>
> Implementing that will require some careful thought. It could be done
> either in the planner (choose initial batch sizes based on context
> like LIMIT) or more dynamically in the executor (adjust on the fly). I
> lean towards a runtime heuristic because it’s hard for the planner to
> predict exactly how a LIMIT will play out, especially in complex
> plans. In any case, I agree that a gradual ramp-up or other adaptive
> approach would make batching more robust in the presence of query
> execution variability. I will definitely consider adding such logic,
> perhaps as an improvement once the basic framework is in.
>
I agree a runtime heuristics is probably the right approach. After all,
a lot of the issues with LIMIT queries is due to the planner not knowing
the real data distribution, etc.
>> In fact, how shall the optimizer decide whether to use batching? It's
>> one thing to decide whether a node can produce/consume batches, but
>> another thing is "should it"? With a node that "builds" a batch, this
>> decision would apply to even more plans, I guess.
>>
>> I don't have a great answer to this, it seems like an incredibly tricky
>> costing issue. I'm a bit worried we might end up with something too
>> coarse, like "jit=on" which we know is causing problems (admittedly,
>> mostly due to a lot of the LLVM work being unpredictable/external). But
>> having some "adaptive" heuristics (like the gradual ramp up) might make
>> it less risky.
>
> I agree that deciding when to use batching is tricky. So far, the
> patch takes a fairly simplistic approach: if a node (particularly a
> scan node) supports batching, it just does it, and other parts of the
> plan will consume batches if they are capable. There isn’t yet a
> nuanced cost-based decision in the planner for enabling batching. This
> is indeed something we’ll have to refine. We don’t want to end up with
> a blunt on/off GUC that could cause regressions in some cases.
>
> One idea is to introduce costing for batching: for example, estimate
> the per-tuple savings from batching vs the overhead of materialization
> or batch setup. However, developing a reliable cost model for that
> will take time and experimentation, especially with the possibility of
> variable batch sizes or adaptive behavior. Not to mention, that will
> be adding one more dimension to planner's costing model making the
> planning more expensive and unpredictable. In the near term, I’m fine
> with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> decide on batching, but that’s perhaps not a long-term solution.
>
Yeah, the cost model is going to be hard, because this depends on so
much low-level plan/hardware details. Like, the TAM may allow execution
on compressed data / leverage vectorization, .... But maybe the CPU does
not do that efficiently? There's so many unknown unknowns ...
Considering we still haven't fixed the JIT cost model, maybe it's better
to not rely on it too much for this batching patch? Also, all those
details contradict the idea that cost models are a simplified model of
the reality.
> I share your inclination that adaptive heuristics might be the safer
> path initially. Perhaps the executor can decide to batch or not batch
> based on runtime conditions. The gradual ramp-up of batch size is one
> such adaptive approach. We could also consider things like monitoring
> how effective batching is (are we actually processing full batches or
> frequently getting cut off?) and adjust behavior. These are somewhat
> speculative ideas at the moment, but the bottom line is I’m aware we
> need a smarter strategy than a simple switch. This will likely evolve
> as we test the patch in more scenarios.
>
I think the big question is how much can the batching change the
relative cost of two plans (I mean, actual cost, not just estimates).
Imagine plans P1 and P2, where
cost(P1) < cost(P2) = cost(P1) + delta
where "delta" is small (so P1 is faster, but not much). If we
"batchify" the plans into P1' and P2', can this happen?
cost(P1') >> cost(P2')
That is, can the "slower" plan P2 benefit much more from the batching,
making it significantly faster?
If this is unlikely, we could entirely ignore batching during planning,
and only do that as post-processing on the selected plan, or perhaps
even just during execution.
OTOH that's what JIT does, and we know it's not perfect - but that's
mostly because JIT has rather unpredictable costs when enabling. Maybe
batching doesn't have that.
>> FWIW the current batch size limit (64 tuples) seems rather low, but it's
>> hard to say. It'd be good to be able to experiment with different
>> values, so I suggest we make this a GUC and not a hard-coded constant.
>
> Yeah, I was thinking the same while testing -- the optimal batch size
> might vary by workload or hardware, and 64 was a somewhat arbitrary
> starting point. I will make the batch size limit configurable
> (probably as a GUC executor_batch_tuples, maybe only developer-focused
> at first). That will let us and others experiment easily with
> different batch sizes to see how it affects performance. It should
> also help with your earlier point: for example, on a machine where 64
> is too low or too high, we can adjust it without recompiling. So yes,
> I'll add a GUC for the batch size in the next version of the patch.
>
+1 to have developer-only GUC for testing. But the goal should be to not
expect users to tune this.
>> As for what to add to explain, I'd start by adding info about which
>> nodes are "batched" (consuming/producing batches), and some info about
>> the batch sizes. An average size, maybe a histogram if you want to be a
>> bit fancy.
>
> Adding more information to EXPLAIN is a good idea. In the current
> patch, EXPLAIN does not show anything about batching, but it would be
> very helpful for debugging and user transparency to indicate which
> nodes are operating in batch mode. I will update EXPLAIN to mark
> nodes that produce or consume batches. Likely I’ll start with
> something simple like an extra line or tag for a node, e.g., "Batch:
> true (avg batch size 64)" or something along those lines. An average
> batch size could be computed if we have instrumentation, which would
> be useful to see if, say, the batch sizes ended up smaller due to
> LIMIT or other factors. A full histogram might be more detail than
> most users need, but I agree even just knowing average or maximum
> batch size per node could be useful for performance analysis. I'll
> implement at least the basics for now, and we can refine it (maybe add
> more stats) if needed.
+1 to start with something simple
>
> (I had added a flag in the EXPLAIN output at one point, but removed it
> due to finding the regression output churn too noisy, though I
> understand I'll have to bite the bullet at some point.)
>
Why would there be regression churn, if the option is disabled by default?
>> Now, numbers from some microbenchmarks:
>>
>> ...
>>>> Perhaps I did something wrong. It does not surprise me this is somewhat
>> CPU dependent. It's a bit sad the improvements are smaller for the newer
>> CPU, though.
>
> Thanks for sharing your benchmark results -- that’s very useful data.
> I haven’t yet finished investigating why there's a regression relative
> to master when executor_batching is turned off. I re-ran my benchmarks
> to include comparisons with master and did observe some regressions in
> a few cases too, but I didn't see anything obvious in profiles that
> explained the slowdown. I initially assumed it might be noise, but now
> I suspect it could be related to structural changes in the scan code
> -- for example, I added a few new fields in the middle of
> HeapScanDescData, and even though the batching logic is bypassed when
> executor_batching is off, it’s possible that change alone affects
> memory layout or cache behavior in a way that penalizes the unbatched
> path. I haven’t confirmed that yet, but it’s on my list to look into
> more closely.
>
> Your observation that newer CPUs like the Ryzen may see smaller
> improvements makes sense -- perhaps they handle the per-tuple overhead
> more efficiently to begin with. Still, I’d prefer not to see
> regressions at all, even in the unbatched case, so I’ll focus on
> understanding and fixing that part before drawing conclusions from the
> performance data.
>
> Thanks again for the scripts -- those will help a lot in narrowing things down.
>
If needed, I can rerun the tests and collect additional information
(e.g. maybe perf-stat or perf-diff would be interesting).
>> I also tried running TPC-H. I don't have useful numbers yet, but I ran
>> into a segfault - see the attached backtrace. It only happens with the
>> batching, and only on Q22 for some reason. I initially thought it's a
>> bug in clang, because I saw it with clang-22 built from git, and not
>> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
>> debian 13). Still could be a clang bug, of course. I've seen ~20 of
>> those segfaults so far, and the backtraces look exactly the same.
>
> The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
> evaluation that I suspect is also causing the crash you saw.
>
> I'll try to post a v4 in a couple of weeks with some of the things I
> mentioned above.
>
Sounds good. Thank you.
regards
--
Tomas Vondra
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-27 17:37 Peter Geoghegan <[email protected]>
parent: Tomas Vondra <[email protected]>
3 siblings, 1 reply; 29+ messages in thread
From: Peter Geoghegan @ 2025-10-27 17:37 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: Amit Langote <[email protected]>; pgsql-hackers
On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <[email protected]> wrote:
> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.
I've been working on a new prototype enhancement to the index
prefetching patch. The new spinoff patch has index scans batch up
calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
return. This optimization is effective whenever an index scan returns
a contiguous group of TIDs that all point to the same heap page. We're
able to lock and unlock heap page buffers at the same point that
they're pinned and unpinned, which can dramatically decrease the
number of heap buffer locks acquired by index scans that return
contiguous TIDs (which is very common).
I find that speedups for pgbench SELECT variants with a predicate such
as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
throughput, at least in cases with low client counts (think 1 or 2
clients). These are cases where everything can fit in shared buffers,
so we're not getting any benefit from I/O prefetching (in spite of the
fact that this is built on top of the index prefetching patchset).
It makes sense to put this in scope for the index prefetching work
because that work will already give code outside of an index AM
visibility into which group of TIDs need to be read next. Right now
(on master) there is some trivial sense in which index AMs use their
own batches, but that's completely hidden from external callers.
> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.
The major difficulty with my heap batching prototype is getting the
layering right (no surprises there). In some sense we're deliberately
sharing information across different what we currently think of as
different layers of abstraction, in order to be able to "schedule" the
work more intelligently. There's a number of competing considerations.
I have invented a new concept of heap batch, that is orthogonal to the
existing concept of index batches. Right now these are just an array
of HeapTuple structs that relate to exactly one group of group of
contiguous heap TIDs (i.e. if the index scan returns TIDs even a
little out of order, which is fairly common, we cannot currently
reorder the work in the current prototype patch).
Once a batch is prepared, calls to heapam_index_fetch_tuple just
return the next TID from the batch (until the next time we have to
return a TID pointing to some distinct heap block). In the case of
pgbench queries like the one I mentioned, we only need to call
LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
returned (not once per heap tuple returned).
Importantly, the new interface added by my new prototype spinoff patch
is higher level than the existing
table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
executor asks the table AM "give me the next heap TID in the current
scan direction", rather than asking "give me this heap TID". The
general idea is that the table AM has a direct understanding of
ordered index scans.
The advantage of this higher-level interface is that it gives the
table AM maximum freedom to reorder work. As I said already, we won't
do things like merge together logically noncontiguous accesses to the
same heap page into one physical access right now. But I think that
that should at least be enabled by this interface.
The downside of this approach is that table AM (not the executor
proper) is responsible for interfacing with the index AM layer. I
think that this can be generalized without very much code duplication
across table AMs. But it's hard.
> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.
I think that the base index prefetching patch's current notion of
index-AM-wise batches can be kept quite separate from any table AM
batch concept that might be invented, either as part of what I'm
working on, or in Amit's patch.
It probably wouldn't be terribly difficult to get the new interface
I've described to return heap tuples in whatever batch format Amit
comes up with. That only has a benefit if it makes life easier for
expression evaluation in higher levels of the plan tree, but it might
just make sense to always do it that way. I doubt that adopting Amit's
batch format will make life much harder for the
heap_hot_search_buffer-batching mechanism (at least if it is generally
understood that its new index scan interface's builds batches in
Amit's format on a best-effort basis).
--
Peter Geoghegan
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-28 13:11 Amit Langote <[email protected]>
parent: Peter Geoghegan <[email protected]>
0 siblings, 0 replies; 29+ messages in thread
From: Amit Langote @ 2025-10-28 13:11 UTC (permalink / raw)
To: Peter Geoghegan <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers
Hi Peter,
Thanks for chiming in here.
On Tue, Oct 28, 2025 at 2:37 AM Peter Geoghegan <[email protected]> wrote:
>
> On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <[email protected]> wrote:
> > While looking at the patch, I couldn't help but think about the index
> > prefetching stuff that I work on. It also introduces the concept of a
> > "batch", for passing data between an index AM and the executor. It's
> > interesting how different the designs are in some respects. I'm not
> > saying one of those designs is wrong, it's more due different goals.
>
> I've been working on a new prototype enhancement to the index
> prefetching patch. The new spinoff patch has index scans batch up
> calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
> return. This optimization is effective whenever an index scan returns
> a contiguous group of TIDs that all point to the same heap page. We're
> able to lock and unlock heap page buffers at the same point that
> they're pinned and unpinned, which can dramatically decrease the
> number of heap buffer locks acquired by index scans that return
> contiguous TIDs (which is very common).
>
> I find that speedups for pgbench SELECT variants with a predicate such
> as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
> throughput, at least in cases with low client counts (think 1 or 2
> clients). These are cases where everything can fit in shared buffers,
> so we're not getting any benefit from I/O prefetching (in spite of the
> fact that this is built on top of the index prefetching patchset).
I gathered from the index prefetching thread that it is mainly about
enabling I/O prefetching, so it's nice to see that kind of speedup
even for the in-memory case.
Is this spinoff patch separate from the one that adds amgetbatch() to
IndexAmRoutine which you posted on Oct 12? If so, where can I find it?
> It makes sense to put this in scope for the index prefetching work
> because that work will already give code outside of an index AM
> visibility into which group of TIDs need to be read next. Right now
> (on master) there is some trivial sense in which index AMs use their
> own batches, but that's completely hidden from external callers.
As you might know, heapam's TableAmRoutine.scan_* functions use a
"pagemode" in some cases, which fills a batch of tuples in
HeapScanData.rs_vistuples. However, that batch currently only stores
the tuples’ offset numbers. I started this work based on Andres’s
suggestion to propagate that batch up into the executor’s scan nodes.
The idea is to create a HeapTuple array sized according to the
executor’s batch size, and then populate it when the scan node calls
the new TableAmRoutine.scan_batch* variant. There might be some
overlap between our respective ideas.
> > For example, the index prefetching patch establishes a "shared" batch
> > struct, and the index AM is expected to fill it with data. After that,
> > the batch is managed entirely by indexam.c, with no AM calls. The only
> > AM-specific bit in the batch is "position", but that's used only when
> > advancing to the next page, etc.
>
> The major difficulty with my heap batching prototype is getting the
> layering right (no surprises there). In some sense we're deliberately
> sharing information across different what we currently think of as
> different layers of abstraction, in order to be able to "schedule" the
> work more intelligently. There's a number of competing considerations.
>
> I have invented a new concept of heap batch, that is orthogonal to the
> existing concept of index batches. Right now these are just an array
> of HeapTuple structs that relate to exactly one group of group of
> contiguous heap TIDs (i.e. if the index scan returns TIDs even a
> little out of order, which is fairly common, we cannot currently
> reorder the work in the current prototype patch).
>
> Once a batch is prepared, calls to heapam_index_fetch_tuple just
> return the next TID from the batch (until the next time we have to
> return a TID pointing to some distinct heap block). In the case of
> pgbench queries like the one I mentioned, we only need to call
> LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
> returned (not once per heap tuple returned).
>
> Importantly, the new interface added by my new prototype spinoff patch
> is higher level than the existing
> table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
> executor asks the table AM "give me the next heap TID in the current
> scan direction", rather than asking "give me this heap TID". The
> general idea is that the table AM has a direct understanding of
> ordered index scans.
>
> The advantage of this higher-level interface is that it gives the
> table AM maximum freedom to reorder work. As I said already, we won't
> do things like merge together logically noncontiguous accesses to the
> same heap page into one physical access right now. But I think that
> that should at least be enabled by this interface.
Interesting. It sounds like you aim to replace the fetch_tuple
interface with a more generic one, is that right?
> The downside of this approach is that table AM (not the executor
> proper) is responsible for interfacing with the index AM layer. I
> think that this can be generalized without very much code duplication
> across table AMs. But it's hard.
Seems so.
> > This patch does things differently. IIUC, each TAM may produce it's own
> > "batch", which is then wrapped in a generic one. For example, heap
> > produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> > is fine. In the prefetching we chose to move all this code (walking the
> > batch items) from the AMs into the layer above, and make it AM agnostic.
>
> I think that the base index prefetching patch's current notion of
> index-AM-wise batches can be kept quite separate from any table AM
> batch concept that might be invented, either as part of what I'm
> working on, or in Amit's patch.
>
> It probably wouldn't be terribly difficult to get the new interface
> I've described to return heap tuples in whatever batch format Amit
> comes up with. That only has a benefit if it makes life easier for
> expression evaluation in higher levels of the plan tree, but it might
> just make sense to always do it that way. I doubt that adopting Amit's
> batch format will make life much harder for the
> heap_hot_search_buffer-batching mechanism (at least if it is generally
> understood that its new index scan interface's builds batches in
> Amit's format on a best-effort basis).
In my implementation, the new TableAmRoutine.scan_getnextbatch()
returns a batch as an opaque table AM structure, which can then be
passed up to the upper levels of the plan. Patch 0001 in my series
adds the following to the TableAmRoutine API:
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
I haven't seen what your version looks like, but if it is compatible
with the above, I'd be happy to adopt a batch format that accommodates
multiple use cases.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-28 13:40 Amit Langote <[email protected]>
parent: Tomas Vondra <[email protected]>
0 siblings, 2 replies; 29+ messages in thread
From: Amit Langote @ 2025-10-28 13:40 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
On Tue, Oct 28, 2025 at 1:18 AM Tomas Vondra <[email protected]> wrote:
> On 10/27/25 08:24, Amit Langote wrote:
> > Thank you for reviewing the patch and taking the time to run those
> > experiments. I appreciate the detailed feedback and questions. I also
> > apologize for my late reply, I spent perhaps way too much time going
> > over your index prefetching thread trying to understand the notion of
> > batching that it uses and getting sidelined by other things while
> > writing this reply.
>
> Cool! Now you can do a review of the index prefetch patch ;-)
Would love to and I'm adding that to my list. :)
> >> How far ahead have you though about these capabilities? I was wondering
> >> about two things in particular. First, at which point do we have to
> >> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> >> I get it that you want to enable passing batches between nodes, but
> >> would those use the same "format" as the underlying scan node, or some
> >> generic one? Second, will it be possible to execute expressions on the
> >> custom batches (i.e. on "compressed data")? Or is it necessary to
> >> "materialize" the batch into regular tuple slots? I realize those may
> >> not be there "now" but maybe it'd be nice to plan for the future.
> >
> > I have been thinking about those future capabilities. Currently, the
> > patch keeps tuples in the TAM-specific batch format up until they need
> > to be consumed by a node that doesn’t understand that format or has
> > not been modified to invoke the TAM callbacks to decode it. In the
> > current patch, that means we materialize to regular TupleTableSlots at
> > nodes that require it (for example, the scan node reading from TAM
> > needing to evaluate quals, etc.). However, the intention is to allow
> > batches to be passed through as many nodes as possible without
> > materialization, ideally using the same format produced by the scan
> > node all the way up until reaching a node that can only work with
> > tuples in TupleTableSlots.
> >
> > As for executing expressions directly on the custom batch data: that’s
> > something I would like to enable in the future. Right now, expressions
> > (quals, projections, etc.) are evaluated after materializing into
> > normal tuples in TupleTableSlots stored in TupleBatch, because the
> > expression evaluation code isn’t yet totally batch-aware or is very
> > from doing things like operate on compressed data in its native form.
> > Patches 0004-0008 do try to add batch-aware expression evaluation but
> > that's just a prototype. In the long term, the goal is to allow
> > expression evaluation on batch data (for example, applying a WHERE
> > clause or aggregate transition directly on a columnar batch without
> > converting it to heap tuples first). This will require significant new
> > infrastructure (perhaps specialized batch-aware expression operators
> > and functions), so it's not in the current patch, but I agree it's
> > important to plan for it. The current design doesn’t preclude it, it
> > lays some groundwork by introducing the batch abstraction -- but fully
> > supporting that will be future work.
> >
> > That said, one area I’d like to mention while at it, especially to
> > enable native execution on compressed or columnar batches, is giving
> > the table AM more control over how expression evaluation is performed
> > on its batch data. In the current patch, the AM can provide a
> > materialize function via TupleBatchOps, but that always produces an
> > array of TupleTableSlots stored in the TupleBatch, not an opaque
> > representation that remains under AM control. Maybe that's not bad for
> > a v1 patch.
>
> I think materializing into a batch of TupleTableSlots (and then doing
> the regular expression evaluation) seems perfectly fine for v1. It's the
> simplest fallback possible, and we'll need it anyway if overriding the
> expression evaluation will be optional (which I assume it will be?).
Yes. The ability to materialize into TupleTableSlots won't be
optional for the table AM's BatchOps. Converting to other formats
would be.
> > When evaluating expressions over a batch, a BatchVector
> > is built by looping over these slots and invoking the standard
> > per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> > While that enables batch-style EEOPs for qual evaluation and aggregate
> > transition (and is already a gain over per-row evaluation), it misses
> > the opportunity to leverage any batch-specific optimizations the AM
> > could offer, such as vectorized decoding or filtering over compressed
> > data, and other AM optimizations for getting only the necessary
> > columns out possibly in a vector format.
> >
>
> I'm not sure about this BatchVector thing. I haven't looked into that
> very much, I'd expect the construction to be more expensive than the
> benefits (compared to just doing the materialize + regular evaluation),
> but maybe I'm completely wrong. Or maybe we could keep the vector
> representation for multiple operations? No idea.
Constructing the BatchVector does require looping over the batch and
deforming each tuple, typically via getsomeattrs(). So yes, there’s an
up-front cost similar to materialization. But the goal is to amortize
that by enabling expression evaluation to run in a tight loop over
column vectors, avoiding repeated jumps into slot/AM code for each
tuple and each column. That can reduce branching and improve locality.
In its current form, the BatchVector is ephemeral -- it's built just
before expression evaluation and discarded after. But your idea of
reusing the same vector across multiple operations is interesting.
That would let us spread out the construction cost even further and
might be necessary to justify the overhead fully in some cases. I’ll
keep that in mind.
> But it seems like a great area for experimenting ...
Yep.
> > I’m considering extending TupleTableSlotOps with a batch-aware variant
> > of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> > AMs can populate column vectors (e.g., BatchVector) directly from
> > their native format. That would allow bypassing slot materialization
> > entirely and plug AM-provided decoding logic directly into the
> > executor’s batch expression paths. This isn’t implemented yet, but I
> > see it as a necessary step toward supporting fully native expression
> > evaluation over compressed or columnar formats. I’m not yet sure if
> > TupleTableSlotOps is the right place for such a hook, it might belong
> > elsewhere in the abstraction, but exposing a batch-aware interface for
> > this purpose seems like the right direction.
> >
>
> No opinion. I don't see it as a necessary prerequisite for the other
> parts of the patch series, but maybe the BatchVector really helps, and
> then this would make perfect sense. I'm not sure there's a single
> "correct" sequence in which to do these improvements, it's always a
> matter of opinion.
Yes, I think we can come back to this later.
> >> The other option would be to "create batches" during execution, say by
> >> having a new node that accumulates tuples, builds a batch and sends it
> >> to the node above. This would help both in cases when either the lower
> >> node does not produce batches at all, or the batches are too small (due
> >> to filtering, aggregation, ...). Or course, it'd only win if this
> >> increases efficiency of the upper part of the plan enough to pay for
> >> building the batches. That can be a hard decision.
> >
> > Yes, introducing a dedicated executor node to accumulate and form
> > batches on the fly is an interesting idea, I have thought about it and
> > even mentioned it in passing in the pgconf.dev unconference. This
> > could indeed cover scenarios where the data source (a node) doesn't
> > produce batches (e.g., a non-batching node feeding into a
> > batching-aware upper node) or where batches coming from below are too
> > small to be efficient. The current patch set doesn’t implement such a
> > node; I focused on enabling batching at the scan/TAM level first. The
> > cost/benefit decision for a batch-aggregator node is tricky, as you
> > said. We’d need a way to decide when the overhead of gathering tuples
> > into a batch is outweighed by the benefits to the upper node. This
> > likely ties into costing or adaptive execution decisions. It's
> > something I’m open to exploring in a future iteration, perhaps once we
> > have more feedback on how the existing batching performs in various
> > scenarios. It might also require some planner or executor smarts
> > (maybe the executor can decide to batch on the fly if it sees a
> > pattern of use, or the planner could insert such nodes when
> > beneficial).
> >
>
> Yeah, those are good questions. I don't have a clear idea how should we
> decide when to do this batching. Costing during planning is the
> "traditional" option, with all the issues (e.g. it requires a reasonably
> good cost model). Another option would be some sort of execution-time
> heuristics - buts then which node would be responsible for building the
> batches (if we didn't create them during planning)?
>
> I agree it makes sense to focus on batching at the TAM/scan level for
> now. That's a pretty big project already.
Right -- batching at the TAM/scan level is already a sizable project,
especially given its interaction with prefetching work (maybe). I
think it's best to focus design effort there and on batched expression
evaluation first, and only revisit batch-creation nodes once that
groundwork is in place.
> >> In fact, how shall the optimizer decide whether to use batching? It's
> >> one thing to decide whether a node can produce/consume batches, but
> >> another thing is "should it"? With a node that "builds" a batch, this
> >> decision would apply to even more plans, I guess.
> >>
> >> I don't have a great answer to this, it seems like an incredibly tricky
> >> costing issue. I'm a bit worried we might end up with something too
> >> coarse, like "jit=on" which we know is causing problems (admittedly,
> >> mostly due to a lot of the LLVM work being unpredictable/external). But
> >> having some "adaptive" heuristics (like the gradual ramp up) might make
> >> it less risky.
> >
> > I agree that deciding when to use batching is tricky. So far, the
> > patch takes a fairly simplistic approach: if a node (particularly a
> > scan node) supports batching, it just does it, and other parts of the
> > plan will consume batches if they are capable. There isn’t yet a
> > nuanced cost-based decision in the planner for enabling batching. This
> > is indeed something we’ll have to refine. We don’t want to end up with
> > a blunt on/off GUC that could cause regressions in some cases.
> >
> > One idea is to introduce costing for batching: for example, estimate
> > the per-tuple savings from batching vs the overhead of materialization
> > or batch setup. However, developing a reliable cost model for that
> > will take time and experimentation, especially with the possibility of
> > variable batch sizes or adaptive behavior. Not to mention, that will
> > be adding one more dimension to planner's costing model making the
> > planning more expensive and unpredictable. In the near term, I’m fine
> > with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> > decide on batching, but that’s perhaps not a long-term solution.
> >
>
> Yeah, the cost model is going to be hard, because this depends on so
> much low-level plan/hardware details. Like, the TAM may allow execution
> on compressed data / leverage vectorization, .... But maybe the CPU does
> not do that efficiently? There's so many unknown unknowns ...
>
> Considering we still haven't fixed the JIT cost model, maybe it's better
> to not rely on it too much for this batching patch? Also, all those
> details contradict the idea that cost models are a simplified model of
> the reality.
Yeah, totally agreed -- the complexity and unpredictability here are
real, and your point about JIT costing is a good reminder not to
over-index on planner models for now.
> > I share your inclination that adaptive heuristics might be the safer
> > path initially. Perhaps the executor can decide to batch or not batch
> > based on runtime conditions. The gradual ramp-up of batch size is one
> > such adaptive approach. We could also consider things like monitoring
> > how effective batching is (are we actually processing full batches or
> > frequently getting cut off?) and adjust behavior. These are somewhat
> > speculative ideas at the moment, but the bottom line is I’m aware we
> > need a smarter strategy than a simple switch. This will likely evolve
> > as we test the patch in more scenarios.
> >
>
> I think the big question is how much can the batching change the
> relative cost of two plans (I mean, actual cost, not just estimates).
>
> Imagine plans P1 and P2, where
>
> cost(P1) < cost(P2) = cost(P1) + delta
>
> where "delta" is small (so P1 is faster, but not much). If we
> "batchify" the plans into P1' and P2', can this happen?
>
> cost(P1') >> cost(P2')
>
> That is, can the "slower" plan P2 benefit much more from the batching,
> making it significantly faster?
>
> If this is unlikely, we could entirely ignore batching during planning,
> and only do that as post-processing on the selected plan, or perhaps
> even just during execution.
>
> OTOH that's what JIT does, and we know it's not perfect - but that's
> mostly because JIT has rather unpredictable costs when enabling. Maybe
> batching doesn't have that.
That’s an interesting scenario. I suspect batching (even with SIMD)
won’t usually flip plan orderings that dramatically -- i.e., turning
the clearly slower plan into the faster one -- though I could be
wrong. But I agree with the conclusion: this supports treating
batching as an executor concern, at least initially. Might be worth
seeing if there’s any relevant guidance in systems literature too.
> >> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> >> hard to say. It'd be good to be able to experiment with different
> >> values, so I suggest we make this a GUC and not a hard-coded constant.
> >
> > Yeah, I was thinking the same while testing -- the optimal batch size
> > might vary by workload or hardware, and 64 was a somewhat arbitrary
> > starting point. I will make the batch size limit configurable
> > (probably as a GUC executor_batch_tuples, maybe only developer-focused
> > at first). That will let us and others experiment easily with
> > different batch sizes to see how it affects performance. It should
> > also help with your earlier point: for example, on a machine where 64
> > is too low or too high, we can adjust it without recompiling. So yes,
> > I'll add a GUC for the batch size in the next version of the patch.
> >
>
> +1 to have developer-only GUC for testing. But the goal should be to not
> expect users to tune this.
Yes.
> >> As for what to add to explain, I'd start by adding info about which
> >> nodes are "batched" (consuming/producing batches), and some info about
> >> the batch sizes. An average size, maybe a histogram if you want to be a
> >> bit fancy.
> >
> > Adding more information to EXPLAIN is a good idea. In the current
> > patch, EXPLAIN does not show anything about batching, but it would be
> > very helpful for debugging and user transparency to indicate which
> > nodes are operating in batch mode. I will update EXPLAIN to mark
> > nodes that produce or consume batches. Likely I’ll start with
> > something simple like an extra line or tag for a node, e.g., "Batch:
> > true (avg batch size 64)" or something along those lines. An average
> > batch size could be computed if we have instrumentation, which would
> > be useful to see if, say, the batch sizes ended up smaller due to
> > LIMIT or other factors. A full histogram might be more detail than
> > most users need, but I agree even just knowing average or maximum
> > batch size per node could be useful for performance analysis. I'll
> > implement at least the basics for now, and we can refine it (maybe add
> > more stats) if needed.
>
> +1 to start with something simple
>
> >
> > (I had added a flag in the EXPLAIN output at one point, but removed it
> > due to finding the regression output churn too noisy, though I
> > understand I'll have to bite the bullet at some point.)
> >
>
> Why would there be regression churn, if the option is disabled by default?
executor_batching is on my default in my patch, so a seq scan will
always use batching provided the query features preventing it are not
present, which is true for a huge number of plans appearing in
regression suite output.
> >> Now, numbers from some microbenchmarks:
> >>
> >> ...
> >>>> Perhaps I did something wrong. It does not surprise me this is somewhat
> >> CPU dependent. It's a bit sad the improvements are smaller for the newer
> >> CPU, though.
> >
> > Thanks for sharing your benchmark results -- that’s very useful data.
> > I haven’t yet finished investigating why there's a regression relative
> > to master when executor_batching is turned off. I re-ran my benchmarks
> > to include comparisons with master and did observe some regressions in
> > a few cases too, but I didn't see anything obvious in profiles that
> > explained the slowdown. I initially assumed it might be noise, but now
> > I suspect it could be related to structural changes in the scan code
> > -- for example, I added a few new fields in the middle of
> > HeapScanDescData, and even though the batching logic is bypassed when
> > executor_batching is off, it’s possible that change alone affects
> > memory layout or cache behavior in a way that penalizes the unbatched
> > path. I haven’t confirmed that yet, but it’s on my list to look into
> > more closely.
> >
> > Your observation that newer CPUs like the Ryzen may see smaller
> > improvements makes sense -- perhaps they handle the per-tuple overhead
> > more efficiently to begin with. Still, I’d prefer not to see
> > regressions at all, even in the unbatched case, so I’ll focus on
> > understanding and fixing that part before drawing conclusions from the
> > performance data.
> >
> > Thanks again for the scripts -- those will help a lot in narrowing things down.
>
> If needed, I can rerun the tests and collect additional information
> (e.g. maybe perf-stat or perf-diff would be interesting).
That would be nice to see if you have the time, but maybe after I post
a new version.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-28 14:32 Daniil Davydov <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 1 reply; 29+ messages in thread
From: Daniil Davydov @ 2025-10-28 14:32 UTC (permalink / raw)
To: Amit Langote <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers
Hi,
As far as I understand, this work partially overlaps with what we did in the
thread [1] (in short - we introduce support for batching within the ModifyTable
node). Am I correct?
It's worth saying that the patch in that thread is already quite old -
now I have
an improved implementation and tests for it (as well as performance
measurements). But the basic idea and design remained unchanged.
Maybe we can combine approaches? I haven't reviewed patches in this thread
yet, but I'll try to do it in the near future.
[1] https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40m...
--
Best regards,
Daniil Davydov
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-29 02:22 Amit Langote <[email protected]>
parent: Daniil Davydov <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2025-10-29 02:22 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers
Hi Daniil,
On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <[email protected]> wrote:
>
> Hi,
>
> As far as I understand, this work partially overlaps with what we did in the
> thread [1] (in short - we introduce support for batching within the ModifyTable
> node). Am I correct?
There might be some relation, but not much overlap. The thread you
mention seems to focus on batching in the write path (for INSERT,
etc.), while this work targets batching in the read path via Table AM
scan callbacks. I think they can be developed independently, though
I'm happy to take a look.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-29 06:37 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 1 reply; 29+ messages in thread
From: Amit Langote @ 2025-10-29 06:37 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <[email protected]> wrote:
> That would be nice to see if you have the time, but maybe after I post
> a new version.
I’ve created a CF entry marked WoA for this in the next CF under the
title “Batching in executor, part 1: add batch variant of table AM
scan API.” The idea is to track this piece separately so that later
parts can have their own entries and we don’t end up with a single
long-lived entry that never gets marked done. :-)
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-10-30 12:12 Daniil Davydov <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Daniil Davydov @ 2025-10-30 12:12 UTC (permalink / raw)
To: Amit Langote <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers
Hi,
On Wed, Oct 29, 2025 at 9:23 AM Amit Langote <[email protected]> wrote:
>
> Hi Daniil,
>
> On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <[email protected]> wrote:
> >
> > Hi,
> >
> > As far as I understand, this work partially overlaps with what we did in the
> > thread [1] (in short - we introduce support for batching within the ModifyTable
> > node). Am I correct?
>
> There might be some relation, but not much overlap. The thread you
> mention seems to focus on batching in the write path (for INSERT,
> etc.), while this work targets batching in the read path via Table AM
> scan callbacks. I think they can be developed independently, though
> I'm happy to take a look.
Oh, I got it. Thanks!
I looked at 0001-0003 patches and got some comments :
1)
I noticed that some Nodes may set SO_ALLOW_PAGEMODE flag to 'false'
during ExecReScan. heap_getnextslot works carefully with it - checks whether
pagemode is allowed at every call. If not - it just uses tuple-at-a-time mode.
At the same time, heap_getnextbatch always expects that pagemode is enabled.
I didn't find any code paths which can lead to an assertion [1] fail.
If such a code
path is unreachable under any circumstances, maybe we should add a comment
why?
2)
heapgettup_pagemode_batch : Do we really need to compute lineindex variable
in this way? :
***
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
(scan->rs_ntuples - lineindex) : 0;
***
As far as I understand, this is enough :
***
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
linesleft = scan->rs_ntuples - lineindex;
***
3)
Is this code inside heapgettup_pagemode_batch necessary? :
***
ScanDirectionIsForward(dir) ? 0 : 0
***
4)
heapgettup_pagemode has this change :
HeapTuple tuple = &(scan->rs_ctup) ---> HeapTuple tuple = &scan->rs_ctup
I guess it was changed accidentally.
5)
I apologize for the tediousness, but these braces are not in the
postgres style :
***
static const TupleBatchOps TupleBatchHeapOps = {
.materialize_all = heap_materialize_batch_all
};
***
[1] heap_getnextbatch : Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE)
--
Best regards,
Daniil Davydov
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-12-04 15:54 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2025-12-04 15:54 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
On Wed, Oct 29, 2025 at 3:37 PM Amit Langote <[email protected]> wrote:
> On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <[email protected]> wrote:
> > That would be nice to see if you have the time, but maybe after I post
> > a new version.
>
> I’ve created a CF entry marked WoA for this in the next CF under the
> title “Batching in executor, part 1: add batch variant of table AM
> scan API.” The idea is to track this piece separately so that later
> parts can have their own entries and we don’t end up with a single
> long-lived entry that never gets marked done. :-)
I intend to continue working on this, so have just moved it into the
next fest. I will post a new patch version next week that addresses
Daniil's comments and implements a few other things I mentioned I will
in my reply to Tomas on Oct 28; sorry for the delay.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-12-20 14:12 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2025-12-20 14:12 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers
On Fri, Dec 5, 2025 at 12:54 AM Amit Langote <[email protected]>
wrote:
> On Wed, Oct 29, 2025 at 3:37 PM Amit Langote <[email protected]>
wrote:
> > On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <[email protected]>
wrote:
> > > That would be nice to see if you have the time, but maybe after I post
> > > a new version.
> >
> > I’ve created a CF entry marked WoA for this in the next CF under the
> > title “Batching in executor, part 1: add batch variant of table AM
> > scan API.” The idea is to track this piece separately so that later
> > parts can have their own entries and we don’t end up with a single
> > long-lived entry that never gets marked done. :-)
>
> I intend to continue working on this, so have just moved it into the
> next fest. I will post a new patch version next week that addresses
> Daniil's comments and implements a few other things I mentioned I will
> in my reply to Tomas on Oct 28; sorry for the delay.
Before I go on vacation for a couple of weeks, here's an updated patch
set. I am only including the patches that add TAM interface, add
TupleBatch executor wrapper for TAM batches, and use it in SeqScan as I had
posted before. There is a new patch to add a BATCHES option to EXPLAIN. I
renamed the testing GUC to executor_batch_rows (integer) from the boolean
executor_batching. EXPLAIN (BATCHES) example:
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off)
select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
What I have not included in this set are the patches that add
ExecProcNodeBatch() so that TupleBatch can be passed from one plan node to
another (parent), ExprEvalOps (EEOPs) for batched expression evaluation
(qual and aggregate transition). I would like to focus on the patches that
allow reading batches from TAM into Scan nodes (only SeqScan for now).
After I'm back from vacation, I will post patches for batched qual
evaluation in SeqScan filter quals (once bugs are fixed and polished).
Batching in Agg node can wait for now.
In the meantime, what I would like to have someone's thoughts on:
* the shape of the TAM APIs -- should I add a TAMBatch or something that is
created, populated, and destroyed by the TAM instead of the current void
pointer and TupleBatchOps that are initialized in the executor like this
(excerpt from 0002):
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY =
scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc,
node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops =
table_batch_callbacks(node->ss.ss_currentRelation);
+ }
* the shape of TupleBatch itself -- its contents and operations defined in
execBatch.c/h.
* any other thoughts you might have on the project, patches.
Benchmark:
Scripts attached if you want to try them.
(Negative % = faster than master)
SELECT * FROM table LIMIT 1 OFFSET N:
Rows Master batch=0 vs master batch=64 vs master
--------------------------------------------------------------
1M 11ms 11ms -0% 8ms -23%
2M 23ms 22ms -1% 18ms -23%
3M 36ms 34ms -5% 27ms -25%
4M 51ms 50ms -2% 38ms -26%
5M 64ms 64ms -1% 48ms -26%
10M 147ms 145ms -1% 114ms -22%
SELECT * FROM WHERE a > 0 LIMIT 1 OFFSET N:
Rows Master batch=0 vs master batch=64 vs master
--------------------------------------------------------------
1M 31ms 31ms +0% 16ms -48%
2M 64ms 64ms -0% 34ms -47%
3M 67ms 66ms -1% 50ms -25%
4M 91ms 90ms -1% 71ms -22%
5M 119ms 113ms -5% 88ms -26%
10M 262ms 261ms -0% 205ms -21%
SELECT * FROM table WHERE o > 0 LIMIT 1 OFFSET N (last column -
deform-heavy):
Rows Master batch=0 vs master batch=64 vs master
--------------------------------------------------------------
1M 38ms 37ms -2% 38ms +0%
2M 79ms 75ms -6% 77ms -4%
3M 182ms 186ms +2% 160ms -12%
4M 250ms 252ms +1% 219ms -12%
5M 314ms 316ms +1% 273ms -13%
10M 647ms 651ms +1% 604ms -7%
The smaller improvement with WHERE o > 0 is expected since accessing the
last column requires deforming most of the tuple, which dominates the
execution time. Future work on batched tuple deformation could help here.
Note on regressions with executor_batch_rows = 0 vs master:
I am not seeing the regressions with batch_rows=0 vs master as I did
before. I think some of it might have to do with my removing some stray
fields from HeapScanData that were accidentally left there in the earlier
patches. Also, the regressions I was observing earlier seemed more to have
to do with using gcc to compile master tree and clang to compile patched
tree, which resulted in code layout changes that seemed to cause patched
binary to regress. Would be nice if these numbers can be verified by
others.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v4-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.4K, 3-v4-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From 24a3d208db93312788745882a01b526957919966 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 17:21:56 +0900
Subject: [PATCH v4 1/3] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
Reviewed-by: Daniil Davydov <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 219 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 18 ++
src/include/access/tableam.h | 58 ++++++
src/include/pgstat.h | 5 +
5 files changed, 303 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6daf4a87dec..fcc0813f139 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1023,7 +1023,7 @@ heapgettup_pagemode(HeapScanDesc scan,
int nkeys,
ScanKey key)
{
- HeapTuple tuple = &(scan->rs_ctup);
+ HeapTuple tuple = &scan->rs_ctup;
Page page;
uint32 lineindex;
uint32 linesleft;
@@ -1104,6 +1104,132 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+ Relation rel = scan->rs_base.rs_rd;
+ Oid tableOid = RelationGetRelid(rel);
+ TupleDesc tupdesc = key ? RelationGetDescr(rel) : NULL;
+
+ /*
+ * Current batching limitations (may be relaxed in future):
+ *
+ * - Forward scans only: backward scan support would require changes to
+ * batch iteration and page advancement logic.
+ *
+ * - Pagemode required: batching relies on the pre-built rs_vistuples[]
+ * array from heap_prepare_pagescan(). This is guaranteed by
+ * ScanCanUseBatching() which only enables batching when SO_ALLOW_PAGEMODE
+ * is set. Unlike heap_getnextslot, we don't support dynamic fallback to
+ * tuple-at-a-time mode since the batch execution path is selected at
+ * ExecInit time.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * If we have no current page (or the current page is exhausted),
+ * advance to the next page that has any visible tuples and prepare it.
+ * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+ * as soon as we have a prepared page; we never produce from two pages.
+ */
+ for (;;)
+ {
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ /* Are there more visible tuples left on this page? */
+ lineindex = scan->rs_cindex + dir;
+ linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ if (linesleft > 0)
+ break; /* continue on this page */
+ }
+
+ /* Move to next page and prepare its visible tuple list. */
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ /* After prepare, either rs_ntuples > 0 or we'll loop again. */
+ if (scan->rs_ntuples > 0)
+ {
+ lineindex = 0;
+ linesleft = scan->rs_ntuples;
+ break;
+ }
+ /* else: page had no visible tuples; continue to next page */
+ }
+
+ /* From here on, we must only read tuples from this single page. */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ /*
+ * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+ * until either the page is exhausted or the batch capacity is reached.
+ */
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex <= (uint32) scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ dst->t_tableOid = tableOid;
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL && !HeapKeyTest(dst, tupdesc, nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ return nout;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1436,6 +1562,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+
+ Assert(maxitems > 0);
+
+ hb = palloc(sizeof(HeapBatch));
+ hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb->tupdata);
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf62f..550b788553c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f7e4ae3843c..f6675043fb3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -101,6 +101,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -337,6 +350,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6bf5..3ec3c3dd008 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(dir == ForwardScanDirection);
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6714363144a..85f76dee468 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -697,6 +697,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.47.3
[application/octet-stream] v4-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.6K, 4-v4-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 5630836aefb87948bb745d7faad01e9e3534a64c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 17:23:12 +0900
Subject: [PATCH v4 2/3] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Add executor_batch_rows GUC to specify the maximum number of rows
that can be added into a batch.
Notes / current limits:
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
Reviewed-by: Daniil Davydov <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 29 ++++
src/backend/access/heap/heapam_handler.c | 16 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 117 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 ++
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 99 ++++++++++++
src/include/executor/execScan.h | 69 +++++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 4 +
17 files changed, 598 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fcc0813f139..0c0b2384f0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1592,6 +1592,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 550b788553c..a4de7e5b4f5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,21 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps =
+{
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2617,6 +2632,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 73ebc01a08f..d281aacaf94 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+
+ inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->outslots = outslots;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ if (b == NULL)
+ return;
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ if (drop_slots)
+ {
+ pfree(b->inslots);
+ pfree(b->outslots);
+ b->inslots = b->outslots = NULL;
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 31ed4783c1d..ba25daa5e46 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batch_rows > 0 &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..a9071e32560 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(direction == ForwardScanDirection);
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = executor_batch_rows;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..266502e9778 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..fd97d26c073 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1001,6 +1001,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f6675043fb3..fe07b21eaa2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,6 +354,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3ec3c3dd008..13a95f7a589 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2128,5 +2154,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..2d0066103ce
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 2003cbc7ed5..c1add8ca331 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -251,4 +251,73 @@ ExecScanExtended(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatchSlot
+ * Batch-driven variant of ExecScanExtended.
+ *
+ * Returns one tuple at a time to callers, but internally fetches tuples
+ * in batches from the AM via accessBatchMtd. This reduces per-tuple AM
+ * call overhead while preserving the single-slot interface expected by
+ * parent nodes.
+ *
+ * The batch is refilled when exhausted by calling accessBatchMtd, which
+ * returns false at end-of-scan.
+ *
+ * Note: EPQ is not supported in the batch path; callers must ensure
+ * es_epq_active is NULL before using this function.
+ */
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7cd6a49309f..c1f05ce6273 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -578,12 +578,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9a7d733ddef..13285210998 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3968429f991..219a722c49a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
#define EXECNODES_H
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "fmgr.h"
#include "lib/ilist.h"
@@ -1204,6 +1205,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.47.3
[application/octet-stream] v4-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (13.8K, 5-v4-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From 189edab507d407cce6446a944b3a48c327167ec3 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v4 3/3] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 30 ++++++++++++++
src/backend/commands/explain_state.c | 2 +
src/backend/executor/execBatch.c | 8 +++-
src/backend/executor/nodeSeqscan.c | 24 +++++------
src/include/commands/explain_state.h | 1 +
src/include/executor/execBatch.h | 35 +++++++++++++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 57 +++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 26 ++++++++++++
9 files changed, 171 insertions(+), 13 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5a6390631eb..3a639a13807 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -517,6 +518,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -2292,6 +2295,33 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_buffer_usage(es, &planstate->instrument->bufusage);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->walusage);
+ if (es->batches && planstate->ps_Batch)
+ {
+ TupleBatch *b = planstate->ps_Batch;
+
+ if (b->stat_batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) b->stat_batches,
+ TupleBatchAvgRows(b),
+ b->stat_max_rows,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, b->stat_batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ TupleBatchAvgRows(b), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, b->stat_max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows, es);
+ }
+ }
+ }
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index a6623f8fa52..6ef6055c479 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
index 007ae535687..93c90680d3d 100644
--- a/src/backend/executor/execBatch.c
+++ b/src/backend/executor/execBatch.c
@@ -19,7 +19,7 @@
* Allocate and initialize a new TupleBatch envelope.
*/
TupleBatch *
-TupleBatchCreate(TupleDesc scandesc, int capacity)
+TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats)
{
TupleBatch *b;
TupleTableSlot **inslots,
@@ -44,6 +44,12 @@ TupleBatchCreate(TupleDesc scandesc, int capacity)
b->nvalid = 0;
b->next = 0;
+ b->track_stats = track_stats;
+ b->stat_batches = 0;
+ b->stat_rows = 0;
+ b->stat_max_rows = 0;
+ b->stat_min_rows = INT_MAX;
+
return b;
}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index a9071e32560..73eb9b6a51e 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -213,8 +213,9 @@ SeqNextBatch(SeqScanState *node)
TableScanDesc scandesc;
EState *estate;
ScanDirection direction;
+ TupleBatch *b = node->ss.ps.ps_Batch;
- Assert(node->ss.ps.ps_Batch != NULL);
+ Assert(b != NULL);
/*
* get information from the estate and scan state
@@ -237,22 +238,21 @@ SeqNextBatch(SeqScanState *node)
}
/* Lazily create the AM batch payload. */
- if (node->ss.ps.ps_Batch->am_payload == NULL)
+ if (b->am_payload == NULL)
{
const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
Assert(tam && tam->scan_begin_batch);
- node->ss.ps.ps_Batch->am_payload =
- table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
- node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ b->am_payload = table_scan_begin_batch(scandesc, b->maxslots);
+ b->ops = table_batch_callbacks(node->ss.ss_currentRelation);
}
- node->ss.ps.ps_Batch->ntuples =
- table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
- node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
- node->ss.ps.ps_Batch->materialized = false;
+ b->ntuples = table_scan_getnextbatch(scandesc, b->am_payload, direction);
+ b->nvalid = b->ntuples;
+ b->materialized = false;
+ TupleBatchRecordStats(b, b->ntuples);
- return node->ss.ps.ps_Batch->ntuples > 0;
+ return b->ntuples > 0;
}
static inline bool
@@ -340,8 +340,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
const int cap = executor_batch_rows;
TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
- scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index ba073b86918..b82f7ac0829 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -55,6 +55,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 2d0066103ce..e3a4f762284 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,7 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include "limits.h"
#include "executor/tuptable.h"
/*
@@ -45,11 +46,18 @@ typedef struct TupleBatch
int nvalid; /* number of returnable tuples in outslots */
int next; /* 0-based index of next tuple to be returned */
+
+ /* Statistics (populated when EXPLAIN ANALYZE BATCHES) */
+ bool track_stats; /* whether to collect stats */
+ int64 stat_batches; /* total number of batches fetched */
+ int64 stat_rows; /* total tuples across all batches */
+ int stat_max_rows; /* max rows in any single batch */
+ int stat_min_rows; /* min rows in any single batch (non-zero) */
} TupleBatch;
/* Helpers */
-extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats);
extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
@@ -96,4 +104,29 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+/* === Batching stats. ===*/
+
+static inline void
+TupleBatchRecordStats(TupleBatch *b, int rows)
+{
+ if (!b->track_stats)
+ return;
+
+ b->stat_batches++;
+ b->stat_rows += rows;
+ if (rows > b->stat_max_rows)
+ b->stat_max_rows = rows;
+ if (rows < b->stat_min_rows && rows > 0)
+ b->stat_min_rows = rows;
+}
+
+static inline double
+TupleBatchAvgRows(TupleBatch *b)
+{
+ if (b->stat_batches == 0)
+ return 0.0;
+
+ return (double) b->stat_rows / b->stat_batches;
+}
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index ffe470f2b84..0af02db3760 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..fef3b4a5497 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,60 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+reset executor_batch_rows;
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+ ?column?
+----------
+ 0
+(1 row)
+
+drop table batch_test;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..87bb179ced9 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,29 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+reset executor_batch_rows;
+
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+
+drop table batch_test;
--
2.47.3
[text/x-sh] bar_limit.sh (1.7K, 6-bar_limit.sh)
download | inline:
home=$HOME
master=$home/pg/install/master-opt/bin
patched=$home/pg/install/patched-opt/bin
master_data=$home/pg/data/master
patched_data=$home/pg/data/patched
logdir=$home/pg/log
# master
export PATH=$master:$PATH
which postgres
pg_ctl -D $master_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $master_data -l $logdir/pg_master_log stop
export PATH=$patched:$PATH;
which postgres
echo "executor_batch_rows=0" >> $patched_data/postgresql.conf
echo "executor_batch_rows=0"
pg_ctl -D $patched_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $patched_data -l $logdir/pg_master_log stop
which postgres
echo "executor_batch_rows=64" >> $patched_data/postgresql.conf
echo "executor_batch_rows=64"
pg_ctl -D $patched_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $patched_data -l $logdir/pg_master_log stop
[text/x-sh] bar_limit_where_o.sh (1.7K, 7-bar_limit_where_o.sh)
download | inline:
home=$HOME
master=$home/pg/install/master-opt/bin
patched=$home/pg/install/patched-opt/bin
master_data=$home/pg/data/master
patched_data=$home/pg/data/patched
logdir=$home/pg/log
# master
export PATH=$master:$PATH
which postgres
pg_ctl -D $master_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i where o > 0 limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $master_data -l $logdir/pg_master_log stop
export PATH=$patched:$PATH;
which postgres
echo "executor_batch_rows=0" >> $patched_data/postgresql.conf
echo "executor_batch_rows=0"
pg_ctl -D $patched_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i where o > 0 limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $patched_data -l $logdir/pg_master_log stop
which postgres
echo "executor_batch_rows=64" >> $patched_data/postgresql.conf
echo "executor_batch_rows=64"
pg_ctl -D $patched_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i where o > 0 limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $patched_data -l $logdir/pg_master_log stop
[text/x-sh] bar_limit_where_a.sh (1.7K, 8-bar_limit_where_a.sh)
download | inline:
home=$HOME
master=$home/pg/install/master-opt/bin
patched=$home/pg/install/patched-opt/bin
master_data=$home/pg/data/master
patched_data=$home/pg/data/patched
logdir=$home/pg/log
# master
export PATH=$master:$PATH
which postgres
pg_ctl -D $master_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i where a > 0 limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $master_data -l $logdir/pg_master_log stop
export PATH=$patched:$PATH;
which postgres
echo "executor_batch_rows=0" >> $patched_data/postgresql.conf
echo "executor_batch_rows=0";
pg_ctl -D $patched_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i where a > 0 limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $patched_data -l $logdir/pg_master_log stop
which postgres
echo "executor_batch_rows=64" >> $patched_data/postgresql.conf
echo "executor_batch_rows=64"
pg_ctl -D $patched_data -l $logdir/pg_master_log start
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
psql -c "vacuum bar_$i" > /dev/null 2>&1
printf "%s\t" "$i"
echo "select * from bar_$i where a > 0 limit 1 offset $i" > /tmp/bar_limit.sql
pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done
pg_ctl -D $patched_data -l $logdir/pg_master_log stop
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-12-20 14:36 Amit Langote <[email protected]>
parent: Daniil Davydov <[email protected]>
0 siblings, 0 replies; 29+ messages in thread
From: Amit Langote @ 2025-12-20 14:36 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers
Hi Daniil,
On Thu, Oct 30, 2025 at 9:12 PM Daniil Davydov <[email protected]> wrote:
> On Wed, Oct 29, 2025 at 9:23 AM Amit Langote <[email protected]> wrote:
> >
> > Hi Daniil,
> >
> > On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > As far as I understand, this work partially overlaps with what we did in the
> > > thread [1] (in short - we introduce support for batching within the ModifyTable
> > > node). Am I correct?
> >
> > There might be some relation, but not much overlap. The thread you
> > mention seems to focus on batching in the write path (for INSERT,
> > etc.), while this work targets batching in the read path via Table AM
> > scan callbacks. I think they can be developed independently, though
> > I'm happy to take a look.
>
> Oh, I got it. Thanks!
>
> I looked at 0001-0003 patches and got some comments :
> 1)
> I noticed that some Nodes may set SO_ALLOW_PAGEMODE flag to 'false'
> during ExecReScan. heap_getnextslot works carefully with it - checks whether
> pagemode is allowed at every call. If not - it just uses tuple-at-a-time mode.
> At the same time, heap_getnextbatch always expects that pagemode is enabled.
> I didn't find any code paths which can lead to an assertion [1] fail.
> If such a code
> path is unreachable under any circumstances, maybe we should add a comment
> why?
>
> 2)
> heapgettup_pagemode_batch : Do we really need to compute lineindex variable
> in this way? :
> ***
> lineindex = scan->rs_cindex + dir;
> if (ScanDirectionIsForward(dir))
> linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
> (scan->rs_ntuples - lineindex) : 0;
> ***
>
> As far as I understand, this is enough :
> ***
> lineindex = scan->rs_cindex + dir;
> if (ScanDirectionIsForward(dir))
> linesleft = scan->rs_ntuples - lineindex;
> ***
>
> 3)
> Is this code inside heapgettup_pagemode_batch necessary? :
> ***
> ScanDirectionIsForward(dir) ? 0 : 0
> ***
>
> 4)
> heapgettup_pagemode has this change :
> HeapTuple tuple = &(scan->rs_ctup) ---> HeapTuple tuple = &scan->rs_ctup
> I guess it was changed accidentally.
>
> 5)
> I apologize for the tediousness, but these braces are not in the
> postgres style :
> ***
> static const TupleBatchOps TupleBatchHeapOps = {
> .materialize_all = heap_materialize_batch_all
> };
> ***
>
> [1] heap_getnextbatch : Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE)
Thanks for the review and apologies for getting to them so late.
I think I've addressed your comments in v4 that I just posted.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2025-12-22 11:45 =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: =?utf-8?B?Y2NhNTUwNw==?= @ 2025-12-22 11:45 UTC (permalink / raw)
To: =?utf-8?B?QW1pdCBMYW5nb3Rl?= <[email protected]>; +Cc: pgsql-hackers; =?utf-8?B?VG9tYXMgVm9uZHJh?= <[email protected]>
Hi,
Some comments for v4:
0001
====
1) table_scan_getnextbatch()
"Assert(dir == ForwardScanDirection);" -> "Assert(ScanDirectionIsForward(dir));"
2) heapgettup_pagemode_batch()
"TupleDesc tupdesc = key ? RelationGetDescr(rel) : NULL;" -> "TupleDesc tupdesc = RelationGetDescr(rel);"
I think the latter is enough.
3) heapgettup_pagemode_batch()
```
/* Are there more visible tuples left on this page? */
lineindex = scan->rs_cindex + dir;
linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
(scan->rs_ntuples - lineindex) : 0;
if (linesleft > 0)
break; /* continue on this page */
```
The "scan->rs_ntuples" is already an uint32.
4) heapgettup_pagemode_batch()
```
Assert(lineindex <= (uint32) scan->rs_ntuples);
```
The "scan->rs_ntuples" is already an uint32. And I think this should be "Assert(lineindex < scan->rs_ntuples);", the related
assert in heapgettup_pagemode() is also wrong.
5) heapgettup_pagemode_batch()
If the scan key filters out all tuples on a page, we may return 0 before reaching the end of scan, right?
6) heap_begin_batch()
```
hb = palloc(sizeof(HeapBatch));
hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
```
Can we just use one palloc() for cache-friendly?
0002
====
1) heap_materialize_batch_all()
```
slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
slot->base.tts_tid = tuple->t_self;
slot->base.tts_tableOid = tuple->t_tableOid;
slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
```
Redundant of "slot->base.tts_flags &="?
2) TupleBatchCreate()
```
inslots = palloc(sizeof(TupleTableSlot *) * capacity);
outslots = palloc(sizeof(TupleTableSlot *) * capacity);
for (int i = 0; i < capacity; i++)
inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
b = (TupleBatch *) palloc(sizeof(TupleBatch));
```
Can we just use one palloc() for cache-friendly?
3) TupleBatchCreate()
```
b->outslots = outslots;
b->activeslots = NULL;
b->outslots = outslots;
```
Redundant of "b->outslots = outslots;"?
4) TupleBatchReset()
```
if (b == NULL)
return;
```
This can never happen, convert to a assert or just delete it?
5) SeqNextBatch()
"Assert(direction == ForwardScanDirection);" -> "Assert(ScanDirectionIsForward(direction));"
--
Regards,
ChangAo Chen
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-01-26 09:34 Daniil Davydov <[email protected]>
parent: =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Daniil Davydov @ 2026-01-26 09:34 UTC (permalink / raw)
To: cca5507 <[email protected]>; +Cc: Amit Langote <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
On Mon, Dec 22, 2025 at 6:46 PM cca5507 <[email protected]> wrote:
>
> Some comments for v4:
>
Agree with your (1)-(4) comments.
> 5) heapgettup_pagemode_batch()
> If the scan key filters out all tuples on a page, we may return 0 before reaching the end of scan, right?
>
Yes. I think that we should advance to the next page if "nout == 0"
at the end of walking through the rs_vistuples.
> 6) heap_begin_batch()
> ```
> hb = palloc(sizeof(HeapBatch));
> hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
> ```
> Can we just use one palloc() for cache-friendly?
>
Actually, we are using memory context when calling the palloc function.
I.e. in the general case it will not cause memory allocation. But of course
there is no guarantee for it. I saw a lot of places in the code where we
are calling the palloc function several times in a row, so I guess that
this is OK.
If you will decide to leave these palloc calls, I suggest using the
palloc_object/palloc_array functions.
A few other comments on 0001 patch:
1)
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
Is it syntactically correct?
2)
/* Initialize static fields of HeapTupleData. Row bodies remain on page. */
relid = RelationGetRelid(sscan->rs_rd);
for (int i = 0; i < maxitems; i++)
hb->tupdata[i].t_tableOid = relid;
Is it really necessary? I see that we are setting this field inside the
heapgettup_pagemode_batch function.
A few comment on 0002 patch:
1)
I guess that you should rebase your patches on the current master, because
the second patch doesn't apply.
2)
Maybe we can use tuplestore for tuples stored in TupleBatch? It is just a
proposal - I didn't check this idea carefully.
--
Best regards,
Daniil Davydov
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-01-27 03:00 Amit Langote <[email protected]>
parent: Daniil Davydov <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2026-01-27 03:00 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
On Mon, Jan 26, 2026 at 6:34 PM Daniil Davydov <[email protected]> wrote:
>
> Hi,
>
> On Mon, Dec 22, 2025 at 6:46 PM cca5507 <[email protected]> wrote:
> >
> > Some comments for v4:
> >
>
> Agree with your (1)-(4) comments.
>
> > 5) heapgettup_pagemode_batch()
> > If the scan key filters out all tuples on a page, we may return 0 before reaching the end of scan, right?
> >
>
> Yes. I think that we should advance to the next page if "nout == 0"
> at the end of walking through the rs_vistuples.
Next version (v5) does it like that.
> > 6) heap_begin_batch()
> > ```
> > hb = palloc(sizeof(HeapBatch));
> > hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
> > ```
> > Can we just use one palloc() for cache-friendly?
> >
>
> Actually, we are using memory context when calling the palloc function.
> I.e. in the general case it will not cause memory allocation. But of course
> there is no guarantee for it. I saw a lot of places in the code where we
> are calling the palloc function several times in a row, so I guess that
> this is OK.
>
> If you will decide to leave these palloc calls, I suggest using the
> palloc_object/palloc_array functions.
I think combining those individual pallocs into one is a good idea, so
v5 does it like that.
> A few other comments on 0001 patch:
>
> 1)
> + void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
> Is it syntactically correct?
Yes, it compiles fine. Though I'm considering changing the return type
to a struct with common fields (like nitems) so callers can access
them directly without callback indirection. Maybe call it TAMBatch or
something.
> 2)
> /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
> relid = RelationGetRelid(sscan->rs_rd);
> for (int i = 0; i < maxitems; i++)
> hb->tupdata[i].t_tableOid = relid;
>
> Is it really necessary? I see that we are setting this field inside the
> heapgettup_pagemode_batch function.
It's intentional -- by initializing t_tableOid once in
heap_begin_batch, we can avoid setting it repeatedly for every tuple
in heapgettup_pagemode_batch(). Though you are correct to point out
the redundant assignment in heapgettup_pagemode_batch(); I'll change
it to an Assert instead. The relid doesn't change during the scan.
> A few comment on 0002 patch:
>
> 1)
> I guess that you should rebase your patches on the current master, because
> the second patch doesn't apply.
Yep, will do.
> 2)
> Maybe we can use tuplestore for tuples stored in TupleBatch? It is just a
> proposal - I didn't check this idea carefully.
TupleBatch is designed to be lightweight -- it holds an array of
TupleTableSlot pointers, not the tuple data itself. The slots
reference tuples that remain in the AM's buffer (no copy). Using
tuplestore would require materializing tuples, adding overhead we're
trying to avoid.
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-01-29 07:35 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 2 replies; 29+ messages in thread
From: Amit Langote @ 2026-01-29 07:35 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
Here is v5 of the patch series.
Patches 0001-0003 add the core batching infrastructure. 0001 adds the
batch table AM API with heapam implementation, 0002 wires up SeqScan
to use it (still returning one slot at a time), and 0003 adds EXPLAIN
(BATCHES). I'd love to hear people's thoughts around TupleBatch
structure added in 0002. I thought about making it a separate patch so
that 0002 will still populate the single ScanState.ss_scanTupleSlot,
but that means we'd still have to call the TAM callback to populate
the tuple in the TAM's batch struct into the slot, defeating the whole
point. With TupleBatch, you have executor_batch_rows number of slots
which are filled in one TAM callback (materialize_all) call. So I
decided to keep the TupleBatch and related things in 0002.
For scans without quals, batching shows 20-30% improvement with no
visible regressions when batching is disabled (batch_rows=0):
SELECT * FROM t LIMIT n (no qual)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
(%diff: positive = faster than master, negative = slower)
Patches 0004-0005 add batched qual evaluation and are more
experimental (see below on why 0005 exists). For quals referencing
early columns, the improvement is significant:
SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 30.19 ms 15.55 ms 48.5%
3M 92.47 ms 50.01 ms 45.9%
10M 325.58 ms 211.83 ms 34.9%
However, for quals on later columns (e.g., 15th), batching provides no
benefit - deformation dominates and batching doesn't help:
SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
Rows Master batch=64 %diff
------ -------- -------- -----
1M 44.14 ms 44.56 ms -0.9%
3M 133.89 ms 137.77 ms -2.9%
10M 503.33 ms 528.88 ms -5.1%
I don't have a satisfactory explanation for why batching doesn't help
the deform-heavy case at all. One would expect at least some benefit
from reduced per-tuple overhead, but that's not materializing.
I've also been struggling to understand why 0004 affects the per-tuple
path even when batch_rows=0. For quals with 0% selectivity (all rows
fail the qual), perf shows ExecInterpExpr is noticeably hotter with
the patched code compared to master, even though batching is disabled:
SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
Compare that to 100% selectivity (all rows pass), where there's no regression:
SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
Rows Master batch=0 %diff batch=64 %diff
------ -------- ------- ----- -------- -----
1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
I tried moving batch opcodes to a separate interpreter (0005) thinking
it might be register pressure or jump table effects from adding cases
to ExecInterpExpr's switch. With 0005, the generated assembly for
ExecInterpExpr looks identical to master (same stack frame size, same
epilogue), yet the performance still differs. Specifically, the ldp
instruction in the function epilogue shows 53% hotness in patched vs
35% in master. We still need placeholder entries in the dispatch
table, so it's unclear if this fully isolates the per-tuple path. I'll
continue looking at perf, but I feel like at a bit of a loss here and
would appreciate any insights.
Other changes worth noting:
- I removed the BatchVector intermediate representation that copied
Datums into columnar arrays before qual evaluation (it used to be in
the batched qual patch 0004). Now quals access batch slots' tts_values
directly. This simplifies the code and the copy overhead wasn't paying
off. If we pursue serious vectorization later, this may need to be
revisited, but removing it doesn't degrade performance.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.0K, 2-v5-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From f772043e2104bf67964418dc80c3abb56bdb069d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 29 Jan 2026 00:57:04 +0900
Subject: [PATCH v5 1/5] Add batch table AM API and heapam implementation
Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.
Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.
Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.
Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 221 +++++++++++++++++++++++
src/backend/access/heap/heapam_handler.c | 4 +
src/include/access/heapam.h | 18 ++
src/include/access/tableam.h | 58 ++++++
src/include/pgstat.h | 5 +
5 files changed, 306 insertions(+)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f30a56ecf55..d8d1bdf5191 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1151,6 +1151,134 @@ continue_page:
scan->rs_inited = false;
}
+/*
+ * heapgettup_pagemode_batch
+ * Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ * number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ * - Ensures rs_cbuf pins the page from which tuples were produced.
+ * - Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ * heapgettup_pagemode’s inner-loop effects).
+ * - Does *not* change buffer pin counts except through normal page
+ * transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+ ScanDirection dir,
+ int nkeys, ScanKey key,
+ HeapTupleData *tdata,
+ int maxitems)
+{
+ Page page;
+ uint32 lineindex;
+ uint32 linesleft;
+ int nout = 0;
+ Relation rel = scan->rs_base.rs_rd;
+ TupleDesc tupdesc = RelationGetDescr(rel);
+
+ /*
+ * Current batching limitations (may be relaxed in future):
+ *
+ * - Forward scans only: backward scan support would require changes to
+ * batch iteration and page advancement logic.
+ *
+ * - Pagemode required: batching relies on the pre-built rs_vistuples[]
+ * array from heap_prepare_pagescan(). This is guaranteed by
+ * ScanCanUseBatching() which only enables batching when SO_ALLOW_PAGEMODE
+ * is set. Unlike heap_getnextslot, we don't support dynamic fallback to
+ * tuple-at-a-time mode since the batch execution path is selected at
+ * ExecInit time.
+ */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(maxitems > 0);
+
+ /*
+ * Loop until we find tuples that pass the scan key, or reach end of scan.
+ * We never cross page boundaries within a single batch.
+ */
+ for (;;)
+ {
+ /*
+ * Advance to a page with visible tuples if needed.
+ */
+ if (BufferIsValid(scan->rs_cbuf))
+ {
+ lineindex = scan->rs_cindex + 1;
+ linesleft = (lineindex <= scan->rs_ntuples) ?
+ (scan->rs_ntuples - lineindex) : 0;
+ }
+ else
+ linesleft = 0;
+
+ while (linesleft == 0)
+ {
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ return 0;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+ heap_prepare_pagescan((TableScanDesc) scan);
+
+ lineindex = 0;
+ linesleft = scan->rs_ntuples;
+ }
+
+ /*
+ * Walk rs_vistuples[] copying headers into tdata[] until the page
+ * is exhausted or batch capacity is reached.
+ */
+ page = BufferGetPage(scan->rs_cbuf);
+
+ for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex++)
+ {
+ OffsetNumber lineoff;
+ ItemId lpp;
+ HeapTupleData *dst = &tdata[nout];
+
+ Assert(lineindex < scan->rs_ntuples);
+ lineoff = scan->rs_vistuples[lineindex];
+ lpp = PageGetItemId(page, lineoff);
+ Assert(ItemIdIsNormal(lpp));
+
+ dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+ dst->t_len = ItemIdGetLength(lpp);
+ Assert(dst->t_tableOid == RelationGetRelid(rel));
+ ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+ if (key != NULL && !HeapKeyTest(dst, tupdesc, nkeys, key))
+ continue;
+
+ scan->rs_cindex = lineindex;
+ nout++;
+ }
+
+ /* Return if we found any tuples; otherwise try next page */
+ if (nout > 0)
+ return nout;
+
+ /* Mark page exhausted so we advance on next iteration */
+ scan->rs_cindex = scan->rs_ntuples;
+ }
+
+ pg_unreachable();
+ return 0;
+}
/* ----------------------------------------------------------------
* heap access method interface
@@ -1483,6 +1611,99 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
return true;
}
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ HeapBatch *hb;
+ Oid relid;
+ Size alloc_size;
+
+ Assert(maxitems > 0);
+
+ /* Single allocation for HeapBatch header + tupdata array */
+ alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
+ hb = palloc(alloc_size);
+ hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
+ hb->maxitems = maxitems;
+ hb->nitems = 0;
+ hb->buf = InvalidBuffer;
+
+ /* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+ relid = RelationGetRelid(sscan->rs_rd);
+ for (int i = 0; i < maxitems; i++)
+ hb->tupdata[i].t_tableOid = relid;
+
+ return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapBatch *hb = (HeapBatch *) am_batch;
+ Buffer curbuf;
+ int n;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+ Assert(hb->maxitems > 0);
+
+ /* Drop prior batch pin, if any. */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ hb->nitems = 0;
+
+ /* One call per batch, never crosses a page. */
+ n = heapgettup_pagemode_batch(scan, dir,
+ sscan->rs_nkeys, sscan->rs_key,
+ hb->tupdata, hb->maxitems);
+
+ if (n == 0)
+ return 0; /* end of scan */
+
+ /* Hold a shared pin for the batch lifetime so t_data stays valid. */
+ curbuf = scan->rs_cbuf;
+ IncrBufferRefCount(curbuf);
+ hb->buf = curbuf;
+
+ /* Per-tuple stats (can be collapsed into a future _multi() call). */
+ pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+ hb->nitems = n;
+ return n;
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index cbef73e5d4b..e4cf7fc296b 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2637,6 +2637,10 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 3c0961ab36b..e2417650c5f 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -101,6 +101,19 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+ HeapTupleData *tupdata; /* len = maxitems; headers only */
+ int nitems; /* tuples produced in last getnextbatch() */
+ int maxitems; /* fixed capacity set at begin_batch() */
+ Buffer buf; /* single pinned buffer for this batch */
+} HeapBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -337,6 +350,11 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e2ec5289d4d..584b580f7a1 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ void *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+ int (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+ ScanDirection dir);
+ void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ * Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ /* Only forward scans are supported in the batched mode. */
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ if (am_batch == NULL)
+ return;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, am_batch);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff7ecc2533..48e4e034a33 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -697,6 +697,11 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned++; \
} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n) \
+ do { \
+ if (pgstat_should_count_relation(rel)) \
+ (rel)->pgstat_info->counts.tuples_returned += n; \
+ } while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
if (pgstat_should_count_relation(rel)) \
--
2.47.3
[application/octet-stream] v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.6K, 3-v5-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 94d0f92c807895e6edadf583a06bb39c5dc52a4c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 27 Jan 2026 14:07:55 +0900
Subject: [PATCH v5 2/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.
Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.
Add executor_batch_rows GUC to specify the maximum number of rows
that can be added into a batch.
Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.
Notes / current limits:
- With the current heapam, batches are composed from a single page, so
the batch may not always be full. Future work may let SeqScan and/or
AMs top up batches across pages when safe to do so.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 28 ++++
src/backend/access/heap/heapam_handler.c | 16 ++
src/backend/access/table/tableam.c | 11 ++
src/backend/executor/Makefile | 1 +
src/backend/executor/execBatch.c | 112 ++++++++++++++
src/backend/executor/execScan.c | 31 ++++
src/backend/executor/meson.build | 1 +
src/backend/executor/nodeSeqscan.c | 176 +++++++++++++++++++++-
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 ++
src/include/access/heapam.h | 1 +
src/include/access/tableam.h | 27 ++++
src/include/executor/execBatch.h | 99 ++++++++++++
src/include/executor/execScan.h | 69 +++++++++
src/include/executor/executor.h | 4 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 4 +
17 files changed, 592 insertions(+), 1 deletion(-)
create mode 100644 src/backend/executor/execBatch.c
create mode 100644 src/include/executor/execBatch.h
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index d8d1bdf5191..db91085b07c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1644,6 +1644,34 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
return hb;
}
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+ HeapBatch *hb = (HeapBatch *) am_batch;
+
+ Assert(n <= hb->nitems);
+
+ for (int i = 0; i < n; i++)
+ {
+ HeapTupleData *tuple = &hb->tupdata[i];
+ HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+ /* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+ slot->tuple = tuple;
+ slot->off = 0;
+ slot->base.tts_nvalid = 0;
+ slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+ slot->base.tts_tid = tuple->t_self;
+ slot->base.tts_tableOid = tuple->t_tableOid;
+ }
+}
+
/*
* heap_scan_end_batch
*
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index e4cf7fc296b..0f6bda7b69f 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,21 @@ heapam_slot_callbacks(Relation relation)
return &TTSOpsBufferHeapTuple;
}
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps =
+{
+ .materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+ return &TupleBatchHeapOps;
+}
/* ------------------------------------------------------------------------
* Index Scan Callbacks for heap AM
@@ -2631,6 +2646,7 @@ static const TableAmRoutine heapam_methods = {
.type = T_TableAmRoutine,
.slot_callbacks = heapam_slot_callbacks,
+ .batch_callbacks = heapam_batch_callbacks,
.scan_begin = heap_beginscan,
.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 87491796523..ffb3b738f6a 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
return slot;
}
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
/* ----------------------------------------------------------------------------
* Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..1ef4117b87c
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,112 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ * Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ * Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+ TupleBatch *b;
+ TupleTableSlot **inslots,
+ **outslots;
+ Size alloc_size;
+
+ /* Single allocation for TupleBatch + inslots + outslots arrays */
+ alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
+ b = palloc(alloc_size);
+ inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
+ outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
+ sizeof(TupleTableSlot *) * capacity);
+
+ for (int i = 0; i < capacity; i++)
+ inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+ /* Initial state: empty envelope */
+ b->am_payload = NULL;
+ b->ntuples = 0;
+ b->inslots = inslots;
+ b->outslots = outslots;
+ b->activeslots = NULL;
+ b->maxslots = capacity;
+
+ b->nvalid = 0;
+ b->next = 0;
+
+ return b;
+}
+
+/*
+ * TupleBatchReset
+ * Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ for (int i = 0; i < b->maxslots; i++)
+ {
+ ExecClearTuple(b->inslots[i]);
+ if (drop_slots)
+ ExecDropSingleTupleTableSlot(b->inslots[i]);
+ }
+
+ b->ntuples = 0;
+ b->nvalid = 0;
+ b->next = 0;
+ b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->inslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+ b->materialized = true;
+ b->activeslots = b->outslots;
+ b->nvalid = nvalid;
+ b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+ return b != NULL &&
+ b->maxslots > 0 &&
+ b->inslots != NULL &&
+ b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+ b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+ return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 9f68be17b99..5023eb6756a 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
*/
#include "postgres.h"
+#include "access/tableam.h"
#include "executor/executor.h"
#include "executor/execScan.h"
#include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
}
}
}
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ (scanstate->ps.state->es_epq_active == NULL) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+ TupleBatch *b = scanstate->ps.ps_Batch;
+
+ if (b)
+ {
+ TupleBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ table_scan_end_batch(scanstate->ss_currentScanDesc,
+ b->am_payload);
+ b->am_payload = NULL;
+ }
+ if (drop)
+ pfree(b);
+ }
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..e5af90e3a0f 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index af3c788ce8b..08d93e6f0be 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+
+ Assert(node->ss.ps.ps_Batch != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (node->ss.ps.ps_Batch->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ node->ss.ps.ps_Batch->am_payload =
+ table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+ node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ }
+
+ node->ss.ps.ps_Batch->ntuples =
+ table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+ node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+ node->ss.ps.ps_Batch->materialized = false;
+
+ return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+ if (SeqNextBatch(node))
+ {
+ TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+ return true;
+ }
+
+ return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return ExecScanExtendedBatchSlot(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+ const int cap = executor_batch_rows;
+ TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+ /* Choose batch variant to preserve your specialization matrix */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ }
+ else
+ {
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
+ bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
+ use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
+ use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (use_batching)
+ SeqScanInitBatching(scanstate, eflags);
+
return scanstate;
}
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ ScanResetBatching(&node->ss, true);
+
/*
* close heap scan
*/
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
if (scan != NULL)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
-
+ ScanResetBatching(&node->ss, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..4c422c854d0 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1004,6 +1004,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e2417650c5f..d6154d5ab15 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,6 +354,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 584b580f7a1..bdf733c8b22 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
#include "access/sdir.h"
#include "access/xact.h"
#include "commands/vacuum.h"
+#include "executor/execBatch.h"
#include "executor/tuptable.h"
#include "storage/read_stream.h"
#include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
typedef struct IndexInfo IndexInfo;
typedef struct SampleScanState SampleScanState;
typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
/*
* Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
* Return slot implementation suitable for storing a tuple of this AM.
*/
const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+ const TupleBatchOps *(*batch_callbacks)(Relation rel);
/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
ScanDirection dir);
void (*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
*/
extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
/* ----------------------------------------------------------------------------
* Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ return tam->scan_getnextbatch != NULL;
+}
+
/*
* table_scan_begin_batch
* Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2128,5 +2154,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
*/
extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
#endif /* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..2d0066103ce
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ * Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+ void (*materialize_all)(void *am_payload,
+ TupleTableSlot **dst,
+ int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+ void *am_payload;
+ const TupleBatchOps *ops;
+ int ntuples; /* number of tuples in am_payload */
+ bool materialized; /* tuples in slots valid? */
+ struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+ struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+ * batch */
+ struct TupleTableSlot **activeslots;
+ int maxslots;
+
+ int nvalid; /* number of returnable tuples in outslots */
+ int next; /* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+ return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+ Assert(index < b->nvalid);
+ return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+ Assert(TupleBatchIsValid(b));
+ b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+ return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+ if (b->materialized)
+ return;
+
+ if (b->ops == NULL || b->ops->materialize_all == NULL)
+ elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+ b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+ TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif /* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 028edb8d9fd..d9185331e22 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -251,4 +251,73 @@ ExecScanExtended(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatchSlot
+ * Batch-driven variant of ExecScanExtended.
+ *
+ * Returns one tuple at a time to callers, but internally fetches tuples
+ * in batches from the AM via accessBatchMtd. This reduces per-tuple AM
+ * call overhead while preserving the single-slot interface expected by
+ * parent nodes.
+ *
+ * The batch is refilled when exhausted by calling accessBatchMtd, which
+ * returns false at end-of-scan.
+ *
+ * Note: EPQ is not supported in the batch path; callers must ensure
+ * es_epq_active is NULL before using this function.
+ */
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!TupleBatchHasMore(b))
+ {
+ if (!accessBatchMtd(node))
+ return NULL;
+ }
+
+ in = TupleBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Qual only */
+ if (projInfo == NULL)
+ {
+ if (qual == NULL || ExecQual(qual, econtext))
+ return in;
+ else
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Projection (with or without qual) */
+ if (qual == NULL || ExecQual(qual, econtext))
+ return ExecProject(projInfo);
+ else
+ InstrCountFiltered1(node, 1);
+ /* else try next tuple */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5929aabc353..e82fd6c0c8a 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -578,12 +578,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
*/
typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
ExecScanRecheckMtd recheckMtd);
+
extern void ExecAssignScanProjectionInfo(ScanState *node);
extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
/*
* prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index db559b39c4d..f6bd59f2af1 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f8053d9e572..6a191202ced 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -31,6 +31,7 @@
#include "access/skey.h"
#include "access/tupconvert.h"
+#include "executor/execBatch.h"
#include "executor/instrument.h"
#include "executor/instrument_node.h"
#include "fmgr.h"
@@ -1206,6 +1207,9 @@ typedef struct PlanState
ExprContext *ps_ExprContext; /* node's expression-evaluation context */
ProjectionInfo *ps_ProjInfo; /* info for doing tuple projection */
+ /* Batching state if node supports it. */
+ TupleBatch *ps_Batch;
+
bool async_capable; /* true if node is async-capable */
/*
--
2.47.3
[application/octet-stream] v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (14.0K, 4-v5-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From f282f5dde3b4bc58b2cd7b66e55803df26e357aa Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v5 3/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 30 ++++++++++++++
src/backend/commands/explain_state.c | 2 +
src/backend/executor/execBatch.c | 31 +++++++++++++-
src/backend/executor/nodeSeqscan.c | 24 ++++++-----
src/include/commands/explain_state.h | 1 +
src/include/executor/execBatch.h | 16 +++++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 58 +++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 27 +++++++++++++
9 files changed, 177 insertions(+), 13 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index b7bb111688c..f3d521e1f93 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -517,6 +518,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -2294,6 +2297,33 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_buffer_usage(es, &planstate->instrument->bufusage);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->walusage);
+ if (es->batches && planstate->ps_Batch)
+ {
+ TupleBatch *b = planstate->ps_Batch;
+
+ if (b->stat_batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) b->stat_batches,
+ TupleBatchAvgRows(b),
+ b->stat_max_rows,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, b->stat_batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ TupleBatchAvgRows(b), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, b->stat_max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows, es);
+ }
+ }
+ }
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 803c74dd178..ad5b223ede7 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
index 1ef4117b87c..ed54e3165c8 100644
--- a/src/backend/executor/execBatch.c
+++ b/src/backend/executor/execBatch.c
@@ -19,7 +19,7 @@
* Allocate and initialize a new TupleBatch envelope.
*/
TupleBatch *
-TupleBatchCreate(TupleDesc scandesc, int capacity)
+TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats)
{
TupleBatch *b;
TupleTableSlot **inslots,
@@ -47,6 +47,12 @@ TupleBatchCreate(TupleDesc scandesc, int capacity)
b->nvalid = 0;
b->next = 0;
+ b->track_stats = track_stats;
+ b->stat_batches = 0;
+ b->stat_rows = 0;
+ b->stat_max_rows = 0;
+ b->stat_min_rows = INT_MAX;
+
return b;
}
@@ -110,3 +116,26 @@ TupleBatchGetNumValid(TupleBatch *b)
{
return b->nvalid;
}
+
+void
+TupleBatchRecordStats(TupleBatch *b, int rows)
+{
+ if (!b->track_stats)
+ return;
+
+ b->stat_batches++;
+ b->stat_rows += rows;
+ if (rows > b->stat_max_rows)
+ b->stat_max_rows = rows;
+ if (rows < b->stat_min_rows && rows > 0)
+ b->stat_min_rows = rows;
+}
+
+double
+TupleBatchAvgRows(TupleBatch *b)
+{
+ if (b->stat_batches == 0)
+ return 0.0;
+
+ return (double) b->stat_rows / b->stat_batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 08d93e6f0be..f36b31d4fbb 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -213,8 +213,9 @@ SeqNextBatch(SeqScanState *node)
TableScanDesc scandesc;
EState *estate;
ScanDirection direction;
+ TupleBatch *b = node->ss.ps.ps_Batch;
- Assert(node->ss.ps.ps_Batch != NULL);
+ Assert(b != NULL);
/*
* get information from the estate and scan state
@@ -237,22 +238,21 @@ SeqNextBatch(SeqScanState *node)
}
/* Lazily create the AM batch payload. */
- if (node->ss.ps.ps_Batch->am_payload == NULL)
+ if (b->am_payload == NULL)
{
const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
Assert(tam && tam->scan_begin_batch);
- node->ss.ps.ps_Batch->am_payload =
- table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
- node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+ b->am_payload = table_scan_begin_batch(scandesc, b->maxslots);
+ b->ops = table_batch_callbacks(node->ss.ss_currentRelation);
}
- node->ss.ps.ps_Batch->ntuples =
- table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
- node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
- node->ss.ps.ps_Batch->materialized = false;
+ b->ntuples = table_scan_getnextbatch(scandesc, b->am_payload, direction);
+ b->nvalid = b->ntuples;
+ b->materialized = false;
+ TupleBatchRecordStats(b, b->ntuples);
- return node->ss.ps.ps_Batch->ntuples > 0;
+ return b->ntuples > 0;
}
static bool
@@ -340,8 +340,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
const int cap = executor_batch_rows;
TupleDesc scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
- scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+ scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 0b695f7d812..0a99f0f2341 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -55,6 +55,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 2d0066103ce..1efc194d8ff 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
/*
@@ -45,11 +47,18 @@ typedef struct TupleBatch
int nvalid; /* number of returnable tuples in outslots */
int next; /* 0-based index of next tuple to be returned */
+
+ /* Statistics (populated when EXPLAIN ANALYZE BATCHES) */
+ bool track_stats; /* whether to collect stats */
+ int64 stat_batches; /* total number of batches fetched */
+ int64 stat_rows; /* total tuples across all batches */
+ int stat_max_rows; /* max rows in any single batch */
+ int stat_min_rows; /* min rows in any single batch (non-zero) */
} TupleBatch;
/* Helpers */
-extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats);
extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
@@ -96,4 +105,9 @@ TupleBatchMaterializeAll(TupleBatch *b)
TupleBatchUseInput(b, b->ntuples);
}
+/* === Batching stats. ===*/
+
+extern void TupleBatchRecordStats(TupleBatch *b, int rows);
+extern double TupleBatchAvgRows(TupleBatch *b);
+
#endif /* EXECBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..bee69b4ac8f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..1bec59eea9e 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,61 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+reset executor_batch_rows;
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+ ?column?
+----------
+ 0
+(1 row)
+
+drop table batch_test;
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..7881c674495 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,30 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+reset executor_batch_rows;
+
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+
+drop table batch_test;
+reset executor_batch_rows;
--
2.47.3
[application/octet-stream] v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch (32.2K, 5-v5-0004-WIP-Add-ExecQualBatch-for-batched-qual-evaluation.patch)
download | inline diff:
From e155dc70e0370435061da70362175255d83a36ea Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 26 Jan 2026 11:01:44 +0900
Subject: [PATCH v5 4/5] WIP: Add ExecQualBatch() for batched qual evaluation
Introduce batched qual evaluation for SeqScan when quals are simple
AND-trees of Var op Const, Var op Var, or NullTest expressions.
The batch is evaluated using a bitmask, avoiding per-tuple ExecQual()
overhead.
Only leakproof operators are eligible for batching, since batching
changes evaluation order which could otherwise leak data through
side channels before security barrier quals filter rows.
Add supporting infrastructure: EEOP_SCAN_FETCHSOME_BATCH to deform
all tuples in a batch and ExprContext.scan_batch field.
The postgres_fdw regression test is updated to disable batching for
a query with LIMIT, since batching processes entire batches before
checking LIMIT, resulting in different "Rows Removed by Filter"
counts in EXPLAIN ANALYZE output.
---
.../postgres_fdw/expected/postgres_fdw.out | 1 +
contrib/postgres_fdw/sql/postgres_fdw.sql | 1 +
src/backend/executor/execExpr.c | 335 ++++++++++++++++++
src/backend/executor/execExprInterp.c | 224 ++++++++++++
src/backend/executor/execTuples.c | 32 ++
src/backend/executor/nodeSeqscan.c | 28 +-
src/backend/jit/llvm/llvmjit_expr.c | 35 ++
src/backend/jit/llvm/llvmjit_types.c | 3 +
src/include/executor/execExpr.h | 84 ++++-
src/include/executor/execScan.h | 46 +++
src/include/executor/executor.h | 3 +
src/include/executor/tuptable.h | 2 +
src/include/nodes/execnodes.h | 11 +-
13 files changed, 802 insertions(+), 3 deletions(-)
diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 6066510c7c0..67df4233235 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -12208,6 +12208,7 @@ SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
Filter: (t1_3.b === 505)
(14 rows)
+SET executor_batch_rows = 1;
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
QUERY PLAN
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 4f7ab2ed0ac..daffc545a5c 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -4126,6 +4126,7 @@ SELECT * FROM local_tbl t1 LEFT JOIN (SELECT *, (SELECT count(*) FROM async_pt W
EXPLAIN (VERBOSE, COSTS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
+SET executor_batch_rows = 1;
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF, BUFFERS OFF)
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
SELECT * FROM async_pt t1 WHERE t1.b === 505 LIMIT 1;
diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 088eca24021..cc76b760ee7 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -104,6 +104,16 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
bool exists_coerce,
Datum *resv, bool *resnull);
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+ List *leaves; /* List<Node*> of accepted leaves */
+ Bitmapset *attnos; /* Vars referenced by accepted leaves */
+ bool ok; /* stays true if batchable */
+ AttrNumber last_scan; /* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
/*
* ExecInitExpr: prepare an expression tree for execution
@@ -5064,3 +5074,328 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
DomainHasConstraints(returning->typid);
ExprEvalPushStep(state, &scratch);
}
+
+/*
+ * Extract Var attno from expression, unwrapping RelabelType/TargetEntry.
+ * Returns attno > 0 on success, 0 on failure (not a Var, or system column).
+ */
+static AttrNumber
+extract_var_attno(Expr *expr)
+{
+ if (expr == NULL)
+ return 0;
+ if (IsA(expr, TargetEntry))
+ return extract_var_attno(((TargetEntry *) expr)->expr);
+ if (IsA(expr, RelabelType))
+ return extract_var_attno((Expr *) ((RelabelType *) expr)->arg);
+ if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+ return ((Var *) expr)->varattno;
+ return 0;
+}
+
+/*
+ * qual_batchable_walker
+ * Check if a qual tree is eligible for batched evaluation.
+ *
+ * Walks the qual tree and validates that it consists only of:
+ * - AND expressions (OR/NOT disqualify)
+ * - NullTest on simple Vars
+ * - Binary OpExpr with Var op Const or Var op Var arguments
+ *
+ * For OpExpr, the operator must be:
+ * - Strict: ensures NULL inputs produce NULL/false, matching WHERE semantics
+ * - Leakproof: required because batching evaluates all rows before filtering,
+ * which could leak data to non-leakproof operators before security barrier
+ * quals have a chance to filter rows
+ *
+ * On success, populates cxt->leaves with the leaf nodes and cxt->attnos with
+ * the referenced attribute numbers. Sets cxt->ok = false if any node fails
+ * validation.
+ */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+ QualBatchContext *cxt = (QualBatchContext *) context;
+
+ if (node == NULL || !cxt->ok)
+ return false;
+
+ switch (nodeTag(node))
+ {
+ case T_List:
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+ case T_BoolExpr:
+ {
+ BoolExpr *b = (BoolExpr *) node;
+
+ /* Only AND trees are allowed */
+ if (b->boolop != AND_EXPR)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* Recurse normally over children */
+ return expression_tree_walker(node, qual_batchable_walker, cxt);
+ }
+
+ case T_NullTest:
+ {
+ NullTest *nt = (NullTest *) node;
+ AttrNumber attno = extract_var_attno(nt->arg);
+
+ if (attno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ cxt->attnos = bms_add_member(cxt->attnos, attno);
+ if (attno > cxt->last_scan)
+ cxt->last_scan = attno;
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ /* Do NOT recurse into leaf */
+ return false;
+ }
+
+ case T_OpExpr:
+ {
+ OpExpr *op = (OpExpr *) node;
+ List *args = op->args;
+ AttrNumber lattno,
+ rattno;
+
+ /* Only binary operators */
+ if (list_length(args) != 2)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /* Must be strict (NULL input -> NULL/false result) */
+ if (!func_strict(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+ /*
+ * Must be leakproof. Batching changes evaluation order, which
+ * could leak data through side channels before security barrier
+ * quals filter rows.
+ */
+ if (!get_func_leakproof(op->opfuncid))
+ {
+ cxt->ok = false;
+ return true;
+ }
+
+ /* Left arg must be a Var */
+ lattno = extract_var_attno(linitial(op->args));
+ if (lattno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, lattno);
+ if (lattno > cxt->last_scan)
+ cxt->last_scan = lattno;
+
+ /* Right arg must be Const or Var */
+ if (!IsA(lsecond(op->args), Const))
+ {
+ rattno = extract_var_attno(lsecond(op->args));
+ if (rattno == 0)
+ {
+ cxt->ok = false;
+ return true;
+ }
+ cxt->attnos = bms_add_member(cxt->attnos, rattno);
+ if (rattno > cxt->last_scan)
+ cxt->last_scan = rattno;
+ }
+
+ cxt->leaves = lappend(cxt->leaves, node);
+
+ return false; /* leaf; don't recurse */
+ }
+
+ /* Unhandled node type; fall back to per-tuple evaluation */
+ default:
+ cxt->ok = false;
+ break;
+ }
+
+ return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n)
+{
+ BatchQualTerm *term;
+ BatchQualTermKind kind;
+ bool strict;
+ AttrNumber l_attno;
+ AttrNumber r_attno;
+ Datum r_const = (Datum) 0;
+ bool r_isnull = false;
+ FmgrInfo *finfo = NULL;
+ Oid collation;
+
+ if (IsA(n, NullTest))
+ {
+ NullTest *nt = (NullTest *) n;
+
+ kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+ l_attno = extract_var_attno(nt->arg);
+ r_attno = 0;
+ strict = false;
+ collation = InvalidOid;
+
+ if (l_attno == 0)
+ return NULL;
+ }
+ else if (IsA(n, OpExpr))
+ {
+ OpExpr *op = (OpExpr *) n;
+ Expr *l = linitial(op->args);
+ Expr *r = lsecond(op->args);
+
+ l_attno = extract_var_attno(l);
+ if (l_attno == 0)
+ return NULL;
+
+ if (IsA(r, Const))
+ {
+ Const *c = (Const *) r;
+
+ kind = BQTK_VAR_CONST;
+ r_const = c->constvalue;
+ r_isnull = c->constisnull;
+ r_attno = 0;
+ }
+ else
+ {
+ r_attno = extract_var_attno(r);
+ if (r_attno == 0)
+ return NULL;
+ kind = BQTK_VAR_VAR;
+ }
+
+ strict = func_strict(op->opfuncid);
+ collation = exprInputCollation((Node *) op);
+ finfo = palloc(sizeof(FmgrInfo));
+ fmgr_info(op->opfuncid, finfo);
+ }
+ else
+ return NULL;
+
+ term = palloc(sizeof(BatchQualTerm));
+ term->kind = kind;
+ term->strict = strict;
+ term->l_attno = l_attno;
+ term->r_attno = r_attno;
+ term->r_const = r_const;
+ term->r_isnull = r_isnull;
+ term->finfo = finfo;
+ term->collation = collation;
+
+ return term;
+}
+
+/*
+ * ExecInitQualBatch
+ * Build a batched-qual ExprState for evaluating scan quals over a TupleBatch.
+ *
+ * Returns a dedicated ExprState that evaluates the plan's quals in batch mode,
+ * or NULL if the quals are not eligible for batching. The caller should retain
+ * the regular ps->qual for fallback when batching is not used.
+ *
+ * Batching is only possible when the qual tree consists of:
+ * - Top-level AND of simple clauses (no OR, NOT)
+ * - NullTest on a simple Var
+ * - Binary OpExpr with (Var op Const) or (Var op Var), where the operator
+ * is both strict (for proper NULL handling) and leakproof (to avoid
+ * leaking data when evaluation order changes vs. security barrier quals)
+ *
+ * The generated EEOP program:
+ * 1. EEOP_SCAN_FETCHSOME_BATCH - deforms all slots in the batch
+ * 2. EEOP_QUAL_BATCH_INITMASK - initializes bitmask to all-pass
+ * 3. EEOP_QUAL_BATCH_TERM (per leaf) - evaluates term, clears failing bits
+ *
+ * The result bitmask is stored in BatchQualRuntime (via ExprState.batch_private)
+ * for the caller to use when populating output slots.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+ Node *qual = (Node *) ps->plan->qual;
+ QualBatchContext cxt = {NIL, NULL, true, 0};
+ BatchQualRuntime *rt;
+ ExprState *state;
+ int maxrows = executor_batch_rows;
+ uint64 *mask;
+ int mask_words;
+ ListCell *lc;
+ ExprEvalStep scratch = {0};
+
+ if (qual == NULL)
+ return NULL;
+
+ /*
+ * Check if qual tree is batchable; collect leaf nodes and referenced
+ * attnos.
+ */
+ (void) qual_batchable_walker(qual, &cxt);
+ if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+ return NULL;
+
+ /* Allocate bitmask: one bit per row, rounded up to 64-bit words */
+ mask_words = (maxrows + 63) >> 6;
+ mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+ /* Bundle runtime state; attached to ExprState for access during execution */
+ rt = palloc0(sizeof(BatchQualRuntime));
+ rt->mask = mask;
+ rt->mask_words = mask_words;
+
+ /* Create ExprState for the batched program */
+ state = makeNode(ExprState);
+ state->expr = (Expr *) qual;
+ state->parent = ps;
+ state->ext_params = NULL;
+ state->flags = EEO_FLAG_IS_QUAL;
+ state->batch_private = (void *) rt;
+
+ /* Step 1: deform all slots in batch up to highest referenced attribute */
+ scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+ scratch.d.fetch_batch.last_var = cxt.last_scan;
+ ExprEvalPushStep(state, &scratch);
+
+ /* Step 2 initialize mask to all-ones (all rows pass initially) */
+ scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+ scratch.d.qualbatch_init.mask = mask;
+ scratch.d.qualbatch_init.mask_words = mask_words;
+ ExprEvalPushStep(state, &scratch);
+
+ /* Step 3: one TERM per qual leaf; each clears mask bits for failing rows */
+ foreach(lc, cxt.leaves)
+ {
+ BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc));
+
+ if (term == NULL)
+ return NULL;
+
+ scratch.opcode = EEOP_QUAL_BATCH_TERM;
+ scratch.d.qualbatch_term.term = term; /* by value */
+ ExprEvalPushStep(state, &scratch);
+ }
+
+ /* Done; mask now indicates which rows survived all quals */
+ scratch.opcode = EEOP_DONE_NO_RETURN;
+ ExprEvalPushStep(state, &scratch);
+
+ ExecReadyExpr(state);
+
+ return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index a7a5ac1e83b..304c7f4e0fb 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
#include "access/heaptoast.h"
#include "catalog/pg_type.h"
#include "commands/sequence.h"
+#include "executor/execBatch.h"
#include "executor/execExpr.h"
#include "executor/nodeSubplan.h"
#include "funcapi.h"
@@ -466,6 +467,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
+ TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -592,6 +594,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+ &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
+ &&CASE_EEOP_QUAL_BATCH_INITMASK,
+ &&CASE_EEOP_QUAL_BATCH_TERM,
&&CASE_EEOP_LAST
};
@@ -612,6 +617,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
+ scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -2265,6 +2271,28 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
+ EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+ {
+ CheckOpSlotCompatibility(op, scanslot);
+
+ Assert(scanbatch);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+ {
+ ExecQualBatchInitMask(state, op, econtext);
+ EEO_NEXT();
+ }
+
+ EEO_CASE(EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
+ EEO_NEXT();
+ }
+
EEO_CASE(EEOP_LAST)
{
/* unreachable */
@@ -5914,3 +5942,199 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
MemoryContextSwitchTo(oldContext);
}
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+ for (int i = 0; i < nwords; i++)
+ a[i] = ~UINT64CONST(0);
+
+ if ((nvalid_bits & 63) != 0)
+ {
+ int rem = nvalid_bits & 63;
+
+ a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+ }
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+ a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+ for (int i = 0; i < nwords; i++)
+ {
+ if (mask[i] != 0)
+ return false;
+ }
+ return true;
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ TupleBatch *b = econtext->scan_batch;
+ uint64 *mask = op->d.qualbatch_init.mask;
+ int nwords = op->d.qualbatch_init.mask_words;
+ int n = b->ntuples;
+
+ /* initialize to all-pass for current batch size */
+ mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);
+ TupleBatch *b = econtext->scan_batch;
+ TupleTableSlot **slots = b->activeslots;
+ uint64 *mask = rt->mask;
+ int mask_words = rt->mask_words;
+ BatchQualTerm *t = op->d.qualbatch_term.term;
+ int n = b->ntuples;
+
+ /* Early exit if no rows remain */
+ if (mask_is_empty(mask, mask_words))
+ return;
+
+ switch (t->kind)
+ {
+ case BQTK_IS_NULL:
+ {
+ /* keep bit set only if value IS NULL; clear otherwise */
+ for (int i = 0; i < n; i++)
+ {
+ if (!slots[i]->tts_isnull[t->l_attno-1])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_IS_NOT_NULL:
+ {
+ /* keep bit set only if value IS NOT NULL; clear if NULL */
+ for (int i = 0; i < n; i++)
+ {
+ if (slots[i]->tts_isnull[t->l_attno-1])
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_CONST:
+ {
+ const bool r_isnull = t->r_isnull;
+ const Datum r_const = t->r_const;
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = slots[i]->tts_isnull[t->l_attno-1];
+ bool pass;
+
+ /* WHERE treats NULL as false; strict ops short-circuit */
+ if (strict && (ln || r_isnull))
+ pass = false;
+ else
+ {
+ Datum lv = slots[i]->tts_values[t->l_attno-1];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ case BQTK_VAR_VAR:
+ {
+ const bool strict = t->strict;
+ const Oid coll = t->collation;
+ FmgrInfo *finfo = t->finfo;
+
+ for (int i = 0; i < n; i++)
+ {
+ bool ln = slots[i]->tts_isnull[t->l_attno-1];
+ bool rn = slots[i]->tts_isnull[t->r_attno-1];
+ bool pass;
+
+ if (strict && (ln || rn))
+ pass = false;
+ else
+ {
+ Datum lv = slots[i]->tts_values[t->l_attno-1];
+ Datum rv = slots[i]->tts_values[t->r_attno-1];
+
+ pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+ }
+
+ if (!pass)
+ mask_clear_bit(mask, i);
+ }
+ break;
+ }
+
+ default:
+ /* should not happen; leave mask unchanged */
+ break;
+ }
+}
+
+/*
+ * ExecQualBatch
+ * Evaluate a batched qual over all rows in a TupleBatch.
+ *
+ * Runs the EEOP program built by ExecInitQualBatch, which produces a bitmask
+ * indicating which rows pass the qual. Rows that pass are copied to the
+ * batch's output slots (b->outslots).
+ *
+ * Returns the number of qualifying rows. The caller should then call
+ * TupleBatchUseOutput(b, qualified) to switch the batch to return from
+ * outslots.
+ *
+ * The batch must be materialized (slots populated) before calling this.
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+ int i;
+ uint64 *mask;
+ int kept = 0;
+ BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);
+
+ /* verify that expression was compiled using ExecInitQualBatch */
+ Assert(state->flags & EEO_FLAG_IS_QUAL);
+ Assert(rt && rt->mask && rt->mask_words);
+
+ /* run the batched EEOP program once */
+ econtext->scan_batch = b;
+ ExecEvalExprNoReturn(state, econtext);
+
+ mask = rt->mask;
+ if (mask_is_empty(mask, rt->mask_words))
+ return 0;
+
+ /* Add survivors into outslots */
+ TupleBatchRewind(b);
+ i = 0;
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* mask bit set => row survives */
+ if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+ TupleBatchStoreInOut(b, kept++, slot);
+ i++;
+ }
+
+ return kept;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index b768eae9e53..5082d8ecd3b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
}
}
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+ while (TupleBatchHasMore(b))
+ {
+ TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+ /* Check for caller errors */
+ Assert(attnum > 0);
+
+ if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+ elog(ERROR, "invalid attribute number %d", attnum);
+
+ /* XXX - there should perhaps also be a batch-level att_nvalid */
+ if (attnum < slot->tts_nvalid)
+ continue;
+
+ /* Fetch as many attributes as possible from the underlying tuple. */
+ slot->tts_ops->getsomeattrs(slot, attnum);
+
+ /*
+ * If the underlying tuple doesn't have enough attributes, tuple
+ * descriptor must have the missing attributes.
+ */
+ if (unlikely(slot->tts_nvalid < attnum))
+ {
+ slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+ slot->tts_nvalid = attnum;
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecTypeFromTL
*
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index f36b31d4fbb..16f15ed68aa 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -281,6 +281,28 @@ ExecSeqScanBatchSlot(PlanState *pstate)
NULL, NULL);
}
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithBatchQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+ TupleBatch *b = pstate->ps_Batch;
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual_batch != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ if (!TupleBatchHasMore(b))
+ b = ExecScanExtendedBatch(&node->ss,
+ (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+ pstate->qual_batch, NULL);
+
+ return b ? TupleBatchGetNextSlot(b) : NULL;
+}
+
static TupleTableSlot *
ExecSeqScanBatchSlotWithQual(PlanState *pstate)
{
@@ -344,6 +366,7 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
+ scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
/* Choose batch variant to preserve your specialization matrix */
if (scanstate->ss.ps.qual == NULL)
@@ -361,7 +384,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
{
if (scanstate->ss.ps.ps_ProjInfo == NULL)
{
- scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ if (scanstate->ss.ps.qual_batch == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithBatchQual;
}
else
{
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 650f1d42a93..847f265df3b 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,9 @@ llvm_compile_expr(ExprState *state)
LLVMValueRef v_newslot;
LLVMValueRef v_resultslot;
+ /* batches */
+ LLVMValueRef v_scanbatch;
+
/* nulls/values of slots */
LLVMValueRef v_innervalues;
LLVMValueRef v_innernulls;
@@ -221,6 +224,11 @@ llvm_compile_expr(ExprState *state)
v_state,
FIELDNO_EXPRSTATE_RESULTSLOT,
"v_resultslot");
+ v_scanbatch = l_load_struct_gep(b,
+ StructExprContext,
+ v_econtext,
+ FIELDNO_EXPRCONTEXT_SCANBATCH,
+ "v_scanbatch");
/* build global values/isnull pointers */
v_scanvalues = l_load_struct_gep(b,
@@ -2940,6 +2948,33 @@ llvm_compile_expr(ExprState *state)
LLVMBuildBr(b, opblocks[opno + 1]);
break;
+ case EEOP_SCAN_FETCHSOME_BATCH:
+ {
+ LLVMValueRef params[2];
+
+ params[0] = v_scanbatch;
+ params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+ l_call(b,
+ llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+ llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+ params, lengthof(params), "");
+
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ }
+
+ case EEOP_QUAL_BATCH_INITMASK:
+ build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+ case EEOP_QUAL_BATCH_TERM:
+ build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+ v_state, op, v_econtext);
+ LLVMBuildBr(b, opblocks[opno + 1]);
+ break;
+
case EEOP_LAST:
Assert(false);
break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 4636b90cd0f..5ba9920f3fd 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,10 @@ void *referenced_functions[] =
MakeExpandedObjectReadOnlyInternal,
slot_getmissingattrs,
slot_getsomeattrs_int,
+ slot_getsomeattrs_batch,
strlen,
varsize_any,
ExecInterpExprStillValid,
+ ExecQualBatchInitMask,
+ ExecQualBatchTerm,
};
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index aa9b361fa31..2672d2674cc 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -292,11 +292,29 @@ typedef enum ExprEvalOp
EEOP_AGG_ORDERED_TRANS_DATUM,
EEOP_AGG_ORDERED_TRANS_TUPLE,
+ /*
+ * Batched qual evaluation opcodes
+ *
+ * These opcodes implement batch-mode qual evaluation where an entire
+ * TupleBatch is processed at once rather than tuple-by-tuple.
+ *
+ * EEOP_SCAN_FETCHSOME_BATCH: Call slot_getsomeattrs() on all slots in
+ * the batch to ensure needed attributes are deformed.
+ *
+ * EEOP_QUAL_BATCH_INITMASK: Initialize the result bitmask to all-ones
+ * (all rows initially pass).
+ *
+ * EEOP_QUAL_BATCH_TERM: Evaluate one qual leaf (NullTest or OpExpr) over
+ * all rows, clearing mask bits for rows that fail.
+ */
+ EEOP_SCAN_FETCHSOME_BATCH,
+ EEOP_QUAL_BATCH_INITMASK,
+ EEOP_QUAL_BATCH_TERM,
+
/* non-existent operation, used e.g. to check array lengths */
EEOP_LAST
} ExprEvalOp;
-
typedef struct ExprEvalStep
{
/*
@@ -331,6 +349,12 @@ typedef struct ExprEvalStep
const TupleTableSlotOps *kind;
} fetch;
+ struct
+ {
+ /* attribute number up to which to fetch (inclusive) */
+ int last_var;
+ } fetch_batch;
+
/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
struct
{
@@ -769,6 +793,17 @@ typedef struct ExprEvalStep
void *json_coercion_cache;
ErrorSaveContext *escontext;
} jsonexpr_coercion;
+
+ struct
+ {
+ uint64 *mask; /* shared mask buffer for this program */
+ int mask_words; /* ceil(es_max_batch/64) */
+ } qualbatch_init; /* EEOP_QUAL_BATCH_INITMASK */
+
+ struct
+ {
+ struct BatchQualTerm *term; /* compiled leaf */
+ } qualbatch_term; /* EEOP_QUAL_BATCH_TERM */
} d;
} ExprEvalStep;
@@ -917,4 +952,51 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
ExprContext *econtext);
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+ BQTK_VAR_CONST,
+ BQTK_VAR_VAR,
+ BQTK_IS_NULL,
+ BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+ BatchQualTermKind kind;
+ bool strict; /* follow strict NULL semantics if true */
+ AttrNumber l_attno; /* left VAR column */
+ AttrNumber r_attno; /* right VAR column, or -1 if Const */
+ Datum r_const; /* for VAR_CONST */
+ bool r_isnull; /* for VAR_CONST */
+ FmgrInfo *finfo; /* fmgr for generic binary ops */
+ Oid collation; /* op collation */
+} BatchQualTerm;
+
+/*
+ * BatchQualRuntime - execution state for batched qual evaluation
+ *
+ * Attached to ExprState.batch_private for the batched qual program.
+ * Contains the bitmask that tracks which rows pass the qual (bit set = pass),
+ * and references to the BatchVector for EEOP_QUAL_BATCH_TERM to use.
+ *
+ * The mask uses standard bit operations: word = i/64, bit = i%64.
+ * Initialized to all-ones by EEOP_QUAL_BATCH_INITMASK, then each
+ * EEOP_QUAL_BATCH_TERM clears bits for failing rows.
+ */
+typedef struct BatchQualRuntime
+{
+ uint64 *mask;
+ int mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+ return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
#endif /* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index d9185331e22..008780ea230 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -320,4 +320,50 @@ ExecScanExtendedBatchSlot(ScanState *node,
}
}
+/*
+ * ExecScanExtendedBatch
+ * Batch-driven scan with batched qual evaluation.
+ *
+ * Unlike ExecScanExtendedBatchSlot which evaluates quals tuple-at-a-time,
+ * this function uses ExecQualBatch() to evaluate the entire batch at once
+ * using a bitmask. Qualifying tuples are collected into b->outslots.
+ *
+ * Returns the TupleBatch with nvalid set to the number of qualifying rows,
+ * or NULL at end-of-scan. Caller iterates b->outslots[0..nvalid-1].
+ *
+ * Note: EPQ is not supported; projection is not yet implemented.
+ */
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+ ExecScanAccessBatchMtd accessBatchMtd,
+ ExprState *qual_batch, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ps.ps_ExprContext;
+ TupleBatch *b = node->ps.ps_Batch;
+ int qualified;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ps.state->es_epq_active == NULL);
+ Assert(TupleBatchIsValid(b));
+
+ for (;;)
+ {
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next batch from the AM */
+ if (!accessBatchMtd(node))
+ return NULL;
+
+ ResetExprContext(econtext);
+ qualified = ExecQualBatch(qual_batch, econtext, b);
+ InstrCountFiltered1(node, b->nvalid - qualified);
+ /* Update count and start using b->outslots. */
+ TupleBatchUseOutput(b, qualified);
+
+ if (qualified > 0)
+ return b;
+ /* else get the next batch from the AM */
+ }
+}
+
#endif /* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index e82fd6c0c8a..8cded15dec6 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -326,6 +326,7 @@ ExecProcNode(PlanState *node)
extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
extern List *ExecInitExprList(List *nodes, PlanState *parent);
extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -553,6 +554,8 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
}
#endif
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
extern bool ExecCheck(ExprState *state, ExprContext *econtext);
/*
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index a2dfd707e78..b06be83b141 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -346,6 +346,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
int lastAttNum);
extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
#ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 6a191202ced..c79ee965372 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -148,6 +148,9 @@ typedef struct ExprState
* ExecInitExprRec().
*/
ErrorSaveContext *escontext;
+
+ /* batched-program runtime (e.g., BatchQualRuntime) */
+ void *batch_private;
} ExprState;
@@ -314,6 +317,10 @@ typedef struct ExprContext
#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
TupleTableSlot *ecxt_newtuple;
+ /* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 16
+ TupleBatch *scan_batch;
+
/* Link to containing EState (NULL if a standalone ExprContext) */
struct EState *ecxt_estate;
@@ -1186,7 +1193,9 @@ typedef struct PlanState
* state trees parallel links in the associated plan tree (except for the
* subPlan list, which does not exist in the plan tree).
*/
- ExprState *qual; /* boolean qual condition */
+ ExprState *qual; /* boolean qual condition (per tuple) */
+ ExprState *qual_batch; /* batched qual program, NULL if qual not
+ * batchable */
PlanState *lefttree; /* input plan tree(s) */
PlanState *righttree;
--
2.47.3
[application/octet-stream] v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch (5.9K, 6-v5-0005-WIP-Use-dedicated-interpreter-for-batched-qual-ev.patch)
download | inline diff:
From 4916a0891b2e7176dee3c2a3a8018a4d174dd373 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 29 Jan 2026 05:03:55 +0900
Subject: [PATCH v5 5/5] WIP: Use dedicated interpreter for batched qual
evaluation
Move batch-related opcodes (EEOP_SCAN_FETCHSOME_BATCH,
EEOP_QUAL_BATCH_INITMASK, EEOP_QUAL_BATCH_TERM) out of the main
ExecInterpExpr switch and into a dedicated ExecInterpQualBatch
function.
Adding opcodes to ExecInterpExpr may affect performance even when
they are not executed, possibly due to changes in register allocation,
jump table layout, or code size. Use a separate interpreter to avoid
any risk of impacting the existing per-tuple evaluation path.
The batched qual program has a simple linear structure (fetch ->
initmask -> term* -> done) that doesn't need computed goto dispatch
anyway.
---
src/backend/executor/execExprInterp.c | 72 +++++++++++++++++----------
src/backend/executor/nodeSeqscan.c | 6 +--
2 files changed, 46 insertions(+), 32 deletions(-)
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 304c7f4e0fb..04a40ec932c 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -189,6 +189,8 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
int setno);
static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
+static Datum ExecInterpQualBatch(ExprState *state, ExprContext *econtext);
+
/*
* ScalarArrayOpExprHashEntry
* Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -266,6 +268,12 @@ ExecReadyInterpretedExpr(ExprState *state)
*/
state->evalfunc = ExecInterpExprStillValid;
+ if (state->batch_private)
+ {
+ state->evalfunc_private = (void *) ExecInterpQualBatch;
+ return;
+ }
+
/* DIRECT_THREADED should not already be set */
Assert((state->flags & EEO_FLAG_DIRECT_THREADED) == 0);
@@ -467,7 +475,6 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
TupleTableSlot *scanslot;
TupleTableSlot *oldslot;
TupleTableSlot *newslot;
- TupleBatch *scanbatch;
/*
* This array has to be in the same order as enum ExprEvalOp.
@@ -594,9 +601,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
- &&CASE_EEOP_SCAN_FETCHSOME_BATCH,
- &&CASE_EEOP_QUAL_BATCH_INITMASK,
- &&CASE_EEOP_QUAL_BATCH_TERM,
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_SCAN_FETCHSOME_BATCH */
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_QUAL_BATCH_INITMASK */
+ &&CASE_EEOP_BATCH_UNREACHABLE, /* EEOP_QUAL_BATCH_TERM */
&&CASE_EEOP_LAST
};
@@ -617,7 +624,6 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
scanslot = econtext->ecxt_scantuple;
oldslot = econtext->ecxt_oldtuple;
newslot = econtext->ecxt_newtuple;
- scanbatch = econtext->scan_batch;
#if defined(EEO_USE_COMPUTED_GOTO)
EEO_DISPATCH();
@@ -2271,34 +2277,18 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
EEO_NEXT();
}
- EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
- {
- CheckOpSlotCompatibility(op, scanslot);
-
- Assert(scanbatch);
- slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
-
- EEO_NEXT();
- }
-
- EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
- {
- ExecQualBatchInitMask(state, op, econtext);
- EEO_NEXT();
- }
-
- EEO_CASE(EEOP_QUAL_BATCH_TERM)
- {
- ExecQualBatchTerm(state, op, econtext);
- EEO_NEXT();
- }
-
EEO_CASE(EEOP_LAST)
{
/* unreachable */
Assert(false);
goto out_error;
}
+
+ EEO_CASE(EEOP_BATCH_UNREACHABLE)
+ {
+ Assert(false && "batch opcodes use dedicated interpreter");
+ pg_unreachable();
+ }
}
out_error:
@@ -6089,6 +6079,34 @@ ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
}
}
+static Datum
+ExecInterpQualBatch(ExprState *state, ExprContext *econtext)
+{
+ ExprEvalStep *op = state->steps;
+ TupleBatch *scanbatch = econtext->scan_batch;
+
+ /* Step 1: fetch/deform all slots */
+ Assert(ExecEvalStepOp(state, op) == EEOP_SCAN_FETCHSOME_BATCH);
+ slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+ op++;
+
+ /* Step 2: initialize mask */
+ Assert(ExecEvalStepOp(state, op) == EEOP_QUAL_BATCH_INITMASK);
+ ExecQualBatchInitMask(state, op, econtext);
+ op++;
+
+ /* Step 3: process all TERM steps */
+ while (ExecEvalStepOp(state, op) == EEOP_QUAL_BATCH_TERM)
+ {
+ ExecQualBatchTerm(state, op, econtext);
+ op++;
+ }
+
+ Assert(ExecEvalStepOp(state, op) == EEOP_DONE_NO_RETURN);
+
+ return (Datum) 0;
+}
+
/*
* ExecQualBatch
* Evaluate a batched qual over all rows in a TupleBatch.
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 16f15ed68aa..4a76108bd2f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -404,7 +404,6 @@ SeqScanState *
ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
{
SeqScanState *scanstate;
- bool use_batching;
/*
* Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -435,12 +434,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
node->scan.scanrelid,
eflags);
- use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
-
/* and create slot with the appropriate rowtype */
ExecInitScanTupleSlot(estate, &scanstate->ss,
RelationGetDescr(scanstate->ss.ss_currentRelation),
- use_batching ? &TTSOpsHeapTuple :
table_slot_callbacks(scanstate->ss.ss_currentRelation));
/*
@@ -477,7 +473,7 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
- if (use_batching)
+ if (ScanCanUseBatching(&scanstate->ss, eflags))
SeqScanInitBatching(scanstate, eflags);
return scanstate;
--
2.47.3
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-01-29 10:04 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 0 replies; 29+ messages in thread
From: Amit Langote @ 2026-01-29 10:04 UTC (permalink / raw)
To: Daniil Davydov <[email protected]>; +Cc: cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
On Thu, Jan 29, 2026 at 8:35 AM Amit Langote <[email protected]> wrote:
>
> Hi,
>
> Here is v5 of the patch series.
>
> Patches 0001-0003 add the core batching infrastructure. 0001 adds the
> batch table AM API with heapam implementation, 0002 wires up SeqScan
> to use it (still returning one slot at a time), and 0003 adds EXPLAIN
> (BATCHES). I'd love to hear people's thoughts around TupleBatch
> structure added in 0002. I thought about making it a separate patch so
> that 0002 will still populate the single ScanState.ss_scanTupleSlot,
> but that means we'd still have to call the TAM callback to populate
> the tuple in the TAM's batch struct into the slot, defeating the whole
> point. With TupleBatch, you have executor_batch_rows number of slots
> which are filled in one TAM callback (materialize_all) call. So I
> decided to keep the TupleBatch and related things in 0002.
>
> For scans without quals, batching shows 20-30% improvement with no
> visible regressions when batching is disabled (batch_rows=0):
>
> SELECT * FROM t LIMIT n (no qual)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
> 3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
> 10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
>
> (%diff: positive = faster than master, negative = slower)
Oops, I meant SELECT * FROM t LIMIT 1 OFFSET n (no qual).
--
Thanks, Amit Langote
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-02-01 14:49 Junwang Zhao <[email protected]>
parent: Amit Langote <[email protected]>
1 sibling, 1 reply; 29+ messages in thread
From: Junwang Zhao @ 2026-02-01 14:49 UTC (permalink / raw)
To: Amit Langote <[email protected]>; +Cc: Daniil Davydov <[email protected]>; cca5507 <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi Amit,
On Thu, Jan 29, 2026 at 3:35 PM Amit Langote <[email protected]> wrote:
>
> Hi,
>
> Here is v5 of the patch series.
>
> Patches 0001-0003 add the core batching infrastructure. 0001 adds the
> batch table AM API with heapam implementation, 0002 wires up SeqScan
> to use it (still returning one slot at a time), and 0003 adds EXPLAIN
> (BATCHES). I'd love to hear people's thoughts around TupleBatch
> structure added in 0002. I thought about making it a separate patch so
> that 0002 will still populate the single ScanState.ss_scanTupleSlot,
> but that means we'd still have to call the TAM callback to populate
> the tuple in the TAM's batch struct into the slot, defeating the whole
> point. With TupleBatch, you have executor_batch_rows number of slots
> which are filled in one TAM callback (materialize_all) call. So I
> decided to keep the TupleBatch and related things in 0002.
>
> For scans without quals, batching shows 20-30% improvement with no
> visible regressions when batching is disabled (batch_rows=0):
>
> SELECT * FROM t LIMIT n (no qual)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 12.42 ms 11.96 ms 3.7% 8.56 ms 31.0%
> 3M 38.95 ms 38.92 ms 0.1% 28.59 ms 26.6%
> 10M 153.64 ms 150.28 ms 2.2% 112.95 ms 26.5%
>
> (%diff: positive = faster than master, negative = slower)
>
> Patches 0004-0005 add batched qual evaluation and are more
> experimental (see below on why 0005 exists). For quals referencing
> early columns, the improvement is significant:
>
> SELECT * FROM t WHERE a = 0 ... OFFSET n (qual on 1st col)
>
> Rows Master batch=64 %diff
> ------ -------- -------- -----
> 1M 30.19 ms 15.55 ms 48.5%
> 3M 92.47 ms 50.01 ms 45.9%
> 10M 325.58 ms 211.83 ms 34.9%
>
> However, for quals on later columns (e.g., 15th), batching provides no
> benefit - deformation dominates and batching doesn't help:
>
> SELECT * FROM t WHERE o = 0 ... OFFSET n (qual on 15th col)
>
> Rows Master batch=64 %diff
> ------ -------- -------- -----
> 1M 44.14 ms 44.56 ms -0.9%
> 3M 133.89 ms 137.77 ms -2.9%
> 10M 503.33 ms 528.88 ms -5.1%
>
> I don't have a satisfactory explanation for why batching doesn't help
> the deform-heavy case at all. One would expect at least some benefit
> from reduced per-tuple overhead, but that's not materializing.
>
> I've also been struggling to understand why 0004 affects the per-tuple
> path even when batch_rows=0. For quals with 0% selectivity (all rows
> fail the qual), perf shows ExecInterpExpr is noticeably hotter with
> the patched code compared to master, even though batching is disabled:
>
> SELECT * FROM t WHERE a = 0 ... OFFSET n (0% selectivity)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 24.37 ms 28.67 ms -17.6% 12.46 ms 48.9%
> 3M 73.95 ms 85.07 ms -15.0% 41.64 ms 43.7%
> 10M 287.63 ms 316.81 ms -10.1% 188.01 ms 34.6%
>
> Compare that to 100% selectivity (all rows pass), where there's no regression:
>
> SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
>
> Rows Master batch=0 %diff batch=64 %diff
> ------ -------- ------- ----- -------- -----
> 1M 29.44 ms 29.10 ms 1.2% 16.61 ms 43.6%
> 3M 91.22 ms 90.28 ms 1.0% 54.10 ms 40.7%
> 10M 360.77 ms 331.25 ms 8.2% 224.00 ms 37.9%
>
> I tried moving batch opcodes to a separate interpreter (0005) thinking
> it might be register pressure or jump table effects from adding cases
> to ExecInterpExpr's switch. With 0005, the generated assembly for
> ExecInterpExpr looks identical to master (same stack frame size, same
> epilogue), yet the performance still differs. Specifically, the ldp
> instruction in the function epilogue shows 53% hotness in patched vs
> 35% in master. We still need placeholder entries in the dispatch
> table, so it's unclear if this fully isolates the per-tuple path. I'll
> continue looking at perf, but I feel like at a bit of a loss here and
> would appreciate any insights.
>
> Other changes worth noting:
>
> - I removed the BatchVector intermediate representation that copied
> Datums into columnar arrays before qual evaluation (it used to be in
> the batched qual patch 0004). Now quals access batch slots' tts_values
> directly. This simplifies the code and the copy overhead wasn't paying
> off. If we pursue serious vectorization later, this may need to be
> revisited, but removing it doesn't degrade performance.
>
> --
> Thanks, Amit Langote
Here are some comments for v5:
0001:
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
These function names are not consistent with comments.
0002:
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
ditto.
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}
Is there any chance this batch_callbacks can be NULL? In that case it
can cause a segfault. I felt changing to
if (relation->rd_tableam && relation->rd_tableam->batch_callbacks)
should be more robust, but then I found table_slot_callbacks follow
the same pattern, so this shouldn't be a problem.
0003:
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>
I guess the reason for including this header is because of the use
of INT_MAX, so maybe put that line into execBatch.c?
--
Regards
Junwang Zhao
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-02-03 13:30 =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
parent: Junwang Zhao <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: =?utf-8?B?Y2NhNTUwNw==?= @ 2026-02-03 13:30 UTC (permalink / raw)
To: =?utf-8?B?SnVud2FuZyBaaGFv?= <[email protected]>; =?utf-8?B?QW1pdCBMYW5nb3Rl?= <[email protected]>; +Cc: =?utf-8?B?RGFuaWlsIERhdnlkb3Y=?= <[email protected]>; pgsql-hackers; =?utf-8?B?VG9tYXMgVm9uZHJh?= <[email protected]>
Hi,
Some comments for v5:
0001
====
1) heap_begin_batch()
```
/* Single allocation for HeapBatch header + tupdata array */
alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
hb = palloc(alloc_size);
hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
```
Do we need a MAXALIGN() here to avoid unaligned access? Something like this:
```
/* Single allocation for HeapBatch header + tupdata array */
alloc_size = MAXALIGN(sizeof(HeapBatch)) + sizeof(HeapTupleData) * maxitems;
hb = palloc(alloc_size);
hb->tupdata = (HeapTupleData *) ((char *) hb + MAXALIGN(sizeof(HeapBatch)));
```
Or how about just using zero-length array:
```
typedef struct HeapBatch
{
Buffer buf;
int maxitems;
int nitems;
HeapTupleData tupdata[FLEXIBLE_ARRAY_MEMBER];
} HeapBatch;
// and
hb = palloc(offsetof(HeapBatch, tupdata) + sizeof(HeapTupleData) * maxitems);
```
2) pgstat_count_heap_getnext_batch()
```
#define pgstat_count_heap_getnext_batch(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.tuples_returned += n; \
} while (0)
```
"+= n" -> "+= (n)", just like pgstat_count_index_tuples().
0002
====
1) TupleBatchCreate()
```
/* Single allocation for TupleBatch + inslots + outslots arrays */
alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
b = palloc(alloc_size);
inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
sizeof(TupleTableSlot *) * capacity);
```
Do we need a MAXALIGN() here to avoid unaligned access?
2) TupleBatchReset()
```
for (int i = 0; i < b->maxslots; i++)
{
ExecClearTuple(b->inslots[i]);
if (drop_slots)
ExecDropSingleTupleTableSlot(b->inslots[i]);
}
```
ExecDropSingleTupleTableSlot() will call ExecClearTuple(), so ExecClearTuple() will be
called twice if drop_slots is true, I think we can avoid this.
3) ScanCanUseBatching()
In heap_beginscan(), we may disable page-at-a-time mode:
```
/*
* Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
*/
if (!(snapshot && IsMVCCSnapshot(snapshot)))
scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
```
It seems that ScanCanUseBatching() didn't consider this.
4) struct TupleBatch
```
struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
struct TupleTableSlot **outslots; /* slots for tuples going "out of"
* batch */
struct TupleTableSlot **activeslots;
```
I think we can remove the word "struct".
5) ExecScanExtendedBatchSlot()
```
/* Get next input slot from current batch, or refill */
if (!TupleBatchHasMore(b))
{
if (!accessBatchMtd(node))
return NULL;
}
```
I think we cannot just return NULL here, see comments in ExecScanExtended():
```
/*
* if the slot returned by the accessMtd contains NULL, then it means
* there is nothing more to scan so we just return an empty slot,
* being careful to use the projection result slot so it has correct
* tupleDesc.
*/
if (TupIsNull(slot))
{
if (projInfo)
return ExecClearTuple(projInfo->pi_state.resultslot);
else
return slot;
}
```
And why not just write this function like ExecScanExtended() and ExecScanFetch()?
--
Regards,
ChangAo Chen
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-02-03 15:54 Junwang Zhao <[email protected]>
parent: =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Junwang Zhao @ 2026-02-03 15:54 UTC (permalink / raw)
To: cca5507 <[email protected]>; +Cc: Amit Langote <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
On Tue, Feb 3, 2026 at 9:30 PM cca5507 <[email protected]> wrote:
>
> Hi,
>
> Some comments for v5:
>
> 0001
> ====
>
> 1) heap_begin_batch()
>
> ```
> /* Single allocation for HeapBatch header + tupdata array */
> alloc_size = sizeof(HeapBatch) + sizeof(HeapTupleData) * maxitems;
> hb = palloc(alloc_size);
> hb->tupdata = (HeapTupleData *) ((char *) hb + sizeof(HeapBatch));
> ```
>
> Do we need a MAXALIGN() here to avoid unaligned access? Something like this:
TBH I don't think this single allocation helps too much, it's not on
the hot path,
but makes the code harder to read ;(
>
> ```
> /* Single allocation for HeapBatch header + tupdata array */
> alloc_size = MAXALIGN(sizeof(HeapBatch)) + sizeof(HeapTupleData) * maxitems;
> hb = palloc(alloc_size);
> hb->tupdata = (HeapTupleData *) ((char *) hb + MAXALIGN(sizeof(HeapBatch)));
> ```
>
> Or how about just using zero-length array:
>
> ```
> typedef struct HeapBatch
> {
> Buffer buf;
> int maxitems;
> int nitems;
> HeapTupleData tupdata[FLEXIBLE_ARRAY_MEMBER];
> } HeapBatch;
>
> // and
> hb = palloc(offsetof(HeapBatch, tupdata) + sizeof(HeapTupleData) * maxitems);
> ```
>
> 2) pgstat_count_heap_getnext_batch()
>
> ```
> #define pgstat_count_heap_getnext_batch(rel, n) \
> do { \
> if (pgstat_should_count_relation(rel)) \
> (rel)->pgstat_info->counts.tuples_returned += n; \
> } while (0)
> ```
>
> "+= n" -> "+= (n)", just like pgstat_count_index_tuples().
>
> 0002
> ====
>
> 1) TupleBatchCreate()
>
> ```
> /* Single allocation for TupleBatch + inslots + outslots arrays */
> alloc_size = sizeof(TupleBatch) + 2 * sizeof(TupleTableSlot *) * capacity;
> b = palloc(alloc_size);
> inslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch));
> outslots = (TupleTableSlot **) ((char *) b + sizeof(TupleBatch) +
> sizeof(TupleTableSlot *) * capacity);
> ```
>
> Do we need a MAXALIGN() here to avoid unaligned access?
>
> 2) TupleBatchReset()
>
> ```
> for (int i = 0; i < b->maxslots; i++)
> {
> ExecClearTuple(b->inslots[i]);
> if (drop_slots)
> ExecDropSingleTupleTableSlot(b->inslots[i]);
> }
> ```
>
> ExecDropSingleTupleTableSlot() will call ExecClearTuple(), so ExecClearTuple() will be
> called twice if drop_slots is true, I think we can avoid this.
>
> 3) ScanCanUseBatching()
>
> In heap_beginscan(), we may disable page-at-a-time mode:
>
> ```
> /*
> * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
> */
> if (!(snapshot && IsMVCCSnapshot(snapshot)))
> scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
> ```
>
> It seems that ScanCanUseBatching() didn't consider this.
>
> 4) struct TupleBatch
>
> ```
> struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
> struct TupleTableSlot **outslots; /* slots for tuples going "out of"
> * batch */
> struct TupleTableSlot **activeslots;
> ```
>
> I think we can remove the word "struct".
>
> 5) ExecScanExtendedBatchSlot()
>
> ```
> /* Get next input slot from current batch, or refill */
> if (!TupleBatchHasMore(b))
> {
> if (!accessBatchMtd(node))
> return NULL;
> }
> ```
>
> I think we cannot just return NULL here, see comments in ExecScanExtended():
>
> ```
> /*
> * if the slot returned by the accessMtd contains NULL, then it means
> * there is nothing more to scan so we just return an empty slot,
> * being careful to use the projection result slot so it has correct
> * tupleDesc.
> */
> if (TupIsNull(slot))
> {
> if (projInfo)
> return ExecClearTuple(projInfo->pi_state.resultslot);
> else
> return slot;
> }
> ```
>
> And why not just write this function like ExecScanExtended() and ExecScanFetch()?
>
> --
> Regards,
> ChangAo Chen
--
Regards
Junwang Zhao
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-03-24 00:59 Amit Langote <[email protected]>
parent: Junwang Zhao <[email protected]>
0 siblings, 1 reply; 29+ messages in thread
From: Amit Langote @ 2026-03-24 00:59 UTC (permalink / raw)
To: Junwang Zhao <[email protected]>; +Cc: cca5507 <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
Hi,
Here is a significantly revised version of the patch series. A lot has
changed since the January submission, so I want to summarize the
design changes before getting into the patches. I think it does
address the points in the two reviews that landed since v5 but maybe a
bunch of points became moot after my rewrite of the relevant portions
(thanks Junwang and ChangAo for the review in any case).
At this point it might be better to think of this as targeting v20,
except that if there is review bandwidth in the remaining two weeks
before the v19 feature freeze, the rs_vistuples[] change described
below as a standalone improvement to the existing pagemode scan path
could be considered for v19, though that too is an optimistic
scenario.
It is also worth noting that Andres identified a number of
inefficiencies in the existing scan path in:
Re: unnecessary executor overheads around seqscans
https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
that are worth fixing independently of batching. Some of those fixes
may be better pursued first, both because they benefit all scan paths
and because they would make batching's gains more honest.
Separately, after looking at the previous version, Andres pointed out
offlist two fundamental issues with the patch's design:
* The heapam implementation (in a version of the patch I didn't post
to the thread) duplicated heap_prepare_pagescan() logic in a separate
batch-specific code path, which is not acceptable as changes should
benefit the existing slot interface too. Code duplication is not good
either from a future maintainability aspect. The v5 version of that
code is not great in that respect either; it instead duplicated
heapggettup_pagemode() to slap batching on it.
* Allocating executor_batch_rows slots on the executor side to receive
rows from the AM adds significant overhead for slot initialization and
management, and for non-row-organized AMs that do not produce
individual rows at all, those slots would never be meaningfully
populated.
In any case, he just wasn't a fan of the slot-array approach the
moment I mentioned it. The previous version had two slot arrays,
inslots and outslots, of TTSOpsHeapTuple type (not
TTSOpsBufferHeapTuple because buffer pins were managed by the batch
code, which has its own modularity/correctness issues), populated via
a materialize_all callback. A batch qual evaluator would copy
qualifying tuples into outslots, with an activeslots pointer switching
between the two depending on whether batch qual evaluation was used.
The new design addresses both issues and differs from the previous
version in several other ways:
* Single slot instead of slot arrays: there is a single
TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
was already determined by the AM via table_slot_callbacks(). The slot
is re-pointed to each HeapTuple in the current buffer page via a new
repoint_slot AM callback, with no materialization or copying. Tuples
are returned one by one from the executor's perspective, but the AM
serves them in page-sized batches from pre-built HeapTupleData
descriptors in rs_vistuples[], avoiding repeated descent into heapam
per tuple. This is heapam's implementation of the batch interface;
there is no intention to force other AMs into the same row-oriented
model.
* Batch qual evaluator not included: with the single-slot model,
quals are evaluated per tuple via the existing ExecQual path after
each repoint_slot call. A natural next step would be a new opcode
(EEOP) that calls repoint_slot() internally within expression
evaluation, allowing ExecQual to advance through multiple tuples from
the same batch without returning to the scan node each time, with qual
results accumulated in a bitmask in ExprState. The details of that
will be worked out in a follow-on series.
* heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
allows heap_getnextbatch() to simply advance a slice pointer into that
array without any additional copying or re-entering heap code, making
a separate batch-specific scan function unnecessary.
* TupleBatch renamed to RowBatch: "row batch" is more natural
terminology for this concept and also consistent with how similar
abstractions are named in columnar and OLAP systems.
* AM callbacks now take RowBatch directly: previously
heap_getnextbatch() returned a void pointer that the executor would
store into RowBatch.am_payload, because only the executor knew the
internals of RowBatch. Now the AM receives RowBatch directly as a
parameter and can populate it without the executor acting as an
intermediary. This is also why RowBatch is introduced in its own
patch ahead of the AM API addition, so the struct definition is
available to both sides.
Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
instead of OffsetNumbers, as a standalone improvement to the existing
pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd
in the all-visible case) table with 1M/5M/10M rows:
query all-visible not-all-visible
count(*) -0.2% to +0.9% -0.4% to +0.5%
count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5%
SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6%
SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6%
No significant regression on either page type. The structural
improvement is most visible on not-all-visible pages where
HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
visibility checks, so persisting the result into rs_vistuples[]
eliminates the downstream re-read (in heapgettupe_pagemode()) with no
measurable overhead. That said, these numbers are somewhat noisy on
my machine. Results on other machines would be welcome.
Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
and heapam implementation including seqscan variants that use the new
scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
respectively. With batching enabled (executor_batch_rows=300,
~MaxHeapTuplesPerPage):
query all-visible not-all-visible
count(*) +11 to +15% +9 to +13%
count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14%
SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22%
SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13%
With executor_batch_rows=0, results are within noise of master across
all query types and sizes, confirming no regression from the
infrastructure changes themselves. The not-all-visible results tend
to show slightly higher gains than the all-visible case. This is
likely because the existing heapam code is more optimized for the
all-visible path, so the not-all-visible path, which goes through
HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
more headroom that batching can exploit.
Setting aside the current series for a moment, there are some broader
design questions worth raising while we have attention on this area.
Some of these echo points Tomas raised in his first reply on this
thread, and I am reiterating them deliberately since I have not
managed to fully address them on my own or I simply didn't need to for
the TAM-to-scan-node batching and think they would benefit from wider
input rather than just my own iteration.
We should also start thinking about other ways the executor can
consume batch rows, not always assuming they are presented as
HeapTupleData. For instance, an AM could expose decoded column arrays
directly to operators that can consume them, bypassing slot-based
deform entirely, or a columnar AM could implement scan_getnextbatch by
decoding column strips directly into the batch without going through
per-tuple HeapTupleData at all. Feedback on whether the current
RowBatch design and the choices made in the scan_getnextbatch and
RowBatchOps API make that sort of thing harder than it needs to be
would be appreciated. For example, heapam's implementation of
scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
to HeapTupleData entries one at a time via repoint_slot in
RowBatchHeapOps. That works for heapam but a columnar AM could
implement scan_getnextbatch to decode column strips directly into
arrays in the batch, with no per-row repoint step needed at all. Any
adjustments that would make RowBatch more AM-agnostic are worth
discussing now before the design hardens.
There are also broader open questions about how far the batch model
can extend beyond the scan node. Qual pushdown into the AM has been
discussed in nearby threads and would be one way to allow expression
evaluation to happen before data reaches the executor proper, though
that is a separate effort. For the purposes of this series, expression
evaluation still happens in the executor after scan_getnextbatch
returns. If the scan node does not project, the buffer heap slot is
passed directly to the parent node, which calls slot callbacks to
deform as needed. But once a node above projects, aggregates, or
joins, the notion of a page-sized batch from a single AM loses its
meaning and virtual slots take over. Whether RowBatch is usable or
meaningful beyond the scan/TAM boundary in any form, and whether the
core executor will ever have non-HeapTupleData batch consumption paths
or leave that entirely to extensions, are open questions worth
discussing.
For RowBatch to eventually play the role that TupleTableSlot plays for
row-at-a-time execution, something inside it would need to serve as
the common currency for batch data, analogous to TupleTableSlot's
datum/isnull arrays. Column arrays are the obvious direction, but even
that leaves open the question of representation. PostgreSQL's Datum is
a pointer-sized abstraction that boxes everything, whereas vectorized
systems use typed packed arrays of native types with validity
bitmasks, which is a significant part of why tight vectorized loops
are fast there. Whether column arrays of Datum would be good enough,
or whether going further toward typed packed arrays would be necessary
to get meaningful vectorization, is a deeper design question that this
series deliberately does not try to answer.
Even though the focus is on getting batching working at the scan/TAM
boundary first, thoughts on any of these points would be welcome.
--
Thanks, Amit Langote
Attachments:
[application/x-patch] v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch (19.0K, 2-v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From a095d26e1b5a361a7d42300e5364da948496f2ba Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 23 Mar 2026 18:21:47 +0900
Subject: [PATCH v6 3/5] Add batch table AM API and heapam implementation
Introduce table AM callbacks for batched tuple fetching:
scan_begin_batch, scan_getnextbatch, scan_reset_batch, and
scan_end_batch. AMs implement all four or none; checked by
table_supports_batching().
scan_reset_batch releases held resources (e.g. buffer pins)
without freeing, allowing reuse across rescans.
Provide the heapam implementation. HeapPageBatch (stored in
RowBatch.am_payload) is a thin slice descriptor over the scan's
rs_vistuples[] array, which was introduced in the previous commit.
Rather than owning a copy of tuple headers, HeapPageBatch holds a
pointer into scan->rs_vistuples[] for the current slice and a buffer
pin for the current page.
heap_getnextbatch() calls heap_prepare_pagescan() to populate
rs_vistuples[] for each new page, then re-points hb->tuples to the
next slice of rs_vistuples[] on each call. If the page has more
tuples than the executor's max_rows, subsequent calls return the
next slice without re-entering page preparation. The buffer pin is
held until the page is fully consumed.
scan_begin_batch creates a single TupleTableSlot with
TTSOpsBufferHeapTuple ops. heap_repoint_slot() re-points this slot
to each tuple in turn via ExecStoreBufferHeapTuple(). Consumers
that need to retain the slot across calls rely on the normal slot
materialization contract.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 229 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 8 +-
src/include/access/heapam.h | 33 ++++
src/include/access/tableam.h | 136 ++++++++++++++
src/include/pgstat.h | 4 +-
5 files changed, 403 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c6d0aacc5c9..e70c0ccbe82 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
#include "catalog/pg_database.h"
#include "catalog/pg_database_d.h"
#include "commands/vacuum.h"
+#include "executor/execRowBatch.h"
#include "pgstat.h"
#include "port/pg_bitutils.h"
#include "storage/lmgr.h"
@@ -109,6 +110,7 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static void heap_repoint_slot(RowBatch *b, int idx);
/*
@@ -1213,7 +1215,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_cbuf = InvalidBuffer;
/*
- * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
+ * Disable page-at-a-time mode if the snapshot does not allow it.
*/
if (!(snapshot && IsMVCCSnapshot(snapshot)))
scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
@@ -1463,7 +1465,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
return &scan->rs_ctup;
}
@@ -1491,13 +1493,232 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
scan->rs_cbuf);
return true;
}
+/*---------- Batching support -----------*/
+
+static const RowBatchOps RowBatchHeapOps =
+{
+ .repoint_slot = heap_repoint_slot
+};
+
+/*
+ * heap_batch_feasible
+ * Batching requires a MVCC snapshot since it relies on
+ * page-at-a-time mode, which heap_beginscan() disables for
+ * non-MVCC snapshots.
+ */
+bool
+heap_batch_feasible(Relation relation, Snapshot snapshot)
+{
+ return snapshot && IsMVCCSnapshot(snapshot);
+}
+
+/*
+ * heap_begin_batch
+ * Initialize AM-side batch state for a heap scan.
+ *
+ * Allocates a HeapPageBatch, which acts as a thin slice descriptor over
+ * the scan's rs_vistuples[] array. Unlike the previous version there is
+ * no separate tuple header storage in HeapPageBatch itself; rs_vistuples[]
+ * in HeapScanDescData (populated by page_collect_tuples() via
+ * heap_prepare_pagescan()) serves as the page-level buffer. HeapPageBatch
+ * holds a pointer into that array for the current slice and the buffer pin
+ * for the current page.
+ *
+ * b->slot must be a TTSOpsBufferHeapTuple slot.
+ */
+void
+heap_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb;
+
+ /* Batch path relies on executor-level qual eval, not AM scan keys */
+ Assert(sscan->rs_nkeys == 0);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ hb = palloc(sizeof(HeapPageBatch));
+ hb->tuples = NULL;
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+ hb->buf = InvalidBuffer;
+
+ b->am_payload = hb;
+ b->ops = &RowBatchHeapOps;
+}
+
+/*
+ * heap_reset_batch
+ * Release pin and reset for rescan, keeping allocations.
+ */
+void
+heap_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(hb != NULL);
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+}
+
+/*
+ * heap_end_batch
+ * Release all batch resources.
+ */
+void
+heap_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+ b->am_payload = NULL;
+}
+
+/*
+ * heap_getnextbatch
+ * Fetch the next slice of visible tuples from a heap scan.
+ *
+ * Serves slices from the current page's rs_vistuples[] array. If the
+ * current page has remaining tuples, sets hb->tuples to point at the next
+ * slice without re-entering the page scan. If the page is exhausted,
+ * advances to the next page via heap_fetch_next_buffer(), prepares it
+ * with heap_prepare_pagescan(), and serves the first slice from it.
+ *
+ * hb->tuples points directly into scan->rs_vistuples[]; the entries remain
+ * valid as long as hb->buf (the page's buffer pin) is held. The pin is
+ * released at the top of the next call once the page is fully consumed.
+ *
+ * Each call returns at most b->max_rows tuples.
+ *
+ * Returns true if tuples were fetched, false at end of scan.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+ int remaining;
+ int nserve;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+
+ /*
+ * Try to serve from the current page first. No page advance, no buffer
+ * management, no re-entry into heap code.
+ */
+ remaining = scan->rs_ntuples - hb->nextitem;
+ if (remaining > 0)
+ {
+ nserve = Min(remaining, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[hb->nextitem];
+ hb->ntuples = nserve;
+ hb->nextitem += nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /*
+ * Current page exhausted. Advance to the next page with visible tuples.
+ */
+ for (;;)
+ {
+ /*
+ * Release the previous page's pin. The page is fully consumed at
+ * this point -- all slices have been served.
+ */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ b->nrows = 0;
+ return false;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+ /*
+ * Prepare the page: prune, run visibility checks, and populate
+ * scan->rs_vistuples[0..rs_ntuples-1] via page_collect_tuples().
+ */
+ heap_prepare_pagescan(sscan);
+
+ if (scan->rs_ntuples > 0)
+ {
+ /*
+ * Pin the page so tuple data stays valid while the executor
+ * processes slices. Released at the top of the next call
+ * once the page is fully consumed.
+ */
+ IncrBufferRefCount(scan->rs_cbuf);
+ hb->buf = scan->rs_cbuf;
+
+ nserve = Min(scan->rs_ntuples, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[0];
+ hb->ntuples = nserve;
+ hb->nextitem = nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /* Empty page (all dead/invisible tuples), try next */
+ }
+}
+
+/*
+ * heap_repoint_slot
+ * Re-point the batch's single slot to the tuple at index idx.
+ *
+ * Called by RowBatchGetNextSlot() for each tuple served to the parent
+ * node. hb->tuples[idx] was populated by page_collect_tuples() via
+ * heap_prepare_pagescan() and remains valid as long as hb->buf is pinned.
+ */
+static void
+heap_repoint_slot(RowBatch *b, int idx)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(idx >= 0 && idx < hb->ntuples);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ ExecStoreBufferHeapTuple(&hb->tuples[idx], b->slot, hb->buf);
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
@@ -1639,7 +1860,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* if we get here it means we have a new current scan tuple, so point to
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2fd120028bb..8124d573ac3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2348,7 +2348,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
/* Count successfully-fetched tuples as heap fetches */
- pgstat_count_heap_getnext(scan->rs_rd);
+ pgstat_count_heap_getnext(scan->rs_rd, 1);
return true;
}
@@ -2637,6 +2637,12 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_batch_feasible = heap_batch_feasible,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+ .scan_reset_batch = heap_reset_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 09b9566d0ac..0783fa13c4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -107,6 +107,32 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapPageBatch -- heapam-private page-level batch state.
+ *
+ * Thin slice descriptor over the scan's rs_vistuples[] array. Rather
+ * than owning a copy of tuple headers, HeapPageBatch holds a pointer
+ * into scan->rs_vistuples[] for the current slice, which was populated
+ * by page_collect_tuples() during heap_prepare_pagescan().
+ *
+ * The executor consumes tuples in slices. Each heap_getnextbatch call
+ * re-points tuples to the next slice and advances nextitem, serving up
+ * to RowBatch.max_rows tuples from the current page before advancing
+ * to the next.
+ *
+ * buf holds the pin for the current page. tuple data referenced via
+ * tuples remains valid as long as buf is pinned.
+ *
+ * Stored in RowBatch.am_payload.
+ */
+typedef struct HeapPageBatch
+{
+ HeapTupleData *tuples; /* points into scan->rs_vistuples[nextitem] */
+ int ntuples; /* tuples in current slice */
+ int nextitem; /* next unserved tuple index in rs_vistuples[] */
+ Buffer buf; /* pinned buffer for current page */
+} HeapPageBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -362,6 +388,13 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern bool heap_batch_feasible(Relation relation, Snapshot snapshot);
+extern void heap_begin_batch(TableScanDesc sscan, RowBatch *batch);
+extern bool heap_getnextbatch(TableScanDesc sscan, RowBatch *batch, ScanDirection dir);
+extern void heap_end_batch(TableScanDesc sscan, RowBatch *batch);
+extern void heap_reset_batch(TableScanDesc sscan, RowBatch *batch);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..a72be111c26 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -275,6 +275,8 @@ typedef void (*IndexBuildCallback) (Relation index,
bool tupleIsAlive,
void *state);
+typedef struct RowBatch RowBatch;
+
/*
* API struct for a table AM. Note this must be allocated in a
* server-lifetime manner, typically as a static const struct, which then gets
@@ -351,6 +353,56 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ /*
+ * Returns true if the AM can support batching for a scan with the
+ * given snapshot. Called at plan init time before the scan descriptor
+ * exists. AMs that have no snapshot-based restrictions can omit this
+ * callback, in which case batching is considered feasible.
+ */
+ bool (*scan_batch_feasible)(Relation relation, Snapshot snapshot);
+
+ /*
+ * Initialize AM-owned batch state for a scan. Called once before
+ * the first scan_getnextbatch call. The AM allocates whatever
+ * private state it needs and stores it in b->am_payload. b->slot
+ * is the scan node's ss_ScanTupleSlot, whose type was already
+ * determined by the AM via table_slot_callbacks(). The AM's
+ * repoint_slot callback re-points it to each tuple in the batch
+ * in turn. Future interfaces may allow the AM to expose batch
+ * data in other forms without going through a slot.
+ */
+ void (*scan_begin_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Fetch the next batch of tuples from the scan into b. Sets b->nrows
+ * to the number of tuples available and resets b->pos to 0. Returns
+ * true if any tuples were fetched, false at end of scan. The caller
+ * advances through the batch via RowBatchGetNextSlot(), which calls
+ * ops->repoint_slot for each position up to b->nrows.
+ */
+ bool (*scan_getnextbatch)(TableScanDesc sscan, RowBatch *b,
+ ScanDirection dir);
+
+ /*
+ * Release all AM-owned batch resources, including any buffer pins
+ * held in am_payload. Called when the scan node is shut down.
+ * After this call b->am_payload must not be used.
+ */
+ void (*scan_end_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Reset batch state for rescan. Release any held resources (e.g.
+ * buffer pins) and reset counts, but keep the allocation so the
+ * next getnextbatch call can reuse it without re-entering
+ * begin_batch.
+ */
+ void (*scan_reset_batch)(TableScanDesc sscan, RowBatch *b);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1047,6 +1099,90 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation, Snapshot snapshot)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ if (tam->scan_getnextbatch == NULL)
+ return false;
+
+ Assert(tam->scan_begin_batch != NULL);
+ Assert(tam->scan_reset_batch != NULL);
+ Assert(tam->scan_end_batch != NULL);
+
+ /*
+ * Optional: AM may restrict batching based on snapshot or other conditions.
+ */
+ if (tam->scan_batch_feasible != NULL &&
+ !tam->scan_batch_feasible(relation, snapshot))
+ return false;
+
+ return true;
+}
+
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload in the RowBatch
+ */
+static inline void
+table_scan_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, b);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fetch the next batch of tuples from the AM. Returns true if tuples
+ * were fetched, false at end of scan. Only forward scans are supported.
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, b, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, b);
+}
+
+/*
+ * table_scan_reset_batch
+ * Reset AM-owned batch state for rescan without freeing.
+ */
+static inline void
+table_scan_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_reset_batch != NULL);
+
+ tam->scan_reset_batch(sscan, b);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 216b93492ba..0344c4e88c3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -695,10 +695,10 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.numscans++; \
} while (0)
-#define pgstat_count_heap_getnext(rel) \
+#define pgstat_count_heap_getnext(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
- (rel)->pgstat_info->counts.tuples_returned++; \
+ (rel)->pgstat_info->counts.tuples_returned += (n); \
} while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
--
2.47.3
[application/x-patch] v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch (12.8K, 3-v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch)
download | inline diff:
From d7e8f76144cb27e761e2d4bc9c687dd0a2de203e Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 12 Mar 2026 09:18:04 +0900
Subject: [PATCH v6 1/5] heapam: store full HeapTupleData in rs_vistuples[] for
pagemode scans
page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the OffsetNumber
of each survivor into rs_vistuples[]. heapgettup_pagemode() then
re-derived those same values on every call from the saved OffsetNumber
via PageGetItemId() and PageGetItem().
Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
block, and relid are already in scope, so no additional page reads are
needed. For the all_visible path (the common case on a primary not
under active modification) the write piggy-backs on the existing
per-lineoff loop. For the !all_visible path, HeapTupleData entries are
written during the visibility loop and compacted to visible survivors
afterwards using batchmvcc.visible[], avoiding a return to pd_linp[] via
PageGetItemId().
With rs_vistuples[] populated, heapgettup_pagemode() replaces the
per-tuple PageGetItemId/PageGetItem calls with a single struct copy:
*tuple = scan->rs_vistuples[lineindex];
The stack-local HeapTupleData array in BatchMVCCState is eliminated by
passing rs_vistuples[] directly to HeapTupleSatisfiesMVCCBatch(),
saving MaxHeapTuplesPerPage * 24 bytes of stack per page_collect_tuples()
call. HeapTupleSatisfiesMVCCBatch() loses its vistuples_dense parameter
since compaction is now handled by the caller.
t_tableOid is pre-initialized for all rs_vistuples[] entries at scan
start in heap_beginscan(), eliminating a store per visible tuple from the
fill loop. The raw ItemId word is read once per tuple with lp_off and
lp_len extracted via mask and shift rather than calling ItemIdGetOffset()
and ItemIdGetLength() separately, avoiding a potential second load from
the same address in the inner loop.
Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line pointer
traversal.
Suggested-by: Andres Freund <[email protected]>
---
src/backend/access/heap/heapam.c | 73 ++++++++++++---------
src/backend/access/heap/heapam_handler.c | 19 ++----
src/backend/access/heap/heapam_visibility.c | 21 +++---
src/include/access/heapam.h | 5 +-
4 files changed, 58 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e5bd062de77..c6d0aacc5c9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -524,7 +524,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
BlockNumber block, int lines,
bool all_visible, bool check_serializable)
{
- Oid relid = RelationGetRelid(scan->rs_base.rs_rd);
int ntup = 0;
int nvis = 0;
BatchMVCCState batchmvcc;
@@ -536,7 +535,7 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
{
ItemId lpp = PageGetItemId(page, lineoff);
- HeapTuple tup;
+ HeapTuple tup = &scan->rs_vistuples[ntup];
if (unlikely(!ItemIdIsNormal(lpp)))
continue;
@@ -549,25 +548,33 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
*/
if (!all_visible || check_serializable)
{
- tup = &batchmvcc.tuples[ntup];
+ uint32 lp_val = *(uint32 *) lpp;
- tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tup->t_len = ItemIdGetLength(lpp);
- tup->t_tableOid = relid;
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
ItemPointerSet(&(tup->t_self), block, lineoff);
}
- /*
- * If the page is all visible, these fields otherwise won't be
- * populated in loop below.
- */
if (all_visible)
{
if (check_serializable)
- {
batchmvcc.visible[ntup] = true;
+
+ /*
+ * In the all_visible && !check_serializable path, the block
+ * above was skipped, so tup's fields have not been set yet.
+ * Fill them here while lpp is still in hand.
+ */
+ if (!check_serializable)
+ {
+ uint32 lp_val = *(uint32 *) lpp;
+
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+ ItemPointerSet(&tup->t_self, block, lineoff);
}
- scan->rs_vistuples[ntup] = lineoff;
}
ntup++;
@@ -598,11 +605,24 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
{
HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
scan->rs_base.rs_rd,
- &batchmvcc.tuples[i],
+ &scan->rs_vistuples[i],
buffer, snapshot);
}
}
+
+ /* Now compact rs_vistuples[] to visible survivors only */
+ if (!all_visible)
+ {
+ int dst = 0;
+ for (int i = 0; i < ntup; i++)
+ {
+ if (batchmvcc.visible[i])
+ scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+ }
+ Assert(dst == nvis);
+ }
+
return nvis;
}
@@ -1073,14 +1093,13 @@ heapgettup_pagemode(HeapScanDesc scan,
ScanKey key)
{
HeapTuple tuple = &(scan->rs_ctup);
- Page page;
uint32 lineindex;
uint32 linesleft;
if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
- page = BufferGetPage(scan->rs_cbuf);
+ Assert(BufferIsValid(scan->rs_cbuf));
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
@@ -1108,29 +1127,21 @@ heapgettup_pagemode(HeapScanDesc scan,
/* prune the page and determine visible tuple offsets */
heap_prepare_pagescan((TableScanDesc) scan);
- page = BufferGetPage(scan->rs_cbuf);
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
- /* block is the same for all tuples, set it once outside the loop */
- ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
/* lineindex now references the next or previous visible tid */
continue_page:
for (; linesleft > 0; linesleft--, lineindex += dir)
{
- ItemId lpp;
- OffsetNumber lineoff;
-
- Assert(lineindex < scan->rs_ntuples);
- lineoff = scan->rs_vistuples[lineindex];
- lpp = PageGetItemId(page, lineoff);
- Assert(ItemIdIsNormal(lpp));
-
- tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tuple->t_len = ItemIdGetLength(lpp);
- ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
+ /*
+ * Headers were pre-built by page_collect_tuples() into
+ * rs_vistuples[]. Copy the entry; t_data still points into the
+ * pinned page, which is safe for the lifetime of the current page
+ * scan.
+ */
+ *tuple = scan->rs_vistuples[lineindex];
/* skip any tuples that don't match the scan key */
if (key != NULL &&
@@ -1244,6 +1255,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
/* we only need to set this up once */
scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+ for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+ scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
/*
* Allocate memory to keep track of page allocation for parallel workers
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253a735b6c1..2fd120028bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2153,9 +2153,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
{
BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
HeapScanDesc hscan = (HeapScanDesc) bscan;
- OffsetNumber targoffset;
- Page page;
- ItemId lp;
/*
* Out of range? If so, nothing more to look at on this page
@@ -2170,15 +2167,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
return false;
}
- targoffset = hscan->rs_vistuples[hscan->rs_cindex];
- page = BufferGetPage(hscan->rs_cbuf);
- lp = PageGetItemId(page, targoffset);
- Assert(ItemIdIsNormal(lp));
-
- hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
- hscan->rs_ctup.t_len = ItemIdGetLength(lp);
- hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
- ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+ hscan->rs_ctup = hscan->rs_vistuples[hscan->rs_cindex];
pgstat_count_heap_fetch(scan->rs_rd);
@@ -2456,7 +2445,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
while (start < end)
{
uint32 mid = start + (end - start) / 2;
- OffsetNumber curoffset = hscan->rs_vistuples[mid];
+ OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
if (tupoffset == curoffset)
return true;
@@ -2575,7 +2564,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
ItemPointerSet(&tid, block, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
&heapTuple, NULL, true))
- hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ hscan->rs_vistuples[ntup++] = heapTuple;
}
}
else
@@ -2604,7 +2593,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
if (valid)
{
- hscan->rs_vistuples[ntup++] = offnum;
+ hscan->rs_vistuples[ntup++] = loctup;
PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
HeapTupleHeaderGetXmin(loctup.t_data));
}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index fc64f4343ce..cd6cd4d8d69 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1670,16 +1670,16 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
}
/*
- * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
* efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
*
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[]. The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
*
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
* setting hint bits more efficiently (see the one BufferFinishSetHintBits()
* call below).
*
@@ -1689,7 +1689,7 @@ int
HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense)
+ HeapTupleData *tuples)
{
int nvis = 0;
SetHintBitsState state = SHB_INITIAL;
@@ -1699,16 +1699,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
for (int i = 0; i < ntups; i++)
{
bool valid;
- HeapTuple tup = &batchmvcc->tuples[i];
+ HeapTuple tup = &tuples[i];
valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
batchmvcc->visible[i] = valid;
if (likely(valid))
- {
- vistuples_dense[nvis] = tup->t_self.ip_posid;
nvis++;
- }
}
if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2fdc50b865b..09b9566d0ac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -103,7 +103,7 @@ typedef struct HeapScanDescData
/* these fields only used in page-at-a-time mode and for bitmap scans */
uint32 rs_cindex; /* current tuple's index in vistuples */
uint32 rs_ntuples; /* number of visible tuples on page */
- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ HeapTupleData rs_vistuples[MaxHeapTuplesPerPage]; /* tuples */
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
@@ -483,14 +483,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
*/
typedef struct BatchMVCCState
{
- HeapTupleData tuples[MaxHeapTuplesPerPage];
bool visible[MaxHeapTuplesPerPage];
} BatchMVCCState;
extern int HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense);
+ HeapTupleData *tuples);
/*
* To avoid leaking too much knowledge about reorderbuffer implementation
--
2.47.3
[application/x-patch] v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch (6.5K, 4-v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch)
download | inline diff:
From 0d810ceed77e394883ab0e95eafe36051b546040 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 17:42:19 +0900
Subject: [PATCH v6 2/5] Add RowBatch infrastructure for batched tuple
processing
Introduce RowBatch, a data carrier that allows table AMs to deliver
multiple rows per call and the executor to process them as a group.
RowBatch separates three concerns:
- am_payload: opaque, AM-owned storage (e.g. HeapBatch with pinned
page and tuple headers). The AM allocates this in its
scan_begin_batch callback.
- slots[]: TupleTableSlot array, created by RowBatchCreateSlots()
with AM-appropriate slot ops. Populated from am_payload by
ops->materialize_into_slots when the executor needs tuple data.
- max_rows: executor-set upper bound that the AM respects when
filling a batch.
RowBatch does not own selection/filtering state. Which rows survive
qual evaluation is the executor's concern, tracked separately in
scan node state. This keeps RowBatch focused on the AM-to-executor
data transfer boundary.
RowBatchOps provides a vtable for AM-specific operations; currently
only materialize_into_slots is defined.
---
src/backend/executor/Makefile | 1 +
src/backend/executor/execRowBatch.c | 54 ++++++++++++++++++
src/backend/executor/meson.build | 1 +
src/include/executor/execRowBatch.h | 88 +++++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 146 insertions(+)
create mode 100644 src/backend/executor/execRowBatch.c
create mode 100644 src/include/executor/execRowBatch.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..99a00e762f6 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execRowBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
new file mode 100644
index 00000000000..6a298813bd8
--- /dev/null
+++ b/src/backend/executor/execRowBatch.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.c
+ * Helpers for RowBatch
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execRowBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execRowBatch.h"
+
+/*
+ * RowBatchCreate
+ * Allocate and initialize a new RowBatch envelope.
+ */
+RowBatch *
+RowBatchCreate(int max_rows)
+{
+ RowBatch *b;
+
+ Assert(max_rows > 0);
+
+ b = palloc(sizeof(RowBatch));
+ b->am_payload = NULL;
+ b->ops = NULL;
+ b->max_rows = max_rows;
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ b->slot = NULL;
+
+ return b;
+}
+
+/*
+ * RowBatchReset
+ * Reset an existing RowBatch envelope to empty.
+ */
+void
+RowBatchReset(RowBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ /* b->slot belongs to the owning PlanState node */
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..fd0bf80bacd 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execRowBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
new file mode 100644
index 00000000000..021fdeecc73
--- /dev/null
+++ b/src/include/executor/execRowBatch.h
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.h
+ * Executor batch envelope for passing row batch state upward
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execRowBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECROWBATCH_H
+#define EXECROWBATCH_H
+
+#include "executor/tuptable.h"
+
+typedef struct RowBatchOps RowBatchOps;
+
+/*
+ * RowBatch
+ *
+ * Data carrier from table AM to executor. The AM populates am_payload
+ * and nrows via scan_getnextbatch(). The executor calls ops->materialize_all
+ * to populate slots[] when it needs tuple data.
+ *
+ * Selection state (which rows survived qual eval) is owned by the executor,
+ * not the batch.
+ */
+typedef struct RowBatch
+{
+ void *am_payload;
+ const RowBatchOps *ops;
+
+ int max_rows; /* executor-set upper bound */
+ int nrows; /* rows TAM put in */
+ int pos; /* iteration position */
+ bool materialized; /* tuples in slots valid? */
+
+ TupleTableSlot *slot; /* row view */
+} RowBatch;
+
+/*
+ * RowBatchOps -- AM-specific operations on a RowBatch.
+ *
+ * Table AMs set b->ops during scan_begin_batch to provide
+ * callbacks that the executor uses to access batch contents.
+ *
+ * repoint_slot re-points the batch's single slot to the tuple at
+ * index idx within the current batch. The slot remains valid until
+ * the next call or until the batch is exhausted.
+ *
+ * Additional callbacks can be added here as new AMs or executor
+ * features require them.
+ */
+typedef struct RowBatchOps
+{
+ void (*repoint_slot) (RowBatch *b, int idx);
+} RowBatchOps;
+
+/* Create/teardown */
+extern RowBatch *RowBatchCreate(int max_rows);
+extern void RowBatchReset(RowBatch *b, bool drop_slots);
+
+/* Validation */
+static inline bool
+RowBatchIsValid(RowBatch *b)
+{
+ return b != NULL && b->max_rows > 0;
+}
+
+/* Iteration over materialized slots */
+static inline bool
+RowBatchHasMore(RowBatch *b)
+{
+ return b->pos < b->nrows;
+}
+
+static inline TupleTableSlot *
+RowBatchGetNextSlot(RowBatch *b)
+{
+ if (b->pos >= b->nrows)
+ return NULL;
+ b->ops->repoint_slot(b, b->pos++);
+ return b->slot;
+}
+
+#endif /* EXECROWBATCH_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 52f8603a7be..a2b0b1d99d4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2663,6 +2663,8 @@ RoleSpec
RoleSpecType
RoleStmtType
RollupData
+RowBatch
+RowBatchOps
RowCompareExpr
RowExpr
RowIdentityVarInfo
--
2.47.3
[application/x-patch] v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (17.4K, 5-v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From c5f58f57cda191408855ab243c05f15580ca5eef Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v6 5/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 44 +++++++++++
src/backend/commands/explain_state.c | 8 ++
src/backend/executor/execRowBatch.c | 44 ++++++++++-
src/backend/executor/nodeSeqscan.c | 8 +-
src/include/commands/explain_state.h | 1 +
src/include/executor/execRowBatch.h | 22 +++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 107 ++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 59 ++++++++++++++
9 files changed, 291 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 296ea8a1ed2..b507fec0dab 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execRowBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -519,6 +520,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -1372,6 +1375,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
int save_indent = es->indent;
bool haschildren;
bool isdisabled;
+ RowBatch *batch = NULL;
/*
* Prepare per-worker output buffers, if needed. We'll append the data in
@@ -2297,6 +2301,46 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->walusage);
+ /* BATCHES */
+ switch (nodeTag(plan))
+ {
+ case T_SeqScan:
+ batch = castNode(SeqScanState, planstate)->batch;
+ break;
+ default:
+ break;
+ }
+
+ if (es->batches && batch)
+ {
+ RowBatchStats *stats = batch->stats;
+
+ Assert(stats);
+ if (stats->batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) stats->batches,
+ RowBatchAvgRows(batch), stats->max_rows,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, stats->batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ RowBatchAvgRows(batch), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, stats->max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows, es);
+ }
+ }
+ }
+
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 77f59b8e500..28022a171cd 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -198,6 +200,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
errmsg("%s options %s and %s cannot be used together",
"EXPLAIN", "ANALYZE", "GENERIC_PLAN")));
+ /* check that BATCHES is used with EXPLAIN ANALYZE */
+ if (es->batches && !es->analyze)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN option %s requires ANALYZE", "BATCHES")));
+
/* if the summary was not set explicitly, set default value */
es->summary = (summary_set) ? es->summary : es->analyze;
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
index 6a298813bd8..6ef54deca04 100644
--- a/src/backend/executor/execRowBatch.c
+++ b/src/backend/executor/execRowBatch.c
@@ -20,7 +20,7 @@
* Allocate and initialize a new RowBatch envelope.
*/
RowBatch *
-RowBatchCreate(int max_rows)
+RowBatchCreate(int max_rows, bool track_stats)
{
RowBatch *b;
@@ -35,6 +35,20 @@ RowBatchCreate(int max_rows)
b->materialized = false;
b->slot = NULL;
+ if (track_stats)
+ {
+ RowBatchStats *stats = palloc_object(RowBatchStats);
+
+ stats->batches = 0;
+ stats->rows = 0;
+ stats->max_rows = 0;
+ stats->min_rows = INT_MAX;
+
+ b->stats = stats;
+ }
+ else
+ b->stats = NULL;
+
return b;
}
@@ -52,3 +66,31 @@ RowBatchReset(RowBatch *b, bool drop_slots)
b->materialized = false;
/* b->slot belongs to the owning PlanState node */
}
+
+void
+RowBatchRecordStats(RowBatch *b, int rows)
+{
+ RowBatchStats *stats = b->stats;
+
+ if (stats == NULL)
+ return;
+
+ stats->batches++;
+ stats->rows += rows;
+ if (rows > stats->max_rows)
+ stats->max_rows = rows;
+ if (rows < stats->min_rows && rows > 0)
+ stats->min_rows = rows;
+}
+
+double
+RowBatchAvgRows(RowBatch *b)
+{
+ RowBatchStats *stats = b->stats;
+
+ Assert(stats != NULL);
+ if (stats->batches == 0)
+ return 0.0;
+
+ return (double) stats->rows / stats->batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index b41d18b67e3..c1527be946a 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -245,8 +245,12 @@ SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
static void
SeqScanInitBatching(SeqScanState *scanstate)
{
- RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+ RowBatch *batch;
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument &&
+ (estate->es_instrument & INSTRUMENT_BATCHES);
+ batch = RowBatchCreate(MaxHeapTuplesPerPage, track_stats);
batch->slot = scanstate->ss.ss_ScanTupleSlot;
scanstate->batch = batch;
@@ -347,6 +351,8 @@ SeqNextBatch(SeqScanState *node)
if (!table_scan_getnextbatch(scandesc, b, direction))
return false;
+ RowBatchRecordStats(b, b->nrows);
+
return true;
}
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 5a48bc6fbb1..579ca4cfa20 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
index 021fdeecc73..ad0b4763b70 100644
--- a/src/include/executor/execRowBatch.h
+++ b/src/include/executor/execRowBatch.h
@@ -13,9 +13,12 @@
#ifndef EXECROWBATCH_H
#define EXECROWBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
typedef struct RowBatchOps RowBatchOps;
+typedef struct RowBatchStats RowBatchStats;
/*
* RowBatch
@@ -38,6 +41,9 @@ typedef struct RowBatch
bool materialized; /* tuples in slots valid? */
TupleTableSlot *slot; /* row view */
+
+ RowBatchStats *stats; /* NULL if instrumentation stats
+ * are not requested */
} RowBatch;
/*
@@ -58,8 +64,17 @@ typedef struct RowBatchOps
void (*repoint_slot) (RowBatch *b, int idx);
} RowBatchOps;
+/* Instrumentation stats populated for EXPLAIN ANALYZE BATCHES */
+typedef struct RowBatchStats
+{
+ int64 batches; /* total number of batches fetched */
+ int64 rows; /* total tuples across all batches */
+ int max_rows; /* max rows in any single batch */
+ int min_rows; /* min rows in any single batch (non-zero) */
+} RowBatchStats;
+
/* Create/teardown */
-extern RowBatch *RowBatchCreate(int max_rows);
+extern RowBatch *RowBatchCreate(int max_rows, bool track_stats);
extern void RowBatchReset(RowBatch *b, bool drop_slots);
/* Validation */
@@ -85,4 +100,9 @@ RowBatchGetNextSlot(RowBatch *b)
return b->slot;
}
+/* === Batching stats. ===*/
+
+extern void RowBatchRecordStats(RowBatch *b, int rows);
+extern double RowBatchAvgRows(RowBatch *b);
+
#endif /* EXECROWBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..bee69b4ac8f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..950de5a9d78 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,110 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: ((a > N) OR (b IS NULL))
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches | has_avg | has_max | has_min
+-------------+---------+---------+---------
+ t | t | t | t
+(1 row)
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+ child_has_batches | child_has_avg | child_has_max | child_has_min
+-------------------+---------------+---------------+---------------
+ t | t | t | t
+(1 row)
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches
+-------------
+ f
+(1 row)
+
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..55acb9058ce 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,62 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+reset executor_batch_rows;
--
2.47.3
[application/x-patch] v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch (12.6K, 6-v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From 074facc85aae66ebab49b08eadf9957a6dca778d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 11:28:16 +0900
Subject: [PATCH v6 4/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per-tuple AM crossings without
changing the node interface seen by parents.
SeqScanState gains a RowBatch pointer that holds the current batch
when batching is active. Batch state is localized to SeqScanState
-- no changes to PlanState or ScanState.
Add executor_batch_rows GUC (DEVELOPER_OPTIONS, default 64) to
control the maximum batch size. Setting it to 0 disables batching.
XXX currently ignored when reading from heapam tables.
Wire up runtime selection in ExecInitSeqScan via
SeqScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is forward-only, and the relation's AM supports
batching, ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used with zero overhead.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/executor/nodeSeqscan.c | 276 ++++++++++++++++++++++
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 2 +
5 files changed, 291 insertions(+)
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 8f219f60a93..b41d18b67e3 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,12 +29,17 @@
#include "access/relscan.h"
#include "access/tableam.h"
+#include "executor/execRowBatch.h"
#include "executor/execScan.h"
#include "executor/executor.h"
#include "executor/nodeSeqscan.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
+static TupleTableSlot *ExecSeqScanBatchSlot(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQual(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithProject(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQualProject(PlanState *pstate);
/* ----------------------------------------------------------------
* Scan Support
@@ -203,6 +208,271 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * SeqScanCanUseBatching
+ * Check whether this SeqScan can use batch mode execution.
+ *
+ * Batching requires: the GUC is enabled, no EPQ recheck is active, the scan
+ * is forward-only, and the table AM supports batching with the current
+ * snapshot (see table_supports_batching()).
+ */
+static bool
+SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss.ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ relation &&
+ table_supports_batching(relation,
+ scanstate->ss.ps.state->es_snapshot) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ scanstate->ss.ps.state->es_epq_active == NULL;
+}
+
+/*
+ * SeqScanInitBatching
+ * Set up batch execution state and select the appropriate
+ * ExecProcNode variant for batch mode.
+ *
+ * Called from ExecInitSeqScan when SeqScanCanUseBatching returns true.
+ * Overwrites the ExecProcNode pointer set by the non-batch path.
+ */
+static void
+SeqScanInitBatching(SeqScanState *scanstate)
+{
+ RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+
+ batch->slot = scanstate->ss.ss_ScanTupleSlot;
+ scanstate->batch = batch;
+
+ /* Choose batch variant */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+}
+
+/*
+ * SeqScanResetBatching
+ * Reset or tear down batch execution state.
+ *
+ * When drop is false (rescan), resets the RowBatch and releases any
+ * AM-held resources like buffer pins, but keeps allocations for reuse.
+ * When drop is true (end of node), frees everything.
+ */
+static void
+SeqScanResetBatching(SeqScanState *scanstate, bool drop)
+{
+ RowBatch *b = scanstate->batch;
+
+ if (b)
+ {
+ RowBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ if (drop)
+ {
+ table_scan_end_batch(scanstate->ss.ss_currentScanDesc, b);
+ b->am_payload = NULL;
+ }
+ else
+ table_scan_reset_batch(scanstate->ss.ss_currentScanDesc, b);
+ }
+ if (drop)
+ pfree(b);
+ }
+}
+
+/*
+ * SeqNextBatch
+ * Fetch the next batch of tuples from the table AM.
+ *
+ * Lazily initializes the scan descriptor and AM batch state on first
+ * call. Returns false at end of scan.
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+ RowBatch *b = node->batch;
+
+ Assert(b != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (b->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ table_scan_begin_batch(scandesc, b);
+ }
+
+ if (!table_scan_getnextbatch(scandesc, b, direction))
+ return false;
+
+ return true;
+}
+
+/*
+ * SeqScanBatchSlot
+ * Core loop for batch-driven SeqScan variants.
+ *
+ * Internally fetches tuples in batches from the table AM, but returns
+ * one slot at a time to preserve the single-slot interface expected by
+ * parent nodes. When the current batch is exhausted, fetches and
+ * materializes the next one.
+ *
+ * qual and projInfo are passed explicitly so the compiler can eliminate
+ * dead branches when inlined into the typed wrapper functions (e.g.
+ * ExecSeqScanBatchSlot passes NULL for both).
+ *
+ * EPQ is not supported in the batch path; asserted at entry.
+ */
+static inline TupleTableSlot *
+SeqScanBatchSlot(SeqScanState *node,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ss.ps.ps_ExprContext;
+ RowBatch *b = node->batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ss.ps.state->es_epq_active == NULL);
+ Assert(RowBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!RowBatchHasMore(b))
+ {
+ if (!SeqNextBatch(node))
+ return NULL;
+ }
+
+ in = RowBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Check qual if present */
+ if (qual != NULL && !ExecQual(qual, econtext))
+ {
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Project if needed, otherwise return scan tuple directly */
+ if (projInfo != NULL)
+ return ExecProject(projInfo);
+
+ return in;
+ }
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, pstate->ps_ProjInfo);
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -281,6 +551,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (SeqScanCanUseBatching(scanstate, eflags))
+ SeqScanInitBatching(scanstate);
+
return scanstate;
}
@@ -300,6 +573,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ SeqScanResetBatching(node, true);
+
/*
* close heap scan
*/
@@ -329,6 +604,7 @@ ExecReScanSeqScan(SeqScanState *node)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
+ SeqScanResetBatching(node, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a5a0edf2534..e1eadcf643d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1004,6 +1004,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..ad406bf53f3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0716c5a9aed..6f038cfcc60 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -67,6 +67,7 @@ typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct RowBatch RowBatch;
/* ----------------
@@ -1644,6 +1645,7 @@ typedef struct SeqScanState
{
ScanState ss; /* its first field is NodeTag */
Size pscan_len; /* size of parallel heap scan descriptor */
+ RowBatch *batch; /* NULL if batching disabled */
} SeqScanState;
/* ----------------
--
2.47.3
^ permalink raw reply [nested|flat] 29+ messages in thread
* Re: Batching in executor
@ 2026-04-06 12:02 Amit Langote <[email protected]>
parent: Amit Langote <[email protected]>
0 siblings, 0 replies; 29+ messages in thread
From: Amit Langote @ 2026-04-06 12:02 UTC (permalink / raw)
To: Junwang Zhao <[email protected]>; +Cc: cca5507 <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>
On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <[email protected]> wrote:
> Here is a significantly revised version of the patch series. A lot has
> changed since the January submission, so I want to summarize the
> design changes before getting into the patches. I think it does
> address the points in the two reviews that landed since v5 but maybe a
> bunch of points became moot after my rewrite of the relevant portions
> (thanks Junwang and ChangAo for the review in any case).
>
> At this point it might be better to think of this as targeting v20,
> except that if there is review bandwidth in the remaining two weeks
> before the v19 feature freeze, the rs_vistuples[] change described
> below as a standalone improvement to the existing pagemode scan path
> could be considered for v19, though that too is an optimistic
> scenario.
>
> It is also worth noting that Andres identified a number of
> inefficiencies in the existing scan path in:
>
> Re: unnecessary executor overheads around seqscans
> https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
>
> that are worth fixing independently of batching. Some of those fixes
> may be better pursued first, both because they benefit all scan paths
> and because they would make batching's gains more honest.
>
> Separately, after looking at the previous version, Andres pointed out
> offlist two fundamental issues with the patch's design:
>
> * The heapam implementation (in a version of the patch I didn't post
> to the thread) duplicated heap_prepare_pagescan() logic in a separate
> batch-specific code path, which is not acceptable as changes should
> benefit the existing slot interface too. Code duplication is not good
> either from a future maintainability aspect. The v5 version of that
> code is not great in that respect either; it instead duplicated
> heapggettup_pagemode() to slap batching on it.
>
> * Allocating executor_batch_rows slots on the executor side to receive
> rows from the AM adds significant overhead for slot initialization and
> management, and for non-row-organized AMs that do not produce
> individual rows at all, those slots would never be meaningfully
> populated.
>
> In any case, he just wasn't a fan of the slot-array approach the
> moment I mentioned it. The previous version had two slot arrays,
> inslots and outslots, of TTSOpsHeapTuple type (not
> TTSOpsBufferHeapTuple because buffer pins were managed by the batch
> code, which has its own modularity/correctness issues), populated via
> a materialize_all callback. A batch qual evaluator would copy
> qualifying tuples into outslots, with an activeslots pointer switching
> between the two depending on whether batch qual evaluation was used.
>
> The new design addresses both issues and differs from the previous
> version in several other ways:
>
> * Single slot instead of slot arrays: there is a single
> TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
> was already determined by the AM via table_slot_callbacks(). The slot
> is re-pointed to each HeapTuple in the current buffer page via a new
> repoint_slot AM callback, with no materialization or copying. Tuples
> are returned one by one from the executor's perspective, but the AM
> serves them in page-sized batches from pre-built HeapTupleData
> descriptors in rs_vistuples[], avoiding repeated descent into heapam
> per tuple. This is heapam's implementation of the batch interface;
> there is no intention to force other AMs into the same row-oriented
> model.
>
> * Batch qual evaluator not included: with the single-slot model,
> quals are evaluated per tuple via the existing ExecQual path after
> each repoint_slot call. A natural next step would be a new opcode
> (EEOP) that calls repoint_slot() internally within expression
> evaluation, allowing ExecQual to advance through multiple tuples from
> the same batch without returning to the scan node each time, with qual
> results accumulated in a bitmask in ExprState. The details of that
> will be worked out in a follow-on series.
>
> * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
> HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
> allows heap_getnextbatch() to simply advance a slice pointer into that
> array without any additional copying or re-entering heap code, making
> a separate batch-specific scan function unnecessary.
>
> * TupleBatch renamed to RowBatch: "row batch" is more natural
> terminology for this concept and also consistent with how similar
> abstractions are named in columnar and OLAP systems.
>
> * AM callbacks now take RowBatch directly: previously
> heap_getnextbatch() returned a void pointer that the executor would
> store into RowBatch.am_payload, because only the executor knew the
> internals of RowBatch. Now the AM receives RowBatch directly as a
> parameter and can populate it without the executor acting as an
> intermediary. This is also why RowBatch is introduced in its own
> patch ahead of the AM API addition, so the struct definition is
> available to both sides.
>
> Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
> instead of OffsetNumbers, as a standalone improvement to the existing
> pagemode scan path. Measured on a pg_prewarm'd (also vaccum freeze'd
> in the all-visible case) table with 1M/5M/10M rows:
>
> query all-visible not-all-visible
> count(*) -0.2% to +0.9% -0.4% to +0.5%
> count(*) WHERE id % 10 = 0 -1.1% to +3.4% +0.2% to +1.5%
> SELECT * LIMIT 1 OFFSET N -2.2% to -0.6% -0.9% to +6.6%
> SELECT * WHERE id%10=0 LIMIT -0.8% to +3.9% +0.9% to +9.6%
>
> No significant regression on either page type. The structural
> improvement is most visible on not-all-visible pages where
> HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
> visibility checks, so persisting the result into rs_vistuples[]
> eliminates the downstream re-read (in heapgettupe_pagemode()) with no
> measurable overhead. That said, these numbers are somewhat noisy on
> my machine. Results on other machines would be welcome.
>
> Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
> and heapam implementation including seqscan variants that use the new
> scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
> respectively. With batching enabled (executor_batch_rows=300,
> ~MaxHeapTuplesPerPage):
>
> query all-visible not-all-visible
> count(*) +11 to +15% +9 to +13%
> count(*) WHERE id % 10 = 0 +6 to +11% +10 to +14%
> SELECT * LIMIT 1 OFFSET N +16 to +19% +16 to +22%
> SELECT * WHERE id%10=0 LIMIT +8 to +10% +8 to +13%
>
> With executor_batch_rows=0, results are within noise of master across
> all query types and sizes, confirming no regression from the
> infrastructure changes themselves. The not-all-visible results tend
> to show slightly higher gains than the all-visible case. This is
> likely because the existing heapam code is more optimized for the
> all-visible path, so the not-all-visible path, which goes through
> HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
> more headroom that batching can exploit.
>
> Setting aside the current series for a moment, there are some broader
> design questions worth raising while we have attention on this area.
> Some of these echo points Tomas raised in his first reply on this
> thread, and I am reiterating them deliberately since I have not
> managed to fully address them on my own or I simply didn't need to for
> the TAM-to-scan-node batching and think they would benefit from wider
> input rather than just my own iteration.
>
> We should also start thinking about other ways the executor can
> consume batch rows, not always assuming they are presented as
> HeapTupleData. For instance, an AM could expose decoded column arrays
> directly to operators that can consume them, bypassing slot-based
> deform entirely, or a columnar AM could implement scan_getnextbatch by
> decoding column strips directly into the batch without going through
> per-tuple HeapTupleData at all. Feedback on whether the current
> RowBatch design and the choices made in the scan_getnextbatch and
> RowBatchOps API make that sort of thing harder than it needs to be
> would be appreciated. For example, heapam's implementation of
> scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
> to HeapTupleData entries one at a time via repoint_slot in
> RowBatchHeapOps. That works for heapam but a columnar AM could
> implement scan_getnextbatch to decode column strips directly into
> arrays in the batch, with no per-row repoint step needed at all. Any
> adjustments that would make RowBatch more AM-agnostic are worth
> discussing now before the design hardens.
>
> There are also broader open questions about how far the batch model
> can extend beyond the scan node. Qual pushdown into the AM has been
> discussed in nearby threads and would be one way to allow expression
> evaluation to happen before data reaches the executor proper, though
> that is a separate effort. For the purposes of this series, expression
> evaluation still happens in the executor after scan_getnextbatch
> returns. If the scan node does not project, the buffer heap slot is
> passed directly to the parent node, which calls slot callbacks to
> deform as needed. But once a node above projects, aggregates, or
> joins, the notion of a page-sized batch from a single AM loses its
> meaning and virtual slots take over. Whether RowBatch is usable or
> meaningful beyond the scan/TAM boundary in any form, and whether the
> core executor will ever have non-HeapTupleData batch consumption paths
> or leave that entirely to extensions, are open questions worth
> discussing.
>
> For RowBatch to eventually play the role that TupleTableSlot plays for
> row-at-a-time execution, something inside it would need to serve as
> the common currency for batch data, analogous to TupleTableSlot's
> datum/isnull arrays. Column arrays are the obvious direction, but even
> that leaves open the question of representation. PostgreSQL's Datum is
> a pointer-sized abstraction that boxes everything, whereas vectorized
> systems use typed packed arrays of native types with validity
> bitmasks, which is a significant part of why tight vectorized loops
> are fast there. Whether column arrays of Datum would be good enough,
> or whether going further toward typed packed arrays would be necessary
> to get meaningful vectorization, is a deeper design question that this
> series deliberately does not try to answer.
>
> Even though the focus is on getting batching working at the scan/TAM
> boundary first, thoughts on any of these points would be welcome.
Rebased.
--
Thanks, Amit Langote
Attachments:
[application/octet-stream] v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch (12.8K, 2-v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch)
download | inline diff:
From 1557236686140c29be98dc461e97f8df4a0f1a73 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 12 Mar 2026 09:18:04 +0900
Subject: [PATCH v7 1/5] heapam: store full HeapTupleData in rs_vistuples[] for
pagemode scans
page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the OffsetNumber
of each survivor into rs_vistuples[]. heapgettup_pagemode() then
re-derived those same values on every call from the saved OffsetNumber
via PageGetItemId() and PageGetItem().
Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
block, and relid are already in scope, so no additional page reads are
needed. For the all_visible path (the common case on a primary not
under active modification) the write piggy-backs on the existing
per-lineoff loop. For the !all_visible path, HeapTupleData entries are
written during the visibility loop and compacted to visible survivors
afterwards using batchmvcc.visible[], avoiding a return to pd_linp[] via
PageGetItemId().
With rs_vistuples[] populated, heapgettup_pagemode() replaces the
per-tuple PageGetItemId/PageGetItem calls with a single struct copy:
*tuple = scan->rs_vistuples[lineindex];
The stack-local HeapTupleData array in BatchMVCCState is eliminated by
passing rs_vistuples[] directly to HeapTupleSatisfiesMVCCBatch(),
saving MaxHeapTuplesPerPage * 24 bytes of stack per page_collect_tuples()
call. HeapTupleSatisfiesMVCCBatch() loses its vistuples_dense parameter
since compaction is now handled by the caller.
t_tableOid is pre-initialized for all rs_vistuples[] entries at scan
start in heap_beginscan(), eliminating a store per visible tuple from the
fill loop. The raw ItemId word is read once per tuple with lp_off and
lp_len extracted via mask and shift rather than calling ItemIdGetOffset()
and ItemIdGetLength() separately, avoiding a potential second load from
the same address in the inner loop.
Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line pointer
traversal.
Suggested-by: Andres Freund <[email protected]>
---
src/backend/access/heap/heapam.c | 73 ++++++++++++---------
src/backend/access/heap/heapam_handler.c | 19 ++----
src/backend/access/heap/heapam_visibility.c | 21 +++---
src/include/access/heapam.h | 5 +-
4 files changed, 58 insertions(+), 60 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e06ce2db2cf..b70c75c8288 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -524,7 +524,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
BlockNumber block, int lines,
bool all_visible, bool check_serializable)
{
- Oid relid = RelationGetRelid(scan->rs_base.rs_rd);
int ntup = 0;
int nvis = 0;
BatchMVCCState batchmvcc;
@@ -536,7 +535,7 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
{
ItemId lpp = PageGetItemId(page, lineoff);
- HeapTuple tup;
+ HeapTuple tup = &scan->rs_vistuples[ntup];
if (unlikely(!ItemIdIsNormal(lpp)))
continue;
@@ -549,25 +548,33 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
*/
if (!all_visible || check_serializable)
{
- tup = &batchmvcc.tuples[ntup];
+ uint32 lp_val = *(uint32 *) lpp;
- tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tup->t_len = ItemIdGetLength(lpp);
- tup->t_tableOid = relid;
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
ItemPointerSet(&(tup->t_self), block, lineoff);
}
- /*
- * If the page is all visible, these fields otherwise won't be
- * populated in loop below.
- */
if (all_visible)
{
if (check_serializable)
- {
batchmvcc.visible[ntup] = true;
+
+ /*
+ * In the all_visible && !check_serializable path, the block
+ * above was skipped, so tup's fields have not been set yet.
+ * Fill them here while lpp is still in hand.
+ */
+ if (!check_serializable)
+ {
+ uint32 lp_val = *(uint32 *) lpp;
+
+ tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+ tup->t_len = lp_val >> 17;
+ Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+ ItemPointerSet(&tup->t_self, block, lineoff);
}
- scan->rs_vistuples[ntup] = lineoff;
}
ntup++;
@@ -598,11 +605,24 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
{
HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
scan->rs_base.rs_rd,
- &batchmvcc.tuples[i],
+ &scan->rs_vistuples[i],
buffer, snapshot);
}
}
+
+ /* Now compact rs_vistuples[] to visible survivors only */
+ if (!all_visible)
+ {
+ int dst = 0;
+ for (int i = 0; i < ntup; i++)
+ {
+ if (batchmvcc.visible[i])
+ scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+ }
+ Assert(dst == nvis);
+ }
+
return nvis;
}
@@ -1074,14 +1094,13 @@ heapgettup_pagemode(HeapScanDesc scan,
ScanKey key)
{
HeapTuple tuple = &(scan->rs_ctup);
- Page page;
uint32 lineindex;
uint32 linesleft;
if (likely(scan->rs_inited))
{
/* continue from previously returned page/tuple */
- page = BufferGetPage(scan->rs_cbuf);
+ Assert(BufferIsValid(scan->rs_cbuf));
lineindex = scan->rs_cindex + dir;
if (ScanDirectionIsForward(dir))
@@ -1109,29 +1128,21 @@ heapgettup_pagemode(HeapScanDesc scan,
/* prune the page and determine visible tuple offsets */
heap_prepare_pagescan((TableScanDesc) scan);
- page = BufferGetPage(scan->rs_cbuf);
linesleft = scan->rs_ntuples;
lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
- /* block is the same for all tuples, set it once outside the loop */
- ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
/* lineindex now references the next or previous visible tid */
continue_page:
for (; linesleft > 0; linesleft--, lineindex += dir)
{
- ItemId lpp;
- OffsetNumber lineoff;
-
- Assert(lineindex < scan->rs_ntuples);
- lineoff = scan->rs_vistuples[lineindex];
- lpp = PageGetItemId(page, lineoff);
- Assert(ItemIdIsNormal(lpp));
-
- tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
- tuple->t_len = ItemIdGetLength(lpp);
- ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
+ /*
+ * Headers were pre-built by page_collect_tuples() into
+ * rs_vistuples[]. Copy the entry; t_data still points into the
+ * pinned page, which is safe for the lifetime of the current page
+ * scan.
+ */
+ *tuple = scan->rs_vistuples[lineindex];
/* skip any tuples that don't match the scan key */
if (key != NULL &&
@@ -1245,6 +1256,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
/* we only need to set this up once */
scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+ for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+ scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
/*
* Allocate memory to keep track of page allocation for parallel workers
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 07f07188d46..88add129674 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2050,9 +2050,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
{
BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
HeapScanDesc hscan = (HeapScanDesc) bscan;
- OffsetNumber targoffset;
- Page page;
- ItemId lp;
/*
* Out of range? If so, nothing more to look at on this page
@@ -2067,15 +2064,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
return false;
}
- targoffset = hscan->rs_vistuples[hscan->rs_cindex];
- page = BufferGetPage(hscan->rs_cbuf);
- lp = PageGetItemId(page, targoffset);
- Assert(ItemIdIsNormal(lp));
-
- hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
- hscan->rs_ctup.t_len = ItemIdGetLength(lp);
- hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
- ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+ hscan->rs_ctup = hscan->rs_vistuples[hscan->rs_cindex];
pgstat_count_heap_fetch(scan->rs_rd);
@@ -2353,7 +2342,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
while (start < end)
{
uint32 mid = start + (end - start) / 2;
- OffsetNumber curoffset = hscan->rs_vistuples[mid];
+ OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
if (tupoffset == curoffset)
return true;
@@ -2473,7 +2462,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
ItemPointerSet(&tid, block, offnum);
if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
&heapTuple, NULL, true))
- hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+ hscan->rs_vistuples[ntup++] = heapTuple;
}
}
else
@@ -2502,7 +2491,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
if (valid)
{
- hscan->rs_vistuples[ntup++] = offnum;
+ hscan->rs_vistuples[ntup++] = loctup;
PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
HeapTupleHeaderGetXmin(loctup.t_data));
}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3a6a1e5a084..7162c848097 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1671,16 +1671,16 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
}
/*
- * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
* efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
*
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[]. The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
*
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
* setting hint bits more efficiently (see the one BufferFinishSetHintBits()
* call below).
*
@@ -1690,7 +1690,7 @@ int
HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense)
+ HeapTupleData *tuples)
{
int nvis = 0;
SetHintBitsState state = SHB_INITIAL;
@@ -1700,16 +1700,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
for (int i = 0; i < ntups; i++)
{
bool valid;
- HeapTuple tup = &batchmvcc->tuples[i];
+ HeapTuple tup = &tuples[i];
valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
batchmvcc->visible[i] = valid;
if (likely(valid))
- {
- vistuples_dense[nvis] = tup->t_self.ip_posid;
nvis++;
- }
}
if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5176478c295..56f2d1a5748 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -102,7 +102,7 @@ typedef struct HeapScanDescData
/* these fields only used in page-at-a-time mode and for bitmap scans */
uint32 rs_cindex; /* current tuple's index in vistuples */
uint32 rs_ntuples; /* number of visible tuples on page */
- OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */
+ HeapTupleData rs_vistuples[MaxHeapTuplesPerPage]; /* tuples */
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
@@ -498,14 +498,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
*/
typedef struct BatchMVCCState
{
- HeapTupleData tuples[MaxHeapTuplesPerPage];
bool visible[MaxHeapTuplesPerPage];
} BatchMVCCState;
extern int HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
int ntups,
BatchMVCCState *batchmvcc,
- OffsetNumber *vistuples_dense);
+ HeapTupleData *tuples);
/*
* To avoid leaking too much knowledge about reorderbuffer implementation
--
2.47.3
[application/octet-stream] v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (17.4K, 3-v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
download | inline diff:
From 8beefb53e7fa94a060456d1321f36abb221cbe47 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v7 5/5] Add EXPLAIN (BATCHES) option for tuple batching
statistics
Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.
For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.
Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/commands/explain.c | 44 +++++++++++
src/backend/commands/explain_state.c | 8 ++
src/backend/executor/execRowBatch.c | 44 ++++++++++-
src/backend/executor/nodeSeqscan.c | 8 +-
src/include/commands/explain_state.h | 1 +
src/include/executor/execRowBatch.h | 22 +++++-
src/include/executor/instrument.h | 1 +
src/test/regress/expected/explain.out | 107 ++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 59 ++++++++++++++
9 files changed, 291 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73eaaf176ac..8c98ca57c92 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
+#include "executor/execRowBatch.h"
#include "foreign/fdwapi.h"
#include "jit/jit.h"
#include "libpq/pqformat.h"
@@ -519,6 +520,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_BUFFERS;
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
+ if (es->batches)
+ instrument_option |= INSTRUMENT_BATCHES;
/*
* We always collect timing for the entire statement, even when node-level
@@ -1370,6 +1373,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
int save_indent = es->indent;
bool haschildren;
bool isdisabled;
+ RowBatch *batch = NULL;
/*
* Prepare per-worker output buffers, if needed. We'll append the data in
@@ -2296,6 +2300,46 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
+ /* BATCHES */
+ switch (nodeTag(plan))
+ {
+ case T_SeqScan:
+ batch = castNode(SeqScanState, planstate)->batch;
+ break;
+ default:
+ break;
+ }
+
+ if (es->batches && batch)
+ {
+ RowBatchStats *stats = batch->stats;
+
+ Assert(stats);
+ if (stats->batches > 0)
+ {
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Batches: %lld Avg Rows: %.1f Max: %d Min: %d\n",
+ (long long) stats->batches,
+ RowBatchAvgRows(batch), stats->max_rows,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows);
+ }
+ else
+ {
+ ExplainPropertyInteger("Batches", NULL, stats->batches, es);
+ ExplainPropertyFloat("Average Batch Rows", NULL,
+ RowBatchAvgRows(batch), 1, es);
+ ExplainPropertyInteger("Max Batch Rows", NULL, stats->max_rows, es);
+ ExplainPropertyInteger("Min Batch Rows", NULL,
+ stats->min_rows == INT_MAX ? 0 :
+ stats->min_rows, es);
+ }
+ }
+ }
+
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 77f59b8e500..28022a171cd 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
"EXPLAIN", opt->defname, p),
parser_errposition(pstate, opt->location)));
}
+ else if (strcmp(opt->defname, "batches") == 0)
+ es->batches = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -198,6 +200,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
errmsg("%s options %s and %s cannot be used together",
"EXPLAIN", "ANALYZE", "GENERIC_PLAN")));
+ /* check that BATCHES is used with EXPLAIN ANALYZE */
+ if (es->batches && !es->analyze)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN option %s requires ANALYZE", "BATCHES")));
+
/* if the summary was not set explicitly, set default value */
es->summary = (summary_set) ? es->summary : es->analyze;
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
index 6a298813bd8..6ef54deca04 100644
--- a/src/backend/executor/execRowBatch.c
+++ b/src/backend/executor/execRowBatch.c
@@ -20,7 +20,7 @@
* Allocate and initialize a new RowBatch envelope.
*/
RowBatch *
-RowBatchCreate(int max_rows)
+RowBatchCreate(int max_rows, bool track_stats)
{
RowBatch *b;
@@ -35,6 +35,20 @@ RowBatchCreate(int max_rows)
b->materialized = false;
b->slot = NULL;
+ if (track_stats)
+ {
+ RowBatchStats *stats = palloc_object(RowBatchStats);
+
+ stats->batches = 0;
+ stats->rows = 0;
+ stats->max_rows = 0;
+ stats->min_rows = INT_MAX;
+
+ b->stats = stats;
+ }
+ else
+ b->stats = NULL;
+
return b;
}
@@ -52,3 +66,31 @@ RowBatchReset(RowBatch *b, bool drop_slots)
b->materialized = false;
/* b->slot belongs to the owning PlanState node */
}
+
+void
+RowBatchRecordStats(RowBatch *b, int rows)
+{
+ RowBatchStats *stats = b->stats;
+
+ if (stats == NULL)
+ return;
+
+ stats->batches++;
+ stats->rows += rows;
+ if (rows > stats->max_rows)
+ stats->max_rows = rows;
+ if (rows < stats->min_rows && rows > 0)
+ stats->min_rows = rows;
+}
+
+double
+RowBatchAvgRows(RowBatch *b)
+{
+ RowBatchStats *stats = b->stats;
+
+ Assert(stats != NULL);
+ if (stats->batches == 0)
+ return 0.0;
+
+ return (double) stats->rows / stats->batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index d0ce8858c49..135b0a4f9a2 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -247,8 +247,12 @@ SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
static void
SeqScanInitBatching(SeqScanState *scanstate)
{
- RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+ RowBatch *batch;
+ EState *estate = scanstate->ss.ps.state;
+ bool track_stats = estate->es_instrument &&
+ (estate->es_instrument & INSTRUMENT_BATCHES);
+ batch = RowBatchCreate(MaxHeapTuplesPerPage, track_stats);
batch->slot = scanstate->ss.ss_ScanTupleSlot;
scanstate->batch = batch;
@@ -351,6 +355,8 @@ SeqNextBatch(SeqScanState *node)
if (!table_scan_getnextbatch(scandesc, b, direction))
return false;
+ RowBatchRecordStats(b, b->nrows);
+
return true;
}
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 5a48bc6fbb1..579ca4cfa20 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool generic; /* generate a generic plan */
+ bool batches; /* print batch statistics */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
index 021fdeecc73..ad0b4763b70 100644
--- a/src/include/executor/execRowBatch.h
+++ b/src/include/executor/execRowBatch.h
@@ -13,9 +13,12 @@
#ifndef EXECROWBATCH_H
#define EXECROWBATCH_H
+#include <limits.h>
+
#include "executor/tuptable.h"
typedef struct RowBatchOps RowBatchOps;
+typedef struct RowBatchStats RowBatchStats;
/*
* RowBatch
@@ -38,6 +41,9 @@ typedef struct RowBatch
bool materialized; /* tuples in slots valid? */
TupleTableSlot *slot; /* row view */
+
+ RowBatchStats *stats; /* NULL if instrumentation stats
+ * are not requested */
} RowBatch;
/*
@@ -58,8 +64,17 @@ typedef struct RowBatchOps
void (*repoint_slot) (RowBatch *b, int idx);
} RowBatchOps;
+/* Instrumentation stats populated for EXPLAIN ANALYZE BATCHES */
+typedef struct RowBatchStats
+{
+ int64 batches; /* total number of batches fetched */
+ int64 rows; /* total tuples across all batches */
+ int max_rows; /* max rows in any single batch */
+ int min_rows; /* min rows in any single batch (non-zero) */
+} RowBatchStats;
+
/* Create/teardown */
-extern RowBatch *RowBatchCreate(int max_rows);
+extern RowBatch *RowBatchCreate(int max_rows, bool track_stats);
extern void RowBatchReset(RowBatch *b, bool drop_slots);
/* Validation */
@@ -85,4 +100,9 @@ RowBatchGetNextSlot(RowBatch *b)
return b->slot;
}
+/* === Batching stats. ===*/
+
+extern void RowBatchRecordStats(RowBatch *b, int rows);
+extern double RowBatchAvgRows(RowBatch *b);
+
#endif /* EXECROWBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc9fbb0e2f0..89df74a86c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
INSTRUMENT_BUFFERS = 1 << 1, /* needs buffer usage */
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
+ INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..950de5a9d78 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,110 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+ERROR: EXPLAIN option BATCHES requires ANALYZE
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: (a > N)
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Filter: ((a > N) OR (b IS NULL))
+ Rows Removed by Filter: N
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+ explain_filter
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+ -> Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Batches: N Avg Rows: N.N Max: N Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches | has_avg | has_max | has_min
+-------------+---------+---------+---------
+ t | t | t | t
+(1 row)
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+ child_has_batches | child_has_avg | child_has_max | child_has_min
+-------------------+---------------+---------------+---------------
+ t | t | t | t
+(1 row)
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+ explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches
+-------------
+ f
+(1 row)
+
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..55acb9058ce 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,62 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Verify batch stats keys present in JSON output
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches,
+ j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+ j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+ j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+ j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+ j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+ j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+ j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- Batching disabled: no batch keys in JSON
+select
+ j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+ 'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+reset executor_batch_rows;
--
2.47.3
[application/octet-stream] v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch (6.5K, 4-v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch)
download | inline diff:
From 815d001dcc7a2cda50e3d55522bfaf30ad7fceee Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 17:42:19 +0900
Subject: [PATCH v7 2/5] Add RowBatch infrastructure for batched tuple
processing
Introduce RowBatch, a data carrier that allows table AMs to deliver
multiple rows per call and the executor to process them as a group.
RowBatch separates three concerns:
- am_payload: opaque, AM-owned storage (e.g. HeapBatch with pinned
page and tuple headers). The AM allocates this in its
scan_begin_batch callback.
- slots[]: TupleTableSlot array, created by RowBatchCreateSlots()
with AM-appropriate slot ops. Populated from am_payload by
ops->materialize_into_slots when the executor needs tuple data.
- max_rows: executor-set upper bound that the AM respects when
filling a batch.
RowBatch does not own selection/filtering state. Which rows survive
qual evaluation is the executor's concern, tracked separately in
scan node state. This keeps RowBatch focused on the AM-to-executor
data transfer boundary.
RowBatchOps provides a vtable for AM-specific operations; currently
only materialize_into_slots is defined.
---
src/backend/executor/Makefile | 1 +
src/backend/executor/execRowBatch.c | 54 ++++++++++++++++++
src/backend/executor/meson.build | 1 +
src/include/executor/execRowBatch.h | 88 +++++++++++++++++++++++++++++
src/tools/pgindent/typedefs.list | 2 +
5 files changed, 146 insertions(+)
create mode 100644 src/backend/executor/execRowBatch.c
create mode 100644 src/include/executor/execRowBatch.h
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..99a00e762f6 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = \
execAmi.o \
execAsync.o \
+ execRowBatch.o \
execCurrent.o \
execExpr.o \
execExprInterp.o \
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
new file mode 100644
index 00000000000..6a298813bd8
--- /dev/null
+++ b/src/backend/executor/execRowBatch.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.c
+ * Helpers for RowBatch
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/backend/executor/execRowBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execRowBatch.h"
+
+/*
+ * RowBatchCreate
+ * Allocate and initialize a new RowBatch envelope.
+ */
+RowBatch *
+RowBatchCreate(int max_rows)
+{
+ RowBatch *b;
+
+ Assert(max_rows > 0);
+
+ b = palloc(sizeof(RowBatch));
+ b->am_payload = NULL;
+ b->ops = NULL;
+ b->max_rows = max_rows;
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ b->slot = NULL;
+
+ return b;
+}
+
+/*
+ * RowBatchReset
+ * Reset an existing RowBatch envelope to empty.
+ */
+void
+RowBatchReset(RowBatch *b, bool drop_slots)
+{
+ Assert(b != NULL);
+
+ b->nrows = 0;
+ b->pos = 0;
+ b->materialized = false;
+ /* b->slot belongs to the owning PlanState node */
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..fd0bf80bacd 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
backend_sources += files(
'execAmi.c',
'execAsync.c',
+ 'execRowBatch.c',
'execCurrent.c',
'execExpr.c',
'execExprInterp.c',
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
new file mode 100644
index 00000000000..021fdeecc73
--- /dev/null
+++ b/src/include/executor/execRowBatch.h
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.h
+ * Executor batch envelope for passing row batch state upward
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ * src/include/executor/execRowBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECROWBATCH_H
+#define EXECROWBATCH_H
+
+#include "executor/tuptable.h"
+
+typedef struct RowBatchOps RowBatchOps;
+
+/*
+ * RowBatch
+ *
+ * Data carrier from table AM to executor. The AM populates am_payload
+ * and nrows via scan_getnextbatch(). The executor calls ops->materialize_all
+ * to populate slots[] when it needs tuple data.
+ *
+ * Selection state (which rows survived qual eval) is owned by the executor,
+ * not the batch.
+ */
+typedef struct RowBatch
+{
+ void *am_payload;
+ const RowBatchOps *ops;
+
+ int max_rows; /* executor-set upper bound */
+ int nrows; /* rows TAM put in */
+ int pos; /* iteration position */
+ bool materialized; /* tuples in slots valid? */
+
+ TupleTableSlot *slot; /* row view */
+} RowBatch;
+
+/*
+ * RowBatchOps -- AM-specific operations on a RowBatch.
+ *
+ * Table AMs set b->ops during scan_begin_batch to provide
+ * callbacks that the executor uses to access batch contents.
+ *
+ * repoint_slot re-points the batch's single slot to the tuple at
+ * index idx within the current batch. The slot remains valid until
+ * the next call or until the batch is exhausted.
+ *
+ * Additional callbacks can be added here as new AMs or executor
+ * features require them.
+ */
+typedef struct RowBatchOps
+{
+ void (*repoint_slot) (RowBatch *b, int idx);
+} RowBatchOps;
+
+/* Create/teardown */
+extern RowBatch *RowBatchCreate(int max_rows);
+extern void RowBatchReset(RowBatch *b, bool drop_slots);
+
+/* Validation */
+static inline bool
+RowBatchIsValid(RowBatch *b)
+{
+ return b != NULL && b->max_rows > 0;
+}
+
+/* Iteration over materialized slots */
+static inline bool
+RowBatchHasMore(RowBatch *b)
+{
+ return b->pos < b->nrows;
+}
+
+static inline TupleTableSlot *
+RowBatchGetNextSlot(RowBatch *b)
+{
+ if (b->pos >= b->nrows)
+ return NULL;
+ b->ops->repoint_slot(b, b->pos++);
+ return b->slot;
+}
+
+#endif /* EXECROWBATCH_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 35acda59851..e5c172628b3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2694,6 +2694,8 @@ RoleSpec
RoleSpecType
RoleStmtType
RollupData
+RowBatch
+RowBatchOps
RowCompareExpr
RowExpr
RowIdentityVarInfo
--
2.47.3
[application/octet-stream] v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch (19.0K, 5-v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch)
download | inline diff:
From dd122f0913affbafe95ee4fc79eb656b482fe1e0 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 23 Mar 2026 18:21:47 +0900
Subject: [PATCH v7 3/5] Add batch table AM API and heapam implementation
Introduce table AM callbacks for batched tuple fetching:
scan_begin_batch, scan_getnextbatch, scan_reset_batch, and
scan_end_batch. AMs implement all four or none; checked by
table_supports_batching().
scan_reset_batch releases held resources (e.g. buffer pins)
without freeing, allowing reuse across rescans.
Provide the heapam implementation. HeapPageBatch (stored in
RowBatch.am_payload) is a thin slice descriptor over the scan's
rs_vistuples[] array, which was introduced in the previous commit.
Rather than owning a copy of tuple headers, HeapPageBatch holds a
pointer into scan->rs_vistuples[] for the current slice and a buffer
pin for the current page.
heap_getnextbatch() calls heap_prepare_pagescan() to populate
rs_vistuples[] for each new page, then re-points hb->tuples to the
next slice of rs_vistuples[] on each call. If the page has more
tuples than the executor's max_rows, subsequent calls return the
next slice without re-entering page preparation. The buffer pin is
held until the page is fully consumed.
scan_begin_batch creates a single TupleTableSlot with
TTSOpsBufferHeapTuple ops. heap_repoint_slot() re-points this slot
to each tuple in turn via ExecStoreBufferHeapTuple(). Consumers
that need to retain the slot across calls rely on the normal slot
materialization contract.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/access/heap/heapam.c | 229 ++++++++++++++++++++++-
src/backend/access/heap/heapam_handler.c | 8 +-
src/include/access/heapam.h | 33 ++++
src/include/access/tableam.h | 136 ++++++++++++++
src/include/pgstat.h | 4 +-
5 files changed, 403 insertions(+), 7 deletions(-)
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b70c75c8288..d45f509fa6b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
#include "catalog/pg_database.h"
#include "catalog/pg_database_d.h"
#include "commands/vacuum.h"
+#include "executor/execRowBatch.h"
#include "pgstat.h"
#include "port/pg_bitutils.h"
#include "storage/lmgr.h"
@@ -109,6 +110,7 @@ static int bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
bool *copy);
+static void heap_repoint_slot(RowBatch *b, int idx);
/*
@@ -1214,7 +1216,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
scan->rs_cbuf = InvalidBuffer;
/*
- * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
+ * Disable page-at-a-time mode if the snapshot does not allow it.
*/
if (!(snapshot && IsMVCCSnapshot(snapshot)))
scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
@@ -1464,7 +1466,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
return &scan->rs_ctup;
}
@@ -1492,13 +1494,232 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
scan->rs_cbuf);
return true;
}
+/*---------- Batching support -----------*/
+
+static const RowBatchOps RowBatchHeapOps =
+{
+ .repoint_slot = heap_repoint_slot
+};
+
+/*
+ * heap_batch_feasible
+ * Batching requires a MVCC snapshot since it relies on
+ * page-at-a-time mode, which heap_beginscan() disables for
+ * non-MVCC snapshots.
+ */
+bool
+heap_batch_feasible(Relation relation, Snapshot snapshot)
+{
+ return snapshot && IsMVCCSnapshot(snapshot);
+}
+
+/*
+ * heap_begin_batch
+ * Initialize AM-side batch state for a heap scan.
+ *
+ * Allocates a HeapPageBatch, which acts as a thin slice descriptor over
+ * the scan's rs_vistuples[] array. Unlike the previous version there is
+ * no separate tuple header storage in HeapPageBatch itself; rs_vistuples[]
+ * in HeapScanDescData (populated by page_collect_tuples() via
+ * heap_prepare_pagescan()) serves as the page-level buffer. HeapPageBatch
+ * holds a pointer into that array for the current slice and the buffer pin
+ * for the current page.
+ *
+ * b->slot must be a TTSOpsBufferHeapTuple slot.
+ */
+void
+heap_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb;
+
+ /* Batch path relies on executor-level qual eval, not AM scan keys */
+ Assert(sscan->rs_nkeys == 0);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ hb = palloc(sizeof(HeapPageBatch));
+ hb->tuples = NULL;
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+ hb->buf = InvalidBuffer;
+
+ b->am_payload = hb;
+ b->ops = &RowBatchHeapOps;
+}
+
+/*
+ * heap_reset_batch
+ * Release pin and reset for rescan, keeping allocations.
+ */
+void
+heap_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(hb != NULL);
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+ hb->ntuples = 0;
+ hb->nextitem = 0;
+}
+
+/*
+ * heap_end_batch
+ * Release all batch resources.
+ */
+void
+heap_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ if (BufferIsValid(hb->buf))
+ ReleaseBuffer(hb->buf);
+
+ pfree(hb);
+ b->am_payload = NULL;
+}
+
+/*
+ * heap_getnextbatch
+ * Fetch the next slice of visible tuples from a heap scan.
+ *
+ * Serves slices from the current page's rs_vistuples[] array. If the
+ * current page has remaining tuples, sets hb->tuples to point at the next
+ * slice without re-entering the page scan. If the page is exhausted,
+ * advances to the next page via heap_fetch_next_buffer(), prepares it
+ * with heap_prepare_pagescan(), and serves the first slice from it.
+ *
+ * hb->tuples points directly into scan->rs_vistuples[]; the entries remain
+ * valid as long as hb->buf (the page's buffer pin) is held. The pin is
+ * released at the top of the next call once the page is fully consumed.
+ *
+ * Each call returns at most b->max_rows tuples.
+ *
+ * Returns true if tuples were fetched, false at end of scan.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ HeapScanDesc scan = (HeapScanDesc) sscan;
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+ int remaining;
+ int nserve;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+
+ /*
+ * Try to serve from the current page first. No page advance, no buffer
+ * management, no re-entry into heap code.
+ */
+ remaining = scan->rs_ntuples - hb->nextitem;
+ if (remaining > 0)
+ {
+ nserve = Min(remaining, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[hb->nextitem];
+ hb->ntuples = nserve;
+ hb->nextitem += nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /*
+ * Current page exhausted. Advance to the next page with visible tuples.
+ */
+ for (;;)
+ {
+ /*
+ * Release the previous page's pin. The page is fully consumed at
+ * this point -- all slices have been served.
+ */
+ if (BufferIsValid(hb->buf))
+ {
+ ReleaseBuffer(hb->buf);
+ hb->buf = InvalidBuffer;
+ }
+
+ heap_fetch_next_buffer(scan, dir);
+
+ if (!BufferIsValid(scan->rs_cbuf))
+ {
+ /* End of scan */
+ scan->rs_cblock = InvalidBlockNumber;
+ scan->rs_prefetch_block = InvalidBlockNumber;
+ scan->rs_inited = false;
+ b->nrows = 0;
+ return false;
+ }
+
+ Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+ /*
+ * Prepare the page: prune, run visibility checks, and populate
+ * scan->rs_vistuples[0..rs_ntuples-1] via page_collect_tuples().
+ */
+ heap_prepare_pagescan(sscan);
+
+ if (scan->rs_ntuples > 0)
+ {
+ /*
+ * Pin the page so tuple data stays valid while the executor
+ * processes slices. Released at the top of the next call
+ * once the page is fully consumed.
+ */
+ IncrBufferRefCount(scan->rs_cbuf);
+ hb->buf = scan->rs_cbuf;
+
+ nserve = Min(scan->rs_ntuples, b->max_rows);
+
+ hb->tuples = &scan->rs_vistuples[0];
+ hb->ntuples = nserve;
+ hb->nextitem = nserve;
+
+ b->nrows = nserve;
+ b->pos = 0;
+
+ pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+ return true;
+ }
+
+ /* Empty page (all dead/invisible tuples), try next */
+ }
+}
+
+/*
+ * heap_repoint_slot
+ * Re-point the batch's single slot to the tuple at index idx.
+ *
+ * Called by RowBatchGetNextSlot() for each tuple served to the parent
+ * node. hb->tuples[idx] was populated by page_collect_tuples() via
+ * heap_prepare_pagescan() and remains valid as long as hb->buf is pinned.
+ */
+static void
+heap_repoint_slot(RowBatch *b, int idx)
+{
+ HeapPageBatch *hb = (HeapPageBatch *) b->am_payload;
+
+ Assert(idx >= 0 && idx < hb->ntuples);
+ Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+ ExecStoreBufferHeapTuple(&hb->tuples[idx], b->slot, hb->buf);
+}
+
+/*----- End of batching support -----*/
+
void
heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid)
@@ -1640,7 +1861,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
* if we get here it means we have a new current scan tuple, so point to
* the proper return buffer and return the tuple.
*/
- pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+ pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 88add129674..828b1a71362 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2245,7 +2245,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
/* Count successfully-fetched tuples as heap fetches */
- pgstat_count_heap_getnext(scan->rs_rd);
+ pgstat_count_heap_getnext(scan->rs_rd, 1);
return true;
}
@@ -2535,6 +2535,12 @@ static const TableAmRoutine heapam_methods = {
.scan_rescan = heap_rescan,
.scan_getnextslot = heap_getnextslot,
+ .scan_batch_feasible = heap_batch_feasible,
+ .scan_begin_batch = heap_begin_batch,
+ .scan_getnextbatch = heap_getnextbatch,
+ .scan_end_batch = heap_end_batch,
+ .scan_reset_batch = heap_reset_batch,
+
.scan_set_tidrange = heap_set_tidrange,
.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 56f2d1a5748..d980dd29a44 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -106,6 +106,32 @@ typedef struct HeapScanDescData
} HeapScanDescData;
typedef struct HeapScanDescData *HeapScanDesc;
+/*
+ * HeapPageBatch -- heapam-private page-level batch state.
+ *
+ * Thin slice descriptor over the scan's rs_vistuples[] array. Rather
+ * than owning a copy of tuple headers, HeapPageBatch holds a pointer
+ * into scan->rs_vistuples[] for the current slice, which was populated
+ * by page_collect_tuples() during heap_prepare_pagescan().
+ *
+ * The executor consumes tuples in slices. Each heap_getnextbatch call
+ * re-points tuples to the next slice and advances nextitem, serving up
+ * to RowBatch.max_rows tuples from the current page before advancing
+ * to the next.
+ *
+ * buf holds the pin for the current page. tuple data referenced via
+ * tuples remains valid as long as buf is pinned.
+ *
+ * Stored in RowBatch.am_payload.
+ */
+typedef struct HeapPageBatch
+{
+ HeapTupleData *tuples; /* points into scan->rs_vistuples[nextitem] */
+ int ntuples; /* tuples in current slice */
+ int nextitem; /* next unserved tuple index in rs_vistuples[] */
+ Buffer buf; /* pinned buffer for current page */
+} HeapPageBatch;
+
typedef struct BitmapHeapScanDescData
{
HeapScanDescData rs_heap_base;
@@ -360,6 +386,13 @@ extern void heap_endscan(TableScanDesc sscan);
extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
extern bool heap_getnextslot(TableScanDesc sscan,
ScanDirection direction, TupleTableSlot *slot);
+
+extern bool heap_batch_feasible(Relation relation, Snapshot snapshot);
+extern void heap_begin_batch(TableScanDesc sscan, RowBatch *batch);
+extern bool heap_getnextbatch(TableScanDesc sscan, RowBatch *batch, ScanDirection dir);
+extern void heap_end_batch(TableScanDesc sscan, RowBatch *batch);
+extern void heap_reset_batch(TableScanDesc sscan, RowBatch *batch);
+
extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
ItemPointer maxtid);
extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4647785fd35..28caa3dcf37 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -303,6 +303,8 @@ typedef void (*IndexBuildCallback) (Relation index,
bool tupleIsAlive,
void *state);
+typedef struct RowBatch RowBatch;
+
/*
* API struct for a table AM. Note this must be allocated in a
* server-lifetime manner, typically as a static const struct, which then gets
@@ -380,6 +382,56 @@ typedef struct TableAmRoutine
ScanDirection direction,
TupleTableSlot *slot);
+ /* ------------------------------------------------------------------------
+ * Batched scan support
+ * ------------------------------------------------------------------------
+ */
+
+ /*
+ * Returns true if the AM can support batching for a scan with the
+ * given snapshot. Called at plan init time before the scan descriptor
+ * exists. AMs that have no snapshot-based restrictions can omit this
+ * callback, in which case batching is considered feasible.
+ */
+ bool (*scan_batch_feasible)(Relation relation, Snapshot snapshot);
+
+ /*
+ * Initialize AM-owned batch state for a scan. Called once before
+ * the first scan_getnextbatch call. The AM allocates whatever
+ * private state it needs and stores it in b->am_payload. b->slot
+ * is the scan node's ss_ScanTupleSlot, whose type was already
+ * determined by the AM via table_slot_callbacks(). The AM's
+ * repoint_slot callback re-points it to each tuple in the batch
+ * in turn. Future interfaces may allow the AM to expose batch
+ * data in other forms without going through a slot.
+ */
+ void (*scan_begin_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Fetch the next batch of tuples from the scan into b. Sets b->nrows
+ * to the number of tuples available and resets b->pos to 0. Returns
+ * true if any tuples were fetched, false at end of scan. The caller
+ * advances through the batch via RowBatchGetNextSlot(), which calls
+ * ops->repoint_slot for each position up to b->nrows.
+ */
+ bool (*scan_getnextbatch)(TableScanDesc sscan, RowBatch *b,
+ ScanDirection dir);
+
+ /*
+ * Release all AM-owned batch resources, including any buffer pins
+ * held in am_payload. Called when the scan node is shut down.
+ * After this call b->am_payload must not be used.
+ */
+ void (*scan_end_batch)(TableScanDesc sscan, RowBatch *b);
+
+ /*
+ * Reset batch state for rescan. Release any held resources (e.g.
+ * buffer pins) and reset counts, but keep the allocation so the
+ * next getnextbatch call can reuse it without re-entering
+ * begin_batch.
+ */
+ void (*scan_reset_batch)(TableScanDesc sscan, RowBatch *b);
+
/*-----------
* Optional functions to provide scanning for ranges of ItemPointers.
* Implementations must either provide both of these functions, or neither
@@ -1099,6 +1151,90 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
}
+/*
+ * table_supports_batching
+ * Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation, Snapshot snapshot)
+{
+ const TableAmRoutine *tam = relation->rd_tableam;
+
+ if (tam->scan_getnextbatch == NULL)
+ return false;
+
+ Assert(tam->scan_begin_batch != NULL);
+ Assert(tam->scan_reset_batch != NULL);
+ Assert(tam->scan_end_batch != NULL);
+
+ /*
+ * Optional: AM may restrict batching based on snapshot or other conditions.
+ */
+ if (tam->scan_batch_feasible != NULL &&
+ !tam->scan_batch_feasible(relation, snapshot))
+ return false;
+
+ return true;
+}
+
+/*
+ * table_scan_begin_batch
+ * Allocate AM-owned batch payload in the RowBatch
+ */
+static inline void
+table_scan_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_begin_batch != NULL);
+
+ return tam->scan_begin_batch(sscan, b);
+}
+
+/*
+ * table_scan_getnextbatch
+ * Fetch the next batch of tuples from the AM. Returns true if tuples
+ * were fetched, false at end of scan. Only forward scans are supported.
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(ScanDirectionIsForward(dir));
+ Assert(tam->scan_getnextbatch != NULL);
+
+ return tam->scan_getnextbatch(sscan, b, dir);
+}
+
+/*
+ * table_scan_end_batch
+ * Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_end_batch != NULL);
+
+ tam->scan_end_batch(sscan, b);
+}
+
+/*
+ * table_scan_reset_batch
+ * Reset AM-owned batch state for rescan without freeing.
+ */
+static inline void
+table_scan_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+ const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+ Assert(tam->scan_reset_batch != NULL);
+
+ tam->scan_reset_batch(sscan, b);
+}
+
/* ----------------------------------------------------------------------------
* TID Range scanning related functions.
* ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2786a7c5ffb..df06e33fba2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,10 +719,10 @@ extern void pgstat_report_analyze(Relation rel,
if (pgstat_should_count_relation(rel)) \
(rel)->pgstat_info->counts.numscans++; \
} while (0)
-#define pgstat_count_heap_getnext(rel) \
+#define pgstat_count_heap_getnext(rel, n) \
do { \
if (pgstat_should_count_relation(rel)) \
- (rel)->pgstat_info->counts.tuples_returned++; \
+ (rel)->pgstat_info->counts.tuples_returned += (n); \
} while (0)
#define pgstat_count_heap_fetch(rel) \
do { \
--
2.47.3
[application/octet-stream] v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch (12.6K, 6-v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch)
download | inline diff:
From e76a49df42dbf22a3169eb2e1d880d9282c1f02f Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 11:28:16 +0900
Subject: [PATCH v7 4/5] SeqScan: add batch-driven variants returning slots
Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per-tuple AM crossings without
changing the node interface seen by parents.
SeqScanState gains a RowBatch pointer that holds the current batch
when batching is active. Batch state is localized to SeqScanState
-- no changes to PlanState or ScanState.
Add executor_batch_rows GUC (DEVELOPER_OPTIONS, default 64) to
control the maximum batch size. Setting it to 0 disables batching.
XXX currently ignored when reading from heapam tables.
Wire up runtime selection in ExecInitSeqScan via
SeqScanCanUseBatching(). When executor_batch_rows > 1, EPQ is
inactive, the scan is forward-only, and the relation's AM supports
batching, ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used with zero overhead.
Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled.
Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
src/backend/executor/nodeSeqscan.c | 278 ++++++++++++++++++++++
src/backend/utils/init/globals.c | 3 +
src/backend/utils/misc/guc_parameters.dat | 9 +
src/include/miscadmin.h | 1 +
src/include/nodes/execnodes.h | 2 +
5 files changed, 293 insertions(+)
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 04803b0e37d..d0ce8858c49 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,12 +29,17 @@
#include "access/relscan.h"
#include "access/tableam.h"
+#include "executor/execRowBatch.h"
#include "executor/execScan.h"
#include "executor/executor.h"
#include "executor/nodeSeqscan.h"
#include "utils/rel.h"
static TupleTableSlot *SeqNext(SeqScanState *node);
+static TupleTableSlot *ExecSeqScanBatchSlot(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQual(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithProject(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQualProject(PlanState *pstate);
/* ----------------------------------------------------------------
* Scan Support
@@ -205,6 +210,273 @@ ExecSeqScanEPQ(PlanState *pstate)
(ExecScanRecheckMtd) SeqRecheck);
}
+/* ----------------------------------------------------------------
+ * Batch Support
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * SeqScanCanUseBatching
+ * Check whether this SeqScan can use batch mode execution.
+ *
+ * Batching requires: the GUC is enabled, no EPQ recheck is active, the scan
+ * is forward-only, and the table AM supports batching with the current
+ * snapshot (see table_supports_batching()).
+ */
+static bool
+SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
+{
+ Relation relation = scanstate->ss.ss_currentRelation;
+
+ return executor_batch_rows > 1 &&
+ relation &&
+ table_supports_batching(relation,
+ scanstate->ss.ps.state->es_snapshot) &&
+ !(eflags & EXEC_FLAG_BACKWARD) &&
+ scanstate->ss.ps.state->es_epq_active == NULL;
+}
+
+/*
+ * SeqScanInitBatching
+ * Set up batch execution state and select the appropriate
+ * ExecProcNode variant for batch mode.
+ *
+ * Called from ExecInitSeqScan when SeqScanCanUseBatching returns true.
+ * Overwrites the ExecProcNode pointer set by the non-batch path.
+ */
+static void
+SeqScanInitBatching(SeqScanState *scanstate)
+{
+ RowBatch *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+
+ batch->slot = scanstate->ss.ss_ScanTupleSlot;
+ scanstate->batch = batch;
+
+ /* Choose batch variant */
+ if (scanstate->ss.ps.qual == NULL)
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+ }
+ else
+ {
+ if (scanstate->ss.ps.ps_ProjInfo == NULL)
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+ else
+ scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+ }
+}
+
+/*
+ * SeqScanResetBatching
+ * Reset or tear down batch execution state.
+ *
+ * When drop is false (rescan), resets the RowBatch and releases any
+ * AM-held resources like buffer pins, but keeps allocations for reuse.
+ * When drop is true (end of node), frees everything.
+ */
+static void
+SeqScanResetBatching(SeqScanState *scanstate, bool drop)
+{
+ RowBatch *b = scanstate->batch;
+
+ if (b)
+ {
+ RowBatchReset(b, drop);
+ if (b->am_payload)
+ {
+ if (drop)
+ {
+ table_scan_end_batch(scanstate->ss.ss_currentScanDesc, b);
+ b->am_payload = NULL;
+ }
+ else
+ table_scan_reset_batch(scanstate->ss.ss_currentScanDesc, b);
+ }
+ if (drop)
+ pfree(b);
+ }
+}
+
+/*
+ * SeqNextBatch
+ * Fetch the next batch of tuples from the table AM.
+ *
+ * Lazily initializes the scan descriptor and AM batch state on first
+ * call. Returns false at end of scan.
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+ TableScanDesc scandesc;
+ EState *estate;
+ ScanDirection direction;
+ RowBatch *b = node->batch;
+
+ Assert(b != NULL);
+
+ /*
+ * get information from the estate and scan state
+ */
+ scandesc = node->ss.ss_currentScanDesc;
+ estate = node->ss.ps.state;
+ direction = estate->es_direction;
+ Assert(ScanDirectionIsForward(direction));
+
+ if (scandesc == NULL)
+ {
+ /*
+ * We reach here if the scan is not parallel, or if we're serially
+ * executing a scan that was planned to be parallel.
+ */
+ scandesc = table_beginscan(node->ss.ss_currentRelation,
+ estate->es_snapshot,
+ 0, NULL,
+ ScanRelIsReadOnly(&node->ss) ?
+ SO_HINT_REL_READ_ONLY : SO_NONE);
+ node->ss.ss_currentScanDesc = scandesc;
+ }
+
+ /* Lazily create the AM batch payload. */
+ if (b->am_payload == NULL)
+ {
+ const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+ Assert(tam && tam->scan_begin_batch);
+ table_scan_begin_batch(scandesc, b);
+ }
+
+ if (!table_scan_getnextbatch(scandesc, b, direction))
+ return false;
+
+ return true;
+}
+
+/*
+ * SeqScanBatchSlot
+ * Core loop for batch-driven SeqScan variants.
+ *
+ * Internally fetches tuples in batches from the table AM, but returns
+ * one slot at a time to preserve the single-slot interface expected by
+ * parent nodes. When the current batch is exhausted, fetches and
+ * materializes the next one.
+ *
+ * qual and projInfo are passed explicitly so the compiler can eliminate
+ * dead branches when inlined into the typed wrapper functions (e.g.
+ * ExecSeqScanBatchSlot passes NULL for both).
+ *
+ * EPQ is not supported in the batch path; asserted at entry.
+ */
+static inline TupleTableSlot *
+SeqScanBatchSlot(SeqScanState *node,
+ ExprState *qual, ProjectionInfo *projInfo)
+{
+ ExprContext *econtext = node->ss.ps.ps_ExprContext;
+ RowBatch *b = node->batch;
+
+ /* Batch path does not support EPQ */
+ Assert(node->ss.ps.state->es_epq_active == NULL);
+ Assert(RowBatchIsValid(b));
+
+ for (;;)
+ {
+ TupleTableSlot *in;
+
+ CHECK_FOR_INTERRUPTS();
+
+ /* Get next input slot from current batch, or refill */
+ if (!RowBatchHasMore(b))
+ {
+ if (!SeqNextBatch(node))
+ return NULL;
+ }
+
+ in = RowBatchGetNextSlot(b);
+ Assert(in);
+
+ /* No qual, no projection: direct return */
+ if (qual == NULL && projInfo == NULL)
+ return in;
+
+ ResetExprContext(econtext);
+ econtext->ecxt_scantuple = in;
+
+ /* Check qual if present */
+ if (qual != NULL && !ExecQual(qual, econtext))
+ {
+ InstrCountFiltered1(node, 1);
+ continue;
+ }
+
+ /* Project if needed, otherwise return scan tuple directly */
+ if (projInfo != NULL)
+ return ExecProject(projInfo);
+
+ return in;
+ }
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ /*
+ * Use pg_assume() for != NULL tests to make the compiler realize no
+ * runtime check for the field is needed in ExecScanExtended().
+ */
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ Assert(pstate->ps_ProjInfo == NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ Assert(pstate->qual == NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+ SeqScanState *node = castNode(SeqScanState, pstate);
+
+ Assert(pstate->state->es_epq_active == NULL);
+ pg_assume(pstate->qual != NULL);
+ pg_assume(pstate->ps_ProjInfo != NULL);
+
+ return SeqScanBatchSlot(node, pstate->qual, pstate->ps_ProjInfo);
+}
+
/* ----------------------------------------------------------------
* ExecInitSeqScan
* ----------------------------------------------------------------
@@ -283,6 +555,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
}
+ if (SeqScanCanUseBatching(scanstate, eflags))
+ SeqScanInitBatching(scanstate);
+
return scanstate;
}
@@ -302,6 +577,8 @@ ExecEndSeqScan(SeqScanState *node)
*/
scanDesc = node->ss.ss_currentScanDesc;
+ SeqScanResetBatching(node, true);
+
/*
* close heap scan
*/
@@ -331,6 +608,7 @@ ExecReScanSeqScan(SeqScanState *node)
table_rescan(scan, /* scan desc */
NULL); /* new scan keys */
+ SeqScanResetBatching(node, false);
ExecScanReScan((ScanState *) node);
}
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int notify_buffers = 16;
int serializable_buffers = 32;
int subtransaction_buffers = 0;
int transaction_buffers = 0;
+
+/* executor batching */
+int executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a315c4ab8ab..a59b5d012a2 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1045,6 +1045,15 @@
boot_val => 'true',
},
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+ short_desc => 'Number of rows to include in batches during execution.',
+ flags => 'GUC_NOT_IN_SAMPLE',
+ variable => 'executor_batch_rows',
+ boot_val => '64',
+ min => '0',
+ max => '1024',
+},
+
{ name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
short_desc => 'Terminate session on any error.',
variable => 'ExitOnAnyError',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7277c37e779..302c0e33165 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
extern PGDLLIMPORT int VacuumCostBalance;
extern PGDLLIMPORT bool VacuumCostActive;
+extern PGDLLIMPORT int executor_batch_rows;
/* in utils/misc/stack_depth.c */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..0f8431ee854 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -70,6 +70,7 @@ typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
+typedef struct RowBatch RowBatch;
/* ----------------
@@ -1670,6 +1671,7 @@ typedef struct SeqScanState
{
ScanState ss; /* its first field is NodeTag */
Size pscan_len; /* size of parallel heap scan descriptor */
+ RowBatch *batch; /* NULL if batching disabled */
} SeqScanState;
/* ----------------
--
2.47.3
^ permalink raw reply [nested|flat] 29+ messages in thread
end of thread, other threads:[~2026-04-06 12:02 UTC | newest]
Thread overview: 29+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-09-26 13:28 Batching in executor Amit Langote <[email protected]>
2025-09-26 13:49 ` Bruce Momjian <[email protected]>
2025-09-30 02:15 ` Amit Langote <[email protected]>
2025-09-29 11:01 ` Tomas Vondra <[email protected]>
2025-09-30 02:11 ` Amit Langote <[email protected]>
2025-09-30 13:35 ` Amit Langote <[email protected]>
2025-10-10 06:40 ` Amit Langote <[email protected]>
2025-10-27 07:24 ` Amit Langote <[email protected]>
2025-10-27 16:18 ` Tomas Vondra <[email protected]>
2025-10-28 13:40 ` Amit Langote <[email protected]>
2025-10-28 14:32 ` Daniil Davydov <[email protected]>
2025-10-29 02:22 ` Amit Langote <[email protected]>
2025-10-30 12:12 ` Daniil Davydov <[email protected]>
2025-12-20 14:36 ` Amit Langote <[email protected]>
2025-10-29 06:37 ` Amit Langote <[email protected]>
2025-12-04 15:54 ` Amit Langote <[email protected]>
2025-12-20 14:12 ` Amit Langote <[email protected]>
2025-12-22 11:45 ` =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-01-26 09:34 ` Daniil Davydov <[email protected]>
2026-01-27 03:00 ` Amit Langote <[email protected]>
2026-01-29 07:35 ` Amit Langote <[email protected]>
2026-01-29 10:04 ` Amit Langote <[email protected]>
2026-02-01 14:49 ` Junwang Zhao <[email protected]>
2026-02-03 13:30 ` =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-02-03 15:54 ` Junwang Zhao <[email protected]>
2026-03-24 00:59 ` Amit Langote <[email protected]>
2026-04-06 12:02 ` Amit Langote <[email protected]>
2025-10-27 17:37 ` Peter Geoghegan <[email protected]>
2025-10-28 13:11 ` Amit Langote <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox