public inbox for [email protected]  
help / color / mirror / Atom feed
Batching in executor
22+ messages / 6 participants
[nested] [flat]

* Batching in executor
@ 2025-09-26 13:28  Amit Langote <[email protected]>
  0 siblings, 2 replies; 22+ messages in thread

From: Amit Langote @ 2025-09-26 13:28 UTC (permalink / raw)
  To: pgsql-hackers

At PGConf.dev this year we had an unconference session [1] on whether
the community can support an additional batch executor. The discussion
there led me to start hacking on $subject. I have also had off-list
discussions on this topic in recent months with Andres and David, who
have offered useful thoughts.

This patch series is an early attempt to make executor nodes pass
around batches of tuples instead of tuple-at-a-time slots. The main
motivation is to enable expression evaluation in batch form, which can
substantially reduce per-tuple overhead (mainly from function calls)
and open the door to further optimizations such as SIMD usage in
aggregate transition functions. We could even change algorithms of
some plan nodes to operate on batches when, for example, a child node
can return batches.

The expression evaluation changes are still exploratory, but before
moving to make them ready for serious review, we first need a way for
scan nodes to produce tuples in batches and an executor API that
allows upper nodes to consume them. The series includes both the
foundational work to let scan nodes produce batches and an executor
API to pass them around, and a set of follow-on patches that
experiment with batch-aware expression evaluation.

The patch set is structured in two parts. The first three patches lay
the groundwork in the executor and table AM, and the later patches
prototype batch-aware expression evaluation.

Patches 0001-0003 introduce a new batch table AM API and an initial
heapam implementation that can return multiple tuples per call.
SeqScan is adapted to use this interface, with new ExecSeqScanBatch*
routines that fetch tuples in bulk but can still return one
TupleTableSlot at a time to preserve compatibility. On the executor
side, ExecProcNodeBatch() is added alongside ExecProcNode(), with
TupleBatch as the new container for passing groups of tuples. ExecScan
has batch-aware variants that use the AM API internally, but can fall
back to row-at-a-time behavior when required. Plan shapes and EXPLAIN
output remain unchanged; the differences here are executor-internal.

At present, heapam batches are restricted to tuples from a single
page, which means they may not always fill EXEC_BATCH_ROWS (currently
64). That limits how much upper executor nodes can leverage batching,
especially with selective quals where batches may end up sparsely
populated. A future improvement would be to allow batches to span
pages or to let the scan node request more tuples when its buffer is
not yet full, so it avoids passing mostly empty TupleBatch to upper
nodes.

It might also be worth adding some lightweight instrumentation to make
it easier to reason about batch behavior. For example, counters for
average rows per batch, reasons why a batch ended (capacity reached,
page boundary, end of scan), or batches per million rows could help
confirm whether limitations like the single-page restriction or
EXEC_BATCH_ROWS size are showing up in benchmarks. Suggestions from
others on which forms of instrumentation would be most useful are
welcome.

Patches 0004 onwards start experimenting with making expression
evaluation batch-aware, first in the aggregate node. These patches add
new EEOPs (ExprEvalOps and ExprEvalSteps) to fetch attributes into
TupleBatch vectors, evaluate quals across a batch, and run aggregate
transitions over multiple rows at once. Agg is extended to pull
TupleBatch from its child via ExecProcNodeBatch(), with two prototype
paths: one that loops inside the interpreter and another that calls
the transition function once per batch using AggBulkArgs. These are
still PoCs, but with scan nodes and the executor capable of moving
batches around, they provide a base from which the work can be refined
into something potentially committable after the usual polish,
testing, and review.

One area that needs more thought is how TupleBatch interacts with
ExprContext. At present the patches extend ExprContext with
scan_batch, inner_batch, and outer_batch fields, but per-batch
evaluation still spills into ecxt_per_tuple_memory, effectively
reusing the per-tuple context for per-batch work. That’s arguably an
abuse of the contract described in ExecEvalExprSwitchContext(), and it
will need a cleaner definition of how batch-scoped memory should be
managed. Feedback on how best to structure that would be particularly
helpful.

To evaluate the overheads and benefits, I ran microbenchmarks with
single and multi-aggregate queries on a single table, with and without
WHERE clauses. Tables were fully VACUUMed so visibility maps are set
and IO costs are minimal. shared_buffers was large enough to fit the
whole table (up to 10M rows, ~43 on each page), and all pages were
prewarmed into cache before tests. Table schema/script is at [2].

Observations from benchmarking (Detailed benchmark tables are at [3];
below is just a high-level summary of the main patterns):

* Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
sum(a) FROM bar_N): batching scan output alone improved latency by
~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
especially once fmgr overhead was paid per batch instead of per row.

* Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
qual interpreter gave a big step up, with latencies dropping by
~30-40% compared to batching=off.

* Five aggregates, no WHERE: batching input from the child scan cut
~15% off runtime. Adding batched transition evaluation increased
improvements to ~30%.

* Five aggregates, with WHERE: modest gains from scan/input batching,
but per-batch transition evaluation and batched quals brought ~20-30%
improvement.

* Across all cases, executor overheads became visible only after IO
was minimized. Once executor cost dominated, batching consistently
reduced CPU time, with the largest benefits coming from avoiding
per-row fmgr calls and evaluating quals across batches.

I would appreciate if others could try these patches with their own
microbenchmarks or workloads and see if they can reproduce numbers
similar to mine. Feedback on both the general direction and the
details of the patches would be very helpful. In particular, patches
0001-0003, which add the basic batch APIs and integrate them into
SeqScan, are intended to be the first candidates for review and
eventual commit. Comments on the later, more experimental patches
(aggregate input batching and expression evaluation (qual, aggregate
transition) batching) are also welcome.

--
Thanks, Amit Langote

[1] https://wiki.postgresql.org/wiki/PGConf.dev_2025_Developer_Unconference#Can_the_Community_Support_an...

[2] Tables:
cat create_tables.sh
for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
psql -c "drop table if exists bar_$i; create table bar_$i (a int, b
int, c int, d int, e int, f int, g int, h int, i text, j int, k int, l
int, m int, n int, o int);" 2>&1 > /dev/null
psql -c "insert into bar_$i select i, i, i, i, i, i, i, i, repeat('x',
100), i, i, i, i, i, i from generate_series(1, $i) i;" 2>&1 >
/dev/null
echo "bar_$i created."
done

[3] Benchmark result tables

All timings are in milliseconds. off = executor_batching off, on =
executor_batching on.  Negative %diff means on is better than off.

Single aggregate, no WHERE
(~20% faster with scan batching only; ~40%+ faster with batched transitions)

With only batched-seqscan (0001-0003):
Rows    off       on       %diff
1M      10.448    8.147    -22.0
2M      18.442    14.552   -21.1
3M      25.296    22.195   -12.3
4M      36.285    33.383   -8.0
5M      44.441    39.894   -10.2
10M     93.110    82.744   -11.1

With batched-agg on top (0001-0007):
Rows    off       on       %diff
1M      9.891     5.579    -43.6
2M      17.648    9.653    -45.3
3M      27.451    13.919   -49.3
4M      36.394    24.269   -33.3
5M      44.665    29.260   -34.5
10M     87.898    56.221   -36.0

Single aggregate, with WHERE
(~30–40% faster once quals + transitions are batched)

With only batched-seqscan (0001-0003):
Rows    off       on       %diff
1M      18.485    17.749   -4.0
2M      34.696    33.033   -4.8
3M      49.582    46.155   -6.9
4M      70.270    67.036   -4.6
5M      84.616    81.013   -4.3
10M     174.649   164.611  -5.7

With batched-agg and batched-qual on top (0001-0008):
Rows    off       on       %diff
1M      18.887    12.367   -34.5
2M      35.706    22.457   -37.1
3M      51.626    30.902   -40.1
4M      72.694    48.214   -33.7
5M      88.103    57.623   -34.6
10M     181.350   124.278  -31.5

Five aggregates, no WHERE
(~15% faster with scan/input batching; ~30% with batched transitions)

Agg input batching only (0001-0004):
Rows    off       on       %diff
1M      23.193    19.196   -17.2
2M      42.177    35.862   -15.0
3M      62.192    51.121   -17.8
4M      83.215    74.665   -10.3
5M      99.426    91.904   -7.6
10M     213.794   184.263  -13.8

Batched transition eval, per-row fmgr (0001-0006):
Rows    off       on       %diff
1M      23.501    19.672   -16.3
2M      44.128    36.603   -17.0
3M      64.466    53.079   -17.7
5M      103.442   97.623   -5.6
10M     219.120   190.354  -13.1

Batched transition eval, per-batch fmgr (0001-0007):
Rows    off       on       %diff
1M      24.238    16.806   -30.7
2M      43.056    30.939   -28.1
3M      62.938    43.295   -31.2
4M      83.346    63.357   -24.0
5M      100.772   78.351   -22.2
10M     213.755   162.203  -24.1

Five aggregates, with WHERE
(~10–15% faster with scan/input batching; ~30% with batched transitions + quals)

Agg input batching only (0001-0004):
Rows    off       on       %diff
1M      24.261    22.744   -6.3
2M      45.802    41.712   -8.9
3M      79.311    72.732   -8.3
4M      107.189   93.870   -12.4
5M      129.172   115.300  -10.7
10M     278.785   236.275  -15.2

Batched transition eval, per-batch fmgr (0001-0007):
Rows    off       on       %diff
1M      24.354    19.409   -20.3
2M      46.888    36.687   -21.8
3M      82.147    57.683   -29.8
4M      109.616   76.471   -30.2
5M      133.777   94.776   -29.2
10M     282.514   194.954  -31.0

Batched transition eval + batched qual (0001-0008):
Rows    off       on       %diff
1M      24.691    20.193   -18.2
2M      47.182    36.530   -22.6
3M      82.030    58.663   -28.5
4M      110.573   76.500   -30.8
5M      136.701   93.299   -31.7
10M     280.551   191.021  -31.9


Attachments:

  [application/octet-stream] v1-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch (11.2K, 2-v1-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch)
  download | inline diff:
From 0bdb18284cb034cf80ac56125b5682e84b856a26 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 9 Sep 2025 21:43:29 +0900
Subject: [PATCH v1 7/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT

The new EEOP runs a plain aggregate transition over a TupleBatch with
a single fmgr call. Batch vectors are passed to the transfn via
AggBulkArgs stored in fcinfo->flinfo->fn_extra, avoiding per-row fmgr
overhead.

Gate selection with AggTransfnSupportsBulk(), an allowlist of
built-in transfns updated to accept AggBulkArgs.  Some integer
transfns are taught to read AggBulkArgs when present, else fall
back. Rowloop batching remains available; unsupported aggregates keep
the row path.
---
 src/backend/executor/execExpr.c       | 28 ++++++++++++++++-
 src/backend/executor/execExprInterp.c | 43 ++++++++++++++++++++++++++
 src/backend/executor/nodeAgg.c        |  1 -
 src/backend/jit/llvm/llvmjit_expr.c   |  1 +
 src/backend/utils/adt/int.c           | 32 +++++++++++++++++++
 src/backend/utils/adt/int8.c          | 44 +++++++++++++++++++++++++++
 src/backend/utils/adt/numeric.c       | 17 +++++++++++
 src/include/executor/execExpr.h       |  1 +
 src/include/executor/executor.h       | 20 ++++++++++++
 9 files changed, 185 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index af5ed8b6368..27a5780f557 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -47,6 +47,7 @@
 #include "utils/acl.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/jsonfuncs.h"
 #include "utils/jsonpath.h"
 #include "utils/lsyscache.h"
@@ -3692,6 +3693,28 @@ AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
 	return true;
 }
 
+/* Return true if this transfn OID is known to accept AggBulkArgs. */
+static bool
+AggTransfnSupportsBulk(Oid fn_oid)
+{
+	/* Phase 1: hard-coded allowlist of built-ins you updated. */
+	static const Oid ok[] =
+	{
+		F_INT8INC_ANY,		/* COUNT(*) transfn */
+		F_INT8INC,			/* COUNT(arg) transfn */
+		F_INT4_SUM,			/* SUM(int) transfn */
+		F_INT4SMALLER,		/* MIN(int) transfn */
+		F_INT4LARGER,		/* MAX(int) transfn */
+		/* add others you make bulk-aware */
+		InvalidOid
+	};
+
+	for (int i = 0; OidIsValid(ok[i]); i++)
+		if (ok[i] == fn_oid)
+			return true;
+	return false;
+}
+
 /*
  * Build transition/combine function invocations for all aggregate transition
  * / combination function invocations in a grouping sets phase. This has to
@@ -4150,7 +4173,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		{
 			if (bv)
 				bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
-			scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+			if (!AggTransfnSupportsBulk(pertrans->transfn_oid))
+				scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+			else
+				scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT;
 		}
 		else if (pertrans->transtypeByVal)
 		{
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3176679b346..41ad9b4838d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -607,6 +607,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
+		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2345,6 +2346,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
+		{
+			/* too complex for an inline implementation */
+			ExecAggPlainTransBatch(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -6138,6 +6147,40 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
 				pergroup->transValueIsNull = fcinfo->isnull;
 			}
 			break;
+
+		case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+			{
+				void *save = fcinfo->flinfo->fn_extra;
+				AggBulkArgs ba = {batch_nrows, start_row};
+
+				if (bvs)
+				{
+					const BatchVector *bv = bvs->bv;
+
+					Assert(bv);
+					ba.nargs = bvs->nargs;
+					ba.argoffs = bvs->argoffs;
+					ba.args = bv->cols;
+					ba.isnull = bv->nulls;
+					ba.hasnull = bv->hasnull;
+				}
+				fcinfo->flinfo->fn_extra = &ba;
+				fcinfo->args[0].value = pergroup->transValue;
+				fcinfo->args[0].isnull = pergroup->transValueIsNull;
+				fcinfo->isnull = false;		/* just in case transfn doesn't set it */
+				newVal = FunctionCallInvoke(fcinfo);   /* one call for the entire slice */
+				if (!pertrans->transtypeByVal &&
+					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+					newVal = ExecAggCopyTransValue(aggstate, pertrans,
+												   newVal, fcinfo->isnull,
+												   pergroup->transValue,
+												   pergroup->transValueIsNull);
+				pergroup->transValue = newVal;
+				pergroup->transValueIsNull = fcinfo->isnull;
+				fcinfo->flinfo->fn_extra = save;
+			}
+			break;
+
 		default:
 			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
 	}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 662d8bef43b..a2286ef5e54 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2687,7 +2687,6 @@ agg_retrieve_direct_batch(AggState *aggstate)
 
 	initialize_aggregates(aggstate, aggstate->pergroups,
 						  Max(aggstate->phase->numsets, 1));
-
 	if (aggstate->grp_firstTuple)
 	{
 		ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index efb3ee639fc..45346124bd7 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,7 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
 			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
 				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
 								v_state, op, v_econtext);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..eb1780b5590 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1363,18 +1363,50 @@ int2smaller(PG_FUNCTION_ARGS)
 Datum
 int4larger(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int32		arg1 = PG_GETARG_INT32(0);
 	int32		arg2 = PG_GETARG_INT32(1);
 
+	if (unlikely(ba))
+	{
+		int32 result = arg1;
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = (int32) ba->args[ba->argoffs[0]][i];
+				if (arg2 > result)
+					result = arg2;
+			}
+		}
+		PG_RETURN_INT32(result);
+	}
 	PG_RETURN_INT32((arg1 > arg2) ? arg1 : arg2);
 }
 
 Datum
 int4smaller(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int32		arg1 = PG_GETARG_INT32(0);
 	int32		arg2 = PG_GETARG_INT32(1);
 
+	if (unlikely(ba))
+	{
+		int32 result = arg1;
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = ba->args[ba->argoffs[0]][i];
+				if (arg2 < result)
+					result = arg2;
+			}
+		}
+		PG_RETURN_INT32(result);
+	}
 	PG_RETURN_INT32((arg1 < arg2) ? arg1 : arg2);
 }
 
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index bdea490202a..bbabf4e0785 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -461,10 +461,28 @@ int8up(PG_FUNCTION_ARGS)
 Datum
 int8pl(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int64		arg1 = PG_GETARG_INT64(0);
 	int64		arg2 = PG_GETARG_INT64(1);
 	int64		result;
 
+	if (unlikely(ba))
+	{
+		result = arg1;
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = ba->args[ba->argoffs[0]][i];
+				if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+				arg1 = result;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
 	if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
 		ereport(ERROR,
 				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
@@ -718,9 +736,35 @@ int8lcm(PG_FUNCTION_ARGS)
 Datum
 int8inc(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int64		arg = PG_GETARG_INT64(0);
 	int64		result;
 
+	if (unlikely(ba))
+	{
+		result = arg;
+		if (!ba->hasnull || ba->nargs == 0)
+		{
+			if (unlikely(pg_add_s64_overflow(arg, ba->nrows, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+			PG_RETURN_INT64(result);
+		}
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+				arg = result;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
+
 	if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
 		ereport(ERROR,
 				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 76269918593..b02664c97f5 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6310,6 +6310,23 @@ int4_sum(PG_FUNCTION_ARGS)
 {
 	int64		oldsum;
 	int64		newval;
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
+
+	if (unlikely(ba))
+	{
+		int64	result = (!PG_ARGISNULL(0) ? PG_GETARG_INT64(0) : 0);
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				int32	arg2 = ba->args[ba->argoffs[0]][i];
+
+				result = result + arg2;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
 
 	if (PG_ARGISNULL(0))
 	{
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 1d33e084b69..f24782ecf58 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -304,6 +304,7 @@ typedef enum ExprEvalOp
 
 	/* Batched aggregate trans evaluation */
 	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
+	EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,	/* call transfn once with AggBulkArgs */
 
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5ba9a523970..c72bd755b79 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -561,6 +561,26 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
 }
 #endif
 
+#ifndef FRONTEND
+/* Per-call bulk argument vectors for batched aggregate trans functions. */
+typedef struct AggBulkArgs
+{
+	int		nrows;		/* number of rows in this batch */
+	int		start_row;
+	int16  *argoffs;
+	int		nargs;		/* number of argument vectors */
+	Datum  **args;		/* args[j][i] = j-th arg at row i */
+	bool   **isnull;	/* isnull[j][i] */
+	bool	hasnull;	/* is any datum in args NULL? */
+} AggBulkArgs;
+
+static inline AggBulkArgs *
+AggGetBulkArgs(FunctionCallInfo fcinfo)
+{
+	return (AggBulkArgs *) (fcinfo->flinfo ? fcinfo->flinfo->fn_extra : NULL);
+}
+#endif
+
 extern bool ExecCheck(ExprState *state, ExprContext *econtext);
 
 /*
-- 
2.43.0



  [application/octet-stream] v1-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch (6.3K, 3-v1-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch)
  download | inline diff:
From d5ff8e14add86233afd3c82935d4f72a31859a57 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 4 Sep 2025 22:55:25 +0900
Subject: [PATCH v1 4/8] WIP: Add agg_retrieve_direct_batch() for plain
 aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Teach Agg to consume child tuples in batches for AGG_PLAIN. A new
agg_retrieve_direct_batch() pulls TupleBatch from the child via
ExecProcNodeBatch(), materializes as needed, and advances per-agg
transition state over the batch. A first tuple is copied to match
the direct path’s behavior before batch processing.

Add AggCanUsePlainBatch() and select retrieve_plain at init:
batch path when no grouping sets, strategy is AGG_PLAIN, and the
child exposes ExecProcNodeBatch(); otherwise keep the row path.

Plan shape and EXPLAIN remain unchanged. Semantics are identical
to the non-batch direct path; this only reduces per-tuple overhead.
---
 src/backend/executor/nodeAgg.c | 123 +++++++++++++++++++++++++++++++++
 src/include/nodes/execnodes.h  |   5 ++
 2 files changed, 128 insertions(+)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a4f3d30f307..3ace6363509 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -820,6 +820,20 @@ advance_aggregates(AggState *aggstate)
 									  aggstate->tmpcontext);
 }
 
+static void
+advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
+{
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	ExprState *evaltrans = aggstate->phase->evaltrans;
+
+	while (TupleBatchHasMore(b))
+	{
+		tmpcontext->ecxt_outertuple = TupleBatchGetNextSlot(b);
+		ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+		ResetExprContext(tmpcontext);
+	}
+}
+
 /*
  * Run the transition function for a DISTINCT or ORDER BY aggregate
  * with only one input.  This is called after we have completed
@@ -2260,6 +2274,9 @@ ExecAgg(PlanState *pstate)
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
+				/* init-time choice */
+				result = node->retrieve_plain(node);
+				break;
 			case AGG_SORTED:
 				result = agg_retrieve_direct(node);
 				break;
@@ -2618,6 +2635,91 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static TupleTableSlot *
+agg_retrieve_direct_batch(AggState *aggstate)
+{
+	PlanState *child = outerPlanState(aggstate);
+	ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	const bool hasGroupingSets = aggstate->phase->numsets > 0;
+	TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	TupleBatch *b = NULL;
+
+	Assert(child->ExecProcNodeBatch);
+
+	/* mimic the first-tuple copy from agg_retrieve_direct() */
+	for (;;)
+	{
+		b = ExecProcNodeBatch(child);
+		if (b == NULL)
+		{
+			if (hasGroupingSets)
+			{
+				aggstate->input_done = true;
+				break;
+			}
+			aggstate->agg_done = true;
+			break;
+		}
+		if (b->nvalid == 0)
+			continue;
+
+		TupleBatchMaterializeAll(b);
+		aggstate->grp_firstTuple = ExecCopySlotHeapTuple(TupleBatchGetSlot(b, 0));
+		break;
+	}
+
+	/* initialize_aggregates etc. as in the direct path */
+	ReScanExprContext(econtext);
+	for (int i = 0; i < Max(aggstate->phase->numsets, 1); i++)
+		ReScanExprContext(aggstate->aggcontexts[i]);
+
+	initialize_aggregates(aggstate, aggstate->pergroups,
+						  Max(aggstate->phase->numsets, 1));
+
+	if (aggstate->grp_firstTuple)
+	{
+		ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
+		aggstate->grp_firstTuple = NULL;
+		tmpcontext->ecxt_outertuple = firstSlot;
+
+		advance_aggregates_batch(aggstate, b);
+		ResetExprContext(tmpcontext);
+	}
+
+	/* consume remaining rows in current and subsequent batches */
+	if (b)
+	{
+		if (TupleBatchHasMore(b))
+			advance_aggregates_batch(aggstate, b);
+		for (;;)
+		{
+			b = ExecProcNodeBatch(child);
+			if (b == NULL)
+			{
+				if (hasGroupingSets)
+					aggstate->input_done = true;
+				else
+					aggstate->agg_done = true;
+				break;
+			}
+			if (b->nvalid == 0)
+				continue;
+
+			TupleBatchMaterializeAll(b);
+			advance_aggregates_batch(aggstate, b);
+		}
+	}
+
+	/* finalize and project like the direct path */
+	econtext->ecxt_outertuple = firstSlot;
+	prepare_projection_slot(aggstate, econtext->ecxt_outertuple, 0);
+	select_current_set(aggstate, 0, false);
+	finalize_aggregates(aggstate, aggstate->peragg, aggstate->pergroups[0]);
+
+	return project_aggregates(aggstate);
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -3265,6 +3367,22 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 }
 
+static bool
+AggCanUsePlainBatch(AggState *aggstate)
+{
+	const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
+
+	Assert(outerPlanState(aggstate));
+
+	/* grouping sets present -> bail */
+	if (aggnode->groupingSets != NIL)
+		return false;
+
+	if (aggstate->phase->aggstrategy != AGG_PLAIN)
+		return false;
+
+	return outerPlanState(aggstate)->ExecProcNodeBatch;
+}
 
 /* -----------------
  * ExecInitAgg
@@ -4060,6 +4178,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	if (AggCanUsePlainBatch(aggstate))
+		aggstate->retrieve_plain = agg_retrieve_direct_batch;
+	else
+		aggstate->retrieve_plain = agg_retrieve_direct;
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a104591ac20..9b81b842161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2535,6 +2535,9 @@ typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
 typedef struct AggStatePerHashData *AggStatePerHash;
 
+struct AggState;
+typedef TupleTableSlot *(*AggRetrievePlainFn)(struct AggState *);
+
 typedef struct AggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2610,6 +2613,8 @@ typedef struct AggState
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	SharedAggInfo *shared_info; /* one entry per worker */
+
+	AggRetrievePlainFn retrieve_plain; /* init-time choice */
 } AggState;
 
 /* ----------------
-- 
2.43.0



  [application/octet-stream] v1-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch (21.5K, 4-v1-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch)
  download | inline diff:
From 992a5e21f7039825b12a6e800efb0265061bbe3a Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 2 Sep 2025 23:46:34 +0900
Subject: [PATCH v1 6/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP

Introduce a batch EEOP that runs plain aggregate transitions by
looping over rows of a TupleBatch. This keeps transition logic in
the interpreter while amortizing per-row costs.

Gate with AggTransCanUseBatch(): plain, non-hashed, single-set
aggregates with no DISTINCT/ORDER/FILTER, and simple Var args.

Extend ExecBuildAggTrans() to prepare batch fetch/build steps and
to return whether a batch path is used.
---
 src/backend/executor/execExpr.c       | 228 ++++++++++++++++++++++++--
 src/backend/executor/execExprInterp.c | 103 ++++++++++++
 src/backend/executor/nodeAgg.c        |  17 +-
 src/backend/jit/llvm/llvmjit_expr.c   |   6 +
 src/backend/jit/llvm/llvmjit_types.c  |   1 +
 src/include/executor/execBatch.h      |   6 +
 src/include/executor/execExpr.h       |  14 ++
 src/include/executor/executor.h       |   3 +-
 src/include/executor/nodeAgg.h        |   2 +
 9 files changed, 363 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..af5ed8b6368 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -95,7 +95,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
 								  int transno, int setno, int setoff, bool ishash,
-								  bool nullcheck);
+								  bool nullcheck, bool batch,
+								  BatchVector *bv);
+
 static void ExecInitJsonExpr(JsonExpr *jsexpr, ExprState *state,
 							 Datum *resv, bool *resnull,
 							 ExprEvalStep *scratch);
@@ -104,6 +106,10 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
 								 bool exists_coerce,
 								 Datum *resv, bool *resnull);
 
+static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
+static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
+static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
+													  const BatchVector *bv);
 
 /*
  * ExecInitExpr: prepare an expression tree for execution
@@ -3659,6 +3665,33 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 	}
 }
 
+/* plain agg, single set, not hashed, no DISTINCT/ORDER/FILTER */
+static inline bool
+AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
+{
+	Agg *aggnode = (Agg *) as->ss.ps.plan;
+
+	if (!AggCanUsePlainBatch(as))
+		return false;
+	if (as->aggstrategy == AGG_HASHED)
+		return false;
+	if (aggnode->groupingSets != NIL)
+		return false;
+	if (as->phase == NULL || as->phase->numsets > 0)
+		return false;
+
+	/* per-aggregate complications */
+	if (pt->aggsortrequired)
+		return false;
+	if (pt->aggref &&
+		(pt->aggref->aggdistinct != NIL ||
+		 pt->aggref->aggorder != NIL ||
+		 pt->aggref->aggfilter != NULL))
+		return false;
+
+	return true;
+}
+
 /*
  * Build transition/combine function invocations for all aggregate transition
  * / combination function invocations in a grouping sets phase. This has to
@@ -3675,13 +3708,17 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  */
 ExprState *
 ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+				  bool doSort, bool doHash, bool nullcheck,
+				  bool *batch_trans)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
 	ExprSetupInfo deform = {0, 0, 0, 0, 0, NIL};
+	bool		batch = AggCanUsePlainBatch(aggstate);
+	Bitmapset  *allattnos = NULL;
+	BatchVector *bv = NULL;
 
 	state->expr = (Expr *) aggstate;
 	state->parent = parent;
@@ -3707,8 +3744,36 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 						  &deform);
 		expr_setup_walker((Node *) pertrans->aggref->aggfilter,
 						  &deform);
+
+		if (!AggTransCanUseBatch(aggstate, pertrans) ||
+			!ExprListAllSimpleVars(pertrans->aggref->args, &allattnos))
+			batch = false;
 	}
-	ExecPushExprSetupSteps(state, &deform);
+
+	if (batch)
+	{
+		if (deform.last_outer > 0)
+		{
+			Assert(!bms_is_empty(allattnos));
+			bv  = BatchVectorCreate(allattnos, deform.last_outer);
+
+			/*
+			 * Deform all tuples upto last_outer in batch
+			 */
+			scratch.opcode = EEOP_OUTER_FETCHSOME_BATCH;
+			scratch.d.fetch_batch.last_var = deform.last_outer;
+			ExprEvalPushStep(state, &scratch);
+
+			/*
+			 * Put all arg Vars into vectors once per batch slice
+			 */
+			scratch.opcode = EEOP_BUILD_OUTER_BATCH_VECTOR;
+			scratch.d.batch_vector.bv = bv;
+			ExprEvalPushStep(state, &scratch);
+		}
+	}
+	else
+		ExecPushExprSetupSteps(state, &deform);
 
 	/*
 	 * Emit instructions for each transition value / grouping set combination.
@@ -3746,7 +3811,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * Evaluate arguments to aggregate/combine function.
 		 */
 		argno = 0;
-		if (isCombine)
+		if (isCombine && !batch)
 		{
 			/*
 			 * Combining two aggregate transition values. Instead of directly
@@ -3816,7 +3881,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 			Assert(pertrans->numInputs == argno);
 		}
-		else if (!pertrans->aggsortrequired)
+		else if (!pertrans->aggsortrequired && !batch)
 		{
 			ListCell   *arg;
 
@@ -3849,7 +3914,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			}
 			Assert(pertrans->numTransInputs == argno);
 		}
-		else if (pertrans->numInputs == 1)
+		else if (pertrans->numInputs == 1 && !batch)
 		{
 			/*
 			 * Non-presorted DISTINCT and/or ORDER BY case, with a single
@@ -3868,7 +3933,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 			Assert(pertrans->numInputs == argno);
 		}
-		else
+		else if (!batch)
 		{
 			/*
 			 * Non-presorted DISTINCT and/or ORDER BY case, with multiple
@@ -3896,7 +3961,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * just keep the prior transValue. This is true for both plain and
 		 * sorted/distinct aggregates.
 		 */
-		if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0)
+		if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0 && !batch)
 		{
 			if (strictnulls)
 				scratch.opcode = EEOP_AGG_STRICT_INPUT_CHECK_NULLS;
@@ -3914,7 +3979,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		}
 
 		/* Handle DISTINCT aggregates which have pre-sorted input */
-		if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired)
+		if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired && !batch)
 		{
 			if (pertrans->numDistinctCols > 1)
 				scratch.opcode = EEOP_AGG_PRESORTED_DISTINCT_MULTI;
@@ -3942,7 +4007,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
+									  nullcheck, batch, bv);
 				setoff++;
 			}
 		}
@@ -3962,7 +4027,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 									  pertrans, transno, setno, setoff, true,
-									  nullcheck);
+									  nullcheck, false, NULL);
 				setoff++;
 			}
 		}
@@ -4007,6 +4072,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 	ExecReadyExpr(state);
 
+	if (batch_trans)
+		*batch_trans = batch;
+
 	return state;
 }
 
@@ -4020,10 +4088,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
 					  int transno, int setno, int setoff, bool ishash,
-					  bool nullcheck)
+					  bool nullcheck, bool batch, BatchVector *bv)
 {
 	ExprContext *aggcontext;
 	int			adjust_jumpnull = -1;
+	BatchVectorSlice *bvs = NULL;
 
 	if (ishash)
 		aggcontext = aggstate->hashcontext;
@@ -4077,7 +4146,13 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	 */
 	if (!pertrans->aggsortrequired)
 	{
-		if (pertrans->transtypeByVal)
+		if (batch)
+		{
+			if (bv)
+				bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
+			scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+		}
+		else if (pertrans->transtypeByVal)
 		{
 			if (fcinfo->flinfo->fn_strict &&
 				pertrans->initValueIsNull)
@@ -4108,6 +4183,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	scratch->d.agg_trans.setoff = setoff;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
+	scratch->d.agg_trans.bvs = bvs;
 	ExprEvalPushStep(state, scratch);
 
 	/* fix up jumpnull */
@@ -5070,3 +5146,129 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
 		DomainHasConstraints(returning->typid);
 	ExprEvalPushStep(state, &scratch);
 }
+
+/* Is expr a Var node for a non-system attribute? */
+static bool
+expr_is_simple_var(Expr *expr, AttrNumber *out_attno)
+{
+	if (expr == NULL)
+		return false;
+
+	if (IsA(expr, TargetEntry))
+		return expr_is_simple_var((Expr *) ((TargetEntry *) expr)->expr,
+								  out_attno);
+	if (IsA(expr, RelabelType))
+		return expr_is_simple_var((Expr *) ((RelabelType *) expr)->arg,
+								  out_attno);
+
+	if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+	{
+		*out_attno = ((Var *) expr)->varattno;
+		return true;
+	}
+
+	return false;
+}
+
+/* Are all inputs plain Vars (optionally allow RelabelType->Var)? Collect attnos. */
+static bool
+ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
+{
+	ListCell *lc;
+
+	foreach(lc, args)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		Expr *arg = tle->expr;
+		AttrNumber attno;
+
+		if (!expr_is_simple_var(arg, &attno))
+			return false;
+
+		if (!IsA(arg, Var))
+			return false;
+
+		Assert(attno > 0);
+		*allattnos = bms_add_member(*allattnos, attno);
+	}
+
+	return true;
+}
+
+/* ---------- BatchVector stuff ------------- */
+
+static BatchVector *
+BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var)
+{
+	int maxrows = EXEC_BATCH_ROWS;
+	BatchVector *bv;
+	AttrNumber	attno;
+	int			i;
+
+	bv = palloc(sizeof(BatchVector));
+	bv->ncols = bms_num_members(attnos);
+	bv->maxrows = maxrows;
+	bv->last_var = last_var;
+	bv->attnos = palloc(sizeof(AttrNumber) * bv->ncols);
+	attno = -1;
+	i = 0;
+	while ((attno = bms_next_member(attnos, attno)) > 0)
+		bv->attnos[i++] = attno;
+	bv->cols = palloc(sizeof(Datum *) * bv->ncols);
+	bv->nulls = palloc(sizeof(bool  *) * bv->ncols);
+
+	for (i =0; i < bv->ncols; i++)
+	{
+		bv->cols[i]  = palloc(sizeof(Datum) * maxrows);
+		bv->nulls[i] = palloc(sizeof(bool)  * maxrows);
+	}
+
+	bv->nrows = 0;
+	bv->hasnull = false;
+
+	return bv;
+}
+
+static int16
+BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno)
+{
+	for (int i = 0; i < bv->ncols; i++)
+		if (bv->attnos[i] == attno)
+			return i;
+
+	return -1;
+}
+
+/*
+ * BatchVectorSliceFromExprArgs
+ *		Build a BatchVectorSlice for a List of args.
+ *
+ * For Var args (possibly under RelabelType), store the col index.
+ * For non-Var args, store -1. Caller can handle Consts, etc.
+ */
+static BatchVectorSlice *
+BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
+{
+	BatchVectorSlice *bvs = palloc(sizeof(BatchVectorSlice));
+	int nargs = list_length(args);
+	int i = 0;
+	ListCell *lc;
+
+	Assert(bv);
+	bvs->bv = bv;
+	bvs->nargs = nargs;
+	bvs->argoffs = (int16 *) palloc(sizeof(int16) * nargs);
+
+	foreach (lc, args)
+	{
+		Expr *arg = (Expr *) lfirst(lc);
+		AttrNumber attno;
+
+		if (expr_is_simple_var(arg, &attno))
+			bvs->argoffs[i++] = BatchVectorFindAttColno(bv, attno);
+		else
+			bvs->argoffs[i++] = -1; /* non-Var */
+	}
+
+	return bvs;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 68629ad7991..3176679b346 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -606,6 +606,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
+		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2336,6 +2337,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
+		{
+			/* too complex for an inline implementation */
+			ExecAggPlainTransBatch(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -6039,3 +6048,97 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 	}
 	bv->nrows = i;
 }
+
+void
+ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	AggState   *aggstate = castNode(AggState, state->parent);
+	AggStatePerTrans	pertrans = op->d.agg_trans.pertrans;
+	AggStatePerGroup pergroup =
+		&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+	BatchVectorSlice  *bvs = op->d.agg_trans.bvs;
+	FunctionCallInfo	fcinfo = pertrans->transfn_fcinfo;
+	FmgrInfo		   *finfo = fcinfo->flinfo;
+	Datum		newVal;
+	TupleBatch *batch = econtext->outer_batch;
+	int			batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
+	int			start_row = 0;
+
+	if (finfo->fn_strict)
+	{
+		if (pergroup->noTransValue && bvs)
+		{
+			const BatchVector *bv = bvs->bv;
+			bool	found = false;
+
+			Assert(bv);
+			for (int i = 0; i < batch_nrows; i++)
+			{
+				for (int j = 0; j < bvs->nargs; j++)
+				{
+					if (!bv->nulls[bvs->argoffs[j]][i])
+					{
+						fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+						fcinfo->args[1].isnull = false;
+						if (j == bvs->nargs - 1)
+						{
+							found = true;
+							break;
+						}
+					}
+				}
+				if (found)
+					break;
+			}
+			/* If transValue has not yet been initialized, do so now. */
+			ExecAggInitGroup(aggstate, pertrans, pergroup,
+							 op->d.agg_trans.aggcontext);
+			start_row = 1;
+		}
+		else if (pergroup->transValueIsNull)
+			return;
+	}
+
+	switch (ExecEvalStepOp(state, op))
+	{
+		case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+			/* Loop rows, call the original transfn per element using vector cols. */
+			for (int i = start_row; i < batch_nrows; i++)
+			{
+				bool hasnull = false;
+
+				/* Set up fcinfo args 1..m from column vectors at row i. */
+				if (bvs)
+				{
+					const BatchVector *bv = bvs->bv;
+
+					for (int j = 0; j < bvs->nargs; j++)
+					{
+						int16	argoff = bvs->argoffs[j];
+
+						fcinfo->args[j+1].value = bv->cols[argoff][i];
+						fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+						if (!hasnull && bv->nulls[argoff][i])
+							hasnull = true;
+					}
+				}
+				/* fcinfo->args[0] is the existing transition state */
+				if (finfo->fn_strict && hasnull)
+					continue;
+				fcinfo->args[0].value = pergroup->transValue;
+				fcinfo->args[0].isnull = pergroup->transValueIsNull;
+				newVal = FunctionCallInvoke(fcinfo);
+				if (!pertrans->transtypeByVal &&
+					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+					newVal = ExecAggCopyTransValue(aggstate, pertrans,
+												   newVal, fcinfo->isnull,
+												   pergroup->transValue,
+												   pergroup->transValueIsNull);
+				pergroup->transValue = newVal;
+				pergroup->transValueIsNull = fcinfo->isnull;
+			}
+			break;
+		default:
+			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+	}
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ace6363509..662d8bef43b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -825,6 +825,16 @@ advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
 {
 	ExprContext *tmpcontext = aggstate->tmpcontext;
 	ExprState *evaltrans = aggstate->phase->evaltrans;
+	bool		batch_trans = aggstate->phase->batch_trans;
+
+	if (batch_trans)
+	{
+		tmpcontext->ecxt_outertuple = TupleBatchGetSlot(b, 0);
+		tmpcontext->outer_batch = b;
+		ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+		TupleBatchConsumeAll(b);
+		return;
+	}
 
 	while (TupleBatchHasMore(b))
 	{
@@ -1800,7 +1810,8 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
 
 		phase->evaltrans_cache[i][j] = ExecBuildAggTrans(aggstate, phase,
 														 dosort, dohash,
-														 nullcheck);
+														 nullcheck,
+														 NULL);
 
 		/* change back */
 		aggstate->ss.ps.outerops = outerops;
@@ -3367,7 +3378,7 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 }
 
-static bool
+bool
 AggCanUsePlainBatch(AggState *aggstate)
 {
 	const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
@@ -4233,7 +4244,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			Assert(false);
 
 		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
+											 false, &phase->batch_trans);
 
 		/* cache compiled expression for outer slot without NULL check */
 		phase->evaltrans_cache[0][0] = phase->evaltrans;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 848f0b52d6f..efb3ee639fc 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,12 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 6bb527c3f6f..1b5e06f60cc 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,4 +186,5 @@ void	   *referenced_functions[] =
 	ExecBuildInnerBatchVector,
 	ExecBuildOuterBatchVector,
 	ExecBuildScanBatchVector,
+	ExecAggPlainTransBatch,
 };
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 6f1a38d14bd..b50961fc0c9 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -99,4 +99,10 @@ TupleBatchMaterializeAll(TupleBatch *b)
 	TupleBatchUseInput(b, b->ntuples);
 }
 
+static inline void
+TupleBatchConsumeAll(TupleBatch *b)
+{
+	b->next = b->nvalid;
+}
+
 #endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 99c86bac702..1d33e084b69 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -302,6 +302,9 @@ typedef enum ExprEvalOp
 	EEOP_BUILD_OUTER_BATCH_VECTOR,
 	EEOP_BUILD_SCAN_BATCH_VECTOR,
 
+	/* Batched aggregate trans evaluation */
+	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
@@ -750,6 +753,7 @@ typedef struct ExprEvalStep
 
 		/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
+		/* for EEOP_AGG_PLAIN_TRANS_{BATCH,BATCH_ROWLOOP}*/
 		struct
 		{
 			AggStatePerTrans pertrans;
@@ -757,6 +761,7 @@ typedef struct ExprEvalStep
 			int			setno;
 			int			transno;
 			int			setoff;
+			struct BatchVectorSlice *bvs;
 		}			agg_trans;
 
 		/* for EEOP_IS_JSON */
@@ -956,8 +961,17 @@ typedef struct BatchVector
 	int		nrows;			/* #rows loaded into cols/nulls */
 } BatchVector;
 
+/* A slice of BatchVector that maps caller args to BatchVector columns. */
+typedef struct BatchVectorSlice
+{
+	const BatchVector *bv;
+	int			nargs;		/* number of args covered */
+	int16	   *argoffs;	/* length nargs, -1 for non-Var entries */
+} BatchVectorSlice;
+
 extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
+extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cf5b0c7e05c..5ba9a523970 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -336,7 +336,8 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool doSort, bool doHash, bool nullcheck,
+									bool *batch_trans);
 extern ExprState *ExecBuildHash32FromAttrs(TupleDesc desc,
 										   const TupleTableSlotOps *ops,
 										   FmgrInfo *hashfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..5c5ebfc73f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -289,6 +289,7 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+	bool		batch_trans;	/* true if evaltrans contains batch EEOPs */
 
 	/*----------
 	 * Cached variants of the compiled expression.
@@ -338,4 +339,5 @@ extern void ExecAggInitializeDSM(AggState *node, ParallelContext *pcxt);
 extern void ExecAggInitializeWorker(AggState *node, ParallelWorkerContext *pwcxt);
 extern void ExecAggRetrieveInstrumentation(AggState *node);
 
+extern bool AggCanUsePlainBatch(AggState *aggstate);
 #endif							/* NODEAGG_H */
-- 
2.43.0



  [application/octet-stream] v1-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch (16.9K, 5-v1-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch)
  download | inline diff:
From b63b357ea48e55b43913559471fd10f5a65e1b8e Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 17:01:29 +0900
Subject: [PATCH v1 5/8] WIP: Add EEOPs and helpers for TupleBatch processing

Introduce new EEOP cases to fetch attributes into TupleBatch
vectors:
- EEOP_{INNER,OUTER,SCAN}_FETCHSOME_BATCH
- EEOP_BUILD_{INNER,OUTER,SCAN}_BATCH_VECTOR

Add ExecBuild{Inner,Outer,Scan}BatchVector() helpers to populate
column vectors (values, nulls, nrows, hasnull) from a TupleBatch.
Extend ExprContext with inner_batch, outer_batch, and scan_batch
fields so expression programs can access active batches directly.

Add slot_getsomeattrs_batch() to prefetch attributes across all
slots in a TupleBatch, similar to slot_getsomeattrs() for one slot.
---
 src/backend/executor/execExprInterp.c | 127 +++++++++++++++++++++++++-
 src/backend/executor/execTuples.c     |  32 +++++++
 src/backend/jit/llvm/llvmjit_expr.c   |  86 +++++++++++++++++
 src/backend/jit/llvm/llvmjit_types.c  |   4 +
 src/include/executor/execExpr.h       |  45 ++++++++-
 src/include/executor/tuptable.h       |   2 +
 src/include/nodes/execnodes.h         |  24 +++--
 7 files changed, 310 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0e1a74976f7..68629ad7991 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
 #include "access/heaptoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
+#include "executor/execBatch.h"
 #include "executor/execExpr.h"
 #include "executor/nodeSubplan.h"
 #include "funcapi.h"
@@ -188,6 +189,11 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  int setno);
 static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
 
+static pg_attribute_always_inline void ExecBuildBatchVector(ExprState *state,
+															ExprEvalStep *op,
+															ExprContext *econtext,
+															TupleBatch *b);
+
 /*
  * ScalarArrayOpExprHashEntry
  * 		Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -446,7 +452,6 @@ ExecReadyInterpretedExpr(ExprState *state)
 	state->evalfunc_private = ExecInterpExpr;
 }
 
-
 /*
  * Evaluate expression identified by "state" in the execution context
  * given by "econtext".  *isnull is set to the is-null flag for the result,
@@ -466,6 +471,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	TupleTableSlot *scanslot;
 	TupleTableSlot *oldslot;
 	TupleTableSlot *newslot;
+	TupleBatch *innerbatch;
+	TupleBatch *outerbatch;
+	TupleBatch *scanbatch;
 
 	/*
 	 * This array has to be in the same order as enum ExprEvalOp.
@@ -479,6 +487,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_SCAN_FETCHSOME,
 		&&CASE_EEOP_OLD_FETCHSOME,
 		&&CASE_EEOP_NEW_FETCHSOME,
+		&&CASE_EEOP_INNER_FETCHSOME_BATCH,
+		&&CASE_EEOP_OUTER_FETCHSOME_BATCH,
+		&&CASE_EEOP_SCAN_FETCHSOME_BATCH,
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
@@ -592,6 +603,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
 		&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
 		&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+		&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
+		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
+		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_LAST
 	};
 
@@ -612,6 +626,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	scanslot = econtext->ecxt_scantuple;
 	oldslot = econtext->ecxt_oldtuple;
 	newslot = econtext->ecxt_newtuple;
+	innerbatch = econtext->inner_batch;
+	outerbatch = econtext->outer_batch;
+	scanbatch = econtext->scan_batch;
 
 #if defined(EEO_USE_COMPUTED_GOTO)
 	EEO_DISPATCH();
@@ -658,6 +675,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_INNER_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, innerslot);
+
+			Assert(innerbatch);
+			slot_getsomeattrs_batch(innerbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, outerslot);
+
+			Assert(outerbatch);
+			slot_getsomeattrs_batch(outerbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, scanslot);
+
+			Assert(scanbatch);
+			slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_OLD_FETCHSOME)
 		{
 			CheckOpSlotCompatibility(op, oldslot);
@@ -2265,6 +2312,30 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_BUILD_INNER_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildInnerBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_BUILD_OUTER_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildOuterBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_BUILD_SCAN_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildScanBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -5914,3 +5985,57 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
 
 	MemoryContextSwitchTo(oldContext);
 }
+
+void
+ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->inner_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->inner_batch);
+}
+
+void
+ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->outer_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->outer_batch);
+}
+
+void
+ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->scan_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->scan_batch);
+}
+
+static pg_attribute_always_inline void
+ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
+					 TupleBatch *b)
+{
+	struct BatchVector *bv = op->d.batch_vector.bv;
+	int		i = 0;
+
+	if (bv->ncols == 0)
+		return;
+
+	/* Fetch each requested attribute into column vectors. */
+	TupleBatchRewind(b);
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		for (int j = 0; j < bv->ncols; j++)
+		{
+			AttrNumber attno = bv->attnos[j];
+			Datum  *cols  = bv->cols[j];
+			bool   *nulls  = bv->nulls[j];
+
+			Assert(attno <= slot->tts_nvalid);
+			cols[i] = slot->tts_values[attno - 1];
+			nulls[i] = slot->tts_isnull[attno - 1];
+			if (!bv->hasnull && nulls[i])
+				bv->hasnull = true;
+		}
+		i++;
+	}
+	bv->nrows = i;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..86d5dea8f8b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
 	}
 }
 
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		/* Check for caller errors */
+		Assert(attnum > 0);
+
+		if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+			elog(ERROR, "invalid attribute number %d", attnum);
+
+		/* XXX - there should perhaps also be a batch-level att_nvalid */
+		if (attnum < slot->tts_nvalid)
+			continue;
+
+		/* Fetch as many attributes as possible from the underlying tuple. */
+		slot->tts_ops->getsomeattrs(slot, attnum);
+
+		/*
+		 * If the underlying tuple doesn't have enough attributes, tuple
+		 * descriptor must have the missing attributes.
+		 */
+		if (unlikely(slot->tts_nvalid < attnum))
+		{
+			slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+			slot->tts_nvalid = attnum;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecTypeFromTL
  *
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 712b35df7e5..848f0b52d6f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,11 @@ llvm_compile_expr(ExprState *state)
 	LLVMValueRef v_newslot;
 	LLVMValueRef v_resultslot;
 
+	/* batches */
+	LLVMValueRef v_innerbatch;
+	LLVMValueRef v_outerbatch;
+	LLVMValueRef v_scanbatch;
+
 	/* nulls/values of slots */
 	LLVMValueRef v_innervalues;
 	LLVMValueRef v_innernulls;
@@ -221,6 +226,21 @@ llvm_compile_expr(ExprState *state)
 									 v_state,
 									 FIELDNO_EXPRSTATE_RESULTSLOT,
 									 "v_resultslot");
+	v_innerbatch = l_load_struct_gep(b,
+									 StructExprContext,
+									 v_econtext,
+									 FIELDNO_EXPRCONTEXT_OUTERBATCH,
+									 "v_innerbatch");
+	v_outerbatch = l_load_struct_gep(b,
+									 StructExprContext,
+									 v_econtext,
+									 FIELDNO_EXPRCONTEXT_OUTERBATCH,
+									 "v_outerbatch");
+	v_scanbatch = l_load_struct_gep(b,
+									StructExprContext,
+									v_econtext,
+									FIELDNO_EXPRCONTEXT_SCANBATCH,
+									"v_scanbatch");
 
 	/* build global values/isnull pointers */
 	v_scanvalues = l_load_struct_gep(b,
@@ -439,6 +459,54 @@ llvm_compile_expr(ExprState *state)
 					break;
 				}
 
+			case EEOP_INNER_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_innerbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
+			case EEOP_OUTER_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_outerbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
+			case EEOP_SCAN_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_scanbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
@@ -2940,6 +3008,24 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_BUILD_INNER_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildInnerBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
+			case EEOP_BUILD_OUTER_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildOuterBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
+			case EEOP_BUILD_SCAN_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildScanBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 167cd554b9c..6bb527c3f6f 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,11 @@ void	   *referenced_functions[] =
 	MakeExpandedObjectReadOnlyInternal,
 	slot_getmissingattrs,
 	slot_getsomeattrs_int,
+	slot_getsomeattrs_batch,
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecBuildInnerBatchVector,
+	ExecBuildOuterBatchVector,
+	ExecBuildScanBatchVector,
 };
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 75366203706..99c86bac702 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,11 @@ typedef enum ExprEvalOp
 	EEOP_OLD_FETCHSOME,
 	EEOP_NEW_FETCHSOME,
 
+	/* apply slot_getsomeattrs_batch() to corresponding batch */
+	EEOP_INNER_FETCHSOME_BATCH,
+	EEOP_OUTER_FETCHSOME_BATCH,
+	EEOP_SCAN_FETCHSOME_BATCH,
+
 	/* compute non-system Var value */
 	EEOP_INNER_VAR,
 	EEOP_OUTER_VAR,
@@ -292,11 +297,15 @@ typedef enum ExprEvalOp
 	EEOP_AGG_ORDERED_TRANS_DATUM,
 	EEOP_AGG_ORDERED_TRANS_TUPLE,
 
+	/* ExprContext.*_batch -> BatchVector */
+	EEOP_BUILD_INNER_BATCH_VECTOR,
+	EEOP_BUILD_OUTER_BATCH_VECTOR,
+	EEOP_BUILD_SCAN_BATCH_VECTOR,
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
 
-
 typedef struct ExprEvalStep
 {
 	/*
@@ -331,6 +340,12 @@ typedef struct ExprEvalStep
 			const TupleTableSlotOps *kind;
 		}			fetch;
 
+		struct
+		{
+			/* attribute number up to which to fetch (inclusive) */
+			int			last_var;
+		}			fetch_batch;
+
 		/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
 		struct
 		{
@@ -769,6 +784,12 @@ typedef struct ExprEvalStep
 			void	   *json_coercion_cache;
 			ErrorSaveContext *escontext;
 		}			jsonexpr_coercion;
+
+		/* for batch vector construction */
+		struct
+		{
+			struct BatchVector *bv;
+		}			batch_vector;
 	}			d;
 } ExprEvalStep;
 
@@ -917,4 +938,26 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 
+/* ---------- BatchVector stuff ------------- */
+
+/* Vector fetch spec for a list of simple Vars. */
+typedef struct BatchVector
+{
+	/* immutable after BatchVectorCreate */
+	AttrNumber *attnos;		/* [ncols] */
+	int			ncols;
+	int			maxrows;
+	int			last_var;
+
+	/* per batch state */
+	Datum **cols;			/* [ncols][maxbatch] */
+	bool  **nulls;			/* [ncols][maxbatch] */
+	bool	hasnull;		/* is any datum in cols NULL? */
+	int		nrows;			/* #rows loaded into cols/nulls */
+} BatchVector;
+
+extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 095e4cc82e3..2e2192fb3cf 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -347,6 +347,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
 extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
 								 int lastAttNum);
 extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
 
 
 #ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9b81b842161..fdfe8b4ddaf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -277,6 +277,14 @@ typedef struct ExprContext
 #define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
 	TupleTableSlot *ecxt_outertuple;
 
+	/* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_INNERBATCH 4
+	TupleBatch	   *inner_batch;
+#define FIELDNO_EXPRCONTEXT_OUTERBATCH 5
+	TupleBatch	   *outer_batch;
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 6
+	TupleBatch	   *scan_batch;
+
 	/* Memory contexts for expression evaluation --- see notes above */
 	MemoryContext ecxt_per_query_memory;
 	MemoryContext ecxt_per_tuple_memory;
@@ -289,27 +297,27 @@ typedef struct ExprContext
 	 * Values to substitute for Aggref nodes in the expressions of an Agg
 	 * node, or for WindowFunc nodes within a WindowAgg node.
 	 */
-#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
+#define FIELDNO_EXPRCONTEXT_AGGVALUES 11
 	Datum	   *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
-#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
+#define FIELDNO_EXPRCONTEXT_AGGNULLS 12
 	bool	   *ecxt_aggnulls;	/* null flags for aggs/windowfuncs */
 
 	/* Value to substitute for CaseTestExpr nodes in expression */
-#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
+#define FIELDNO_EXPRCONTEXT_CASEDATUM 13
 	Datum		caseValue_datum;
-#define FIELDNO_EXPRCONTEXT_CASENULL 11
+#define FIELDNO_EXPRCONTEXT_CASENULL 14
 	bool		caseValue_isNull;
 
 	/* Value to substitute for CoerceToDomainValue nodes in expression */
-#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
+#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 15
 	Datum		domainValue_datum;
-#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
+#define FIELDNO_EXPRCONTEXT_DOMAINNULL 16
 	bool		domainValue_isNull;
 
 	/* Tuples that OLD/NEW Var nodes in RETURNING may refer to */
-#define FIELDNO_EXPRCONTEXT_OLDTUPLE 14
+#define FIELDNO_EXPRCONTEXT_OLDTUPLE 17
 	TupleTableSlot *ecxt_oldtuple;
-#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
+#define FIELDNO_EXPRCONTEXT_NEWTUPLE 18
 	TupleTableSlot *ecxt_newtuple;
 
 	/* Link to containing EState (NULL if a standalone ExprContext) */
-- 
2.43.0



  [application/octet-stream] v1-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.2K, 6-v1-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
  download | inline diff:
From 6a43a40037e4b656739743b3c0abdfb73a8f9b92 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:59:56 +0900
Subject: [PATCH v1 2/8] SeqScan: add batch-driven variants returning slots

Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.

Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.

Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.

Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.

Notes / current limits:

- Batching uses EXEC_BATCH_ROWS (currently 64) as the target capacity.
- With the current heapam, batches are composed from a single page, so
  the batch may not always be full. Future work may let SeqScan and/or
  AMs top up batches across pages when safe to do so.
---
 src/backend/access/heap/heapam.c          |  29 ++++
 src/backend/access/heap/heapam_handler.c  |  15 ++
 src/backend/access/table/tableam.c        |  11 ++
 src/backend/executor/Makefile             |   1 +
 src/backend/executor/execBatch.c          | 117 ++++++++++++++
 src/backend/executor/execScan.c           |  31 ++++
 src/backend/executor/meson.build          |   1 +
 src/backend/executor/nodeSeqscan.c        | 176 +++++++++++++++++++++-
 src/backend/utils/init/globals.c          |   3 +
 src/backend/utils/misc/guc_parameters.dat |   7 +
 src/include/access/heapam.h               |   1 +
 src/include/access/tableam.h              |  27 ++++
 src/include/executor/execBatch.h          | 102 +++++++++++++
 src/include/executor/execScan.h           |  54 +++++++
 src/include/executor/executor.h           |   4 +
 src/include/miscadmin.h                   |   1 +
 src/include/nodes/execnodes.h             |   8 +
 17 files changed, 587 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/executor/execBatch.c
 create mode 100644 src/include/executor/execBatch.h

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f62f7edbf5e..9fd7948482d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1570,6 +1570,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
 	return hb;
 }
 
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	Assert(n <= hb->nitems);
+
+	for (int i = 0; i < n; i++)
+	{
+		HeapTupleData *tuple = &hb->tupdata[i];
+		HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+		/* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+		slot->tuple = tuple;
+		slot->off = 0;
+		slot->base.tts_nvalid = 0;
+		slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+		slot->base.tts_tid = tuple->t_self;
+		slot->base.tts_tableOid = tuple->t_tableOid;
+		slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+	}
+}
+
 /*
  * heap_scan_end_batch
  *
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec4eeccf19c..8e88cc9e8f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,20 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps = {
+	.materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+	return &TupleBatchHeapOps;
+}
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -2617,6 +2631,7 @@ static const TableAmRoutine heapam_methods = {
 	.type = T_TableAmRoutine,
 
 	.slot_callbacks = heapam_slot_callbacks,
+	.batch_callbacks = heapam_batch_callbacks,
 
 	.scan_begin = heap_beginscan,
 	.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5a8ebb8b97c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
 	return slot;
 }
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+	if (relation->rd_tableam)
+		return relation->rd_tableam->batch_callbacks(relation);
+	elog(ERROR, "relation does not support TupleBatch operations");
+}
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	execAmi.o \
 	execAsync.o \
+	execBatch.o \
 	execCurrent.o \
 	execExpr.o \
 	execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ *		Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ *		Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+	TupleBatch  *b;
+	TupleTableSlot **inslots,
+				   **outslots;
+
+	inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	for (int i = 0; i < capacity; i++)
+		inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+	b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+	/* Initial state: empty envelope */
+	b->am_payload = NULL;
+	b->ntuples = 0;
+	b->inslots = inslots;
+	b->outslots = outslots;
+	b->activeslots = NULL;
+	b->outslots = outslots;
+	b->maxslots = capacity;
+
+	b->nvalid = 0;
+	b->next = 0;
+
+	return b;
+}
+
+/*
+ * TupleBatchReset
+ *		Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+	if (b == NULL)
+		return;
+
+	for (int i = 0; i < b->maxslots; i++)
+	{
+		ExecClearTuple(b->inslots[i]);
+		if (drop_slots)
+			ExecDropSingleTupleTableSlot(b->inslots[i]);
+	}
+
+	if (drop_slots)
+	{
+		pfree(b->inslots);
+		pfree(b->outslots);
+		b->inslots = b->outslots = NULL;
+	}
+
+	b->ntuples = 0;
+	b->nvalid = 0;
+	b->next = 0;
+	b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->inslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->outslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+	return	b != NULL &&
+			b->maxslots > 0 &&
+			b->inslots != NULL &&
+			b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+	b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+	return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 90726949a87..f24c5d73ae1 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
  */
 #include "postgres.h"
 
+#include "access/tableam.h"
 #include "executor/executor.h"
 #include "executor/execScan.h"
 #include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
 		}
 	}
 }
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+	Relation	relation = scanstate->ss_currentRelation;
+
+	return	executor_batching &&
+			(scanstate->ps.state->es_epq_active == NULL) &&
+			!(eflags & EXEC_FLAG_BACKWARD) &&
+			relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+	TupleBatch *b = scanstate->ps.ps_Batch;
+
+	if (b)
+	{
+		TupleBatchReset(b, drop);
+		if (b->am_payload)
+		{
+			table_scan_end_batch(scanstate->ss_currentScanDesc,
+								 b->am_payload);
+			b->am_payload = NULL;
+		}
+		if (drop)
+			pfree(b);
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'execAmi.c',
   'execAsync.c',
+  'execBatch.c',
   'execCurrent.c',
   'execExpr.c',
   'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..2552d420f1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
 					(ExecScanRecheckMtd) SeqRecheck);
 }
 
+/* ----------------------------------------------------------------
+ *						Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+	TableScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+
+	Assert(node->ss.ps.ps_Batch != NULL);
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	Assert(direction == ForwardScanDirection);
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the scan is not parallel, or if we're serially
+		 * executing a scan that was planned to be parallel.
+		 */
+		scandesc = table_beginscan(node->ss.ss_currentRelation,
+								   estate->es_snapshot,
+								   0, NULL);
+		node->ss.ss_currentScanDesc = scandesc;
+	}
+
+	/* Lazily create the AM batch payload. */
+	if (node->ss.ps.ps_Batch->am_payload == NULL)
+	{
+		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+		Assert(tam && tam->scan_begin_batch);
+		node->ss.ps.ps_Batch->am_payload =
+			table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+		node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+	}
+
+	node->ss.ps.ps_Batch->ntuples =
+		table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+	node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+	node->ss.ps.ps_Batch->materialized = false;
+
+	return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+	if (SeqNextBatch(node))
+	{
+		TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+		return true;
+	}
+
+	return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	/*
+	 * Use pg_assume() for != NULL tests to make the compiler realize no
+	 * runtime check for the field is needed in ExecScanExtended().
+	 */
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+	const int cap = EXEC_BATCH_ROWS;
+	TupleDesc	scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+	scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+	/* Choose batch variant to preserve your specialization matrix */
+	if (scanstate->ss.ps.qual == NULL)
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+		}
+	}
+	else
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
 ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 {
 	SeqScanState *scanstate;
+	bool	use_batching;
 
 	/*
 	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 							 node->scan.scanrelid,
 							 eflags);
 
+	use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
 	/* and create slot with the appropriate rowtype */
 	ExecInitScanTupleSlot(estate, &scanstate->ss,
 						  RelationGetDescr(scanstate->ss.ss_currentRelation),
+						  use_batching ? &TTSOpsHeapTuple :
 						  table_slot_callbacks(scanstate->ss.ss_currentRelation));
 
 	/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
 	}
 
+	if (use_batching)
+		SeqScanInitBatching(scanstate, eflags);
+
 	return scanstate;
 }
 
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
 	 */
 	scanDesc = node->ss.ss_currentScanDesc;
 
+	ScanResetBatching(&node->ss, true);
+
 	/*
 	 * close heap scan
 	 */
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
 	if (scan != NULL)
 		table_rescan(scan,		/* scan desc */
 					 NULL);		/* new scan keys */
-
+	ScanResetBatching(&node->ss, false);
 	ExecScanReScan((ScanState *) node);
 }
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..b4a0996a717 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* executor batching */
+bool		executor_batching = false;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..c9fbb7ffef9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -880,6 +880,13 @@
   boot_val => 'true',
 },
 
+{ name => 'executor_batching', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Use tuple batching during execution.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'executor_batching',
+  boot_val => 'true',
+},
+
 { name => 'data_sync_retry', type => 'bool', context => 'PGC_POSTMASTER', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Whether to continue running after a failure to sync data files.',
   variable => 'data_sync_retry',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 02f7793fba0..13ce6166ec3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -314,6 +314,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
 extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
 extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
 extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
 
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 953207eac50..05f828b9762 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "commands/vacuum.h"
+#include "executor/execBatch.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
 typedef struct IndexInfo IndexInfo;
 typedef struct SampleScanState SampleScanState;
 typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
 
 /*
  * Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
 	 * Return slot implementation suitable for storing a tuple of this AM.
 	 */
 	const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+	const TupleBatchOps *(*batch_callbacks)(Relation rel);
 
 
 	/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
 									 ScanDirection dir);
 	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
 
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
  */
 extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_supports_batching
+ *		Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+	const TableAmRoutine *tam = relation->rd_tableam;
+
+	return tam->scan_getnextbatch != NULL;
+}
+
 /*
  * table_scan_begin_batch
  *		Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2116,5 +2142,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
  */
 
 extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
 
 #endif							/* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..6f1a38d14bd
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ *		Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/* XXX fixed 64 for PoC */
+#define	EXEC_BATCH_ROWS		64
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+	void (*materialize_all)(void *am_payload,
+							TupleTableSlot **dst,
+							int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+	void	   *am_payload;
+	const TupleBatchOps *ops;
+	int			ntuples;				/* number of tuples in am_payload */
+	bool		materialized;		 /* tuples in slots valid? */
+	struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+	struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+									   * batch */
+	struct TupleTableSlot **activeslots;
+	int			maxslots;
+
+	int		nvalid;		/* number of returnable tuples in outslots */
+	int		next;		/* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+	return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+	Assert(index < b->nvalid);
+	return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+	Assert(TupleBatchIsValid(b));
+	b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+	return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+	if (b->materialized)
+		return;
+
+	if (b->ops == NULL || b->ops->materialize_all == NULL)
+		elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+	b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+	TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 837ea7785bb..fec606471c8 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -243,4 +243,58 @@ ExecScanExtended(ScanState *node,
 	}
 }
 
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+						  ExecScanAccessBatchMtd accessBatchMtd,
+						  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		TupleTableSlot *in;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next input slot from current batch, or refill */
+		if (!TupleBatchHasMore(b))
+		{
+			if (!accessBatchMtd(node))
+				return NULL;
+		}
+
+		in = TupleBatchGetNextSlot(b);
+		Assert(in);
+
+		/* No qual, no projection: direct return */
+		if (qual == NULL && projInfo == NULL)
+			return in;
+
+		ResetExprContext(econtext);
+		econtext->ecxt_scantuple = in;
+
+		/* Qual only */
+		if (projInfo == NULL)
+		{
+			if (qual == NULL || ExecQual(qual, econtext))
+				return in;
+			else
+				InstrCountFiltered1(node, 1);
+			continue;
+		}
+
+		/* Projection (with or without qual) */
+		if (qual == NULL || ExecQual(qual, econtext))
+			return ExecProject(projInfo);
+		else
+			InstrCountFiltered1(node, 1);
+		/* else try next tuple */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..17258f7ae2d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -575,12 +575,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
 extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
 
 /*
  * prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b8e7afda57c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -287,6 +287,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
 extern PGDLLIMPORT int VacuumCostBalance;
 extern PGDLLIMPORT bool VacuumCostActive;
 
+extern PGDLLIMPORT bool executor_batching;
 
 /* in utils/misc/stack_depth.c */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..f4bb8f7dd7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
 #define EXECNODES_H
 
 #include "access/tupconvert.h"
+#include "executor/execBatch.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
 #include "lib/ilist.h"
@@ -1143,6 +1144,10 @@ typedef struct JsonExprState
  */
 typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
 
+/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
+struct TupleBatch;
+typedef struct TupleBatch TupleBatch;
+
 /* ----------------
  *		PlanState node
  *
@@ -1198,6 +1203,9 @@ typedef struct PlanState
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 
+	/* Batching state if node supports it. */
+	TupleBatch *ps_Batch;
+
 	bool		async_capable;	/* true if node is async-capable */
 
 	/*
-- 
2.43.0



  [application/octet-stream] v1-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.7K, 7-v1-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
  download | inline diff:
From 3318650e720a01cbd5948349b9fbcdbb8ddda7cf Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:56:17 +0900
Subject: [PATCH v1 1/8] Add batch table AM API and heapam implementation

Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.

Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.

Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.

Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
---
 src/backend/access/heap/heapam.c         | 212 ++++++++++++++++++++++-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/include/access/heapam.h              |  21 +++
 src/include/access/tableam.h             |  58 +++++++
 src/include/pgstat.h                     |   5 +
 5 files changed, 299 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed0c0c2dc9f..f62f7edbf5e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1008,7 +1008,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 					int nkeys,
 					ScanKey key)
 {
-	HeapTuple	tuple = &(scan->rs_ctup);
+	HeapTuple tuple = &scan->rs_ctup;
 	Page		page;
 	uint32		lineindex;
 	uint32		linesleft;
@@ -1089,6 +1089,121 @@ continue_page:
 	scan->rs_inited = false;
 }
 
+/*
+ * heapgettup_pagemode_batch
+ *		Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ *		number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ *	- Ensures rs_cbuf pins the page from which tuples were produced.
+ *	- Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ *	  heapgettup_pagemode’s inner-loop effects).
+ *	- Does *not* change buffer pin counts except through normal page
+ *	  transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+						  ScanDirection dir,
+						  int nkeys, ScanKey key,
+						  HeapTupleData *tdata,
+						  int maxitems)
+{
+	Page		page;
+	uint32		lineindex;
+	uint32		linesleft;
+	int			nout = 0;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(maxitems > 0);
+
+	/*
+	 * If we have no current page (or the current page is exhausted),
+	 * advance to the next page that has any visible tuples and prepare it.
+	 * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+	 * as soon as we have a prepared page; we never produce from two pages.
+	 */
+	for (;;)
+	{
+		if (BufferIsValid(scan->rs_cbuf))
+		{
+			/* Are there more visible tuples left on this page? */
+			lineindex = scan->rs_cindex + dir;
+			if (ScanDirectionIsForward(dir))
+				linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+					(scan->rs_ntuples - lineindex) : 0;
+			else
+				linesleft = scan->rs_cindex;
+			if (linesleft > 0)
+				break;	/* continue on this page */
+		}
+
+		/* Move to next page and prepare its visible tuple list. */
+		heap_fetch_next_buffer(scan, dir);
+
+		if (!BufferIsValid(scan->rs_cbuf))
+		{
+			/* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_prefetch_block = InvalidBlockNumber;
+			scan->rs_inited = false;
+			return 0;
+		}
+
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+		heap_prepare_pagescan((TableScanDesc) scan);
+
+		/* After prepare, either rs_ntuples > 0 or we'll loop again. */
+		if (scan->rs_ntuples > 0)
+		{
+			lineindex = ScanDirectionIsForward(dir) ? 0 : scan->rs_ntuples - 1;
+			linesleft = scan->rs_ntuples - (ScanDirectionIsForward(dir) ? 0 : 0);
+			break;
+		}
+		/* else: page had no visible tuples; continue to next page */
+	}
+
+	/* From here on, we must only read tuples from this single page. */
+	page = BufferGetPage(scan->rs_cbuf);
+
+	/*
+	 * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+	 * until either the page is exhausted or the batch capacity is reached.
+	 */
+	for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+	{
+		OffsetNumber	lineoff;
+		ItemId			lpp;
+		HeapTupleData *dst = &tdata[nout];
+
+		Assert(lineindex <= (uint32) scan->rs_ntuples);
+		lineoff = scan->rs_vistuples[lineindex];
+		lpp = PageGetItemId(page, lineoff);
+		Assert(ItemIdIsNormal(lpp));
+
+		dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		dst->t_len  = ItemIdGetLength(lpp);
+		dst->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+		if (key != NULL &&
+			!HeapKeyTest(dst, RelationGetDescr(scan->rs_base.rs_rd),
+						 nkeys, key))
+			continue;
+
+		scan->rs_cindex = lineindex;
+		nout++;
+	}
+
+	return nout;
+}
 
 /* ----------------------------------------------------------------
  *					 heap access method interface
@@ -1136,6 +1251,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 	scan->rs_cbuf = InvalidBuffer;
+	scan->rs_batch_ctup = NULL;
+	scan->rs_batch_cbuf = InvalidBuffer;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1315,6 +1432,8 @@ heap_endscan(TableScanDesc sscan)
 	 */
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
+	if (BufferIsValid(scan->rs_batch_cbuf))
+		ReleaseBuffer(scan->rs_batch_cbuf);
 
 	/*
 	 * Must free the read stream before freeing the BufferAccessStrategy.
@@ -1421,6 +1540,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
 	return true;
 }
 
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	HeapBatch  *hb;
+	Oid			relid;
+
+	Assert(maxitems > 0);
+
+	hb = palloc(sizeof(HeapBatch));
+	hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+	hb->maxitems = maxitems;
+	hb->nitems = 0;
+	hb->buf = InvalidBuffer;
+
+	/* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+	relid = RelationGetRelid(sscan->rs_rd);
+	for (int i = 0; i < maxitems; i++)
+		hb->tupdata[i].t_tableOid = relid;
+
+	return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	if (BufferIsValid(hb->buf))
+		ReleaseBuffer(hb->buf);
+
+	pfree(hb->tupdata);
+	pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	HeapScanDesc scan = (HeapScanDesc) sscan;
+	HeapBatch  *hb = (HeapBatch *) am_batch;
+	Buffer		curbuf;
+	int			n;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(hb->maxitems > 0);
+
+	/* Drop prior batch pin, if any. */
+	if (BufferIsValid(hb->buf))
+	{
+		ReleaseBuffer(hb->buf);
+		hb->buf = InvalidBuffer;
+	}
+
+	hb->nitems = 0;
+
+	/* One call per batch, never crosses a page. */
+	n = heapgettup_pagemode_batch(scan, dir,
+								  sscan->rs_nkeys, sscan->rs_key,
+								  hb->tupdata, hb->maxitems);
+
+	if (n == 0)
+		return 0;	/* end of scan */
+
+	/* Hold a shared pin for the batch lifetime so t_data stays valid. */
+	curbuf = scan->rs_cbuf;
+	IncrBufferRefCount(curbuf);
+	hb->buf = curbuf;
+
+	/* Per-tuple stats (can be collapsed into a future _multi() call). */
+	pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+	hb->nitems = n;
+	return n;
+}
+
+/*----- End of batching support -----*/
+
 void
 heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 				  ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..ec4eeccf19c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_rescan = heap_rescan,
 	.scan_getnextslot = heap_getnextslot,
 
+	.scan_begin_batch = heap_begin_batch,
+	.scan_getnextbatch = heap_getnextbatch,
+	.scan_end_batch = heap_end_batch,
+
 	.scan_set_tidrange = heap_set_tidrange,
 	.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..02f7793fba0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,9 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	HeapTupleData *rs_batch_ctup;	/* NULL when not using batched mode */
+	Buffer	rs_batch_cbuf;		/* buffer feeding the batch */
+
 	/* For scans that stream reads */
 	ReadStream *rs_read_stream;
 
@@ -101,6 +104,19 @@ typedef struct HeapScanDescData
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+	HeapTupleData  *tupdata;	/* len = maxitems; headers only */
+	int				nitems;		/* tuples produced in last getnextbatch() */
+	int				maxitems;	/* fixed capacity set at begin_batch() */
+	Buffer			buf;		/* single pinned buffer for this batch */
+} HeapBatch;
+
 typedef struct BitmapHeapScanDescData
 {
 	HeapScanDescData rs_heap_base;
@@ -294,6 +310,11 @@ extern void heap_endscan(TableScanDesc sscan);
 extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
 extern bool heap_getnextslot(TableScanDesc sscan,
 							 ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
 extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..953207eac50 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
 									 ScanDirection direction,
 									 TupleTableSlot *slot);
 
+	/* ------------------------------------------------------------------------
+	 * Batched scan support
+	 * ------------------------------------------------------------------------
+	 */
+
+	void	   *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+	int			(*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+									 ScanDirection dir);
+	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_scan_begin_batch
+ *		Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_begin_batch != NULL);
+
+	return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ *		Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ *		Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	/* Only forward scans are supported in the batched mode. */
+	Assert(dir == ForwardScanDirection);
+	Assert(tam->scan_getnextbatch != NULL);
+
+	return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ *		Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	if (am_batch == NULL)
+		return;
+
+	Assert(tam->scan_end_batch != NULL);
+
+	tam->scan_end_batch(sscan, am_batch);
+}
+
 /* ----------------------------------------------------------------------------
  * TID Range scanning related functions.
  * ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e4a59a30b8c..aaea9520b1d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -687,6 +687,11 @@ extern void pgstat_report_analyze(Relation rel,
 		if (pgstat_should_count_relation(rel))						\
 			(rel)->pgstat_info->counts.tuples_returned++;			\
 	} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n)						\
+	do {															\
+		if (pgstat_should_count_relation(rel))						\
+			(rel)->pgstat_info->counts.tuples_returned += n;		\
+	} while (0)
 #define pgstat_count_heap_fetch(rel)								\
 	do {															\
 		if (pgstat_should_count_relation(rel))						\
-- 
2.43.0



  [application/octet-stream] v1-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch (9.0K, 8-v1-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch)
  download | inline diff:
From 64971ee050c86326c2ca6023c302cff661383251 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 22:18:30 +0900
Subject: [PATCH v1 3/8] Executor: add ExecProcNodeBatch() and integrate
 SeqScan with batch API
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a batch-capable executor interface alongside the existing
slot-at-a-time path:

 * ExecProcNodeBatch() is added to return a TupleBatch instead of a
   TupleTableSlot. PlanState gains ExecProcNodeBatch as a function
   pointer.

Integrate SeqScan with this interface:

 * Add ExecSeqScanBatch* routines that drive heap via the batch table
   AM API and return a TupleBatch.
 * At init, set ps.ExecProcNodeBatch to these routines when
   ScanCanUseBatching() allows.
 * Retain ExecSeqScanBatchSlot* variants for slot-at-a-time consumers.

This builds on 0002, which introduced TupleBatch and made SeqScan
consume the AM’s batch API internally but still surface slots. With this
patch, SeqScan can surface batches directly to batch-aware upper nodes.

Plan shape and EXPLAIN output remain unchanged; only internal tuple flow
differs when batching is enabled and allowed.
---
 src/backend/executor/execProcnode.c | 52 +++++++++++++++++++++++++++++
 src/backend/executor/nodeSeqscan.c  | 35 +++++++++++++++++++
 src/include/executor/execScan.h     | 51 ++++++++++++++++++++++++++++
 src/include/executor/executor.h     | 10 ++++++
 src/include/nodes/execnodes.h       |  5 +++
 5 files changed, 153 insertions(+)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..a8c0315e874 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,6 +121,8 @@
 
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
 static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+static TupleBatch *ExecProcNodeBatchFirst(PlanState *node);
+static TupleBatch *ExecProcNodeBatchInstr(PlanState *node);
 static bool ExecShutdownNode_walker(PlanState *node, void *context);
 
 
@@ -389,6 +391,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 	}
 
 	ExecSetExecProcNode(result, result->ExecProcNode);
+	if (result->ExecProcNodeBatch)
+		ExecSetExecProcNodeBatch(result, result->ExecProcNodeBatch);
 
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
@@ -489,6 +493,54 @@ ExecProcNodeInstr(PlanState *node)
 	return result;
 }
 
+/*
+ * ExecSetExecProcNodeBatch
+ *		Install ExecProcNodeBatch with first-call wrapper, mirroring row path.
+ */
+void
+ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function)
+{
+	node->ExecProcNodeBatchReal = function;
+	node->ExecProcNodeBatch = ExecProcNodeBatchFirst;
+}
+
+/*
+ * ExecProcNodeBatchFirst
+ *		One-time stack-depth check; then pick instrument/no-instrument wrapper.
+ */
+static TupleBatch *
+ExecProcNodeBatchFirst(PlanState *node)
+{
+	check_stack_depth();
+
+	if (node->instrument)
+		node->ExecProcNodeBatch = ExecProcNodeBatchInstr;
+	else
+		node->ExecProcNodeBatch = node->ExecProcNodeBatchReal;
+
+	return node->ExecProcNodeBatch(node);
+}
+
+/*
+ * ExecProcNodeBatchInstr
+ *		Instrumentation wrapper for batch calls.
+ *
+ * Note: we can record nrows as the "tuple" count for this call. That keeps
+ * instrumentation meaningful without changing Instr API.
+ */
+static TupleBatch *
+ExecProcNodeBatchInstr(PlanState *node)
+{
+	TupleBatch *b;
+
+	InstrStartNode(node->instrument);
+
+	b = node->ExecProcNodeBatchReal(node);
+
+	InstrStopNode(node->instrument, b ? (double) b->nvalid : 0.0);
+
+	return b;
+}
 
 /* ----------------------------------------------------------------
  *		MultiExecProcNode
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 2552d420f1c..3f7e40c8908 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -334,6 +334,37 @@ ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
 									 pstate->qual, pstate->ps_ProjInfo);
 }
 
+static TupleBatch *
+ExecSeqScanBatch(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatch(&node->ss,
+								 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+								 NULL, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation is required.
+ */
+static TupleBatch *
+ExecSeqScanBatchWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatch(&node->ss,
+								 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+								 pstate->qual, NULL);
+}
+
 /* Batch SeqScan enablement and dispatch */
 static void
 SeqScanInitBatching(SeqScanState *scanstate, int eflags)
@@ -348,10 +379,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 	{
 		if (scanstate->ss.ps.ps_ProjInfo == NULL)
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatch;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
 		}
 		else
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = NULL;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
 		}
 	}
@@ -359,10 +392,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 	{
 		if (scanstate->ss.ps.ps_ProjInfo == NULL)
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatchWithQual;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
 		}
 		else
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = NULL;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
 		}
 	}
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fec606471c8..fb4b57a831c 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -297,4 +297,55 @@ ExecScanExtendedBatchSlot(ScanState *node,
 	}
 }
 
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+					  ExecScanAccessBatchMtd accessBatchMtd,
+					  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+	int			qualified;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next batch from the AM */
+		if (!accessBatchMtd(node))
+			return NULL;
+
+		if (qual != NULL)
+		{
+			qualified = 0;
+			while (TupleBatchHasMore(b))
+			{
+				TupleTableSlot *in = TupleBatchGetNextSlot(b);
+
+				Assert(in);
+				ResetExprContext(econtext);
+				econtext->ecxt_scantuple = in;
+
+				if (ExecQual(qual, econtext))
+				{
+					TupleBatchStoreInOut(b, qualified, in);
+					qualified++;
+				}
+				else
+					InstrCountFiltered1(node, 1);
+			}
+			TupleBatchUseOutput(b, qualified);
+		}
+		else
+			qualified = b->nvalid;
+
+		if (qualified > 0)
+			return b;
+		/* else get the next batch from the AM */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 17258f7ae2d..cf5b0c7e05c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -294,6 +294,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
+extern void ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern void ExecShutdownNode(PlanState *node);
@@ -315,6 +316,15 @@ ExecProcNode(PlanState *node)
 
 	return node->ExecProcNode(node);
 }
+
+static inline TupleBatch *
+ExecProcNodeBatch(PlanState *node)
+{
+	if (node->chgParam != NULL) /* something changed? */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	return node->ExecProcNodeBatch(node);
+}
 #endif
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f4bb8f7dd7f..a104591ac20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1147,6 +1147,7 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
 /* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
 struct TupleBatch;
 typedef struct TupleBatch TupleBatch;
+typedef TupleBatch *(*ExecProcNodeBatchMtd)(struct PlanState *ps);
 
 /* ----------------
  *		PlanState node
@@ -1171,6 +1172,10 @@ typedef struct PlanState
 	ExecProcNodeMtd ExecProcNodeReal;	/* actual function, if above is a
 										 * wrapper */
 
+	/* Optional batch-producing entry point (NULL => no batching). */
+	ExecProcNodeBatchMtd ExecProcNodeBatch;
+	ExecProcNodeBatchMtd ExecProcNodeBatchReal;
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-09-26 13:49  Bruce Momjian <[email protected]>
  parent: Amit Langote <[email protected]>
  1 sibling, 1 reply; 22+ messages in thread

From: Bruce Momjian @ 2025-09-26 13:49 UTC (permalink / raw)
  To: Amit Langote <[email protected]>; +Cc: pgsql-hackers

On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> At PGConf.dev this year we had an unconference session [1] on whether
> the community can support an additional batch executor. The discussion
> there led me to start hacking on $subject. I have also had off-list
> discussions on this topic in recent months with Andres and David, who
> have offered useful thoughts.
> 
> This patch series is an early attempt to make executor nodes pass
> around batches of tuples instead of tuple-at-a-time slots. The main
> motivation is to enable expression evaluation in batch form, which can
> substantially reduce per-tuple overhead (mainly from function calls)
> and open the door to further optimizations such as SIMD usage in
> aggregate transition functions. We could even change algorithms of
> some plan nodes to operate on batches when, for example, a child node
> can return batches.

For background, people might want to watch these two videos from POSETTE
2025.  The first video explains how data warehouse query needs are
different from OLTP needs:

	Building a PostgreSQL data warehouse
	https://www.youtube.com/watch?v=tpq4nfEoioE

and the second one explains the executor optimizations done in PG 18:

	Hacking Postgres Executor For Performance
	https://www.youtube.com/watch?v=D3Ye9UlcR5Y

I learned from these two videos that to handle new workloads, I need to
think of the query demands differently, and of course can this be
accomplished without hampering OLTP workloads?

-- 
  Bruce Momjian  <[email protected]>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Do not let urgent matters crowd out time for investment in the future.





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-09-29 11:01  Tomas Vondra <[email protected]>
  parent: Amit Langote <[email protected]>
  1 sibling, 4 replies; 22+ messages in thread

From: Tomas Vondra @ 2025-09-29 11:01 UTC (permalink / raw)
  To: Amit Langote <[email protected]>; pgsql-hackers

Hi Amit,

Thanks for the patch. I took a look over the weekend, and done a couple
experiments / benchmarks, so let me share some initial feedback (or
rather a bunch of questions I came up with).

I'll start with some general thoughts, before going into some nitpicky
comments about patches / code and perf results.

I think the general goal of the patch - reducing the per-tuple overhead
and making the executor more efficient for OLAP workloads - is very
desirable. I believe the limitations of per-row executor are one of the
reasons why attempts to implement a columnar TAM mostly failed. The
compression is nice, but it's hard to be competitive without an executor
that leverages that too. So starting with an executor, in a way that
helps even heap, seems like a good plan. So +1 to this.

While looking at the patch, I couldn't help but think about the index
prefetching stuff that I work on. It also introduces the concept of a
"batch", for passing data between an index AM and the executor. It's
interesting how different the designs are in some respects. I'm not
saying one of those designs is wrong, it's more due different goals.

For example, the index prefetching patch establishes a "shared" batch
struct, and the index AM is expected to fill it with data. After that,
the batch is managed entirely by indexam.c, with no AM calls. The only
AM-specific bit in the batch is "position", but that's used only when
advancing to the next page, etc.

This patch does things differently. IIUC, each TAM may produce it's own
"batch", which is then wrapped in a generic one. For example, heap
produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
is fine. In the prefetching we chose to move all this code (walking the
batch items) from the AMs into the layer above, and make it AM agnostic.

But for the batching, we want to retain the custom format as long as
possible. Presumably, the various advantages of the TAMs are tied to the
custom/columnar storage format. Memory efficiency thanks to compression,
execution on compressed data, etc. Keeping the custom format as long as
possible is the whole point of "late materialization" (and materializing
as late as possible is one of the important details in column stores).

How far ahead have you though about these capabilities? I was wondering
about two things in particular. First, at which point do we have to
"materialize" the TupleBatch into some generic format (e.g. TupleSlots).
I get it that you want to enable passing batches between nodes, but
would those use the same "format" as the underlying scan node, or some
generic one? Second, will it be possible to execute expressions on the
custom batches (i.e. on "compressed data")? Or is it necessary to
"materialize" the batch into regular tuple slots? I realize those may
not be there "now" but maybe it'd be nice to plan for the future.

It might be worth exploring some columnar formats, and see if this
design would be a good fit. Let's say we want to process data read from
a parquet file. Would we be able to leverage the format, or would we
need to "materialize" into slots too early? Or maybe it'd be good to
look at the VCI extension [1], discussed in a nearby thread. AFAICS
that's still based on an index AM, but there were suggestions to use TAM
instead (and maybe that'd be a better choice).

The other option would be to "create batches" during execution, say by
having a new node that accumulates tuples, builds a batch and sends it
to the node above. This would help both in cases when either the lower
node does not produce batches at all, or the batches are too small (due
to filtering, aggregation, ...). Or course, it'd only win if this
increases efficiency of the upper part of the plan enough to pay for
building the batches. That can be a hard decision.

You also mentioned we could make batches larger by letting them span
multiple pages, etc. I'm not sure that's worth it - wouldn't that
substantially complicate the TAM code, which would need to pin+track
multiple buffers for each batch, etc.? Possible, but is it worth it?

I'm not sure allowing multi-page batches would actually solve the issue.
It'd help with batches at the "scan level", but presumably the batch
size in the upper nodes matters just as much. Large scan batches may
help, but hard to predict.

In the index prefetching patch we chose to keep batches 1:1 with leaf
pages, at least for now. Instead we allowed having multiple batches at
once. I'm not sure that'd be necessary for TAMs, though.

This also reminds me of LIMIT queries. The way I imagine a "batchified"
executor to work is that batches are essentially "units of work". For
example, a nested loop would grab a batch of tuples from the outer
relation, lookup inner tuples for the whole batch, and only then pass
the result batch. (I'm ignoring the cases when the batch explodes due to
duplicates.)

But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
just the first tuple, and the rest of the batch is wasted work? Plenty
of (very expensive) OLAP have that, and many would likely benefit from
batching, so just disabling batching if there's LIMIT seems way too
heavy handed.

Perhaps it'd be good to gradually ramp up the batch size? Start with
small batches, and then make them larger. The index prefetching does
that too, indirectly - it reads the whole leaf page as a batch, but then
gradually ramps up the prefetch distance (well, read_stream does that).
Maybe the batching should have similar thing ...

In fact, how shall the optimizer decide whether to use batching? It's
one thing to decide whether a node can produce/consume batches, but
another thing is "should it"? With a node that "builds" a batch, this
decision would apply to even more plans, I guess.

I don't have a great answer to this, it seems like an incredibly tricky
costing issue. I'm a bit worried we might end up with something too
coarse, like "jit=on" which we know is causing problems (admittedly,
mostly due to a lot of the LLVM work being unpredictable/external). But
having some "adaptive" heuristics (like the gradual ramp up) might make
it less risky.

FWIW the current batch size limit (64 tuples) seems rather low, but it's
hard to say. It'd be good to be able to experiment with different
values, so I suggest we make this a GUC and not a hard-coded constant.

As for what to add to explain, I'd start by adding info about which
nodes are "batched" (consuming/producing batches), and some info about
the batch sizes. An average size, maybe a histogram if you want to be a
bit fancy.

I have no thoughts about the expression patches, at least not beyond
what I already wrote above. I don't know enough about that part.

[1]
https://www.postgresql.org/message-id/OS7PR01MB119648CA4E8502FE89056E56EEA7D2%40OS7PR01MB11964.jpnpr...


Now, numbers from some microbenchmarks:

On 9/26/25 15:28, Amit Langote wrote:
> 
> To evaluate the overheads and benefits, I ran microbenchmarks with
> single and multi-aggregate queries on a single table, with and without
> WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> and IO costs are minimal. shared_buffers was large enough to fit the
> whole table (up to 10M rows, ~43 on each page), and all pages were
> prewarmed into cache before tests. Table schema/script is at [2].
> 
> Observations from benchmarking (Detailed benchmark tables are at [3];
> below is just a high-level summary of the main patterns):
> 
> * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> sum(a) FROM bar_N): batching scan output alone improved latency by
> ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> especially once fmgr overhead was paid per batch instead of per row.
> 
> * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> qual interpreter gave a big step up, with latencies dropping by
> ~30-40% compared to batching=off.
> 
> * Five aggregates, no WHERE: batching input from the child scan cut
> ~15% off runtime. Adding batched transition evaluation increased
> improvements to ~30%.
> 
> * Five aggregates, with WHERE: modest gains from scan/input batching,
> but per-batch transition evaluation and batched quals brought ~20-30%
> improvement.
> 
> * Across all cases, executor overheads became visible only after IO
> was minimized. Once executor cost dominated, batching consistently
> reduced CPU time, with the largest benefits coming from avoiding
> per-row fmgr calls and evaluating quals across batches.
> 
> I would appreciate if others could try these patches with their own
> microbenchmarks or workloads and see if they can reproduce numbers
> similar to mine. Feedback on both the general direction and the
> details of the patches would be very helpful. In particular, patches
> 0001-0003, which add the basic batch APIs and integrate them into
> SeqScan, are intended to be the first candidates for review and
> eventual commit. Comments on the later, more experimental patches
> (aggregate input batching and expression evaluation (qual, aggregate
> transition) batching) are also welcome.
> 

I tried to replicate the results, but the numbers I see are not this
good. In fact, I see a fair number of regressions (and some are not
negligible).

I'm attaching the scripts I used to build the tables / run the test. I
used the same table structure, and tried to follow the same query
pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
conditions (with 100% selectivity).

I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
And I did that on my (relatively) new ryzen machine, and old xeon. The
behavior is quite different for the two machines, but none of them shows
such improvements. I used clang 19.0, and --with-llvm.

See the attached PDFs with a summary of the results, comparing the
results for master and the two batching branches.

The ryzen is much "smoother" - it shows almost no difference with
batching "off" (as expected). The "scan" branch (with 0001-0003) shows
an improvement of 5-10% - it's consistent, but much less than the 10-20%
you report. For the "agg" branch the benefits are much larger, but
there's also a significant regression for the largest table with 100M
rows (which is ~18GB on disk).

For xeon, the results are a bit more variable, but it affects runs both
with batching "on" and "off". The machine is just more noisy. There
seems to be a small benefit of "scan" batching (in most cases much less
than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
and no regression similar to the ryzen.

Perhaps I did something wrong. It does not surprise me this is somewhat
CPU dependent. It's a bit sad the improvements are smaller for the newer
CPU, though.

I also tried running TPC-H. I don't have useful numbers yet, but I ran
into a segfault - see the attached backtrace. It only happens with the
batching, and only on Q22 for some reason. I initially thought it's a
bug in clang, because I saw it with clang-22 built from git, and not
with clang-14 or gcc. But since then I reproduced it with clang-19 (on
debian 13). Still could be a clang bug, of course. I've seen ~20 of
those segfaults so far, and the backtraces look exactly the same.


regards

-- 
Tomas Vondra

Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1569550' in core file too small.
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412		return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1  pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2  0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3  numeric_avg_accum (fcinfo=0x5570b3cd0100) at numeric.c:5052
#4  0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3cd0258, op=op@entry=0x5570b3cd0950, econtext=econtext@entry=0x5570b3cb4718) at execExprInterp.c:6171
#5  0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3cd0258, econtext=0x5570b3cb4718, isnull=0x0) at execExprInterp.c:2338
#6  0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:431
#7  ExecEvalExprNoReturnSwitchContext (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:472
#8  advance_aggregates_batch (aggstate=0x5570b3cb4300, b=<optimized out>) at nodeAgg.c:834
#9  agg_retrieve_direct_batch (aggstate=0x5570b3cb4300) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb4300) at nodeAgg.c:2289
#11 0x00005570aa0ee7e2 in ExecProcNode (node=0x5570b3cb4300) at ../../../src/include/executor/executor.h:317
#12 gather_getnext (gatherstate=0x5570b3cb3ff0) at nodeGather.c:294
#13 ExecGather (pstate=0x5570b3cb3ff0) at nodeGather.c:229
#14 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cb3ff0) at ../../../src/include/executor/executor.h:317
#15 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cb3878) at nodeAgg.c:562
#16 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=0x5570b3cb3878) at nodeAgg.c:2477
#17 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb3878) at nodeAgg.c:2289
#18 0x00005570aa108565 in ExecProcNode (node=0x5570b3cb3878) at ../../../src/include/executor/executor.h:317
#19 ExecSetParamPlan (node=0x5570b3cf4778, econtext=econtext@entry=0x5570b3cf4cd0) at nodeSubplan.c:1116
#20 0x00005570aa108a3b in ExecSetParamPlanMulti (params=params@entry=0x7ff33d3e2a08, econtext=0x5570b3cf4cd0) at nodeSubplan.c:1263
#21 0x00005570aa0d523f in ExecInitParallelPlan (planstate=0x5570b3cd2f48, estate=estate@entry=0x5570b3cb3588, sendParams=0x7ff33d3e2a08, nworkers=4, tuples_needed=-1)
    at execParallel.c:636
#22 0x00005570aa0eece2 in ExecGatherMerge (pstate=0x5570b3cd2c38) at nodeGatherMerge.c:210
#23 0x00005570aa104056 in ExecProcNode (node=0x5570b3cd2c38) at ../../../src/include/executor/executor.h:317
#24 ExecNestLoop (pstate=0x5570b3cd2a28) at nodeNestloop.c:108
#25 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cd2a28) at ../../../src/include/executor/executor.h:317
#26 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:562
#27 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:2477
#28 0x00005570aa0e694d in ExecAgg (pstate=0x5570b3cd22f8) at nodeAgg.c:2292
#29 0x00005570aa0f95e0 in ExecProcNode (node=0x5570b3cd22f8) at ../../../src/include/executor/executor.h:317
#30 ExecLimit (pstate=0x5570b3cd1fe8) at nodeLimit.c:95
#31 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3cd1fe8) at ../../../src/include/executor/executor.h:317
#32 ExecutePlan (queryDesc=0x5570b3cba668, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x7ff33d3e42a0) at execMain.c:1697
#33 standard_ExecutorRun (queryDesc=0x5570b3cba668, direction=<optimized out>, count=0) at execMain.c:366
#34 0x00005570aa2a3ccb in PortalRunSelect (portal=portal@entry=0x5570b3c1cda8, forward=<optimized out>, count=0, count@entry=9223372036854775807, dest=dest@entry=0x7ff33d3e42a0)
    at pquery.c:921
#35 0x00005570aa2a392d in PortalRun (portal=portal@entry=0x5570b3c1cda8, count=count@entry=9223372036854775807, isTopLevel=true, dest=dest@entry=0x7ff33d3e42a0, 
    altdest=altdest@entry=0x7ff33d3e42a0, qc=qc@entry=0x7ffd38218570) at pquery.c:765
#36 0x00005570aa2a2b26 in exec_simple_query (
    query_string=query_string@entry=0x5570b3b9b0d8 "select\r\n\tcntrycode,\r\n\tcount(*) as numcust,\r\n\tsum(c_acctbal) as totacctbal\r\nfrom\r\n\t(\r\n\t\tselect\r\n\t\t\tsubstring(c_phone from 1 for 2) as cntrycode,\r\n\t\t\tc_acctbal\r\n\t\tfrom\r\n\t\t\tcustomer\r\n\t\twhere\r\n\t\t\tsubstrin"...) at postgres.c:1278
#37 0x00005570aa2a04cd in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4770
--Type <RET> for more, q to quit, c to continue without paging--
#38 0x00005570aa29b81b in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:124
#39 0x00005570aa1f85a1 in postmaster_child_launch (child_type=<optimized out>, child_slot=1, startup_data=startup_data@entry=0x7ffd38218988, 
    startup_data_len=startup_data_len@entry=24, client_sock=client_sock@entry=0x7ffd382188f8) at launch_backend.c:268
#40 0x00005570aa1fcb0c in BackendStartup (client_sock=0x7ffd382188f8) at postmaster.c:3590
#41 ServerLoop () at postmaster.c:1705
#42 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#43 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231
(gdb) 


warning: Section `.reg-xstate/1569551' in core file too small.
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412		return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1  pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2  0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3  numeric_avg_accum (fcinfo=0x5570b3c9f718) at numeric.c:5052
#4  0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3c9a440, op=op@entry=0x5570b3c9f978, econtext=econtext@entry=0x5570b3c6b0a0) at execExprInterp.c:6171
#5  0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0, isnull=0x0) at execExprInterp.c:2338
#6  0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:431
#7  ExecEvalExprNoReturnSwitchContext (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:472
#8  advance_aggregates_batch (aggstate=0x5570b3c6b330, b=<optimized out>) at nodeAgg.c:834
#9  agg_retrieve_direct_batch (aggstate=0x5570b3c6b330) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3c6b330) at nodeAgg.c:2289
#11 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3c6b330) at ../../../src/include/executor/executor.h:317
#12 ExecutePlan (queryDesc=0x5570b3c679d8, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x5570b3c26ee8) at execMain.c:1697
#13 standard_ExecutorRun (queryDesc=0x5570b3c679d8, direction=<optimized out>, count=0) at execMain.c:366
#14 0x00005570aa0d6857 in ParallelQueryMain (seg=seg@entry=0x5570b3bcfc30, toc=toc@entry=0x7ff33de00000) at execParallel.c:1499
#15 0x00005570a9f7fce3 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1563
#16 0x00005570aa1f5d8e in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:843
#17 0x00005570aa1f85a1 in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, child_slot=239, startup_data=startup_data@entry=0x5570b3bd5d30, 
    startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at launch_backend.c:268
#18 0x00005570aa1fb2e3 in StartBackgroundWorker (rw=0x5570b3bd5d30) at postmaster.c:4160
#19 maybe_start_bgworkers () at postmaster.c:4326
#20 0x00005570aa1fce85 in LaunchMissingBackgroundProcesses () at postmaster.c:3400
#21 ServerLoop () at postmaster.c:1720
#22 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#23 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231


Attachments:

  [application/x-shellscript] run-test.sh (1.8K, 2-run-test.sh)
  download

  [application/x-shellscript] create-tables.sh (709B, 3-create-tables.sh)
  download

  [application/pdf] batching-xeon.pdf (39.1K, 4-batching-xeon.pdf)
  download

  [application/pdf] batching-ryzen.pdf (52.0K, 5-batching-ryzen.pdf)
  download

  [text/plain] batching-backtrace.txt (7.9K, 6-batching-backtrace.txt)
  download | inline:
Program terminated with signal SIGSEGV, Segmentation fault.

warning: Section `.reg-xstate/1569550' in core file too small.
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412		return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1  pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2  0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3  numeric_avg_accum (fcinfo=0x5570b3cd0100) at numeric.c:5052
#4  0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3cd0258, op=op@entry=0x5570b3cd0950, econtext=econtext@entry=0x5570b3cb4718) at execExprInterp.c:6171
#5  0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3cd0258, econtext=0x5570b3cb4718, isnull=0x0) at execExprInterp.c:2338
#6  0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:431
#7  ExecEvalExprNoReturnSwitchContext (state=0x5570b3cd0258, econtext=0x5570b3cb4718) at ../../../src/include/executor/executor.h:472
#8  advance_aggregates_batch (aggstate=0x5570b3cb4300, b=<optimized out>) at nodeAgg.c:834
#9  agg_retrieve_direct_batch (aggstate=0x5570b3cb4300) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb4300) at nodeAgg.c:2289
#11 0x00005570aa0ee7e2 in ExecProcNode (node=0x5570b3cb4300) at ../../../src/include/executor/executor.h:317
#12 gather_getnext (gatherstate=0x5570b3cb3ff0) at nodeGather.c:294
#13 ExecGather (pstate=0x5570b3cb3ff0) at nodeGather.c:229
#14 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cb3ff0) at ../../../src/include/executor/executor.h:317
#15 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cb3878) at nodeAgg.c:562
#16 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=0x5570b3cb3878) at nodeAgg.c:2477
#17 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3cb3878) at nodeAgg.c:2289
#18 0x00005570aa108565 in ExecProcNode (node=0x5570b3cb3878) at ../../../src/include/executor/executor.h:317
#19 ExecSetParamPlan (node=0x5570b3cf4778, econtext=econtext@entry=0x5570b3cf4cd0) at nodeSubplan.c:1116
#20 0x00005570aa108a3b in ExecSetParamPlanMulti (params=params@entry=0x7ff33d3e2a08, econtext=0x5570b3cf4cd0) at nodeSubplan.c:1263
#21 0x00005570aa0d523f in ExecInitParallelPlan (planstate=0x5570b3cd2f48, estate=estate@entry=0x5570b3cb3588, sendParams=0x7ff33d3e2a08, nworkers=4, tuples_needed=-1)
    at execParallel.c:636
#22 0x00005570aa0eece2 in ExecGatherMerge (pstate=0x5570b3cd2c38) at nodeGatherMerge.c:210
#23 0x00005570aa104056 in ExecProcNode (node=0x5570b3cd2c38) at ../../../src/include/executor/executor.h:317
#24 ExecNestLoop (pstate=0x5570b3cd2a28) at nodeNestloop.c:108
#25 0x00005570aa0e9037 in ExecProcNode (node=0x5570b3cd2a28) at ../../../src/include/executor/executor.h:317
#26 fetch_input_tuple (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:562
#27 0x00005570aa0e7c08 in agg_retrieve_direct (aggstate=aggstate@entry=0x5570b3cd22f8) at nodeAgg.c:2477
#28 0x00005570aa0e694d in ExecAgg (pstate=0x5570b3cd22f8) at nodeAgg.c:2292
#29 0x00005570aa0f95e0 in ExecProcNode (node=0x5570b3cd22f8) at ../../../src/include/executor/executor.h:317
#30 ExecLimit (pstate=0x5570b3cd1fe8) at nodeLimit.c:95
#31 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3cd1fe8) at ../../../src/include/executor/executor.h:317
#32 ExecutePlan (queryDesc=0x5570b3cba668, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x7ff33d3e42a0) at execMain.c:1697
#33 standard_ExecutorRun (queryDesc=0x5570b3cba668, direction=<optimized out>, count=0) at execMain.c:366
#34 0x00005570aa2a3ccb in PortalRunSelect (portal=portal@entry=0x5570b3c1cda8, forward=<optimized out>, count=0, count@entry=9223372036854775807, dest=dest@entry=0x7ff33d3e42a0)
    at pquery.c:921
#35 0x00005570aa2a392d in PortalRun (portal=portal@entry=0x5570b3c1cda8, count=count@entry=9223372036854775807, isTopLevel=true, dest=dest@entry=0x7ff33d3e42a0, 
    altdest=altdest@entry=0x7ff33d3e42a0, qc=qc@entry=0x7ffd38218570) at pquery.c:765
#36 0x00005570aa2a2b26 in exec_simple_query (
    query_string=query_string@entry=0x5570b3b9b0d8 "select\r\n\tcntrycode,\r\n\tcount(*) as numcust,\r\n\tsum(c_acctbal) as totacctbal\r\nfrom\r\n\t(\r\n\t\tselect\r\n\t\t\tsubstring(c_phone from 1 for 2) as cntrycode,\r\n\t\t\tc_acctbal\r\n\t\tfrom\r\n\t\t\tcustomer\r\n\t\twhere\r\n\t\t\tsubstrin"...) at postgres.c:1278
#37 0x00005570aa2a04cd in PostgresMain (dbname=<optimized out>, username=<optimized out>) at postgres.c:4770
--Type <RET> for more, q to quit, c to continue without paging--
#38 0x00005570aa29b81b in BackendMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at backend_startup.c:124
#39 0x00005570aa1f85a1 in postmaster_child_launch (child_type=<optimized out>, child_slot=1, startup_data=startup_data@entry=0x7ffd38218988, 
    startup_data_len=startup_data_len@entry=24, client_sock=client_sock@entry=0x7ffd382188f8) at launch_backend.c:268
#40 0x00005570aa1fcb0c in BackendStartup (client_sock=0x7ffd382188f8) at postmaster.c:3590
#41 ServerLoop () at postmaster.c:1705
#42 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#43 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231
(gdb) 


warning: Section `.reg-xstate/1569551' in core file too small.
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
412		return !VARATT_IS_4B_U(PTR);
(gdb) bt
#0  VARATT_IS_EXTENDED (PTR=0x0) at ../../../../src/include/varatt.h:412
#1  pg_detoast_datum (datum=0x0) at fmgr.c:1798
#2  0x00005570aa359cf4 in DatumGetNumeric (X=0) at ../../../../src/include/utils/numeric.h:66
#3  numeric_avg_accum (fcinfo=0x5570b3c9f718) at numeric.c:5052
#4  0x00005570aa0cf318 in ExecAggPlainTransBatch (state=state@entry=0x5570b3c9a440, op=op@entry=0x5570b3c9f978, econtext=econtext@entry=0x5570b3c6b0a0) at execExprInterp.c:6171
#5  0x00005570aa0cb0aa in ExecInterpExpr (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0, isnull=0x0) at execExprInterp.c:2338
#6  0x00005570aa0e73aa in ExecEvalExprNoReturn (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:431
#7  ExecEvalExprNoReturnSwitchContext (state=0x5570b3c9a440, econtext=0x5570b3c6b0a0) at ../../../src/include/executor/executor.h:472
#8  advance_aggregates_batch (aggstate=0x5570b3c6b330, b=<optimized out>) at nodeAgg.c:834
#9  agg_retrieve_direct_batch (aggstate=0x5570b3c6b330) at nodeAgg.c:2696
#10 0x00005570aa0e6864 in ExecAgg (pstate=0x5570b3c6b330) at nodeAgg.c:2289
#11 0x00005570aa0d26ed in ExecProcNode (node=0x5570b3c6b330) at ../../../src/include/executor/executor.h:317
#12 ExecutePlan (queryDesc=0x5570b3c679d8, operation=CMD_SELECT, sendTuples=true, numberTuples=0, direction=<optimized out>, dest=0x5570b3c26ee8) at execMain.c:1697
#13 standard_ExecutorRun (queryDesc=0x5570b3c679d8, direction=<optimized out>, count=0) at execMain.c:366
#14 0x00005570aa0d6857 in ParallelQueryMain (seg=seg@entry=0x5570b3bcfc30, toc=toc@entry=0x7ff33de00000) at execParallel.c:1499
#15 0x00005570a9f7fce3 in ParallelWorkerMain (main_arg=<optimized out>) at parallel.c:1563
#16 0x00005570aa1f5d8e in BackgroundWorkerMain (startup_data=<optimized out>, startup_data_len=<optimized out>) at bgworker.c:843
#17 0x00005570aa1f85a1 in postmaster_child_launch (child_type=child_type@entry=B_BG_WORKER, child_slot=239, startup_data=startup_data@entry=0x5570b3bd5d30, 
    startup_data_len=startup_data_len@entry=1472, client_sock=client_sock@entry=0x0) at launch_backend.c:268
#18 0x00005570aa1fb2e3 in StartBackgroundWorker (rw=0x5570b3bd5d30) at postmaster.c:4160
#19 maybe_start_bgworkers () at postmaster.c:4326
#20 0x00005570aa1fce85 in LaunchMissingBackgroundProcesses () at postmaster.c:3400
#21 ServerLoop () at postmaster.c:1720
#22 0x00005570aa1fa70d in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x5570b3b95920) at postmaster.c:1403
#23 0x00005570aa1286aa in main (argc=3, argv=0x5570b3b95920) at main.c:231

^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-09-30 02:11  Amit Langote <[email protected]>
  parent: Tomas Vondra <[email protected]>
  3 siblings, 1 reply; 22+ messages in thread

From: Amit Langote @ 2025-09-30 02:11 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

Hi Tomas,

Thanks a lot for your comments and benchmarking.

I plan to reply to your detailed comments and benchmark results, but I
just realized I had forgotten to attach patch 0008 (oops!) in my last
email. That patch adds batched qual evaluation.

I also noticed that the batched path was unnecessarily doing early
“batch-materialization” in cases like SELECT count(*) FROM bar. I’ve
fixed that as well. It was originally designed to avoid such
materialization, but I must have broken it while refactoring.


Attachments:

  [application/octet-stream] v2-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch (22.8K, 2-v2-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch)
  download | inline diff:
From 0ac98eedfef945403822d23e3efc9f7248602895 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 16:19:26 +0900
Subject: [PATCH v2 8/8] WIP: Add ExecQualBatch() and EEOPs for batched quals

Introduce ExecInitQualBatch()/ExecQualBatch() to evaluate scan quals
over a TupleBatch. The batched qual interpreter produces a boolean
mask aligned with the batch, marking which rows satisfy the qual.
The scan node later uses this mask to copy only passing rows into
its output slots. If batching is not possible, fall back to the
existing per-tuple engine.

Add EEOP_QUAL_BATCH_INITMASK and EEOP_QUAL_BATCH_TERM, and wire them
after EEOP_SCAN_FETCHSOME_BATCH and EEOP_BUILD_SCAN_BATCH_VECTOR.
Batching is limited to quals that are a top-level AND of simple
clauses: either NullTest(var) or strict binary OpExpr with var/const
or var/var arguments. A walker validates the tree, collects the
referenced attnos, and builds a BatchVector; terms are compiled from
the leaves and evaluated to update the mask.

ExprState gains batch_private to hold BatchQualRuntime (mask, words)
which are used by the parent node to populate output slots in
TupleBatch.
---
 src/backend/executor/execExpr.c       | 324 ++++++++++++++++++++++++++
 src/backend/executor/execExprInterp.c | 202 ++++++++++++++++
 src/backend/executor/nodeSeqscan.c    |   2 +
 src/backend/jit/llvm/llvmjit_expr.c   |  11 +
 src/backend/jit/llvm/llvmjit_types.c  |   2 +
 src/include/executor/execExpr.h       |  60 +++++
 src/include/executor/execScan.h       |  35 +--
 src/include/executor/executor.h       |   3 +
 src/include/nodes/execnodes.h         |   4 +
 9 files changed, 630 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 27a5780f557..63df560d5f1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -111,6 +111,19 @@ static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
 static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
 static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
 													  const BatchVector *bv);
+static int16 BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno);
+static int16 BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv);
+
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+	List      *leaves;      /* List<Node*> of accepted leaves */
+	Bitmapset *attnos;      /* Vars referenced by accepted leaves */
+	bool		ok;			/* stays true if batchable */
+	AttrNumber	last_scan;	/* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
 
 /*
  * ExecInitExpr: prepare an expression tree for execution
@@ -5221,6 +5234,209 @@ ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
 	return true;
 }
 
+/* helper: extract Var (allowing RelabelType->Var); returns NULL if not */
+static Var *
+strip_to_var(Node *n)
+{
+	if (n == NULL)
+		return NULL;
+	if (IsA(n, RelabelType))
+		n = (Node *) ((RelabelType *) n)->arg;
+	if (!IsA(n, Var))
+		return NULL;
+	if (((Var *) n)->varattno < 0)
+		return NULL;
+	return (Var *) n;
+}
+
+/* main walker; return true to abort traversal early, false to continue */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+	QualBatchContext *cxt = (QualBatchContext *) context;
+
+	if (node == NULL || !cxt->ok)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_List:
+			return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+		case T_BoolExpr:
+		{
+			BoolExpr *b = (BoolExpr *) node;
+
+			/* Only AND trees are allowed */
+			if (b->boolop != AND_EXPR)
+			{
+				cxt->ok = false;
+				return true; /* abort */
+			}
+			/* Recurse normally over children */
+			return expression_tree_walker(node, qual_batchable_walker, cxt);
+		}
+
+		case T_NullTest:
+		{
+			NullTest *nt = (NullTest *) node;
+			Var		 *v  = strip_to_var((Node *) nt->arg);
+
+			if (v == NULL)
+			{
+				cxt->ok = false;
+				return true;
+			}
+
+			cxt->attnos = bms_add_member(cxt->attnos, v->varattno);
+			if (v->varattno > cxt->last_scan)
+				cxt->last_scan = v->varattno;
+			cxt->leaves = lappend(cxt->leaves, node);
+
+			/* Do NOT recurse into leaf */
+			return false;
+		}
+
+		case T_OpExpr:
+		{
+			OpExpr *op = (OpExpr *) node;
+			List   *args = op->args;
+			Node   *l, *r;
+			Var    *lv,
+				   *rv = NULL;
+
+			/* binary only */
+			if (list_length(args) != 2)
+			{
+				cxt->ok = false;
+				return true;
+			}
+			/* strict operator only (NULL -> false semantics) */
+			if (!func_strict(op->opfuncid))
+			{
+				cxt->ok = false;
+				return true;
+			}
+
+			l = linitial(args);
+			r = lsecond(args);
+			lv = strip_to_var(l);
+			if (lv == NULL)
+			{
+				cxt->ok = false;
+				return true;
+			}
+			cxt->attnos = bms_add_member(cxt->attnos, lv->varattno);
+			if (lv->varattno > cxt->last_scan)
+				cxt->last_scan = lv->varattno;
+
+			if (IsA(r, Const))
+			{
+				/* ok; no attno to add */
+			}
+			else
+			{
+				rv = strip_to_var(r);
+				if (rv == NULL)
+				{
+					cxt->ok = false;
+					return true;
+				}
+				cxt->attnos = bms_add_member(cxt->attnos, rv->varattno);
+				if (rv->varattno > cxt->last_scan)
+					cxt->last_scan = rv->varattno;
+			}
+
+			cxt->leaves = lappend(cxt->leaves, node);
+
+			/* Leaf handled; do NOT recurse into args */
+			return false;
+		}
+
+		/* Whitelist ends here; anything else in the tree rejects */
+		default:
+			cxt->ok = false;
+			break;
+	}
+
+	return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n, BatchVector *bv)
+{
+	BatchQualTerm *term;
+	BatchQualTermKind kind;
+	bool		strict;
+	int16		l_off;
+	int16		r_off;
+	Datum		r_const = (Datum) 0;
+	bool		r_isnull = false;
+	FmgrInfo   *finfo = NULL;
+	Oid			collation;
+
+	if (IsA(n, NullTest))
+	{
+		NullTest *nt = (NullTest *) n;
+
+		kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+		l_off = BatchVectorOffsetForVarExpr(nt->arg, bv);
+		r_off = -1;
+		strict = false;
+		collation = InvalidOid;
+
+		if (l_off < 0)
+			return NULL;
+	}
+	else if (IsA(n, OpExpr))
+	{
+		OpExpr *op = (OpExpr *) n;
+		Expr   *l  = linitial(op->args);
+		Expr   *r  = lsecond(op->args);
+
+		l_off = BatchVectorOffsetForVarExpr(l, bv);
+		if (l_off < 0)
+			return NULL;
+
+		r_off = BatchVectorOffsetForVarExpr(r, bv);
+		if (IsA(r, Const))
+		{
+			Const *c = (Const *) r;
+
+			kind = BQTK_VAR_CONST;
+			r_const = c->constvalue;
+			r_isnull = c->constisnull;
+			r_off = -1;
+		}
+		else
+		{
+			if (r_off < 0)
+				return NULL;
+			kind = BQTK_VAR_VAR;
+		}
+
+		strict = func_strict(op->opfuncid);
+		collation = exprInputCollation((Node *) op);
+		finfo = palloc(sizeof(FmgrInfo));
+		fmgr_info(op->opfuncid, finfo);
+	}
+	else
+		return NULL;
+
+	term = palloc(sizeof(BatchQualTerm));
+	term->kind = kind;
+	term->strict = strict;
+	term->l_off = l_off;
+	term->r_off = r_off;
+	term->r_const = r_const;
+	term->r_isnull = r_isnull;
+	term->finfo = finfo;
+	term->collation = collation;
+
+	return term;
+}
+
 /* ---------- BatchVector stuff ------------- */
 
 static BatchVector *
@@ -5298,3 +5514,111 @@ BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
 
 	return bvs;
 }
+
+/*
+ * BatchVectorOffsetForVarExpr
+ *   Map a Var (or RelabelType->Var) to its BatchVector column index.
+ *   Returns -1 if the Var’s attno is not present.
+ */
+static int16
+BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv)
+{
+	AttrNumber attno;
+
+	if (!expr_is_simple_var(expr, &attno))
+		return -1;
+
+	return (int16) BatchVectorFindAttColno(bv, attno);
+}
+
+/*
+ * ExecInitQualBatch
+ *	Build a batched-qual EEOP program (AND-only).
+ *	Caller should also keep scalar ps->qual for runtime fallback.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+	Node	   *qual = (Node *) ps->plan->qual;
+	QualBatchContext cxt = {NIL, NULL, true, 0};
+	BatchQualRuntime *rt;
+	ExprState  *state;
+	BatchVector *bv;
+	uint64	   *mask;
+	int			mask_words;
+	ListCell   *lc;
+	ExprEvalStep scratch = {0};
+
+	if (qual == NULL)
+		return NULL;
+
+	/* validate + collect leaves/attnos with walker */
+	(void) qual_batchable_walker(qual, &cxt);
+	if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+		return NULL;
+
+	bv = BatchVectorCreate(cxt.attnos, cxt.last_scan);
+
+	mask_words = (bv->maxrows + 63) >> 6;
+	mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+	/* Runtime carrier (lifetime == exprstate) */
+	rt = palloc0(sizeof(BatchQualRuntime));
+	rt->mask = mask;
+	rt->mask_words = mask_words;
+
+	/* dedicated ExprState for batched program */
+
+	state = makeNode(ExprState);
+	state->expr = (Expr *) qual;
+	state->parent = ps;
+	state->ext_params = NULL;
+
+	/* mark expression as to be used with ExecQual() */
+	state->flags = EEO_FLAG_IS_QUAL;
+
+	/* Only valid as batch qual if this is set. */
+	state->batch_private = (void *) rt;
+
+	scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+	scratch.d.fetch_batch.last_var = cxt.last_scan;
+	ExprEvalPushStep(state, &scratch);
+
+	scratch.opcode = EEOP_BUILD_SCAN_BATCH_VECTOR;
+	scratch.d.batch_vector.bv = bv;
+	ExprEvalPushStep(state, &scratch);
+
+	scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+	scratch.d.qualbatch_init.bv = bv;
+	scratch.d.qualbatch_init.mask = mask;
+	scratch.d.qualbatch_init.mask_words = mask_words;
+	ExprEvalPushStep(state, &scratch);
+
+	/* TERM per leaf */
+	foreach(lc, cxt.leaves)
+	{
+		BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc), bv);
+
+		if (term == NULL)
+			return NULL;
+
+		scratch.opcode = EEOP_QUAL_BATCH_TERM;
+		scratch.d.qualbatch_term.bv = bv;
+		scratch.d.qualbatch_term.mask = mask;
+		scratch.d.qualbatch_term.mask_words = mask_words;
+		scratch.d.qualbatch_term.term = term;		/* by value */
+		ExprEvalPushStep(state, &scratch);
+	}
+
+	/*
+	 * At the end, we don't need to do anything more.  The last qual expr must
+	 * have yielded TRUE, and since its result is stored in the desired output
+	 * location, we're done.
+	 */
+	scratch.opcode = EEOP_DONE_NO_RETURN;
+	ExprEvalPushStep(state, &scratch);
+
+	ExecReadyExpr(state);
+
+	return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 41ad9b4838d..5c2baa0e19d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -608,6 +608,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
+		&&CASE_EEOP_QUAL_BATCH_INITMASK,
+		&&CASE_EEOP_QUAL_BATCH_TERM,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2350,7 +2352,19 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			/* too complex for an inline implementation */
 			ExecAggPlainTransBatch(state, op, econtext);
+			EEO_NEXT();
+		}
 
+
+		EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+		{
+			ExecQualBatchInitMask(state, op, econtext);
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_QUAL_BATCH_TERM)
+		{
+			ExecQualBatchTerm(state, op, econtext);
 			EEO_NEXT();
 		}
 
@@ -6185,3 +6199,191 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
 			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
 	}
 }
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+	for (int i = 0; i < nwords; i++)
+		a[i] = ~UINT64CONST(0);
+
+	if ((nvalid_bits & 63) != 0)
+	{
+		int rem = nvalid_bits & 63;
+
+		a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+	}
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+	a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	BatchVector *bv = op->d.qualbatch_init.bv;
+	uint64      *mask = op->d.qualbatch_init.mask;
+	int          nwords = op->d.qualbatch_init.mask_words;
+	int          n = bv->nrows;
+
+	/* initialize to all-pass for current batch size */
+	mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	BatchVector    *bv   = op->d.qualbatch_term.bv;
+	uint64         *mask = op->d.qualbatch_term.mask;
+	BatchQualTerm  *t    = op->d.qualbatch_term.term;
+	int             n    = bv->nrows;
+
+	switch (t->kind)
+	{
+		case BQTK_IS_NULL:
+		{
+			/* keep bit set only if value IS NULL; clear otherwise */
+			for (int i = 0; i < n; i++)
+			{
+				if (!bv->nulls[t->l_off][i])
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		case BQTK_IS_NOT_NULL:
+		{
+			/* keep bit set only if value IS NOT NULL; clear if NULL */
+			for (int i = 0; i < n; i++)
+			{
+				if (bv->nulls[t->l_off][i])
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		case BQTK_VAR_CONST:
+		{
+			const bool  r_isnull = t->r_isnull;
+			const Datum r_const  = t->r_const;
+			const bool  strict   = t->strict;
+			const Oid   coll     = t->collation;
+			FmgrInfo   *finfo    = t->finfo;
+			int         loff     = t->l_off;
+
+			for (int i = 0; i < n; i++)
+			{
+				bool ln = bv->nulls[loff][i];
+				bool pass;
+
+				/* WHERE treats NULL as false; strict ops short-circuit */
+				if (strict && (ln || r_isnull))
+					pass = false;
+				else
+				{
+					Datum lv = bv->cols[loff][i];
+
+					/* fast-paths could go here based on t->fastclass */
+
+					pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+				}
+
+				if (!pass)
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		case BQTK_VAR_VAR:
+		{
+			const bool  strict = t->strict;
+			const Oid   coll   = t->collation;
+			FmgrInfo   *finfo  = t->finfo;
+			int         loff   = t->l_off;
+			int         roff   = t->r_off;
+
+			for (int i = 0; i < n; i++)
+			{
+				bool  ln = bv->nulls[loff][i];
+				bool  rn = bv->nulls[roff][i];
+				bool  pass;
+
+				if (strict && (ln || rn))
+					pass = false;
+				else
+				{
+					Datum lv = bv->cols[loff][i];
+					Datum rv = bv->cols[roff][i];
+
+					/* fast-paths could go here based on t->fastclass */
+
+					pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+				}
+
+				if (!pass)
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		default:
+			/* should not happen; leave mask unchanged */
+			break;
+	}
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+	for (int i = 0; i < nwords; i++)
+	{
+		if (mask[i] != 0)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * ExecQualBatch
+ *		Evaluate a compiled qual (EEOP_QUAL) for a batch of rows.
+ *
+ * Returns the number of true rows (optional convenience for callers).
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+	int		i;
+	uint64 *mask;
+	int		kept = 0;
+	BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);;
+
+	/* verify that expression was compiled using ExecInitQual */
+	Assert(state->flags & EEO_FLAG_IS_QUAL);
+	Assert(rt && rt->mask && rt->mask_words);
+
+	/* run the batched EEOP program once */
+	econtext->scan_batch = b;
+	ExecEvalExprNoReturn(state, econtext);
+
+	mask = rt->mask;
+	if (mask_is_empty(mask, rt->mask_words))
+		return 0;
+
+	/* Add survivors into outslots */
+	TupleBatchRewind(b);
+	i = 0;
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		/* mask bit set => row survives */
+		if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+			TupleBatchStoreInOut(b, kept++, slot);
+		i++;
+	}
+
+	return kept;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index a4cf1e51af0..e5ca619731f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -401,6 +401,8 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
 		}
 	}
+
+	scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 45346124bd7..b97d5faebde 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3033,6 +3033,17 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_QUAL_BATCH_INITMASK:
+				build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+			case EEOP_QUAL_BATCH_TERM:
+				build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 1b5e06f60cc..f4f756e7cb5 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -187,4 +187,6 @@ void	   *referenced_functions[] =
 	ExecBuildOuterBatchVector,
 	ExecBuildScanBatchVector,
 	ExecAggPlainTransBatch,
+	ExecQualBatchInitMask,
+	ExecQualBatchTerm,
 };
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index f24782ecf58..f50936acaaa 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -306,6 +306,10 @@ typedef enum ExprEvalOp
 	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
 	EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,	/* call transfn once with AggBulkArgs */
 
+	/* Batched qual evaluation */
+	EEOP_QUAL_BATCH_INITMASK,
+	EEOP_QUAL_BATCH_TERM,
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
@@ -796,6 +800,21 @@ typedef struct ExprEvalStep
 		{
 			struct BatchVector *bv;
 		}			batch_vector;
+
+		struct
+		{
+			struct BatchVector *bv; /* filled earlier by BUILD_BATCH_VECTOR */
+			uint64			   *mask;        /* shared mask buffer for this program */
+			int					mask_words;  /* ceil(es_max_batch/64) */
+		}			qualbatch_init;                    /* EEOP_QUAL_BATCH_INITMASK */
+
+		struct
+		{
+			struct BatchVector *bv; /* same bv as init */
+			uint64			   *mask;        /* same mask buffer */
+			int					mask_words;  /* same word count */
+			struct BatchQualTerm *term;      /* compiled leaf */
+		}			qualbatch_term;                    /* EEOP_QUAL_BATCH_TERM */
 	}			d;
 } ExprEvalStep;
 
@@ -975,4 +994,45 @@ extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprCo
 extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
 extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+	BQTK_VAR_CONST,
+	BQTK_VAR_VAR,
+	BQTK_IS_NULL,
+	BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+	BatchQualTermKind kind;
+	bool		strict;		/* follow strict NULL semantics if true */
+	int16		l_off;		/* left VAR column (index into BatchVector) */
+	int16		r_off;		/* right VAR column, or -1 if Const */
+	Datum		r_const;	/* for VAR_CONST */
+	bool		r_isnull;	/* for VAR_CONST */
+	FmgrInfo   *finfo;		/* fmgr for generic binary ops */
+	Oid			collation;	/* op collation */
+} BatchQualTerm;
+
+/*
+ * Runtime view for batched qual programs.
+ * Owned by the ExprState; lifetime == ExprState.
+ */
+typedef struct BatchQualRuntime
+{
+	uint64 *mask;
+	int		mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+	return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fb4b57a831c..568a7a33b7d 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -304,7 +304,8 @@ ExecScanExtendedBatch(ScanState *node,
 {
 	ExprContext *econtext = node->ps.ps_ExprContext;
 	TupleBatch *b = node->ps.ps_Batch;
-	int			qualified;
+	ExprState  *qual_batch = node->ps.qual_batch;
+	int			qualified = 0;
 
 	/* Batch path does not support EPQ */
 	Assert(node->ps.state->es_epq_active == NULL);
@@ -320,23 +321,31 @@ ExecScanExtendedBatch(ScanState *node,
 
 		if (qual != NULL)
 		{
-			qualified = 0;
-			while (TupleBatchHasMore(b))
+			ResetExprContext(econtext);
+			if (qual_batch)
 			{
-				TupleTableSlot *in = TupleBatchGetNextSlot(b);
-
-				Assert(in);
-				ResetExprContext(econtext);
-				econtext->ecxt_scantuple = in;
+				qualified = ExecQualBatch(qual_batch, econtext, b);
+			}
+			else
+			{
+				int		i = 0;
 
-				if (ExecQual(qual, econtext))
+				while (TupleBatchHasMore(b))
 				{
-					TupleBatchStoreInOut(b, qualified, in);
-					qualified++;
+					TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+					Assert(slot);
+					econtext->ecxt_scantuple = slot;
+					if (ExecQual(qual, econtext))
+					{
+						TupleBatchStoreInOut(b, qualified, slot);
+						qualified++;
+					}
+					i++;
 				}
-				else
-					InstrCountFiltered1(node, 1);
 			}
+			InstrCountFiltered1(node, b->nvalid - qualified);
+			/* Update count and start using b->outslots. */
 			TupleBatchUseOutput(b, qualified);
 		}
 		else
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c72bd755b79..dd0f2c74ae5 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -333,6 +333,7 @@ ExecProcNodeBatch(PlanState *node)
 extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
 extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
 extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -581,6 +582,8 @@ AggGetBulkArgs(FunctionCallInfo fcinfo)
 }
 #endif
 
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
 extern bool ExecCheck(ExprState *state, ExprContext *econtext);
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fdfe8b4ddaf..78c5abbb23a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -146,6 +146,9 @@ typedef struct ExprState
 	 * ExecInitExprRec().
 	 */
 	ErrorSaveContext *escontext;
+
+	/* batched-program runtime (e.g., BatchQualRuntime) */
+	void	 *batch_private;
 } ExprState;
 
 
@@ -1196,6 +1199,7 @@ typedef struct PlanState
 	 * subPlan list, which does not exist in the plan tree).
 	 */
 	ExprState  *qual;			/* boolean qual condition */
+	ExprState  *qual_batch;		/* boolean qual condition evaluated on batches */
 	PlanState  *lefttree;		/* input plan tree(s) */
 	PlanState  *righttree;
 
-- 
2.43.0



  [application/octet-stream] v2-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch (21.5K, 3-v2-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch)
  download | inline diff:
From c0797084b54d1e5d9ffe1af49c76c9396126ea1c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 2 Sep 2025 23:46:34 +0900
Subject: [PATCH v2 6/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP

Introduce a batch EEOP that runs plain aggregate transitions by
looping over rows of a TupleBatch. This keeps transition logic in
the interpreter while amortizing per-row costs.

Gate with AggTransCanUseBatch(): plain, non-hashed, single-set
aggregates with no DISTINCT/ORDER/FILTER, and simple Var args.

Extend ExecBuildAggTrans() to prepare batch fetch/build steps and
to return whether a batch path is used.
---
 src/backend/executor/execExpr.c       | 228 ++++++++++++++++++++++++--
 src/backend/executor/execExprInterp.c | 103 ++++++++++++
 src/backend/executor/nodeAgg.c        |  17 +-
 src/backend/jit/llvm/llvmjit_expr.c   |   6 +
 src/backend/jit/llvm/llvmjit_types.c  |   1 +
 src/include/executor/execBatch.h      |   6 +
 src/include/executor/execExpr.h       |  14 ++
 src/include/executor/executor.h       |   3 +-
 src/include/executor/nodeAgg.h        |   2 +
 9 files changed, 363 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..af5ed8b6368 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -95,7 +95,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
 								  int transno, int setno, int setoff, bool ishash,
-								  bool nullcheck);
+								  bool nullcheck, bool batch,
+								  BatchVector *bv);
+
 static void ExecInitJsonExpr(JsonExpr *jsexpr, ExprState *state,
 							 Datum *resv, bool *resnull,
 							 ExprEvalStep *scratch);
@@ -104,6 +106,10 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
 								 bool exists_coerce,
 								 Datum *resv, bool *resnull);
 
+static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
+static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
+static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
+													  const BatchVector *bv);
 
 /*
  * ExecInitExpr: prepare an expression tree for execution
@@ -3659,6 +3665,33 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 	}
 }
 
+/* plain agg, single set, not hashed, no DISTINCT/ORDER/FILTER */
+static inline bool
+AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
+{
+	Agg *aggnode = (Agg *) as->ss.ps.plan;
+
+	if (!AggCanUsePlainBatch(as))
+		return false;
+	if (as->aggstrategy == AGG_HASHED)
+		return false;
+	if (aggnode->groupingSets != NIL)
+		return false;
+	if (as->phase == NULL || as->phase->numsets > 0)
+		return false;
+
+	/* per-aggregate complications */
+	if (pt->aggsortrequired)
+		return false;
+	if (pt->aggref &&
+		(pt->aggref->aggdistinct != NIL ||
+		 pt->aggref->aggorder != NIL ||
+		 pt->aggref->aggfilter != NULL))
+		return false;
+
+	return true;
+}
+
 /*
  * Build transition/combine function invocations for all aggregate transition
  * / combination function invocations in a grouping sets phase. This has to
@@ -3675,13 +3708,17 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  */
 ExprState *
 ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+				  bool doSort, bool doHash, bool nullcheck,
+				  bool *batch_trans)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
 	ExprSetupInfo deform = {0, 0, 0, 0, 0, NIL};
+	bool		batch = AggCanUsePlainBatch(aggstate);
+	Bitmapset  *allattnos = NULL;
+	BatchVector *bv = NULL;
 
 	state->expr = (Expr *) aggstate;
 	state->parent = parent;
@@ -3707,8 +3744,36 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 						  &deform);
 		expr_setup_walker((Node *) pertrans->aggref->aggfilter,
 						  &deform);
+
+		if (!AggTransCanUseBatch(aggstate, pertrans) ||
+			!ExprListAllSimpleVars(pertrans->aggref->args, &allattnos))
+			batch = false;
 	}
-	ExecPushExprSetupSteps(state, &deform);
+
+	if (batch)
+	{
+		if (deform.last_outer > 0)
+		{
+			Assert(!bms_is_empty(allattnos));
+			bv  = BatchVectorCreate(allattnos, deform.last_outer);
+
+			/*
+			 * Deform all tuples upto last_outer in batch
+			 */
+			scratch.opcode = EEOP_OUTER_FETCHSOME_BATCH;
+			scratch.d.fetch_batch.last_var = deform.last_outer;
+			ExprEvalPushStep(state, &scratch);
+
+			/*
+			 * Put all arg Vars into vectors once per batch slice
+			 */
+			scratch.opcode = EEOP_BUILD_OUTER_BATCH_VECTOR;
+			scratch.d.batch_vector.bv = bv;
+			ExprEvalPushStep(state, &scratch);
+		}
+	}
+	else
+		ExecPushExprSetupSteps(state, &deform);
 
 	/*
 	 * Emit instructions for each transition value / grouping set combination.
@@ -3746,7 +3811,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * Evaluate arguments to aggregate/combine function.
 		 */
 		argno = 0;
-		if (isCombine)
+		if (isCombine && !batch)
 		{
 			/*
 			 * Combining two aggregate transition values. Instead of directly
@@ -3816,7 +3881,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 			Assert(pertrans->numInputs == argno);
 		}
-		else if (!pertrans->aggsortrequired)
+		else if (!pertrans->aggsortrequired && !batch)
 		{
 			ListCell   *arg;
 
@@ -3849,7 +3914,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			}
 			Assert(pertrans->numTransInputs == argno);
 		}
-		else if (pertrans->numInputs == 1)
+		else if (pertrans->numInputs == 1 && !batch)
 		{
 			/*
 			 * Non-presorted DISTINCT and/or ORDER BY case, with a single
@@ -3868,7 +3933,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 			Assert(pertrans->numInputs == argno);
 		}
-		else
+		else if (!batch)
 		{
 			/*
 			 * Non-presorted DISTINCT and/or ORDER BY case, with multiple
@@ -3896,7 +3961,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * just keep the prior transValue. This is true for both plain and
 		 * sorted/distinct aggregates.
 		 */
-		if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0)
+		if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0 && !batch)
 		{
 			if (strictnulls)
 				scratch.opcode = EEOP_AGG_STRICT_INPUT_CHECK_NULLS;
@@ -3914,7 +3979,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		}
 
 		/* Handle DISTINCT aggregates which have pre-sorted input */
-		if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired)
+		if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired && !batch)
 		{
 			if (pertrans->numDistinctCols > 1)
 				scratch.opcode = EEOP_AGG_PRESORTED_DISTINCT_MULTI;
@@ -3942,7 +4007,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
+									  nullcheck, batch, bv);
 				setoff++;
 			}
 		}
@@ -3962,7 +4027,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 									  pertrans, transno, setno, setoff, true,
-									  nullcheck);
+									  nullcheck, false, NULL);
 				setoff++;
 			}
 		}
@@ -4007,6 +4072,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 	ExecReadyExpr(state);
 
+	if (batch_trans)
+		*batch_trans = batch;
+
 	return state;
 }
 
@@ -4020,10 +4088,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
 					  int transno, int setno, int setoff, bool ishash,
-					  bool nullcheck)
+					  bool nullcheck, bool batch, BatchVector *bv)
 {
 	ExprContext *aggcontext;
 	int			adjust_jumpnull = -1;
+	BatchVectorSlice *bvs = NULL;
 
 	if (ishash)
 		aggcontext = aggstate->hashcontext;
@@ -4077,7 +4146,13 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	 */
 	if (!pertrans->aggsortrequired)
 	{
-		if (pertrans->transtypeByVal)
+		if (batch)
+		{
+			if (bv)
+				bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
+			scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+		}
+		else if (pertrans->transtypeByVal)
 		{
 			if (fcinfo->flinfo->fn_strict &&
 				pertrans->initValueIsNull)
@@ -4108,6 +4183,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	scratch->d.agg_trans.setoff = setoff;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
+	scratch->d.agg_trans.bvs = bvs;
 	ExprEvalPushStep(state, scratch);
 
 	/* fix up jumpnull */
@@ -5070,3 +5146,129 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
 		DomainHasConstraints(returning->typid);
 	ExprEvalPushStep(state, &scratch);
 }
+
+/* Is expr a Var node for a non-system attribute? */
+static bool
+expr_is_simple_var(Expr *expr, AttrNumber *out_attno)
+{
+	if (expr == NULL)
+		return false;
+
+	if (IsA(expr, TargetEntry))
+		return expr_is_simple_var((Expr *) ((TargetEntry *) expr)->expr,
+								  out_attno);
+	if (IsA(expr, RelabelType))
+		return expr_is_simple_var((Expr *) ((RelabelType *) expr)->arg,
+								  out_attno);
+
+	if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+	{
+		*out_attno = ((Var *) expr)->varattno;
+		return true;
+	}
+
+	return false;
+}
+
+/* Are all inputs plain Vars (optionally allow RelabelType->Var)? Collect attnos. */
+static bool
+ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
+{
+	ListCell *lc;
+
+	foreach(lc, args)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		Expr *arg = tle->expr;
+		AttrNumber attno;
+
+		if (!expr_is_simple_var(arg, &attno))
+			return false;
+
+		if (!IsA(arg, Var))
+			return false;
+
+		Assert(attno > 0);
+		*allattnos = bms_add_member(*allattnos, attno);
+	}
+
+	return true;
+}
+
+/* ---------- BatchVector stuff ------------- */
+
+static BatchVector *
+BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var)
+{
+	int maxrows = EXEC_BATCH_ROWS;
+	BatchVector *bv;
+	AttrNumber	attno;
+	int			i;
+
+	bv = palloc(sizeof(BatchVector));
+	bv->ncols = bms_num_members(attnos);
+	bv->maxrows = maxrows;
+	bv->last_var = last_var;
+	bv->attnos = palloc(sizeof(AttrNumber) * bv->ncols);
+	attno = -1;
+	i = 0;
+	while ((attno = bms_next_member(attnos, attno)) > 0)
+		bv->attnos[i++] = attno;
+	bv->cols = palloc(sizeof(Datum *) * bv->ncols);
+	bv->nulls = palloc(sizeof(bool  *) * bv->ncols);
+
+	for (i =0; i < bv->ncols; i++)
+	{
+		bv->cols[i]  = palloc(sizeof(Datum) * maxrows);
+		bv->nulls[i] = palloc(sizeof(bool)  * maxrows);
+	}
+
+	bv->nrows = 0;
+	bv->hasnull = false;
+
+	return bv;
+}
+
+static int16
+BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno)
+{
+	for (int i = 0; i < bv->ncols; i++)
+		if (bv->attnos[i] == attno)
+			return i;
+
+	return -1;
+}
+
+/*
+ * BatchVectorSliceFromExprArgs
+ *		Build a BatchVectorSlice for a List of args.
+ *
+ * For Var args (possibly under RelabelType), store the col index.
+ * For non-Var args, store -1. Caller can handle Consts, etc.
+ */
+static BatchVectorSlice *
+BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
+{
+	BatchVectorSlice *bvs = palloc(sizeof(BatchVectorSlice));
+	int nargs = list_length(args);
+	int i = 0;
+	ListCell *lc;
+
+	Assert(bv);
+	bvs->bv = bv;
+	bvs->nargs = nargs;
+	bvs->argoffs = (int16 *) palloc(sizeof(int16) * nargs);
+
+	foreach (lc, args)
+	{
+		Expr *arg = (Expr *) lfirst(lc);
+		AttrNumber attno;
+
+		if (expr_is_simple_var(arg, &attno))
+			bvs->argoffs[i++] = BatchVectorFindAttColno(bv, attno);
+		else
+			bvs->argoffs[i++] = -1; /* non-Var */
+	}
+
+	return bvs;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 68629ad7991..3176679b346 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -606,6 +606,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
+		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2336,6 +2337,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
+		{
+			/* too complex for an inline implementation */
+			ExecAggPlainTransBatch(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -6039,3 +6048,97 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 	}
 	bv->nrows = i;
 }
+
+void
+ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	AggState   *aggstate = castNode(AggState, state->parent);
+	AggStatePerTrans	pertrans = op->d.agg_trans.pertrans;
+	AggStatePerGroup pergroup =
+		&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+	BatchVectorSlice  *bvs = op->d.agg_trans.bvs;
+	FunctionCallInfo	fcinfo = pertrans->transfn_fcinfo;
+	FmgrInfo		   *finfo = fcinfo->flinfo;
+	Datum		newVal;
+	TupleBatch *batch = econtext->outer_batch;
+	int			batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
+	int			start_row = 0;
+
+	if (finfo->fn_strict)
+	{
+		if (pergroup->noTransValue && bvs)
+		{
+			const BatchVector *bv = bvs->bv;
+			bool	found = false;
+
+			Assert(bv);
+			for (int i = 0; i < batch_nrows; i++)
+			{
+				for (int j = 0; j < bvs->nargs; j++)
+				{
+					if (!bv->nulls[bvs->argoffs[j]][i])
+					{
+						fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+						fcinfo->args[1].isnull = false;
+						if (j == bvs->nargs - 1)
+						{
+							found = true;
+							break;
+						}
+					}
+				}
+				if (found)
+					break;
+			}
+			/* If transValue has not yet been initialized, do so now. */
+			ExecAggInitGroup(aggstate, pertrans, pergroup,
+							 op->d.agg_trans.aggcontext);
+			start_row = 1;
+		}
+		else if (pergroup->transValueIsNull)
+			return;
+	}
+
+	switch (ExecEvalStepOp(state, op))
+	{
+		case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+			/* Loop rows, call the original transfn per element using vector cols. */
+			for (int i = start_row; i < batch_nrows; i++)
+			{
+				bool hasnull = false;
+
+				/* Set up fcinfo args 1..m from column vectors at row i. */
+				if (bvs)
+				{
+					const BatchVector *bv = bvs->bv;
+
+					for (int j = 0; j < bvs->nargs; j++)
+					{
+						int16	argoff = bvs->argoffs[j];
+
+						fcinfo->args[j+1].value = bv->cols[argoff][i];
+						fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+						if (!hasnull && bv->nulls[argoff][i])
+							hasnull = true;
+					}
+				}
+				/* fcinfo->args[0] is the existing transition state */
+				if (finfo->fn_strict && hasnull)
+					continue;
+				fcinfo->args[0].value = pergroup->transValue;
+				fcinfo->args[0].isnull = pergroup->transValueIsNull;
+				newVal = FunctionCallInvoke(fcinfo);
+				if (!pertrans->transtypeByVal &&
+					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+					newVal = ExecAggCopyTransValue(aggstate, pertrans,
+												   newVal, fcinfo->isnull,
+												   pergroup->transValue,
+												   pergroup->transValueIsNull);
+				pergroup->transValue = newVal;
+				pergroup->transValueIsNull = fcinfo->isnull;
+			}
+			break;
+		default:
+			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+	}
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ace6363509..662d8bef43b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -825,6 +825,16 @@ advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
 {
 	ExprContext *tmpcontext = aggstate->tmpcontext;
 	ExprState *evaltrans = aggstate->phase->evaltrans;
+	bool		batch_trans = aggstate->phase->batch_trans;
+
+	if (batch_trans)
+	{
+		tmpcontext->ecxt_outertuple = TupleBatchGetSlot(b, 0);
+		tmpcontext->outer_batch = b;
+		ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+		TupleBatchConsumeAll(b);
+		return;
+	}
 
 	while (TupleBatchHasMore(b))
 	{
@@ -1800,7 +1810,8 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
 
 		phase->evaltrans_cache[i][j] = ExecBuildAggTrans(aggstate, phase,
 														 dosort, dohash,
-														 nullcheck);
+														 nullcheck,
+														 NULL);
 
 		/* change back */
 		aggstate->ss.ps.outerops = outerops;
@@ -3367,7 +3378,7 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 }
 
-static bool
+bool
 AggCanUsePlainBatch(AggState *aggstate)
 {
 	const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
@@ -4233,7 +4244,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			Assert(false);
 
 		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
+											 false, &phase->batch_trans);
 
 		/* cache compiled expression for outer slot without NULL check */
 		phase->evaltrans_cache[0][0] = phase->evaltrans;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 848f0b52d6f..efb3ee639fc 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,12 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 6bb527c3f6f..1b5e06f60cc 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,4 +186,5 @@ void	   *referenced_functions[] =
 	ExecBuildInnerBatchVector,
 	ExecBuildOuterBatchVector,
 	ExecBuildScanBatchVector,
+	ExecAggPlainTransBatch,
 };
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 6f1a38d14bd..b50961fc0c9 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -99,4 +99,10 @@ TupleBatchMaterializeAll(TupleBatch *b)
 	TupleBatchUseInput(b, b->ntuples);
 }
 
+static inline void
+TupleBatchConsumeAll(TupleBatch *b)
+{
+	b->next = b->nvalid;
+}
+
 #endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 99c86bac702..1d33e084b69 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -302,6 +302,9 @@ typedef enum ExprEvalOp
 	EEOP_BUILD_OUTER_BATCH_VECTOR,
 	EEOP_BUILD_SCAN_BATCH_VECTOR,
 
+	/* Batched aggregate trans evaluation */
+	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
@@ -750,6 +753,7 @@ typedef struct ExprEvalStep
 
 		/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
+		/* for EEOP_AGG_PLAIN_TRANS_{BATCH,BATCH_ROWLOOP}*/
 		struct
 		{
 			AggStatePerTrans pertrans;
@@ -757,6 +761,7 @@ typedef struct ExprEvalStep
 			int			setno;
 			int			transno;
 			int			setoff;
+			struct BatchVectorSlice *bvs;
 		}			agg_trans;
 
 		/* for EEOP_IS_JSON */
@@ -956,8 +961,17 @@ typedef struct BatchVector
 	int		nrows;			/* #rows loaded into cols/nulls */
 } BatchVector;
 
+/* A slice of BatchVector that maps caller args to BatchVector columns. */
+typedef struct BatchVectorSlice
+{
+	const BatchVector *bv;
+	int			nargs;		/* number of args covered */
+	int16	   *argoffs;	/* length nargs, -1 for non-Var entries */
+} BatchVectorSlice;
+
 extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
+extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cf5b0c7e05c..5ba9a523970 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -336,7 +336,8 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool doSort, bool doHash, bool nullcheck,
+									bool *batch_trans);
 extern ExprState *ExecBuildHash32FromAttrs(TupleDesc desc,
 										   const TupleTableSlotOps *ops,
 										   FmgrInfo *hashfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..5c5ebfc73f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -289,6 +289,7 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+	bool		batch_trans;	/* true if evaltrans contains batch EEOPs */
 
 	/*----------
 	 * Cached variants of the compiled expression.
@@ -338,4 +339,5 @@ extern void ExecAggInitializeDSM(AggState *node, ParallelContext *pcxt);
 extern void ExecAggInitializeWorker(AggState *node, ParallelWorkerContext *pwcxt);
 extern void ExecAggRetrieveInstrumentation(AggState *node);
 
+extern bool AggCanUsePlainBatch(AggState *aggstate);
 #endif							/* NODEAGG_H */
-- 
2.43.0



  [application/octet-stream] v2-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch (11.2K, 4-v2-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch)
  download | inline diff:
From c88299a33c376aa8a5a1a5359217e9c8e67b60e8 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 9 Sep 2025 21:43:29 +0900
Subject: [PATCH v2 7/8] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT

The new EEOP runs a plain aggregate transition over a TupleBatch with
a single fmgr call. Batch vectors are passed to the transfn via
AggBulkArgs stored in fcinfo->flinfo->fn_extra, avoiding per-row fmgr
overhead.

Gate selection with AggTransfnSupportsBulk(), an allowlist of
built-in transfns updated to accept AggBulkArgs.  Some integer
transfns are taught to read AggBulkArgs when present, else fall
back. Rowloop batching remains available; unsupported aggregates keep
the row path.
---
 src/backend/executor/execExpr.c       | 28 ++++++++++++++++-
 src/backend/executor/execExprInterp.c | 43 ++++++++++++++++++++++++++
 src/backend/executor/nodeAgg.c        |  1 -
 src/backend/jit/llvm/llvmjit_expr.c   |  1 +
 src/backend/utils/adt/int.c           | 32 +++++++++++++++++++
 src/backend/utils/adt/int8.c          | 44 +++++++++++++++++++++++++++
 src/backend/utils/adt/numeric.c       | 17 +++++++++++
 src/include/executor/execExpr.h       |  1 +
 src/include/executor/executor.h       | 20 ++++++++++++
 9 files changed, 185 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index af5ed8b6368..27a5780f557 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -47,6 +47,7 @@
 #include "utils/acl.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/jsonfuncs.h"
 #include "utils/jsonpath.h"
 #include "utils/lsyscache.h"
@@ -3692,6 +3693,28 @@ AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
 	return true;
 }
 
+/* Return true if this transfn OID is known to accept AggBulkArgs. */
+static bool
+AggTransfnSupportsBulk(Oid fn_oid)
+{
+	/* Phase 1: hard-coded allowlist of built-ins you updated. */
+	static const Oid ok[] =
+	{
+		F_INT8INC_ANY,		/* COUNT(*) transfn */
+		F_INT8INC,			/* COUNT(arg) transfn */
+		F_INT4_SUM,			/* SUM(int) transfn */
+		F_INT4SMALLER,		/* MIN(int) transfn */
+		F_INT4LARGER,		/* MAX(int) transfn */
+		/* add others you make bulk-aware */
+		InvalidOid
+	};
+
+	for (int i = 0; OidIsValid(ok[i]); i++)
+		if (ok[i] == fn_oid)
+			return true;
+	return false;
+}
+
 /*
  * Build transition/combine function invocations for all aggregate transition
  * / combination function invocations in a grouping sets phase. This has to
@@ -4150,7 +4173,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		{
 			if (bv)
 				bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
-			scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+			if (!AggTransfnSupportsBulk(pertrans->transfn_oid))
+				scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+			else
+				scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT;
 		}
 		else if (pertrans->transtypeByVal)
 		{
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3176679b346..41ad9b4838d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -607,6 +607,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
+		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2345,6 +2346,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
+		{
+			/* too complex for an inline implementation */
+			ExecAggPlainTransBatch(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -6138,6 +6147,40 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
 				pergroup->transValueIsNull = fcinfo->isnull;
 			}
 			break;
+
+		case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+			{
+				void *save = fcinfo->flinfo->fn_extra;
+				AggBulkArgs ba = {batch_nrows, start_row};
+
+				if (bvs)
+				{
+					const BatchVector *bv = bvs->bv;
+
+					Assert(bv);
+					ba.nargs = bvs->nargs;
+					ba.argoffs = bvs->argoffs;
+					ba.args = bv->cols;
+					ba.isnull = bv->nulls;
+					ba.hasnull = bv->hasnull;
+				}
+				fcinfo->flinfo->fn_extra = &ba;
+				fcinfo->args[0].value = pergroup->transValue;
+				fcinfo->args[0].isnull = pergroup->transValueIsNull;
+				fcinfo->isnull = false;		/* just in case transfn doesn't set it */
+				newVal = FunctionCallInvoke(fcinfo);   /* one call for the entire slice */
+				if (!pertrans->transtypeByVal &&
+					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+					newVal = ExecAggCopyTransValue(aggstate, pertrans,
+												   newVal, fcinfo->isnull,
+												   pergroup->transValue,
+												   pergroup->transValueIsNull);
+				pergroup->transValue = newVal;
+				pergroup->transValueIsNull = fcinfo->isnull;
+				fcinfo->flinfo->fn_extra = save;
+			}
+			break;
+
 		default:
 			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
 	}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 662d8bef43b..a2286ef5e54 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2687,7 +2687,6 @@ agg_retrieve_direct_batch(AggState *aggstate)
 
 	initialize_aggregates(aggstate, aggstate->pergroups,
 						  Max(aggstate->phase->numsets, 1));
-
 	if (aggstate->grp_firstTuple)
 	{
 		ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index efb3ee639fc..45346124bd7 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,7 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
 			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
 				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
 								v_state, op, v_econtext);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..eb1780b5590 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1363,18 +1363,50 @@ int2smaller(PG_FUNCTION_ARGS)
 Datum
 int4larger(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int32		arg1 = PG_GETARG_INT32(0);
 	int32		arg2 = PG_GETARG_INT32(1);
 
+	if (unlikely(ba))
+	{
+		int32 result = arg1;
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = (int32) ba->args[ba->argoffs[0]][i];
+				if (arg2 > result)
+					result = arg2;
+			}
+		}
+		PG_RETURN_INT32(result);
+	}
 	PG_RETURN_INT32((arg1 > arg2) ? arg1 : arg2);
 }
 
 Datum
 int4smaller(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int32		arg1 = PG_GETARG_INT32(0);
 	int32		arg2 = PG_GETARG_INT32(1);
 
+	if (unlikely(ba))
+	{
+		int32 result = arg1;
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = ba->args[ba->argoffs[0]][i];
+				if (arg2 < result)
+					result = arg2;
+			}
+		}
+		PG_RETURN_INT32(result);
+	}
 	PG_RETURN_INT32((arg1 < arg2) ? arg1 : arg2);
 }
 
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index bdea490202a..bbabf4e0785 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -461,10 +461,28 @@ int8up(PG_FUNCTION_ARGS)
 Datum
 int8pl(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int64		arg1 = PG_GETARG_INT64(0);
 	int64		arg2 = PG_GETARG_INT64(1);
 	int64		result;
 
+	if (unlikely(ba))
+	{
+		result = arg1;
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = ba->args[ba->argoffs[0]][i];
+				if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+				arg1 = result;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
 	if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
 		ereport(ERROR,
 				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
@@ -718,9 +736,35 @@ int8lcm(PG_FUNCTION_ARGS)
 Datum
 int8inc(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int64		arg = PG_GETARG_INT64(0);
 	int64		result;
 
+	if (unlikely(ba))
+	{
+		result = arg;
+		if (!ba->hasnull || ba->nargs == 0)
+		{
+			if (unlikely(pg_add_s64_overflow(arg, ba->nrows, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+			PG_RETURN_INT64(result);
+		}
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+				arg = result;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
+
 	if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
 		ereport(ERROR,
 				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 76269918593..b02664c97f5 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6310,6 +6310,23 @@ int4_sum(PG_FUNCTION_ARGS)
 {
 	int64		oldsum;
 	int64		newval;
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
+
+	if (unlikely(ba))
+	{
+		int64	result = (!PG_ARGISNULL(0) ? PG_GETARG_INT64(0) : 0);
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				int32	arg2 = ba->args[ba->argoffs[0]][i];
+
+				result = result + arg2;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
 
 	if (PG_ARGISNULL(0))
 	{
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 1d33e084b69..f24782ecf58 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -304,6 +304,7 @@ typedef enum ExprEvalOp
 
 	/* Batched aggregate trans evaluation */
 	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
+	EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,	/* call transfn once with AggBulkArgs */
 
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5ba9a523970..c72bd755b79 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -561,6 +561,26 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
 }
 #endif
 
+#ifndef FRONTEND
+/* Per-call bulk argument vectors for batched aggregate trans functions. */
+typedef struct AggBulkArgs
+{
+	int		nrows;		/* number of rows in this batch */
+	int		start_row;
+	int16  *argoffs;
+	int		nargs;		/* number of argument vectors */
+	Datum  **args;		/* args[j][i] = j-th arg at row i */
+	bool   **isnull;	/* isnull[j][i] */
+	bool	hasnull;	/* is any datum in args NULL? */
+} AggBulkArgs;
+
+static inline AggBulkArgs *
+AggGetBulkArgs(FunctionCallInfo fcinfo)
+{
+	return (AggBulkArgs *) (fcinfo->flinfo ? fcinfo->flinfo->fn_extra : NULL);
+}
+#endif
+
 extern bool ExecCheck(ExprState *state, ExprContext *econtext);
 
 /*
-- 
2.43.0



  [application/octet-stream] v2-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch (16.9K, 5-v2-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch)
  download | inline diff:
From 3cf02cab36bc9b2420f98ff08c17dea082a84f59 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 17:01:29 +0900
Subject: [PATCH v2 5/8] WIP: Add EEOPs and helpers for TupleBatch processing

Introduce new EEOP cases to fetch attributes into TupleBatch
vectors:
- EEOP_{INNER,OUTER,SCAN}_FETCHSOME_BATCH
- EEOP_BUILD_{INNER,OUTER,SCAN}_BATCH_VECTOR

Add ExecBuild{Inner,Outer,Scan}BatchVector() helpers to populate
column vectors (values, nulls, nrows, hasnull) from a TupleBatch.
Extend ExprContext with inner_batch, outer_batch, and scan_batch
fields so expression programs can access active batches directly.

Add slot_getsomeattrs_batch() to prefetch attributes across all
slots in a TupleBatch, similar to slot_getsomeattrs() for one slot.
---
 src/backend/executor/execExprInterp.c | 127 +++++++++++++++++++++++++-
 src/backend/executor/execTuples.c     |  32 +++++++
 src/backend/jit/llvm/llvmjit_expr.c   |  86 +++++++++++++++++
 src/backend/jit/llvm/llvmjit_types.c  |   4 +
 src/include/executor/execExpr.h       |  45 ++++++++-
 src/include/executor/tuptable.h       |   2 +
 src/include/nodes/execnodes.h         |  24 +++--
 7 files changed, 310 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0e1a74976f7..68629ad7991 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
 #include "access/heaptoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
+#include "executor/execBatch.h"
 #include "executor/execExpr.h"
 #include "executor/nodeSubplan.h"
 #include "funcapi.h"
@@ -188,6 +189,11 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  int setno);
 static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
 
+static pg_attribute_always_inline void ExecBuildBatchVector(ExprState *state,
+															ExprEvalStep *op,
+															ExprContext *econtext,
+															TupleBatch *b);
+
 /*
  * ScalarArrayOpExprHashEntry
  * 		Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -446,7 +452,6 @@ ExecReadyInterpretedExpr(ExprState *state)
 	state->evalfunc_private = ExecInterpExpr;
 }
 
-
 /*
  * Evaluate expression identified by "state" in the execution context
  * given by "econtext".  *isnull is set to the is-null flag for the result,
@@ -466,6 +471,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	TupleTableSlot *scanslot;
 	TupleTableSlot *oldslot;
 	TupleTableSlot *newslot;
+	TupleBatch *innerbatch;
+	TupleBatch *outerbatch;
+	TupleBatch *scanbatch;
 
 	/*
 	 * This array has to be in the same order as enum ExprEvalOp.
@@ -479,6 +487,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_SCAN_FETCHSOME,
 		&&CASE_EEOP_OLD_FETCHSOME,
 		&&CASE_EEOP_NEW_FETCHSOME,
+		&&CASE_EEOP_INNER_FETCHSOME_BATCH,
+		&&CASE_EEOP_OUTER_FETCHSOME_BATCH,
+		&&CASE_EEOP_SCAN_FETCHSOME_BATCH,
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
@@ -592,6 +603,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
 		&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
 		&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+		&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
+		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
+		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_LAST
 	};
 
@@ -612,6 +626,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	scanslot = econtext->ecxt_scantuple;
 	oldslot = econtext->ecxt_oldtuple;
 	newslot = econtext->ecxt_newtuple;
+	innerbatch = econtext->inner_batch;
+	outerbatch = econtext->outer_batch;
+	scanbatch = econtext->scan_batch;
 
 #if defined(EEO_USE_COMPUTED_GOTO)
 	EEO_DISPATCH();
@@ -658,6 +675,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_INNER_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, innerslot);
+
+			Assert(innerbatch);
+			slot_getsomeattrs_batch(innerbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, outerslot);
+
+			Assert(outerbatch);
+			slot_getsomeattrs_batch(outerbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, scanslot);
+
+			Assert(scanbatch);
+			slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_OLD_FETCHSOME)
 		{
 			CheckOpSlotCompatibility(op, oldslot);
@@ -2265,6 +2312,30 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_BUILD_INNER_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildInnerBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_BUILD_OUTER_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildOuterBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_BUILD_SCAN_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildScanBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -5914,3 +5985,57 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
 
 	MemoryContextSwitchTo(oldContext);
 }
+
+void
+ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->inner_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->inner_batch);
+}
+
+void
+ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->outer_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->outer_batch);
+}
+
+void
+ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->scan_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->scan_batch);
+}
+
+static pg_attribute_always_inline void
+ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
+					 TupleBatch *b)
+{
+	struct BatchVector *bv = op->d.batch_vector.bv;
+	int		i = 0;
+
+	if (bv->ncols == 0)
+		return;
+
+	/* Fetch each requested attribute into column vectors. */
+	TupleBatchRewind(b);
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		for (int j = 0; j < bv->ncols; j++)
+		{
+			AttrNumber attno = bv->attnos[j];
+			Datum  *cols  = bv->cols[j];
+			bool   *nulls  = bv->nulls[j];
+
+			Assert(attno <= slot->tts_nvalid);
+			cols[i] = slot->tts_values[attno - 1];
+			nulls[i] = slot->tts_isnull[attno - 1];
+			if (!bv->hasnull && nulls[i])
+				bv->hasnull = true;
+		}
+		i++;
+	}
+	bv->nrows = i;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..86d5dea8f8b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
 	}
 }
 
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		/* Check for caller errors */
+		Assert(attnum > 0);
+
+		if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+			elog(ERROR, "invalid attribute number %d", attnum);
+
+		/* XXX - there should perhaps also be a batch-level att_nvalid */
+		if (attnum < slot->tts_nvalid)
+			continue;
+
+		/* Fetch as many attributes as possible from the underlying tuple. */
+		slot->tts_ops->getsomeattrs(slot, attnum);
+
+		/*
+		 * If the underlying tuple doesn't have enough attributes, tuple
+		 * descriptor must have the missing attributes.
+		 */
+		if (unlikely(slot->tts_nvalid < attnum))
+		{
+			slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+			slot->tts_nvalid = attnum;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecTypeFromTL
  *
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 712b35df7e5..848f0b52d6f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,11 @@ llvm_compile_expr(ExprState *state)
 	LLVMValueRef v_newslot;
 	LLVMValueRef v_resultslot;
 
+	/* batches */
+	LLVMValueRef v_innerbatch;
+	LLVMValueRef v_outerbatch;
+	LLVMValueRef v_scanbatch;
+
 	/* nulls/values of slots */
 	LLVMValueRef v_innervalues;
 	LLVMValueRef v_innernulls;
@@ -221,6 +226,21 @@ llvm_compile_expr(ExprState *state)
 									 v_state,
 									 FIELDNO_EXPRSTATE_RESULTSLOT,
 									 "v_resultslot");
+	v_innerbatch = l_load_struct_gep(b,
+									 StructExprContext,
+									 v_econtext,
+									 FIELDNO_EXPRCONTEXT_OUTERBATCH,
+									 "v_innerbatch");
+	v_outerbatch = l_load_struct_gep(b,
+									 StructExprContext,
+									 v_econtext,
+									 FIELDNO_EXPRCONTEXT_OUTERBATCH,
+									 "v_outerbatch");
+	v_scanbatch = l_load_struct_gep(b,
+									StructExprContext,
+									v_econtext,
+									FIELDNO_EXPRCONTEXT_SCANBATCH,
+									"v_scanbatch");
 
 	/* build global values/isnull pointers */
 	v_scanvalues = l_load_struct_gep(b,
@@ -439,6 +459,54 @@ llvm_compile_expr(ExprState *state)
 					break;
 				}
 
+			case EEOP_INNER_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_innerbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
+			case EEOP_OUTER_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_outerbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
+			case EEOP_SCAN_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_scanbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
@@ -2940,6 +3008,24 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_BUILD_INNER_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildInnerBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
+			case EEOP_BUILD_OUTER_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildOuterBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
+			case EEOP_BUILD_SCAN_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildScanBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 167cd554b9c..6bb527c3f6f 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,11 @@ void	   *referenced_functions[] =
 	MakeExpandedObjectReadOnlyInternal,
 	slot_getmissingattrs,
 	slot_getsomeattrs_int,
+	slot_getsomeattrs_batch,
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecBuildInnerBatchVector,
+	ExecBuildOuterBatchVector,
+	ExecBuildScanBatchVector,
 };
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 75366203706..99c86bac702 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,11 @@ typedef enum ExprEvalOp
 	EEOP_OLD_FETCHSOME,
 	EEOP_NEW_FETCHSOME,
 
+	/* apply slot_getsomeattrs_batch() to corresponding batch */
+	EEOP_INNER_FETCHSOME_BATCH,
+	EEOP_OUTER_FETCHSOME_BATCH,
+	EEOP_SCAN_FETCHSOME_BATCH,
+
 	/* compute non-system Var value */
 	EEOP_INNER_VAR,
 	EEOP_OUTER_VAR,
@@ -292,11 +297,15 @@ typedef enum ExprEvalOp
 	EEOP_AGG_ORDERED_TRANS_DATUM,
 	EEOP_AGG_ORDERED_TRANS_TUPLE,
 
+	/* ExprContext.*_batch -> BatchVector */
+	EEOP_BUILD_INNER_BATCH_VECTOR,
+	EEOP_BUILD_OUTER_BATCH_VECTOR,
+	EEOP_BUILD_SCAN_BATCH_VECTOR,
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
 
-
 typedef struct ExprEvalStep
 {
 	/*
@@ -331,6 +340,12 @@ typedef struct ExprEvalStep
 			const TupleTableSlotOps *kind;
 		}			fetch;
 
+		struct
+		{
+			/* attribute number up to which to fetch (inclusive) */
+			int			last_var;
+		}			fetch_batch;
+
 		/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
 		struct
 		{
@@ -769,6 +784,12 @@ typedef struct ExprEvalStep
 			void	   *json_coercion_cache;
 			ErrorSaveContext *escontext;
 		}			jsonexpr_coercion;
+
+		/* for batch vector construction */
+		struct
+		{
+			struct BatchVector *bv;
+		}			batch_vector;
 	}			d;
 } ExprEvalStep;
 
@@ -917,4 +938,26 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 
+/* ---------- BatchVector stuff ------------- */
+
+/* Vector fetch spec for a list of simple Vars. */
+typedef struct BatchVector
+{
+	/* immutable after BatchVectorCreate */
+	AttrNumber *attnos;		/* [ncols] */
+	int			ncols;
+	int			maxrows;
+	int			last_var;
+
+	/* per batch state */
+	Datum **cols;			/* [ncols][maxbatch] */
+	bool  **nulls;			/* [ncols][maxbatch] */
+	bool	hasnull;		/* is any datum in cols NULL? */
+	int		nrows;			/* #rows loaded into cols/nulls */
+} BatchVector;
+
+extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 095e4cc82e3..2e2192fb3cf 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -347,6 +347,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
 extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
 								 int lastAttNum);
 extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
 
 
 #ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9b81b842161..fdfe8b4ddaf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -277,6 +277,14 @@ typedef struct ExprContext
 #define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
 	TupleTableSlot *ecxt_outertuple;
 
+	/* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_INNERBATCH 4
+	TupleBatch	   *inner_batch;
+#define FIELDNO_EXPRCONTEXT_OUTERBATCH 5
+	TupleBatch	   *outer_batch;
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 6
+	TupleBatch	   *scan_batch;
+
 	/* Memory contexts for expression evaluation --- see notes above */
 	MemoryContext ecxt_per_query_memory;
 	MemoryContext ecxt_per_tuple_memory;
@@ -289,27 +297,27 @@ typedef struct ExprContext
 	 * Values to substitute for Aggref nodes in the expressions of an Agg
 	 * node, or for WindowFunc nodes within a WindowAgg node.
 	 */
-#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
+#define FIELDNO_EXPRCONTEXT_AGGVALUES 11
 	Datum	   *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
-#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
+#define FIELDNO_EXPRCONTEXT_AGGNULLS 12
 	bool	   *ecxt_aggnulls;	/* null flags for aggs/windowfuncs */
 
 	/* Value to substitute for CaseTestExpr nodes in expression */
-#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
+#define FIELDNO_EXPRCONTEXT_CASEDATUM 13
 	Datum		caseValue_datum;
-#define FIELDNO_EXPRCONTEXT_CASENULL 11
+#define FIELDNO_EXPRCONTEXT_CASENULL 14
 	bool		caseValue_isNull;
 
 	/* Value to substitute for CoerceToDomainValue nodes in expression */
-#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
+#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 15
 	Datum		domainValue_datum;
-#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
+#define FIELDNO_EXPRCONTEXT_DOMAINNULL 16
 	bool		domainValue_isNull;
 
 	/* Tuples that OLD/NEW Var nodes in RETURNING may refer to */
-#define FIELDNO_EXPRCONTEXT_OLDTUPLE 14
+#define FIELDNO_EXPRCONTEXT_OLDTUPLE 17
 	TupleTableSlot *ecxt_oldtuple;
-#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
+#define FIELDNO_EXPRCONTEXT_NEWTUPLE 18
 	TupleTableSlot *ecxt_newtuple;
 
 	/* Link to containing EState (NULL if a standalone ExprContext) */
-- 
2.43.0



  [application/octet-stream] v2-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch (6.3K, 6-v2-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch)
  download | inline diff:
From abb8b1ded7cf192d286662dd320ad93802ce05d2 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 4 Sep 2025 22:55:25 +0900
Subject: [PATCH v2 4/8] WIP: Add agg_retrieve_direct_batch() for plain
 aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Teach Agg to consume child tuples in batches for AGG_PLAIN. A new
agg_retrieve_direct_batch() pulls TupleBatch from the child via
ExecProcNodeBatch(), materializes as needed, and advances per-agg
transition state over the batch. A first tuple is copied to match
the direct path’s behavior before batch processing.

Add AggCanUsePlainBatch() and select retrieve_plain at init:
batch path when no grouping sets, strategy is AGG_PLAIN, and the
child exposes ExecProcNodeBatch(); otherwise keep the row path.

Plan shape and EXPLAIN remain unchanged. Semantics are identical
to the non-batch direct path; this only reduces per-tuple overhead.
---
 src/backend/executor/nodeAgg.c | 123 +++++++++++++++++++++++++++++++++
 src/include/nodes/execnodes.h  |   5 ++
 2 files changed, 128 insertions(+)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a4f3d30f307..3ace6363509 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -820,6 +820,20 @@ advance_aggregates(AggState *aggstate)
 									  aggstate->tmpcontext);
 }
 
+static void
+advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
+{
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	ExprState *evaltrans = aggstate->phase->evaltrans;
+
+	while (TupleBatchHasMore(b))
+	{
+		tmpcontext->ecxt_outertuple = TupleBatchGetNextSlot(b);
+		ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+		ResetExprContext(tmpcontext);
+	}
+}
+
 /*
  * Run the transition function for a DISTINCT or ORDER BY aggregate
  * with only one input.  This is called after we have completed
@@ -2260,6 +2274,9 @@ ExecAgg(PlanState *pstate)
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
+				/* init-time choice */
+				result = node->retrieve_plain(node);
+				break;
 			case AGG_SORTED:
 				result = agg_retrieve_direct(node);
 				break;
@@ -2618,6 +2635,91 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static TupleTableSlot *
+agg_retrieve_direct_batch(AggState *aggstate)
+{
+	PlanState *child = outerPlanState(aggstate);
+	ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	const bool hasGroupingSets = aggstate->phase->numsets > 0;
+	TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	TupleBatch *b = NULL;
+
+	Assert(child->ExecProcNodeBatch);
+
+	/* mimic the first-tuple copy from agg_retrieve_direct() */
+	for (;;)
+	{
+		b = ExecProcNodeBatch(child);
+		if (b == NULL)
+		{
+			if (hasGroupingSets)
+			{
+				aggstate->input_done = true;
+				break;
+			}
+			aggstate->agg_done = true;
+			break;
+		}
+		if (b->nvalid == 0)
+			continue;
+
+		TupleBatchMaterializeAll(b);
+		aggstate->grp_firstTuple = ExecCopySlotHeapTuple(TupleBatchGetSlot(b, 0));
+		break;
+	}
+
+	/* initialize_aggregates etc. as in the direct path */
+	ReScanExprContext(econtext);
+	for (int i = 0; i < Max(aggstate->phase->numsets, 1); i++)
+		ReScanExprContext(aggstate->aggcontexts[i]);
+
+	initialize_aggregates(aggstate, aggstate->pergroups,
+						  Max(aggstate->phase->numsets, 1));
+
+	if (aggstate->grp_firstTuple)
+	{
+		ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
+		aggstate->grp_firstTuple = NULL;
+		tmpcontext->ecxt_outertuple = firstSlot;
+
+		advance_aggregates_batch(aggstate, b);
+		ResetExprContext(tmpcontext);
+	}
+
+	/* consume remaining rows in current and subsequent batches */
+	if (b)
+	{
+		if (TupleBatchHasMore(b))
+			advance_aggregates_batch(aggstate, b);
+		for (;;)
+		{
+			b = ExecProcNodeBatch(child);
+			if (b == NULL)
+			{
+				if (hasGroupingSets)
+					aggstate->input_done = true;
+				else
+					aggstate->agg_done = true;
+				break;
+			}
+			if (b->nvalid == 0)
+				continue;
+
+			TupleBatchMaterializeAll(b);
+			advance_aggregates_batch(aggstate, b);
+		}
+	}
+
+	/* finalize and project like the direct path */
+	econtext->ecxt_outertuple = firstSlot;
+	prepare_projection_slot(aggstate, econtext->ecxt_outertuple, 0);
+	select_current_set(aggstate, 0, false);
+	finalize_aggregates(aggstate, aggstate->peragg, aggstate->pergroups[0]);
+
+	return project_aggregates(aggstate);
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -3265,6 +3367,22 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 }
 
+static bool
+AggCanUsePlainBatch(AggState *aggstate)
+{
+	const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
+
+	Assert(outerPlanState(aggstate));
+
+	/* grouping sets present -> bail */
+	if (aggnode->groupingSets != NIL)
+		return false;
+
+	if (aggstate->phase->aggstrategy != AGG_PLAIN)
+		return false;
+
+	return outerPlanState(aggstate)->ExecProcNodeBatch;
+}
 
 /* -----------------
  * ExecInitAgg
@@ -4060,6 +4178,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	if (AggCanUsePlainBatch(aggstate))
+		aggstate->retrieve_plain = agg_retrieve_direct_batch;
+	else
+		aggstate->retrieve_plain = agg_retrieve_direct;
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a104591ac20..9b81b842161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2535,6 +2535,9 @@ typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
 typedef struct AggStatePerHashData *AggStatePerHash;
 
+struct AggState;
+typedef TupleTableSlot *(*AggRetrievePlainFn)(struct AggState *);
+
 typedef struct AggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2610,6 +2613,8 @@ typedef struct AggState
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	SharedAggInfo *shared_info; /* one entry per worker */
+
+	AggRetrievePlainFn retrieve_plain; /* init-time choice */
 } AggState;
 
 /* ----------------
-- 
2.43.0



  [application/octet-stream] v2-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.7K, 7-v2-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
  download | inline diff:
From 3318650e720a01cbd5948349b9fbcdbb8ddda7cf Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:56:17 +0900
Subject: [PATCH v2 1/8] Add batch table AM API and heapam implementation

Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.

Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.

Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.

Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
---
 src/backend/access/heap/heapam.c         | 212 ++++++++++++++++++++++-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/include/access/heapam.h              |  21 +++
 src/include/access/tableam.h             |  58 +++++++
 src/include/pgstat.h                     |   5 +
 5 files changed, 299 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index ed0c0c2dc9f..f62f7edbf5e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1008,7 +1008,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 					int nkeys,
 					ScanKey key)
 {
-	HeapTuple	tuple = &(scan->rs_ctup);
+	HeapTuple tuple = &scan->rs_ctup;
 	Page		page;
 	uint32		lineindex;
 	uint32		linesleft;
@@ -1089,6 +1089,121 @@ continue_page:
 	scan->rs_inited = false;
 }
 
+/*
+ * heapgettup_pagemode_batch
+ *		Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ *		number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ *	- Ensures rs_cbuf pins the page from which tuples were produced.
+ *	- Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ *	  heapgettup_pagemode’s inner-loop effects).
+ *	- Does *not* change buffer pin counts except through normal page
+ *	  transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+						  ScanDirection dir,
+						  int nkeys, ScanKey key,
+						  HeapTupleData *tdata,
+						  int maxitems)
+{
+	Page		page;
+	uint32		lineindex;
+	uint32		linesleft;
+	int			nout = 0;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(maxitems > 0);
+
+	/*
+	 * If we have no current page (or the current page is exhausted),
+	 * advance to the next page that has any visible tuples and prepare it.
+	 * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+	 * as soon as we have a prepared page; we never produce from two pages.
+	 */
+	for (;;)
+	{
+		if (BufferIsValid(scan->rs_cbuf))
+		{
+			/* Are there more visible tuples left on this page? */
+			lineindex = scan->rs_cindex + dir;
+			if (ScanDirectionIsForward(dir))
+				linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+					(scan->rs_ntuples - lineindex) : 0;
+			else
+				linesleft = scan->rs_cindex;
+			if (linesleft > 0)
+				break;	/* continue on this page */
+		}
+
+		/* Move to next page and prepare its visible tuple list. */
+		heap_fetch_next_buffer(scan, dir);
+
+		if (!BufferIsValid(scan->rs_cbuf))
+		{
+			/* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_prefetch_block = InvalidBlockNumber;
+			scan->rs_inited = false;
+			return 0;
+		}
+
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+		heap_prepare_pagescan((TableScanDesc) scan);
+
+		/* After prepare, either rs_ntuples > 0 or we'll loop again. */
+		if (scan->rs_ntuples > 0)
+		{
+			lineindex = ScanDirectionIsForward(dir) ? 0 : scan->rs_ntuples - 1;
+			linesleft = scan->rs_ntuples - (ScanDirectionIsForward(dir) ? 0 : 0);
+			break;
+		}
+		/* else: page had no visible tuples; continue to next page */
+	}
+
+	/* From here on, we must only read tuples from this single page. */
+	page = BufferGetPage(scan->rs_cbuf);
+
+	/*
+	 * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+	 * until either the page is exhausted or the batch capacity is reached.
+	 */
+	for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+	{
+		OffsetNumber	lineoff;
+		ItemId			lpp;
+		HeapTupleData *dst = &tdata[nout];
+
+		Assert(lineindex <= (uint32) scan->rs_ntuples);
+		lineoff = scan->rs_vistuples[lineindex];
+		lpp = PageGetItemId(page, lineoff);
+		Assert(ItemIdIsNormal(lpp));
+
+		dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		dst->t_len  = ItemIdGetLength(lpp);
+		dst->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+		if (key != NULL &&
+			!HeapKeyTest(dst, RelationGetDescr(scan->rs_base.rs_rd),
+						 nkeys, key))
+			continue;
+
+		scan->rs_cindex = lineindex;
+		nout++;
+	}
+
+	return nout;
+}
 
 /* ----------------------------------------------------------------
  *					 heap access method interface
@@ -1136,6 +1251,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 	scan->rs_cbuf = InvalidBuffer;
+	scan->rs_batch_ctup = NULL;
+	scan->rs_batch_cbuf = InvalidBuffer;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1315,6 +1432,8 @@ heap_endscan(TableScanDesc sscan)
 	 */
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
+	if (BufferIsValid(scan->rs_batch_cbuf))
+		ReleaseBuffer(scan->rs_batch_cbuf);
 
 	/*
 	 * Must free the read stream before freeing the BufferAccessStrategy.
@@ -1421,6 +1540,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
 	return true;
 }
 
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	HeapBatch  *hb;
+	Oid			relid;
+
+	Assert(maxitems > 0);
+
+	hb = palloc(sizeof(HeapBatch));
+	hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+	hb->maxitems = maxitems;
+	hb->nitems = 0;
+	hb->buf = InvalidBuffer;
+
+	/* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+	relid = RelationGetRelid(sscan->rs_rd);
+	for (int i = 0; i < maxitems; i++)
+		hb->tupdata[i].t_tableOid = relid;
+
+	return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	if (BufferIsValid(hb->buf))
+		ReleaseBuffer(hb->buf);
+
+	pfree(hb->tupdata);
+	pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	HeapScanDesc scan = (HeapScanDesc) sscan;
+	HeapBatch  *hb = (HeapBatch *) am_batch;
+	Buffer		curbuf;
+	int			n;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(hb->maxitems > 0);
+
+	/* Drop prior batch pin, if any. */
+	if (BufferIsValid(hb->buf))
+	{
+		ReleaseBuffer(hb->buf);
+		hb->buf = InvalidBuffer;
+	}
+
+	hb->nitems = 0;
+
+	/* One call per batch, never crosses a page. */
+	n = heapgettup_pagemode_batch(scan, dir,
+								  sscan->rs_nkeys, sscan->rs_key,
+								  hb->tupdata, hb->maxitems);
+
+	if (n == 0)
+		return 0;	/* end of scan */
+
+	/* Hold a shared pin for the batch lifetime so t_data stays valid. */
+	curbuf = scan->rs_cbuf;
+	IncrBufferRefCount(curbuf);
+	hb->buf = curbuf;
+
+	/* Per-tuple stats (can be collapsed into a future _multi() call). */
+	pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+	hb->nitems = n;
+	return n;
+}
+
+/*----- End of batching support -----*/
+
 void
 heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 				  ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..ec4eeccf19c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_rescan = heap_rescan,
 	.scan_getnextslot = heap_getnextslot,
 
+	.scan_begin_batch = heap_begin_batch,
+	.scan_getnextbatch = heap_getnextbatch,
+	.scan_end_batch = heap_end_batch,
+
 	.scan_set_tidrange = heap_set_tidrange,
 	.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..02f7793fba0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,9 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	HeapTupleData *rs_batch_ctup;	/* NULL when not using batched mode */
+	Buffer	rs_batch_cbuf;		/* buffer feeding the batch */
+
 	/* For scans that stream reads */
 	ReadStream *rs_read_stream;
 
@@ -101,6 +104,19 @@ typedef struct HeapScanDescData
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+	HeapTupleData  *tupdata;	/* len = maxitems; headers only */
+	int				nitems;		/* tuples produced in last getnextbatch() */
+	int				maxitems;	/* fixed capacity set at begin_batch() */
+	Buffer			buf;		/* single pinned buffer for this batch */
+} HeapBatch;
+
 typedef struct BitmapHeapScanDescData
 {
 	HeapScanDescData rs_heap_base;
@@ -294,6 +310,11 @@ extern void heap_endscan(TableScanDesc sscan);
 extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
 extern bool heap_getnextslot(TableScanDesc sscan,
 							 ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
 extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..953207eac50 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
 									 ScanDirection direction,
 									 TupleTableSlot *slot);
 
+	/* ------------------------------------------------------------------------
+	 * Batched scan support
+	 * ------------------------------------------------------------------------
+	 */
+
+	void	   *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+	int			(*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+									 ScanDirection dir);
+	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_scan_begin_batch
+ *		Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_begin_batch != NULL);
+
+	return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ *		Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ *		Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	/* Only forward scans are supported in the batched mode. */
+	Assert(dir == ForwardScanDirection);
+	Assert(tam->scan_getnextbatch != NULL);
+
+	return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ *		Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	if (am_batch == NULL)
+		return;
+
+	Assert(tam->scan_end_batch != NULL);
+
+	tam->scan_end_batch(sscan, am_batch);
+}
+
 /* ----------------------------------------------------------------------------
  * TID Range scanning related functions.
  * ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index e4a59a30b8c..aaea9520b1d 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -687,6 +687,11 @@ extern void pgstat_report_analyze(Relation rel,
 		if (pgstat_should_count_relation(rel))						\
 			(rel)->pgstat_info->counts.tuples_returned++;			\
 	} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n)						\
+	do {															\
+		if (pgstat_should_count_relation(rel))						\
+			(rel)->pgstat_info->counts.tuples_returned += n;		\
+	} while (0)
 #define pgstat_count_heap_fetch(rel)								\
 	do {															\
 		if (pgstat_should_count_relation(rel))						\
-- 
2.43.0



  [application/octet-stream] v2-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch (9.0K, 8-v2-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch)
  download | inline diff:
From 10d0df2676462f1931b2ef5072eed7129d936328 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 22:18:30 +0900
Subject: [PATCH v2 3/8] Executor: add ExecProcNodeBatch() and integrate
 SeqScan with batch API
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a batch-capable executor interface alongside the existing
slot-at-a-time path:

 * ExecProcNodeBatch() is added to return a TupleBatch instead of a
   TupleTableSlot. PlanState gains ExecProcNodeBatch as a function
   pointer.

Integrate SeqScan with this interface:

 * Add ExecSeqScanBatch* routines that drive heap via the batch table
   AM API and return a TupleBatch.
 * At init, set ps.ExecProcNodeBatch to these routines when
   ScanCanUseBatching() allows.
 * Retain ExecSeqScanBatchSlot* variants for slot-at-a-time consumers.

This builds on 0002, which introduced TupleBatch and made SeqScan
consume the AM’s batch API internally but still surface slots. With this
patch, SeqScan can surface batches directly to batch-aware upper nodes.

Plan shape and EXPLAIN output remain unchanged; only internal tuple flow
differs when batching is enabled and allowed.
---
 src/backend/executor/execProcnode.c | 52 +++++++++++++++++++++++++++++
 src/backend/executor/nodeSeqscan.c  | 35 +++++++++++++++++++
 src/include/executor/execScan.h     | 51 ++++++++++++++++++++++++++++
 src/include/executor/executor.h     | 10 ++++++
 src/include/nodes/execnodes.h       |  5 +++
 5 files changed, 153 insertions(+)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..a8c0315e874 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,6 +121,8 @@
 
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
 static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+static TupleBatch *ExecProcNodeBatchFirst(PlanState *node);
+static TupleBatch *ExecProcNodeBatchInstr(PlanState *node);
 static bool ExecShutdownNode_walker(PlanState *node, void *context);
 
 
@@ -389,6 +391,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 	}
 
 	ExecSetExecProcNode(result, result->ExecProcNode);
+	if (result->ExecProcNodeBatch)
+		ExecSetExecProcNodeBatch(result, result->ExecProcNodeBatch);
 
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
@@ -489,6 +493,54 @@ ExecProcNodeInstr(PlanState *node)
 	return result;
 }
 
+/*
+ * ExecSetExecProcNodeBatch
+ *		Install ExecProcNodeBatch with first-call wrapper, mirroring row path.
+ */
+void
+ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function)
+{
+	node->ExecProcNodeBatchReal = function;
+	node->ExecProcNodeBatch = ExecProcNodeBatchFirst;
+}
+
+/*
+ * ExecProcNodeBatchFirst
+ *		One-time stack-depth check; then pick instrument/no-instrument wrapper.
+ */
+static TupleBatch *
+ExecProcNodeBatchFirst(PlanState *node)
+{
+	check_stack_depth();
+
+	if (node->instrument)
+		node->ExecProcNodeBatch = ExecProcNodeBatchInstr;
+	else
+		node->ExecProcNodeBatch = node->ExecProcNodeBatchReal;
+
+	return node->ExecProcNodeBatch(node);
+}
+
+/*
+ * ExecProcNodeBatchInstr
+ *		Instrumentation wrapper for batch calls.
+ *
+ * Note: we can record nrows as the "tuple" count for this call. That keeps
+ * instrumentation meaningful without changing Instr API.
+ */
+static TupleBatch *
+ExecProcNodeBatchInstr(PlanState *node)
+{
+	TupleBatch *b;
+
+	InstrStartNode(node->instrument);
+
+	b = node->ExecProcNodeBatchReal(node);
+
+	InstrStopNode(node->instrument, b ? (double) b->nvalid : 0.0);
+
+	return b;
+}
 
 /* ----------------------------------------------------------------
  *		MultiExecProcNode
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 2552d420f1c..a4cf1e51af0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -334,6 +334,37 @@ ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
 									 pstate->qual, pstate->ps_ProjInfo);
 }
 
+static TupleBatch *
+ExecSeqScanBatch(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatch(&node->ss,
+								 (ExecScanAccessBatchMtd) SeqNextBatch,
+								 NULL, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation is required.
+ */
+static TupleBatch *
+ExecSeqScanBatchWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatch(&node->ss,
+								 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+								 pstate->qual, NULL);
+}
+
 /* Batch SeqScan enablement and dispatch */
 static void
 SeqScanInitBatching(SeqScanState *scanstate, int eflags)
@@ -348,10 +379,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 	{
 		if (scanstate->ss.ps.ps_ProjInfo == NULL)
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatch;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
 		}
 		else
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = NULL;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
 		}
 	}
@@ -359,10 +392,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 	{
 		if (scanstate->ss.ps.ps_ProjInfo == NULL)
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatchWithQual;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
 		}
 		else
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = NULL;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
 		}
 	}
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fec606471c8..fb4b57a831c 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -297,4 +297,55 @@ ExecScanExtendedBatchSlot(ScanState *node,
 	}
 }
 
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+					  ExecScanAccessBatchMtd accessBatchMtd,
+					  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+	int			qualified;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next batch from the AM */
+		if (!accessBatchMtd(node))
+			return NULL;
+
+		if (qual != NULL)
+		{
+			qualified = 0;
+			while (TupleBatchHasMore(b))
+			{
+				TupleTableSlot *in = TupleBatchGetNextSlot(b);
+
+				Assert(in);
+				ResetExprContext(econtext);
+				econtext->ecxt_scantuple = in;
+
+				if (ExecQual(qual, econtext))
+				{
+					TupleBatchStoreInOut(b, qualified, in);
+					qualified++;
+				}
+				else
+					InstrCountFiltered1(node, 1);
+			}
+			TupleBatchUseOutput(b, qualified);
+		}
+		else
+			qualified = b->nvalid;
+
+		if (qualified > 0)
+			return b;
+		/* else get the next batch from the AM */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 17258f7ae2d..cf5b0c7e05c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -294,6 +294,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
+extern void ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern void ExecShutdownNode(PlanState *node);
@@ -315,6 +316,15 @@ ExecProcNode(PlanState *node)
 
 	return node->ExecProcNode(node);
 }
+
+static inline TupleBatch *
+ExecProcNodeBatch(PlanState *node)
+{
+	if (node->chgParam != NULL) /* something changed? */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	return node->ExecProcNodeBatch(node);
+}
 #endif
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f4bb8f7dd7f..a104591ac20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1147,6 +1147,7 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
 /* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
 struct TupleBatch;
 typedef struct TupleBatch TupleBatch;
+typedef TupleBatch *(*ExecProcNodeBatchMtd)(struct PlanState *ps);
 
 /* ----------------
  *		PlanState node
@@ -1171,6 +1172,10 @@ typedef struct PlanState
 	ExecProcNodeMtd ExecProcNodeReal;	/* actual function, if above is a
 										 * wrapper */
 
+	/* Optional batch-producing entry point (NULL => no batching). */
+	ExecProcNodeBatchMtd ExecProcNodeBatch;
+	ExecProcNodeBatchMtd ExecProcNodeBatchReal;
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
2.43.0



  [application/octet-stream] v2-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.2K, 9-v2-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
  download | inline diff:
From 6a43a40037e4b656739743b3c0abdfb73a8f9b92 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:59:56 +0900
Subject: [PATCH v2 2/8] SeqScan: add batch-driven variants returning slots

Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.

Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.

Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.

Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.

Notes / current limits:

- Batching uses EXEC_BATCH_ROWS (currently 64) as the target capacity.
- With the current heapam, batches are composed from a single page, so
  the batch may not always be full. Future work may let SeqScan and/or
  AMs top up batches across pages when safe to do so.
---
 src/backend/access/heap/heapam.c          |  29 ++++
 src/backend/access/heap/heapam_handler.c  |  15 ++
 src/backend/access/table/tableam.c        |  11 ++
 src/backend/executor/Makefile             |   1 +
 src/backend/executor/execBatch.c          | 117 ++++++++++++++
 src/backend/executor/execScan.c           |  31 ++++
 src/backend/executor/meson.build          |   1 +
 src/backend/executor/nodeSeqscan.c        | 176 +++++++++++++++++++++-
 src/backend/utils/init/globals.c          |   3 +
 src/backend/utils/misc/guc_parameters.dat |   7 +
 src/include/access/heapam.h               |   1 +
 src/include/access/tableam.h              |  27 ++++
 src/include/executor/execBatch.h          | 102 +++++++++++++
 src/include/executor/execScan.h           |  54 +++++++
 src/include/executor/executor.h           |   4 +
 src/include/miscadmin.h                   |   1 +
 src/include/nodes/execnodes.h             |   8 +
 17 files changed, 587 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/executor/execBatch.c
 create mode 100644 src/include/executor/execBatch.h

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index f62f7edbf5e..9fd7948482d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1570,6 +1570,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
 	return hb;
 }
 
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	Assert(n <= hb->nitems);
+
+	for (int i = 0; i < n; i++)
+	{
+		HeapTupleData *tuple = &hb->tupdata[i];
+		HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+		/* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+		slot->tuple = tuple;
+		slot->off = 0;
+		slot->base.tts_nvalid = 0;
+		slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+		slot->base.tts_tid = tuple->t_self;
+		slot->base.tts_tableOid = tuple->t_tableOid;
+		slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+	}
+}
+
 /*
  * heap_scan_end_batch
  *
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec4eeccf19c..8e88cc9e8f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,20 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps = {
+	.materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+	return &TupleBatchHeapOps;
+}
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -2617,6 +2631,7 @@ static const TableAmRoutine heapam_methods = {
 	.type = T_TableAmRoutine,
 
 	.slot_callbacks = heapam_slot_callbacks,
+	.batch_callbacks = heapam_batch_callbacks,
 
 	.scan_begin = heap_beginscan,
 	.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5a8ebb8b97c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
 	return slot;
 }
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+	if (relation->rd_tableam)
+		return relation->rd_tableam->batch_callbacks(relation);
+	elog(ERROR, "relation does not support TupleBatch operations");
+}
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	execAmi.o \
 	execAsync.o \
+	execBatch.o \
 	execCurrent.o \
 	execExpr.o \
 	execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ *		Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ *		Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+	TupleBatch  *b;
+	TupleTableSlot **inslots,
+				   **outslots;
+
+	inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	for (int i = 0; i < capacity; i++)
+		inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+	b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+	/* Initial state: empty envelope */
+	b->am_payload = NULL;
+	b->ntuples = 0;
+	b->inslots = inslots;
+	b->outslots = outslots;
+	b->activeslots = NULL;
+	b->outslots = outslots;
+	b->maxslots = capacity;
+
+	b->nvalid = 0;
+	b->next = 0;
+
+	return b;
+}
+
+/*
+ * TupleBatchReset
+ *		Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+	if (b == NULL)
+		return;
+
+	for (int i = 0; i < b->maxslots; i++)
+	{
+		ExecClearTuple(b->inslots[i]);
+		if (drop_slots)
+			ExecDropSingleTupleTableSlot(b->inslots[i]);
+	}
+
+	if (drop_slots)
+	{
+		pfree(b->inslots);
+		pfree(b->outslots);
+		b->inslots = b->outslots = NULL;
+	}
+
+	b->ntuples = 0;
+	b->nvalid = 0;
+	b->next = 0;
+	b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->inslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->outslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+	return	b != NULL &&
+			b->maxslots > 0 &&
+			b->inslots != NULL &&
+			b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+	b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+	return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 90726949a87..f24c5d73ae1 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
  */
 #include "postgres.h"
 
+#include "access/tableam.h"
 #include "executor/executor.h"
 #include "executor/execScan.h"
 #include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
 		}
 	}
 }
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+	Relation	relation = scanstate->ss_currentRelation;
+
+	return	executor_batching &&
+			(scanstate->ps.state->es_epq_active == NULL) &&
+			!(eflags & EXEC_FLAG_BACKWARD) &&
+			relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+	TupleBatch *b = scanstate->ps.ps_Batch;
+
+	if (b)
+	{
+		TupleBatchReset(b, drop);
+		if (b->am_payload)
+		{
+			table_scan_end_batch(scanstate->ss_currentScanDesc,
+								 b->am_payload);
+			b->am_payload = NULL;
+		}
+		if (drop)
+			pfree(b);
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'execAmi.c',
   'execAsync.c',
+  'execBatch.c',
   'execCurrent.c',
   'execExpr.c',
   'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..2552d420f1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
 					(ExecScanRecheckMtd) SeqRecheck);
 }
 
+/* ----------------------------------------------------------------
+ *						Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+	TableScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+
+	Assert(node->ss.ps.ps_Batch != NULL);
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	Assert(direction == ForwardScanDirection);
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the scan is not parallel, or if we're serially
+		 * executing a scan that was planned to be parallel.
+		 */
+		scandesc = table_beginscan(node->ss.ss_currentRelation,
+								   estate->es_snapshot,
+								   0, NULL);
+		node->ss.ss_currentScanDesc = scandesc;
+	}
+
+	/* Lazily create the AM batch payload. */
+	if (node->ss.ps.ps_Batch->am_payload == NULL)
+	{
+		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+		Assert(tam && tam->scan_begin_batch);
+		node->ss.ps.ps_Batch->am_payload =
+			table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+		node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+	}
+
+	node->ss.ps.ps_Batch->ntuples =
+		table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+	node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+	node->ss.ps.ps_Batch->materialized = false;
+
+	return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+	if (SeqNextBatch(node))
+	{
+		TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+		return true;
+	}
+
+	return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	/*
+	 * Use pg_assume() for != NULL tests to make the compiler realize no
+	 * runtime check for the field is needed in ExecScanExtended().
+	 */
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+	const int cap = EXEC_BATCH_ROWS;
+	TupleDesc	scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+	scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+	/* Choose batch variant to preserve your specialization matrix */
+	if (scanstate->ss.ps.qual == NULL)
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+		}
+	}
+	else
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
 ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 {
 	SeqScanState *scanstate;
+	bool	use_batching;
 
 	/*
 	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 							 node->scan.scanrelid,
 							 eflags);
 
+	use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
 	/* and create slot with the appropriate rowtype */
 	ExecInitScanTupleSlot(estate, &scanstate->ss,
 						  RelationGetDescr(scanstate->ss.ss_currentRelation),
+						  use_batching ? &TTSOpsHeapTuple :
 						  table_slot_callbacks(scanstate->ss.ss_currentRelation));
 
 	/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
 	}
 
+	if (use_batching)
+		SeqScanInitBatching(scanstate, eflags);
+
 	return scanstate;
 }
 
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
 	 */
 	scanDesc = node->ss.ss_currentScanDesc;
 
+	ScanResetBatching(&node->ss, true);
+
 	/*
 	 * close heap scan
 	 */
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
 	if (scan != NULL)
 		table_rescan(scan,		/* scan desc */
 					 NULL);		/* new scan keys */
-
+	ScanResetBatching(&node->ss, false);
 	ExecScanReScan((ScanState *) node);
 }
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..b4a0996a717 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* executor batching */
+bool		executor_batching = false;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 6bc6be13d2a..c9fbb7ffef9 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -880,6 +880,13 @@
   boot_val => 'true',
 },
 
+{ name => 'executor_batching', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Use tuple batching during execution.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'executor_batching',
+  boot_val => 'true',
+},
+
 { name => 'data_sync_retry', type => 'bool', context => 'PGC_POSTMASTER', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Whether to continue running after a failure to sync data files.',
   variable => 'data_sync_retry',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 02f7793fba0..13ce6166ec3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -314,6 +314,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
 extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
 extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
 extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
 
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 953207eac50..05f828b9762 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "commands/vacuum.h"
+#include "executor/execBatch.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
 typedef struct IndexInfo IndexInfo;
 typedef struct SampleScanState SampleScanState;
 typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
 
 /*
  * Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
 	 * Return slot implementation suitable for storing a tuple of this AM.
 	 */
 	const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+	const TupleBatchOps *(*batch_callbacks)(Relation rel);
 
 
 	/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
 									 ScanDirection dir);
 	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
 
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
  */
 extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_supports_batching
+ *		Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+	const TableAmRoutine *tam = relation->rd_tableam;
+
+	return tam->scan_getnextbatch != NULL;
+}
+
 /*
  * table_scan_begin_batch
  *		Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2116,5 +2142,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
  */
 
 extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
 
 #endif							/* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..6f1a38d14bd
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ *		Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/* XXX fixed 64 for PoC */
+#define	EXEC_BATCH_ROWS		64
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+	void (*materialize_all)(void *am_payload,
+							TupleTableSlot **dst,
+							int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+	void	   *am_payload;
+	const TupleBatchOps *ops;
+	int			ntuples;				/* number of tuples in am_payload */
+	bool		materialized;		 /* tuples in slots valid? */
+	struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+	struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+									   * batch */
+	struct TupleTableSlot **activeslots;
+	int			maxslots;
+
+	int		nvalid;		/* number of returnable tuples in outslots */
+	int		next;		/* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+	return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+	Assert(index < b->nvalid);
+	return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+	Assert(TupleBatchIsValid(b));
+	b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+	return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+	if (b->materialized)
+		return;
+
+	if (b->ops == NULL || b->ops->materialize_all == NULL)
+		elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+	b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+	TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 837ea7785bb..fec606471c8 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -243,4 +243,58 @@ ExecScanExtended(ScanState *node,
 	}
 }
 
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+						  ExecScanAccessBatchMtd accessBatchMtd,
+						  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		TupleTableSlot *in;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next input slot from current batch, or refill */
+		if (!TupleBatchHasMore(b))
+		{
+			if (!accessBatchMtd(node))
+				return NULL;
+		}
+
+		in = TupleBatchGetNextSlot(b);
+		Assert(in);
+
+		/* No qual, no projection: direct return */
+		if (qual == NULL && projInfo == NULL)
+			return in;
+
+		ResetExprContext(econtext);
+		econtext->ecxt_scantuple = in;
+
+		/* Qual only */
+		if (projInfo == NULL)
+		{
+			if (qual == NULL || ExecQual(qual, econtext))
+				return in;
+			else
+				InstrCountFiltered1(node, 1);
+			continue;
+		}
+
+		/* Projection (with or without qual) */
+		if (qual == NULL || ExecQual(qual, econtext))
+			return ExecProject(projInfo);
+		else
+			InstrCountFiltered1(node, 1);
+		/* else try next tuple */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..17258f7ae2d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -575,12 +575,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
 extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
 
 /*
  * prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b8e7afda57c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -287,6 +287,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
 extern PGDLLIMPORT int VacuumCostBalance;
 extern PGDLLIMPORT bool VacuumCostActive;
 
+extern PGDLLIMPORT bool executor_batching;
 
 /* in utils/misc/stack_depth.c */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..f4bb8f7dd7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
 #define EXECNODES_H
 
 #include "access/tupconvert.h"
+#include "executor/execBatch.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
 #include "lib/ilist.h"
@@ -1143,6 +1144,10 @@ typedef struct JsonExprState
  */
 typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
 
+/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
+struct TupleBatch;
+typedef struct TupleBatch TupleBatch;
+
 /* ----------------
  *		PlanState node
  *
@@ -1198,6 +1203,9 @@ typedef struct PlanState
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 
+	/* Batching state if node supports it. */
+	TupleBatch *ps_Batch;
+
 	bool		async_capable;	/* true if node is async-capable */
 
 	/*
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-09-30 02:15  Amit Langote <[email protected]>
  parent: Bruce Momjian <[email protected]>
  0 siblings, 0 replies; 22+ messages in thread

From: Amit Langote @ 2025-09-30 02:15 UTC (permalink / raw)
  To: Bruce Momjian <[email protected]>; +Cc: pgsql-hackers

Hi Bruce,

On Fri, Sep 26, 2025 at 10:49 PM Bruce Momjian <[email protected]> wrote:
> On Fri, Sep 26, 2025 at 10:28:33PM +0900, Amit Langote wrote:
> > At PGConf.dev this year we had an unconference session [1] on whether
> > the community can support an additional batch executor. The discussion
> > there led me to start hacking on $subject. I have also had off-list
> > discussions on this topic in recent months with Andres and David, who
> > have offered useful thoughts.
> >
> > This patch series is an early attempt to make executor nodes pass
> > around batches of tuples instead of tuple-at-a-time slots. The main
> > motivation is to enable expression evaluation in batch form, which can
> > substantially reduce per-tuple overhead (mainly from function calls)
> > and open the door to further optimizations such as SIMD usage in
> > aggregate transition functions. We could even change algorithms of
> > some plan nodes to operate on batches when, for example, a child node
> > can return batches.
>
> For background, people might want to watch these two videos from POSETTE
> 2025.  The first video explains how data warehouse query needs are
> different from OLTP needs:
>
>         Building a PostgreSQL data warehouse
>         https://www.youtube.com/watch?v=tpq4nfEoioE
>
> and the second one explains the executor optimizations done in PG 18:
>
>         Hacking Postgres Executor For Performance
>         https://www.youtube.com/watch?v=D3Ye9UlcR5Y
>
> I learned from these two videos that to handle new workloads, I need to
> think of the query demands differently, and of course can this be
> accomplished without hampering OLTP workloads?

Thanks for pointing to those talks -- I gave the second one. :-)

Yes, the idea here is to introduce batching without adding much
overhead or new code into the OLTP path.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-09-30 13:35  Amit Langote <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 0 replies; 22+ messages in thread

From: Amit Langote @ 2025-09-30 13:35 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

On Tue, Sep 30, 2025 at 11:11 AM Amit Langote <[email protected]> wrote:
> Hi Tomas,
>
> Thanks a lot for your comments and benchmarking.
>
> I plan to reply to your detailed comments and benchmark results

For now, I reran a few benchmarks with the master branch as an
explicit baseline, since Tomas reported possible regressions with
executor_batching=off. I can reproduce that on my side:

5 aggregates, no where:
select avg(a), avg(b), avg(c), avg(d), avg(e) from bar;

parallel_workers=0, jit=off
Rows    master    batching off    batching on    master vs off    master vs on
1M      47.118    48.545          39.531         +3.0%            -16.1%
2M      95.098    97.241          80.189         +2.3%            -15.7%
3M      141.821   148.540         122.005        +4.7%            -14.0%
4M      188.969   197.056         163.779        +4.3%            -13.3%
5M      240.113   245.902         213.645        +2.4%            -11.0%
10M     556.738   564.120         486.359        +1.3%            -12.6%

parallel_workers=2, jit=on
Rows    master    batching off    batching on    master vs off    master vs on
1M      21.147    22.278          20.737         +5.3%            -1.9%
2M      40.319    41.509          37.851         +3.0%            -6.1%
3M      61.582    63.026          55.927         +2.3%            -9.2%
4M      96.363    95.245          78.494         -1.2%            -18.5%
5M      117.226   117.649         97.968         +0.4%            -16.4%
10M     245.503   246.896         196.335        +0.6%            -20.0%

1 aggregate, no where:
select count(*) from bar;

parallel_workers=0, jit=off
Rows    master    batching off    batching on    master vs off    master vs on
1M      17.071    20.135          6.698          +17.9%           -60.8%
2M      36.905    41.522          15.188         +12.5%           -58.9%
3M      56.094    63.110          23.485         +12.5%           -58.1%
4M      74.299    83.912          32.950         +12.9%           -55.7%
5M      94.229    108.621         41.338         +15.2%           -56.1%
10M     234.425   261.490         117.833        +11.6%           -49.7%

parallel_workers=2, jit=on
Rows    master    batching off    batching on    master vs off    master vs on
1M      8.820     9.832           5.324          +11.5%           -39.6%
2M      16.368    18.001          9.526          +10.0%           -41.8%
3M      24.810    28.193          14.482         +13.6%           -41.6%
4M      34.369    35.741          23.212         +4.0%            -32.5%
5M      41.595    45.103          27.918         +8.4%            -32.9%
10M     99.494    112.226         94.081         +12.8%           -5.4%

The regression is more noticeable in the single aggregate case, where
more time is spent in scanning.

Looking into it.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-10 06:40  Amit Langote <[email protected]>
  parent: Tomas Vondra <[email protected]>
  3 siblings, 0 replies; 22+ messages in thread

From: Amit Langote @ 2025-10-10 06:40 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

Hi,

On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <[email protected]> wrote:
> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.

I can reproduce the Q22 segfault with clang-17 on macOS and the
attached patch 0009 fixes it.

The issue I observed is that two EEOPs both called the same helper,
and that helper re-peeked ExecExprEvalOp(op) to choose its path; in
this particular build the two EEOP cases in ExecInterpExpr() compiled
to identical code so their dispatch labels had the same address
(reverse_dispatch_table logging in ExecInitInterpreter() showed the
duplicate), and because ExecEvalStepOp() maps by label address the
reverse lookup could yield the other EEOP -- I saw ExprInit select
ROWLOOP EEOP while the ExprExec-time helper observed DIRECT EEOP and
ran code for it, which then crashed.

In 0009 (the fix), I split the helper into two functions, one per
EEOP, so the helper does not re-derive the opcode; with that change I
cannot reproduce the crash on macOS clang-17.

-- 
Thanks, Amit Langote


Attachments:

  [application/octet-stream] v3-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch (21.5K, 2-v3-0006-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP.patch)
  download | inline diff:
From 20a99f908e6dc9499ba927b1321918cff306aca7 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 2 Sep 2025 23:46:34 +0900
Subject: [PATCH v3 6/9] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP

Introduce a batch EEOP that runs plain aggregate transitions by
looping over rows of a TupleBatch. This keeps transition logic in
the interpreter while amortizing per-row costs.

Gate with AggTransCanUseBatch(): plain, non-hashed, single-set
aggregates with no DISTINCT/ORDER/FILTER, and simple Var args.

Extend ExecBuildAggTrans() to prepare batch fetch/build steps and
to return whether a batch path is used.
---
 src/backend/executor/execExpr.c       | 228 ++++++++++++++++++++++++--
 src/backend/executor/execExprInterp.c | 103 ++++++++++++
 src/backend/executor/nodeAgg.c        |  17 +-
 src/backend/jit/llvm/llvmjit_expr.c   |   6 +
 src/backend/jit/llvm/llvmjit_types.c  |   1 +
 src/include/executor/execBatch.h      |   6 +
 src/include/executor/execExpr.h       |  14 ++
 src/include/executor/executor.h       |   3 +-
 src/include/executor/nodeAgg.h        |   2 +
 9 files changed, 363 insertions(+), 17 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index f1569879b52..af5ed8b6368 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -95,7 +95,9 @@ static void ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 								  ExprEvalStep *scratch,
 								  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
 								  int transno, int setno, int setoff, bool ishash,
-								  bool nullcheck);
+								  bool nullcheck, bool batch,
+								  BatchVector *bv);
+
 static void ExecInitJsonExpr(JsonExpr *jsexpr, ExprState *state,
 							 Datum *resv, bool *resnull,
 							 ExprEvalStep *scratch);
@@ -104,6 +106,10 @@ static void ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
 								 bool exists_coerce,
 								 Datum *resv, bool *resnull);
 
+static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
+static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
+static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
+													  const BatchVector *bv);
 
 /*
  * ExecInitExpr: prepare an expression tree for execution
@@ -3659,6 +3665,33 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
 	}
 }
 
+/* plain agg, single set, not hashed, no DISTINCT/ORDER/FILTER */
+static inline bool
+AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
+{
+	Agg *aggnode = (Agg *) as->ss.ps.plan;
+
+	if (!AggCanUsePlainBatch(as))
+		return false;
+	if (as->aggstrategy == AGG_HASHED)
+		return false;
+	if (aggnode->groupingSets != NIL)
+		return false;
+	if (as->phase == NULL || as->phase->numsets > 0)
+		return false;
+
+	/* per-aggregate complications */
+	if (pt->aggsortrequired)
+		return false;
+	if (pt->aggref &&
+		(pt->aggref->aggdistinct != NIL ||
+		 pt->aggref->aggorder != NIL ||
+		 pt->aggref->aggfilter != NULL))
+		return false;
+
+	return true;
+}
+
 /*
  * Build transition/combine function invocations for all aggregate transition
  * / combination function invocations in a grouping sets phase. This has to
@@ -3675,13 +3708,17 @@ ExecInitCoerceToDomain(ExprEvalStep *scratch, CoerceToDomain *ctest,
  */
 ExprState *
 ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
-				  bool doSort, bool doHash, bool nullcheck)
+				  bool doSort, bool doHash, bool nullcheck,
+				  bool *batch_trans)
 {
 	ExprState  *state = makeNode(ExprState);
 	PlanState  *parent = &aggstate->ss.ps;
 	ExprEvalStep scratch = {0};
 	bool		isCombine = DO_AGGSPLIT_COMBINE(aggstate->aggsplit);
 	ExprSetupInfo deform = {0, 0, 0, 0, 0, NIL};
+	bool		batch = AggCanUsePlainBatch(aggstate);
+	Bitmapset  *allattnos = NULL;
+	BatchVector *bv = NULL;
 
 	state->expr = (Expr *) aggstate;
 	state->parent = parent;
@@ -3707,8 +3744,36 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 						  &deform);
 		expr_setup_walker((Node *) pertrans->aggref->aggfilter,
 						  &deform);
+
+		if (!AggTransCanUseBatch(aggstate, pertrans) ||
+			!ExprListAllSimpleVars(pertrans->aggref->args, &allattnos))
+			batch = false;
 	}
-	ExecPushExprSetupSteps(state, &deform);
+
+	if (batch)
+	{
+		if (deform.last_outer > 0)
+		{
+			Assert(!bms_is_empty(allattnos));
+			bv  = BatchVectorCreate(allattnos, deform.last_outer);
+
+			/*
+			 * Deform all tuples upto last_outer in batch
+			 */
+			scratch.opcode = EEOP_OUTER_FETCHSOME_BATCH;
+			scratch.d.fetch_batch.last_var = deform.last_outer;
+			ExprEvalPushStep(state, &scratch);
+
+			/*
+			 * Put all arg Vars into vectors once per batch slice
+			 */
+			scratch.opcode = EEOP_BUILD_OUTER_BATCH_VECTOR;
+			scratch.d.batch_vector.bv = bv;
+			ExprEvalPushStep(state, &scratch);
+		}
+	}
+	else
+		ExecPushExprSetupSteps(state, &deform);
 
 	/*
 	 * Emit instructions for each transition value / grouping set combination.
@@ -3746,7 +3811,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * Evaluate arguments to aggregate/combine function.
 		 */
 		argno = 0;
-		if (isCombine)
+		if (isCombine && !batch)
 		{
 			/*
 			 * Combining two aggregate transition values. Instead of directly
@@ -3816,7 +3881,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 			Assert(pertrans->numInputs == argno);
 		}
-		else if (!pertrans->aggsortrequired)
+		else if (!pertrans->aggsortrequired && !batch)
 		{
 			ListCell   *arg;
 
@@ -3849,7 +3914,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			}
 			Assert(pertrans->numTransInputs == argno);
 		}
-		else if (pertrans->numInputs == 1)
+		else if (pertrans->numInputs == 1 && !batch)
 		{
 			/*
 			 * Non-presorted DISTINCT and/or ORDER BY case, with a single
@@ -3868,7 +3933,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 			Assert(pertrans->numInputs == argno);
 		}
-		else
+		else if (!batch)
 		{
 			/*
 			 * Non-presorted DISTINCT and/or ORDER BY case, with multiple
@@ -3896,7 +3961,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		 * just keep the prior transValue. This is true for both plain and
 		 * sorted/distinct aggregates.
 		 */
-		if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0)
+		if (trans_fcinfo->flinfo->fn_strict && pertrans->numTransInputs > 0 && !batch)
 		{
 			if (strictnulls)
 				scratch.opcode = EEOP_AGG_STRICT_INPUT_CHECK_NULLS;
@@ -3914,7 +3979,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 		}
 
 		/* Handle DISTINCT aggregates which have pre-sorted input */
-		if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired)
+		if (pertrans->numDistinctCols > 0 && !pertrans->aggsortrequired && !batch)
 		{
 			if (pertrans->numDistinctCols > 1)
 				scratch.opcode = EEOP_AGG_PRESORTED_DISTINCT_MULTI;
@@ -3942,7 +4007,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 									  pertrans, transno, setno, setoff, false,
-									  nullcheck);
+									  nullcheck, batch, bv);
 				setoff++;
 			}
 		}
@@ -3962,7 +4027,7 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 			{
 				ExecBuildAggTransCall(state, aggstate, &scratch, trans_fcinfo,
 									  pertrans, transno, setno, setoff, true,
-									  nullcheck);
+									  nullcheck, false, NULL);
 				setoff++;
 			}
 		}
@@ -4007,6 +4072,9 @@ ExecBuildAggTrans(AggState *aggstate, AggStatePerPhase phase,
 
 	ExecReadyExpr(state);
 
+	if (batch_trans)
+		*batch_trans = batch;
+
 	return state;
 }
 
@@ -4020,10 +4088,11 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 					  ExprEvalStep *scratch,
 					  FunctionCallInfo fcinfo, AggStatePerTrans pertrans,
 					  int transno, int setno, int setoff, bool ishash,
-					  bool nullcheck)
+					  bool nullcheck, bool batch, BatchVector *bv)
 {
 	ExprContext *aggcontext;
 	int			adjust_jumpnull = -1;
+	BatchVectorSlice *bvs = NULL;
 
 	if (ishash)
 		aggcontext = aggstate->hashcontext;
@@ -4077,7 +4146,13 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	 */
 	if (!pertrans->aggsortrequired)
 	{
-		if (pertrans->transtypeByVal)
+		if (batch)
+		{
+			if (bv)
+				bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
+			scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+		}
+		else if (pertrans->transtypeByVal)
 		{
 			if (fcinfo->flinfo->fn_strict &&
 				pertrans->initValueIsNull)
@@ -4108,6 +4183,7 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 	scratch->d.agg_trans.setoff = setoff;
 	scratch->d.agg_trans.transno = transno;
 	scratch->d.agg_trans.aggcontext = aggcontext;
+	scratch->d.agg_trans.bvs = bvs;
 	ExprEvalPushStep(state, scratch);
 
 	/* fix up jumpnull */
@@ -5070,3 +5146,129 @@ ExecInitJsonCoercion(ExprState *state, JsonReturning *returning,
 		DomainHasConstraints(returning->typid);
 	ExprEvalPushStep(state, &scratch);
 }
+
+/* Is expr a Var node for a non-system attribute? */
+static bool
+expr_is_simple_var(Expr *expr, AttrNumber *out_attno)
+{
+	if (expr == NULL)
+		return false;
+
+	if (IsA(expr, TargetEntry))
+		return expr_is_simple_var((Expr *) ((TargetEntry *) expr)->expr,
+								  out_attno);
+	if (IsA(expr, RelabelType))
+		return expr_is_simple_var((Expr *) ((RelabelType *) expr)->arg,
+								  out_attno);
+
+	if (IsA(expr, Var) && ((Var *) expr)->varattno > 0)
+	{
+		*out_attno = ((Var *) expr)->varattno;
+		return true;
+	}
+
+	return false;
+}
+
+/* Are all inputs plain Vars (optionally allow RelabelType->Var)? Collect attnos. */
+static bool
+ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
+{
+	ListCell *lc;
+
+	foreach(lc, args)
+	{
+		TargetEntry *tle = lfirst_node(TargetEntry, lc);
+		Expr *arg = tle->expr;
+		AttrNumber attno;
+
+		if (!expr_is_simple_var(arg, &attno))
+			return false;
+
+		if (!IsA(arg, Var))
+			return false;
+
+		Assert(attno > 0);
+		*allattnos = bms_add_member(*allattnos, attno);
+	}
+
+	return true;
+}
+
+/* ---------- BatchVector stuff ------------- */
+
+static BatchVector *
+BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var)
+{
+	int maxrows = EXEC_BATCH_ROWS;
+	BatchVector *bv;
+	AttrNumber	attno;
+	int			i;
+
+	bv = palloc(sizeof(BatchVector));
+	bv->ncols = bms_num_members(attnos);
+	bv->maxrows = maxrows;
+	bv->last_var = last_var;
+	bv->attnos = palloc(sizeof(AttrNumber) * bv->ncols);
+	attno = -1;
+	i = 0;
+	while ((attno = bms_next_member(attnos, attno)) > 0)
+		bv->attnos[i++] = attno;
+	bv->cols = palloc(sizeof(Datum *) * bv->ncols);
+	bv->nulls = palloc(sizeof(bool  *) * bv->ncols);
+
+	for (i =0; i < bv->ncols; i++)
+	{
+		bv->cols[i]  = palloc(sizeof(Datum) * maxrows);
+		bv->nulls[i] = palloc(sizeof(bool)  * maxrows);
+	}
+
+	bv->nrows = 0;
+	bv->hasnull = false;
+
+	return bv;
+}
+
+static int16
+BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno)
+{
+	for (int i = 0; i < bv->ncols; i++)
+		if (bv->attnos[i] == attno)
+			return i;
+
+	return -1;
+}
+
+/*
+ * BatchVectorSliceFromExprArgs
+ *		Build a BatchVectorSlice for a List of args.
+ *
+ * For Var args (possibly under RelabelType), store the col index.
+ * For non-Var args, store -1. Caller can handle Consts, etc.
+ */
+static BatchVectorSlice *
+BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
+{
+	BatchVectorSlice *bvs = palloc(sizeof(BatchVectorSlice));
+	int nargs = list_length(args);
+	int i = 0;
+	ListCell *lc;
+
+	Assert(bv);
+	bvs->bv = bv;
+	bvs->nargs = nargs;
+	bvs->argoffs = (int16 *) palloc(sizeof(int16) * nargs);
+
+	foreach (lc, args)
+	{
+		Expr *arg = (Expr *) lfirst(lc);
+		AttrNumber attno;
+
+		if (expr_is_simple_var(arg, &attno))
+			bvs->argoffs[i++] = BatchVectorFindAttColno(bv, attno);
+		else
+			bvs->argoffs[i++] = -1; /* non-Var */
+	}
+
+	return bvs;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 68629ad7991..3176679b346 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -606,6 +606,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
+		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2336,6 +2337,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
+		{
+			/* too complex for an inline implementation */
+			ExecAggPlainTransBatch(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -6039,3 +6048,97 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 	}
 	bv->nrows = i;
 }
+
+void
+ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	AggState   *aggstate = castNode(AggState, state->parent);
+	AggStatePerTrans	pertrans = op->d.agg_trans.pertrans;
+	AggStatePerGroup pergroup =
+		&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+	BatchVectorSlice  *bvs = op->d.agg_trans.bvs;
+	FunctionCallInfo	fcinfo = pertrans->transfn_fcinfo;
+	FmgrInfo		   *finfo = fcinfo->flinfo;
+	Datum		newVal;
+	TupleBatch *batch = econtext->outer_batch;
+	int			batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
+	int			start_row = 0;
+
+	if (finfo->fn_strict)
+	{
+		if (pergroup->noTransValue && bvs)
+		{
+			const BatchVector *bv = bvs->bv;
+			bool	found = false;
+
+			Assert(bv);
+			for (int i = 0; i < batch_nrows; i++)
+			{
+				for (int j = 0; j < bvs->nargs; j++)
+				{
+					if (!bv->nulls[bvs->argoffs[j]][i])
+					{
+						fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+						fcinfo->args[1].isnull = false;
+						if (j == bvs->nargs - 1)
+						{
+							found = true;
+							break;
+						}
+					}
+				}
+				if (found)
+					break;
+			}
+			/* If transValue has not yet been initialized, do so now. */
+			ExecAggInitGroup(aggstate, pertrans, pergroup,
+							 op->d.agg_trans.aggcontext);
+			start_row = 1;
+		}
+		else if (pergroup->transValueIsNull)
+			return;
+	}
+
+	switch (ExecEvalStepOp(state, op))
+	{
+		case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+			/* Loop rows, call the original transfn per element using vector cols. */
+			for (int i = start_row; i < batch_nrows; i++)
+			{
+				bool hasnull = false;
+
+				/* Set up fcinfo args 1..m from column vectors at row i. */
+				if (bvs)
+				{
+					const BatchVector *bv = bvs->bv;
+
+					for (int j = 0; j < bvs->nargs; j++)
+					{
+						int16	argoff = bvs->argoffs[j];
+
+						fcinfo->args[j+1].value = bv->cols[argoff][i];
+						fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+						if (!hasnull && bv->nulls[argoff][i])
+							hasnull = true;
+					}
+				}
+				/* fcinfo->args[0] is the existing transition state */
+				if (finfo->fn_strict && hasnull)
+					continue;
+				fcinfo->args[0].value = pergroup->transValue;
+				fcinfo->args[0].isnull = pergroup->transValueIsNull;
+				newVal = FunctionCallInvoke(fcinfo);
+				if (!pertrans->transtypeByVal &&
+					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+					newVal = ExecAggCopyTransValue(aggstate, pertrans,
+												   newVal, fcinfo->isnull,
+												   pergroup->transValue,
+												   pergroup->transValueIsNull);
+				pergroup->transValue = newVal;
+				pergroup->transValueIsNull = fcinfo->isnull;
+			}
+			break;
+		default:
+			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+	}
+}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 3ace6363509..662d8bef43b 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -825,6 +825,16 @@ advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
 {
 	ExprContext *tmpcontext = aggstate->tmpcontext;
 	ExprState *evaltrans = aggstate->phase->evaltrans;
+	bool		batch_trans = aggstate->phase->batch_trans;
+
+	if (batch_trans)
+	{
+		tmpcontext->ecxt_outertuple = TupleBatchGetSlot(b, 0);
+		tmpcontext->outer_batch = b;
+		ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+		TupleBatchConsumeAll(b);
+		return;
+	}
 
 	while (TupleBatchHasMore(b))
 	{
@@ -1800,7 +1810,8 @@ hashagg_recompile_expressions(AggState *aggstate, bool minslot, bool nullcheck)
 
 		phase->evaltrans_cache[i][j] = ExecBuildAggTrans(aggstate, phase,
 														 dosort, dohash,
-														 nullcheck);
+														 nullcheck,
+														 NULL);
 
 		/* change back */
 		aggstate->ss.ps.outerops = outerops;
@@ -3367,7 +3378,7 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 }
 
-static bool
+bool
 AggCanUsePlainBatch(AggState *aggstate)
 {
 	const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
@@ -4233,7 +4244,7 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 			Assert(false);
 
 		phase->evaltrans = ExecBuildAggTrans(aggstate, phase, dosort, dohash,
-											 false);
+											 false, &phase->batch_trans);
 
 		/* cache compiled expression for outer slot without NULL check */
 		phase->evaltrans_cache[0][0] = phase->evaltrans;
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 848f0b52d6f..efb3ee639fc 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,12 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
+				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 6bb527c3f6f..1b5e06f60cc 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,4 +186,5 @@ void	   *referenced_functions[] =
 	ExecBuildInnerBatchVector,
 	ExecBuildOuterBatchVector,
 	ExecBuildScanBatchVector,
+	ExecAggPlainTransBatch,
 };
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 6f1a38d14bd..b50961fc0c9 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -99,4 +99,10 @@ TupleBatchMaterializeAll(TupleBatch *b)
 	TupleBatchUseInput(b, b->ntuples);
 }
 
+static inline void
+TupleBatchConsumeAll(TupleBatch *b)
+{
+	b->next = b->nvalid;
+}
+
 #endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 99c86bac702..1d33e084b69 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -302,6 +302,9 @@ typedef enum ExprEvalOp
 	EEOP_BUILD_OUTER_BATCH_VECTOR,
 	EEOP_BUILD_SCAN_BATCH_VECTOR,
 
+	/* Batched aggregate trans evaluation */
+	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
@@ -750,6 +753,7 @@ typedef struct ExprEvalStep
 
 		/* for EEOP_AGG_PLAIN_TRANS_[INIT_][STRICT_]{BYVAL,BYREF} */
 		/* for EEOP_AGG_ORDERED_TRANS_{DATUM,TUPLE} */
+		/* for EEOP_AGG_PLAIN_TRANS_{BATCH,BATCH_ROWLOOP}*/
 		struct
 		{
 			AggStatePerTrans pertrans;
@@ -757,6 +761,7 @@ typedef struct ExprEvalStep
 			int			setno;
 			int			transno;
 			int			setoff;
+			struct BatchVectorSlice *bvs;
 		}			agg_trans;
 
 		/* for EEOP_IS_JSON */
@@ -956,8 +961,17 @@ typedef struct BatchVector
 	int		nrows;			/* #rows loaded into cols/nulls */
 } BatchVector;
 
+/* A slice of BatchVector that maps caller args to BatchVector columns. */
+typedef struct BatchVectorSlice
+{
+	const BatchVector *bv;
+	int			nargs;		/* number of args covered */
+	int16	   *argoffs;	/* length nargs, -1 for non-Var entries */
+} BatchVectorSlice;
+
 extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
+extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index cf5b0c7e05c..5ba9a523970 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -336,7 +336,8 @@ extern ExprState *ExecInitQual(List *qual, PlanState *parent);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
-									bool doSort, bool doHash, bool nullcheck);
+									bool doSort, bool doHash, bool nullcheck,
+									bool *batch_trans);
 extern ExprState *ExecBuildHash32FromAttrs(TupleDesc desc,
 										   const TupleTableSlotOps *ops,
 										   FmgrInfo *hashfunctions,
diff --git a/src/include/executor/nodeAgg.h b/src/include/executor/nodeAgg.h
index 6c4891bbaeb..5c5ebfc73f2 100644
--- a/src/include/executor/nodeAgg.h
+++ b/src/include/executor/nodeAgg.h
@@ -289,6 +289,7 @@ typedef struct AggStatePerPhaseData
 	Sort	   *sortnode;		/* Sort node for input ordering for phase */
 
 	ExprState  *evaltrans;		/* evaluation of transition functions  */
+	bool		batch_trans;	/* true if evaltrans contains batch EEOPs */
 
 	/*----------
 	 * Cached variants of the compiled expression.
@@ -338,4 +339,5 @@ extern void ExecAggInitializeDSM(AggState *node, ParallelContext *pcxt);
 extern void ExecAggInitializeWorker(AggState *node, ParallelWorkerContext *pwcxt);
 extern void ExecAggRetrieveInstrumentation(AggState *node);
 
+extern bool AggCanUsePlainBatch(AggState *aggstate);
 #endif							/* NODEAGG_H */
-- 
2.47.3



  [application/octet-stream] v3-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch (11.2K, 3-v3-0007-WIP-Add-EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT.patch)
  download | inline diff:
From 9eea71db3c7bb137e676ad0a27f6256d9c6971f0 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Tue, 9 Sep 2025 21:43:29 +0900
Subject: [PATCH v3 7/9] WIP: Add EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT

The new EEOP runs a plain aggregate transition over a TupleBatch with
a single fmgr call. Batch vectors are passed to the transfn via
AggBulkArgs stored in fcinfo->flinfo->fn_extra, avoiding per-row fmgr
overhead.

Gate selection with AggTransfnSupportsBulk(), an allowlist of
built-in transfns updated to accept AggBulkArgs.  Some integer
transfns are taught to read AggBulkArgs when present, else fall
back. Rowloop batching remains available; unsupported aggregates keep
the row path.
---
 src/backend/executor/execExpr.c       | 28 ++++++++++++++++-
 src/backend/executor/execExprInterp.c | 43 ++++++++++++++++++++++++++
 src/backend/executor/nodeAgg.c        |  1 -
 src/backend/jit/llvm/llvmjit_expr.c   |  1 +
 src/backend/utils/adt/int.c           | 32 +++++++++++++++++++
 src/backend/utils/adt/int8.c          | 44 +++++++++++++++++++++++++++
 src/backend/utils/adt/numeric.c       | 17 +++++++++++
 src/include/executor/execExpr.h       |  1 +
 src/include/executor/executor.h       | 20 ++++++++++++
 9 files changed, 185 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index af5ed8b6368..27a5780f557 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -47,6 +47,7 @@
 #include "utils/acl.h"
 #include "utils/array.h"
 #include "utils/builtins.h"
+#include "utils/fmgroids.h"
 #include "utils/jsonfuncs.h"
 #include "utils/jsonpath.h"
 #include "utils/lsyscache.h"
@@ -3692,6 +3693,28 @@ AggTransCanUseBatch(AggState *as, AggStatePerTrans pt)
 	return true;
 }
 
+/* Return true if this transfn OID is known to accept AggBulkArgs. */
+static bool
+AggTransfnSupportsBulk(Oid fn_oid)
+{
+	/* Phase 1: hard-coded allowlist of built-ins you updated. */
+	static const Oid ok[] =
+	{
+		F_INT8INC_ANY,		/* COUNT(*) transfn */
+		F_INT8INC,			/* COUNT(arg) transfn */
+		F_INT4_SUM,			/* SUM(int) transfn */
+		F_INT4SMALLER,		/* MIN(int) transfn */
+		F_INT4LARGER,		/* MAX(int) transfn */
+		/* add others you make bulk-aware */
+		InvalidOid
+	};
+
+	for (int i = 0; OidIsValid(ok[i]); i++)
+		if (ok[i] == fn_oid)
+			return true;
+	return false;
+}
+
 /*
  * Build transition/combine function invocations for all aggregate transition
  * / combination function invocations in a grouping sets phase. This has to
@@ -4150,7 +4173,10 @@ ExecBuildAggTransCall(ExprState *state, AggState *aggstate,
 		{
 			if (bv)
 				bvs = BatchVectorSliceFromExprArgs(pertrans->aggref->args, bv);
-			scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+			if (!AggTransfnSupportsBulk(pertrans->transfn_oid))
+				scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP;
+			else
+				scratch->opcode = EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT;
 		}
 		else if (pertrans->transtypeByVal)
 		{
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 3176679b346..41ad9b4838d 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -607,6 +607,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
+		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2345,6 +2346,14 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
+		{
+			/* too complex for an inline implementation */
+			ExecAggPlainTransBatch(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -6138,6 +6147,40 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
 				pergroup->transValueIsNull = fcinfo->isnull;
 			}
 			break;
+
+		case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+			{
+				void *save = fcinfo->flinfo->fn_extra;
+				AggBulkArgs ba = {batch_nrows, start_row};
+
+				if (bvs)
+				{
+					const BatchVector *bv = bvs->bv;
+
+					Assert(bv);
+					ba.nargs = bvs->nargs;
+					ba.argoffs = bvs->argoffs;
+					ba.args = bv->cols;
+					ba.isnull = bv->nulls;
+					ba.hasnull = bv->hasnull;
+				}
+				fcinfo->flinfo->fn_extra = &ba;
+				fcinfo->args[0].value = pergroup->transValue;
+				fcinfo->args[0].isnull = pergroup->transValueIsNull;
+				fcinfo->isnull = false;		/* just in case transfn doesn't set it */
+				newVal = FunctionCallInvoke(fcinfo);   /* one call for the entire slice */
+				if (!pertrans->transtypeByVal &&
+					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+					newVal = ExecAggCopyTransValue(aggstate, pertrans,
+												   newVal, fcinfo->isnull,
+												   pergroup->transValue,
+												   pergroup->transValueIsNull);
+				pergroup->transValue = newVal;
+				pergroup->transValueIsNull = fcinfo->isnull;
+				fcinfo->flinfo->fn_extra = save;
+			}
+			break;
+
 		default:
 			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
 	}
diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index 662d8bef43b..a2286ef5e54 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -2687,7 +2687,6 @@ agg_retrieve_direct_batch(AggState *aggstate)
 
 	initialize_aggregates(aggstate, aggstate->pergroups,
 						  Max(aggstate->phase->numsets, 1));
-
 	if (aggstate->grp_firstTuple)
 	{
 		ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index efb3ee639fc..45346124bd7 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3026,6 +3026,7 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
 			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
 				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
 								v_state, op, v_econtext);
diff --git a/src/backend/utils/adt/int.c b/src/backend/utils/adt/int.c
index b5781989a64..eb1780b5590 100644
--- a/src/backend/utils/adt/int.c
+++ b/src/backend/utils/adt/int.c
@@ -1363,18 +1363,50 @@ int2smaller(PG_FUNCTION_ARGS)
 Datum
 int4larger(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int32		arg1 = PG_GETARG_INT32(0);
 	int32		arg2 = PG_GETARG_INT32(1);
 
+	if (unlikely(ba))
+	{
+		int32 result = arg1;
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = (int32) ba->args[ba->argoffs[0]][i];
+				if (arg2 > result)
+					result = arg2;
+			}
+		}
+		PG_RETURN_INT32(result);
+	}
 	PG_RETURN_INT32((arg1 > arg2) ? arg1 : arg2);
 }
 
 Datum
 int4smaller(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int32		arg1 = PG_GETARG_INT32(0);
 	int32		arg2 = PG_GETARG_INT32(1);
 
+	if (unlikely(ba))
+	{
+		int32 result = arg1;
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = ba->args[ba->argoffs[0]][i];
+				if (arg2 < result)
+					result = arg2;
+			}
+		}
+		PG_RETURN_INT32(result);
+	}
 	PG_RETURN_INT32((arg1 < arg2) ? arg1 : arg2);
 }
 
diff --git a/src/backend/utils/adt/int8.c b/src/backend/utils/adt/int8.c
index bdea490202a..bbabf4e0785 100644
--- a/src/backend/utils/adt/int8.c
+++ b/src/backend/utils/adt/int8.c
@@ -461,10 +461,28 @@ int8up(PG_FUNCTION_ARGS)
 Datum
 int8pl(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int64		arg1 = PG_GETARG_INT64(0);
 	int64		arg2 = PG_GETARG_INT64(1);
 	int64		result;
 
+	if (unlikely(ba))
+	{
+		result = arg1;
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				arg2 = ba->args[ba->argoffs[0]][i];
+				if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+				arg1 = result;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
 	if (unlikely(pg_add_s64_overflow(arg1, arg2, &result)))
 		ereport(ERROR,
 				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
@@ -718,9 +736,35 @@ int8lcm(PG_FUNCTION_ARGS)
 Datum
 int8inc(PG_FUNCTION_ARGS)
 {
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
 	int64		arg = PG_GETARG_INT64(0);
 	int64		result;
 
+	if (unlikely(ba))
+	{
+		result = arg;
+		if (!ba->hasnull || ba->nargs == 0)
+		{
+			if (unlikely(pg_add_s64_overflow(arg, ba->nrows, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+			PG_RETURN_INT64(result);
+		}
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
+					ereport(ERROR,
+							(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+							 errmsg("bigint out of range")));
+				arg = result;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
+
 	if (unlikely(pg_add_s64_overflow(arg, 1, &result)))
 		ereport(ERROR,
 				(errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index 2501007d981..907c4fddba0 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -6310,6 +6310,23 @@ int4_sum(PG_FUNCTION_ARGS)
 {
 	int64		oldsum;
 	int64		newval;
+	AggBulkArgs *ba = AggGetBulkArgs(fcinfo);
+
+	if (unlikely(ba))
+	{
+		int64	result = (!PG_ARGISNULL(0) ? PG_GETARG_INT64(0) : 0);
+
+		for (int i = ba->start_row; i < ba->nrows; i++)
+		{
+			if (!ba->isnull[ba->argoffs[0]][i])
+			{
+				int32	arg2 = ba->args[ba->argoffs[0]][i];
+
+				result = result + arg2;
+			}
+		}
+		PG_RETURN_INT64(result);
+	}
 
 	if (PG_ARGISNULL(0))
 	{
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 1d33e084b69..f24782ecf58 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -304,6 +304,7 @@ typedef enum ExprEvalOp
 
 	/* Batched aggregate trans evaluation */
 	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
+	EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,	/* call transfn once with AggBulkArgs */
 
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 5ba9a523970..c72bd755b79 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -561,6 +561,26 @@ ExecQualAndReset(ExprState *state, ExprContext *econtext)
 }
 #endif
 
+#ifndef FRONTEND
+/* Per-call bulk argument vectors for batched aggregate trans functions. */
+typedef struct AggBulkArgs
+{
+	int		nrows;		/* number of rows in this batch */
+	int		start_row;
+	int16  *argoffs;
+	int		nargs;		/* number of argument vectors */
+	Datum  **args;		/* args[j][i] = j-th arg at row i */
+	bool   **isnull;	/* isnull[j][i] */
+	bool	hasnull;	/* is any datum in args NULL? */
+} AggBulkArgs;
+
+static inline AggBulkArgs *
+AggGetBulkArgs(FunctionCallInfo fcinfo)
+{
+	return (AggBulkArgs *) (fcinfo->flinfo ? fcinfo->flinfo->fn_extra : NULL);
+}
+#endif
+
 extern bool ExecCheck(ExprState *state, ExprContext *econtext);
 
 /*
-- 
2.47.3



  [application/octet-stream] v3-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch (22.7K, 4-v3-0008-WIP-Add-ExecQualBatch-and-EEOPs-for-batched-quals.patch)
  download | inline diff:
From eec61e901c54ec2149f60c0ff8a0b1b3e63f7a0b Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 16:19:26 +0900
Subject: [PATCH v3 8/9] WIP: Add ExecQualBatch() and EEOPs for batched quals

Introduce ExecInitQualBatch()/ExecQualBatch() to evaluate scan quals
over a TupleBatch. The batched qual interpreter produces a boolean
mask aligned with the batch, marking which rows satisfy the qual.
The scan node later uses this mask to copy only passing rows into
its output slots. If batching is not possible, fall back to the
existing per-tuple engine.

Add EEOP_QUAL_BATCH_INITMASK and EEOP_QUAL_BATCH_TERM, and wire them
after EEOP_SCAN_FETCHSOME_BATCH and EEOP_BUILD_SCAN_BATCH_VECTOR.
Batching is limited to quals that are a top-level AND of simple
clauses: either NullTest(var) or strict binary OpExpr with var/const
or var/var arguments. A walker validates the tree, collects the
referenced attnos, and builds a BatchVector; terms are compiled from
the leaves and evaluated to update the mask.

ExprState gains batch_private to hold BatchQualRuntime (mask, words)
which are used by the parent node to populate output slots in
TupleBatch.
---
 src/backend/executor/execExpr.c       | 324 ++++++++++++++++++++++++++
 src/backend/executor/execExprInterp.c | 198 ++++++++++++++++
 src/backend/executor/nodeSeqscan.c    |   2 +
 src/backend/jit/llvm/llvmjit_expr.c   |  11 +
 src/backend/jit/llvm/llvmjit_types.c  |   2 +
 src/include/executor/execExpr.h       |  60 +++++
 src/include/executor/execScan.h       |  35 +--
 src/include/executor/executor.h       |   3 +
 src/include/nodes/execnodes.h         |   4 +
 9 files changed, 626 insertions(+), 13 deletions(-)

diff --git a/src/backend/executor/execExpr.c b/src/backend/executor/execExpr.c
index 27a5780f557..63df560d5f1 100644
--- a/src/backend/executor/execExpr.c
+++ b/src/backend/executor/execExpr.c
@@ -111,6 +111,19 @@ static BatchVector *BatchVectorCreate(Bitmapset *attnos, AttrNumber last_var);
 static bool ExprListAllSimpleVars(const List *args, Bitmapset **allattnos);
 static BatchVectorSlice *BatchVectorSliceFromExprArgs(const List *args,
 													  const BatchVector *bv);
+static int16 BatchVectorFindAttColno(const BatchVector *bv, AttrNumber attno);
+static int16 BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv);
+
+/* private context for the walker */
+typedef struct QualBatchContext
+{
+	List      *leaves;      /* List<Node*> of accepted leaves */
+	Bitmapset *attnos;      /* Vars referenced by accepted leaves */
+	bool		ok;			/* stays true if batchable */
+	AttrNumber	last_scan;	/* last needed attribute in scan slot */
+} QualBatchContext;
+
+static bool qual_batchable_walker(Node *node, void *context);
 
 /*
  * ExecInitExpr: prepare an expression tree for execution
@@ -5221,6 +5234,209 @@ ExprListAllSimpleVars(const List *args, Bitmapset **allattnos)
 	return true;
 }
 
+/* helper: extract Var (allowing RelabelType->Var); returns NULL if not */
+static Var *
+strip_to_var(Node *n)
+{
+	if (n == NULL)
+		return NULL;
+	if (IsA(n, RelabelType))
+		n = (Node *) ((RelabelType *) n)->arg;
+	if (!IsA(n, Var))
+		return NULL;
+	if (((Var *) n)->varattno < 0)
+		return NULL;
+	return (Var *) n;
+}
+
+/* main walker; return true to abort traversal early, false to continue */
+static bool
+qual_batchable_walker(Node *node, void *context)
+{
+	QualBatchContext *cxt = (QualBatchContext *) context;
+
+	if (node == NULL || !cxt->ok)
+		return false;
+
+	switch (nodeTag(node))
+	{
+		case T_List:
+			return expression_tree_walker(node, qual_batchable_walker, cxt);
+
+		case T_BoolExpr:
+		{
+			BoolExpr *b = (BoolExpr *) node;
+
+			/* Only AND trees are allowed */
+			if (b->boolop != AND_EXPR)
+			{
+				cxt->ok = false;
+				return true; /* abort */
+			}
+			/* Recurse normally over children */
+			return expression_tree_walker(node, qual_batchable_walker, cxt);
+		}
+
+		case T_NullTest:
+		{
+			NullTest *nt = (NullTest *) node;
+			Var		 *v  = strip_to_var((Node *) nt->arg);
+
+			if (v == NULL)
+			{
+				cxt->ok = false;
+				return true;
+			}
+
+			cxt->attnos = bms_add_member(cxt->attnos, v->varattno);
+			if (v->varattno > cxt->last_scan)
+				cxt->last_scan = v->varattno;
+			cxt->leaves = lappend(cxt->leaves, node);
+
+			/* Do NOT recurse into leaf */
+			return false;
+		}
+
+		case T_OpExpr:
+		{
+			OpExpr *op = (OpExpr *) node;
+			List   *args = op->args;
+			Node   *l, *r;
+			Var    *lv,
+				   *rv = NULL;
+
+			/* binary only */
+			if (list_length(args) != 2)
+			{
+				cxt->ok = false;
+				return true;
+			}
+			/* strict operator only (NULL -> false semantics) */
+			if (!func_strict(op->opfuncid))
+			{
+				cxt->ok = false;
+				return true;
+			}
+
+			l = linitial(args);
+			r = lsecond(args);
+			lv = strip_to_var(l);
+			if (lv == NULL)
+			{
+				cxt->ok = false;
+				return true;
+			}
+			cxt->attnos = bms_add_member(cxt->attnos, lv->varattno);
+			if (lv->varattno > cxt->last_scan)
+				cxt->last_scan = lv->varattno;
+
+			if (IsA(r, Const))
+			{
+				/* ok; no attno to add */
+			}
+			else
+			{
+				rv = strip_to_var(r);
+				if (rv == NULL)
+				{
+					cxt->ok = false;
+					return true;
+				}
+				cxt->attnos = bms_add_member(cxt->attnos, rv->varattno);
+				if (rv->varattno > cxt->last_scan)
+					cxt->last_scan = rv->varattno;
+			}
+
+			cxt->leaves = lappend(cxt->leaves, node);
+
+			/* Leaf handled; do NOT recurse into args */
+			return false;
+		}
+
+		/* Whitelist ends here; anything else in the tree rejects */
+		default:
+			cxt->ok = false;
+			break;
+	}
+
+	return true;
+}
+
+/* build a BatchQualTerm from a validated leaf */
+static BatchQualTerm *
+build_term_from_leaf(Node *n, BatchVector *bv)
+{
+	BatchQualTerm *term;
+	BatchQualTermKind kind;
+	bool		strict;
+	int16		l_off;
+	int16		r_off;
+	Datum		r_const = (Datum) 0;
+	bool		r_isnull = false;
+	FmgrInfo   *finfo = NULL;
+	Oid			collation;
+
+	if (IsA(n, NullTest))
+	{
+		NullTest *nt = (NullTest *) n;
+
+		kind = nt->nulltesttype == IS_NULL ? BQTK_IS_NULL : BQTK_IS_NOT_NULL;
+		l_off = BatchVectorOffsetForVarExpr(nt->arg, bv);
+		r_off = -1;
+		strict = false;
+		collation = InvalidOid;
+
+		if (l_off < 0)
+			return NULL;
+	}
+	else if (IsA(n, OpExpr))
+	{
+		OpExpr *op = (OpExpr *) n;
+		Expr   *l  = linitial(op->args);
+		Expr   *r  = lsecond(op->args);
+
+		l_off = BatchVectorOffsetForVarExpr(l, bv);
+		if (l_off < 0)
+			return NULL;
+
+		r_off = BatchVectorOffsetForVarExpr(r, bv);
+		if (IsA(r, Const))
+		{
+			Const *c = (Const *) r;
+
+			kind = BQTK_VAR_CONST;
+			r_const = c->constvalue;
+			r_isnull = c->constisnull;
+			r_off = -1;
+		}
+		else
+		{
+			if (r_off < 0)
+				return NULL;
+			kind = BQTK_VAR_VAR;
+		}
+
+		strict = func_strict(op->opfuncid);
+		collation = exprInputCollation((Node *) op);
+		finfo = palloc(sizeof(FmgrInfo));
+		fmgr_info(op->opfuncid, finfo);
+	}
+	else
+		return NULL;
+
+	term = palloc(sizeof(BatchQualTerm));
+	term->kind = kind;
+	term->strict = strict;
+	term->l_off = l_off;
+	term->r_off = r_off;
+	term->r_const = r_const;
+	term->r_isnull = r_isnull;
+	term->finfo = finfo;
+	term->collation = collation;
+
+	return term;
+}
+
 /* ---------- BatchVector stuff ------------- */
 
 static BatchVector *
@@ -5298,3 +5514,111 @@ BatchVectorSliceFromExprArgs(const List *args, const BatchVector *bv)
 
 	return bvs;
 }
+
+/*
+ * BatchVectorOffsetForVarExpr
+ *   Map a Var (or RelabelType->Var) to its BatchVector column index.
+ *   Returns -1 if the Var’s attno is not present.
+ */
+static int16
+BatchVectorOffsetForVarExpr(Expr *expr, const BatchVector *bv)
+{
+	AttrNumber attno;
+
+	if (!expr_is_simple_var(expr, &attno))
+		return -1;
+
+	return (int16) BatchVectorFindAttColno(bv, attno);
+}
+
+/*
+ * ExecInitQualBatch
+ *	Build a batched-qual EEOP program (AND-only).
+ *	Caller should also keep scalar ps->qual for runtime fallback.
+ */
+ExprState *
+ExecInitQualBatch(PlanState *ps)
+{
+	Node	   *qual = (Node *) ps->plan->qual;
+	QualBatchContext cxt = {NIL, NULL, true, 0};
+	BatchQualRuntime *rt;
+	ExprState  *state;
+	BatchVector *bv;
+	uint64	   *mask;
+	int			mask_words;
+	ListCell   *lc;
+	ExprEvalStep scratch = {0};
+
+	if (qual == NULL)
+		return NULL;
+
+	/* validate + collect leaves/attnos with walker */
+	(void) qual_batchable_walker(qual, &cxt);
+	if (!cxt.ok || cxt.leaves == NIL || bms_is_empty(cxt.attnos))
+		return NULL;
+
+	bv = BatchVectorCreate(cxt.attnos, cxt.last_scan);
+
+	mask_words = (bv->maxrows + 63) >> 6;
+	mask = (uint64 *) palloc0(sizeof(uint64) * mask_words);
+
+	/* Runtime carrier (lifetime == exprstate) */
+	rt = palloc0(sizeof(BatchQualRuntime));
+	rt->mask = mask;
+	rt->mask_words = mask_words;
+
+	/* dedicated ExprState for batched program */
+
+	state = makeNode(ExprState);
+	state->expr = (Expr *) qual;
+	state->parent = ps;
+	state->ext_params = NULL;
+
+	/* mark expression as to be used with ExecQual() */
+	state->flags = EEO_FLAG_IS_QUAL;
+
+	/* Only valid as batch qual if this is set. */
+	state->batch_private = (void *) rt;
+
+	scratch.opcode = EEOP_SCAN_FETCHSOME_BATCH;
+	scratch.d.fetch_batch.last_var = cxt.last_scan;
+	ExprEvalPushStep(state, &scratch);
+
+	scratch.opcode = EEOP_BUILD_SCAN_BATCH_VECTOR;
+	scratch.d.batch_vector.bv = bv;
+	ExprEvalPushStep(state, &scratch);
+
+	scratch.opcode = EEOP_QUAL_BATCH_INITMASK;
+	scratch.d.qualbatch_init.bv = bv;
+	scratch.d.qualbatch_init.mask = mask;
+	scratch.d.qualbatch_init.mask_words = mask_words;
+	ExprEvalPushStep(state, &scratch);
+
+	/* TERM per leaf */
+	foreach(lc, cxt.leaves)
+	{
+		BatchQualTerm *term = build_term_from_leaf((Node *) lfirst(lc), bv);
+
+		if (term == NULL)
+			return NULL;
+
+		scratch.opcode = EEOP_QUAL_BATCH_TERM;
+		scratch.d.qualbatch_term.bv = bv;
+		scratch.d.qualbatch_term.mask = mask;
+		scratch.d.qualbatch_term.mask_words = mask_words;
+		scratch.d.qualbatch_term.term = term;		/* by value */
+		ExprEvalPushStep(state, &scratch);
+	}
+
+	/*
+	 * At the end, we don't need to do anything more.  The last qual expr must
+	 * have yielded TRUE, and since its result is stored in the desired output
+	 * location, we're done.
+	 */
+	scratch.opcode = EEOP_DONE_NO_RETURN;
+	ExprEvalPushStep(state, &scratch);
+
+	ExecReadyExpr(state);
+
+	return state;
+}
diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 41ad9b4838d..c2b76a5e5db 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -608,6 +608,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,
 		&&CASE_EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,
+		&&CASE_EEOP_QUAL_BATCH_INITMASK,
+		&&CASE_EEOP_QUAL_BATCH_TERM,
 		&&CASE_EEOP_LAST
 	};
 
@@ -2350,7 +2352,19 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		{
 			/* too complex for an inline implementation */
 			ExecAggPlainTransBatch(state, op, econtext);
+			EEO_NEXT();
+		}
+
+
+		EEO_CASE(EEOP_QUAL_BATCH_INITMASK)
+		{
+			ExecQualBatchInitMask(state, op, econtext);
+			EEO_NEXT();
+		}
 
+		EEO_CASE(EEOP_QUAL_BATCH_TERM)
+		{
+			ExecQualBatchTerm(state, op, econtext);
 			EEO_NEXT();
 		}
 
@@ -6185,3 +6199,187 @@ ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext
 			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
 	}
 }
+
+/* set mask bits [0..nvalid_bits) to 1; clear padding in the last word */
+static inline void
+mask_init_all_ones(uint64 *a, int nwords, int nvalid_bits)
+{
+	for (int i = 0; i < nwords; i++)
+		a[i] = ~UINT64CONST(0);
+
+	if ((nvalid_bits & 63) != 0)
+	{
+		int rem = nvalid_bits & 63;
+
+		a[nwords - 1] &= (~UINT64CONST(0)) >> (64 - rem);
+	}
+}
+
+static inline void
+mask_clear_bit(uint64 *a, int i)
+{
+	a[i >> 6] &= ~(UINT64CONST(1) << (i & 63));
+}
+
+void
+ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	BatchVector *bv = op->d.qualbatch_init.bv;
+	uint64      *mask = op->d.qualbatch_init.mask;
+	int          nwords = op->d.qualbatch_init.mask_words;
+	int          n = bv->nrows;
+
+	/* initialize to all-pass for current batch size */
+	mask_init_all_ones(mask, nwords, n);
+}
+
+void
+ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	BatchVector    *bv   = op->d.qualbatch_term.bv;
+	uint64         *mask = op->d.qualbatch_term.mask;
+	BatchQualTerm  *t    = op->d.qualbatch_term.term;
+	int             n    = bv->nrows;
+
+	switch (t->kind)
+	{
+		case BQTK_IS_NULL:
+		{
+			/* keep bit set only if value IS NULL; clear otherwise */
+			for (int i = 0; i < n; i++)
+			{
+				if (!bv->nulls[t->l_off][i])
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		case BQTK_IS_NOT_NULL:
+		{
+			/* keep bit set only if value IS NOT NULL; clear if NULL */
+			for (int i = 0; i < n; i++)
+			{
+				if (bv->nulls[t->l_off][i])
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		case BQTK_VAR_CONST:
+		{
+			const bool  r_isnull = t->r_isnull;
+			const Datum r_const  = t->r_const;
+			const bool  strict   = t->strict;
+			const Oid   coll     = t->collation;
+			FmgrInfo   *finfo    = t->finfo;
+			int         loff     = t->l_off;
+
+			for (int i = 0; i < n; i++)
+			{
+				bool ln = bv->nulls[loff][i];
+				bool pass;
+
+				/* WHERE treats NULL as false; strict ops short-circuit */
+				if (strict && (ln || r_isnull))
+					pass = false;
+				else
+				{
+					Datum lv = bv->cols[loff][i];
+
+					pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, r_const));
+				}
+
+				if (!pass)
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		case BQTK_VAR_VAR:
+		{
+			const bool  strict = t->strict;
+			const Oid   coll   = t->collation;
+			FmgrInfo   *finfo  = t->finfo;
+			int         loff   = t->l_off;
+			int         roff   = t->r_off;
+
+			for (int i = 0; i < n; i++)
+			{
+				bool  ln = bv->nulls[loff][i];
+				bool  rn = bv->nulls[roff][i];
+				bool  pass;
+
+				if (strict && (ln || rn))
+					pass = false;
+				else
+				{
+					Datum lv = bv->cols[loff][i];
+					Datum rv = bv->cols[roff][i];
+
+					pass = DatumGetBool(FunctionCall2Coll(finfo, coll, lv, rv));
+				}
+
+				if (!pass)
+					mask_clear_bit(mask, i);
+			}
+			break;
+		}
+
+		default:
+			/* should not happen; leave mask unchanged */
+			break;
+	}
+}
+
+static inline bool
+mask_is_empty(const uint64 *mask, int nwords)
+{
+	for (int i = 0; i < nwords; i++)
+	{
+		if (mask[i] != 0)
+			return false;
+	}
+	return true;
+}
+
+/*
+ * ExecQualBatch
+ *		Evaluate a compiled qual (EEOP_QUAL) for a batch of rows.
+ *
+ * Returns the number of true rows (optional convenience for callers).
+ */
+int
+ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b)
+{
+	int		i;
+	uint64 *mask;
+	int		kept = 0;
+	BatchQualRuntime *rt = ExecGetBatchQualRuntime(state);;
+
+	/* verify that expression was compiled using ExecInitQual */
+	Assert(state->flags & EEO_FLAG_IS_QUAL);
+	Assert(rt && rt->mask && rt->mask_words);
+
+	/* run the batched EEOP program once */
+	econtext->scan_batch = b;
+	ExecEvalExprNoReturn(state, econtext);
+
+	mask = rt->mask;
+	if (mask_is_empty(mask, rt->mask_words))
+		return 0;
+
+	/* Add survivors into outslots */
+	TupleBatchRewind(b);
+	i = 0;
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		/* mask bit set => row survives */
+		if (mask[i >> 6] & (UINT64CONST(1) << (i & 63)))
+			TupleBatchStoreInOut(b, kept++, slot);
+		i++;
+	}
+
+	return kept;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index a4cf1e51af0..e5ca619731f 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -401,6 +401,8 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
 		}
 	}
+
+	scanstate->ss.ps.qual_batch = ExecInitQualBatch((PlanState *) scanstate);
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 45346124bd7..b97d5faebde 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3033,6 +3033,17 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_QUAL_BATCH_INITMASK:
+				build_EvalXFunc(b, mod, "ExecQualBatchInitMask",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+			case EEOP_QUAL_BATCH_TERM:
+				build_EvalXFunc(b, mod, "ExecQualBatchTerm",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 1b5e06f60cc..f4f756e7cb5 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -187,4 +187,6 @@ void	   *referenced_functions[] =
 	ExecBuildOuterBatchVector,
 	ExecBuildScanBatchVector,
 	ExecAggPlainTransBatch,
+	ExecQualBatchInitMask,
+	ExecQualBatchTerm,
 };
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index f24782ecf58..f50936acaaa 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -306,6 +306,10 @@ typedef enum ExprEvalOp
 	EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP,	/* per-row fmgr calls */
 	EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT,	/* call transfn once with AggBulkArgs */
 
+	/* Batched qual evaluation */
+	EEOP_QUAL_BATCH_INITMASK,
+	EEOP_QUAL_BATCH_TERM,
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
@@ -796,6 +800,21 @@ typedef struct ExprEvalStep
 		{
 			struct BatchVector *bv;
 		}			batch_vector;
+
+		struct
+		{
+			struct BatchVector *bv; /* filled earlier by BUILD_BATCH_VECTOR */
+			uint64			   *mask;        /* shared mask buffer for this program */
+			int					mask_words;  /* ceil(es_max_batch/64) */
+		}			qualbatch_init;                    /* EEOP_QUAL_BATCH_INITMASK */
+
+		struct
+		{
+			struct BatchVector *bv; /* same bv as init */
+			uint64			   *mask;        /* same mask buffer */
+			int					mask_words;  /* same word count */
+			struct BatchQualTerm *term;      /* compiled leaf */
+		}			qualbatch_term;                    /* EEOP_QUAL_BATCH_TERM */
 	}			d;
 } ExprEvalStep;
 
@@ -975,4 +994,45 @@ extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprCo
 extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
 extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
+/* See ExecQualBatchTerm(). */
+typedef enum BatchQualTermKind
+{
+	BQTK_VAR_CONST,
+	BQTK_VAR_VAR,
+	BQTK_IS_NULL,
+	BQTK_IS_NOT_NULL,
+} BatchQualTermKind;
+
+typedef struct BatchQualTerm
+{
+	BatchQualTermKind kind;
+	bool		strict;		/* follow strict NULL semantics if true */
+	int16		l_off;		/* left VAR column (index into BatchVector) */
+	int16		r_off;		/* right VAR column, or -1 if Const */
+	Datum		r_const;	/* for VAR_CONST */
+	bool		r_isnull;	/* for VAR_CONST */
+	FmgrInfo   *finfo;		/* fmgr for generic binary ops */
+	Oid			collation;	/* op collation */
+} BatchQualTerm;
+
+/*
+ * Runtime view for batched qual programs.
+ * Owned by the ExprState; lifetime == ExprState.
+ */
+typedef struct BatchQualRuntime
+{
+	uint64 *mask;
+	int		mask_words;
+} BatchQualRuntime;
+
+static inline BatchQualRuntime *
+ExecGetBatchQualRuntime(ExprState *batch_qual)
+{
+	return (BatchQualRuntime *) batch_qual->batch_private;
+}
+
+extern void ExecQualBatchInitMask(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecQualBatchTerm(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fb4b57a831c..568a7a33b7d 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -304,7 +304,8 @@ ExecScanExtendedBatch(ScanState *node,
 {
 	ExprContext *econtext = node->ps.ps_ExprContext;
 	TupleBatch *b = node->ps.ps_Batch;
-	int			qualified;
+	ExprState  *qual_batch = node->ps.qual_batch;
+	int			qualified = 0;
 
 	/* Batch path does not support EPQ */
 	Assert(node->ps.state->es_epq_active == NULL);
@@ -320,23 +321,31 @@ ExecScanExtendedBatch(ScanState *node,
 
 		if (qual != NULL)
 		{
-			qualified = 0;
-			while (TupleBatchHasMore(b))
+			ResetExprContext(econtext);
+			if (qual_batch)
 			{
-				TupleTableSlot *in = TupleBatchGetNextSlot(b);
-
-				Assert(in);
-				ResetExprContext(econtext);
-				econtext->ecxt_scantuple = in;
+				qualified = ExecQualBatch(qual_batch, econtext, b);
+			}
+			else
+			{
+				int		i = 0;
 
-				if (ExecQual(qual, econtext))
+				while (TupleBatchHasMore(b))
 				{
-					TupleBatchStoreInOut(b, qualified, in);
-					qualified++;
+					TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+					Assert(slot);
+					econtext->ecxt_scantuple = slot;
+					if (ExecQual(qual, econtext))
+					{
+						TupleBatchStoreInOut(b, qualified, slot);
+						qualified++;
+					}
+					i++;
 				}
-				else
-					InstrCountFiltered1(node, 1);
 			}
+			InstrCountFiltered1(node, b->nvalid - qualified);
+			/* Update count and start using b->outslots. */
 			TupleBatchUseOutput(b, qualified);
 		}
 		else
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index c72bd755b79..dd0f2c74ae5 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -333,6 +333,7 @@ ExecProcNodeBatch(PlanState *node)
 extern ExprState *ExecInitExpr(Expr *node, PlanState *parent);
 extern ExprState *ExecInitExprWithParams(Expr *node, ParamListInfo ext_params);
 extern ExprState *ExecInitQual(List *qual, PlanState *parent);
+extern ExprState *ExecInitQualBatch(PlanState *ps);
 extern ExprState *ExecInitCheck(List *qual, PlanState *parent);
 extern List *ExecInitExprList(List *nodes, PlanState *parent);
 extern ExprState *ExecBuildAggTrans(AggState *aggstate, struct AggStatePerPhaseData *phase,
@@ -581,6 +582,8 @@ AggGetBulkArgs(FunctionCallInfo fcinfo)
 }
 #endif
 
+extern int ExecQualBatch(ExprState *state, ExprContext *econtext, TupleBatch *b);
+
 extern bool ExecCheck(ExprState *state, ExprContext *econtext);
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fdfe8b4ddaf..78c5abbb23a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -146,6 +146,9 @@ typedef struct ExprState
 	 * ExecInitExprRec().
 	 */
 	ErrorSaveContext *escontext;
+
+	/* batched-program runtime (e.g., BatchQualRuntime) */
+	void	 *batch_private;
 } ExprState;
 
 
@@ -1196,6 +1199,7 @@ typedef struct PlanState
 	 * subPlan list, which does not exist in the plan tree).
 	 */
 	ExprState  *qual;			/* boolean qual condition */
+	ExprState  *qual_batch;		/* boolean qual condition evaluated on batches */
 	PlanState  *lefttree;		/* input plan tree(s) */
 	PlanState  *righttree;
 
-- 
2.47.3



  [application/octet-stream] v3-0009-Blind-guess-at-fixing-segfault-on-running-tpch-q2.patch (11.6K, 5-v3-0009-Blind-guess-at-fixing-segfault-on-running-tpch-q2.patch)
  download | inline diff:
From 92ef364a8f650022a139bc32a2e518804a41767a Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Wed, 8 Oct 2025 08:06:59 -0400
Subject: [PATCH v3 9/9] Blind guess at fixing segfault on running tpch q22

---
 src/backend/executor/execExprInterp.c | 225 ++++++++++++++------------
 src/backend/jit/llvm/llvmjit_expr.c   |   7 +-
 src/backend/jit/llvm/llvmjit_types.c  |   3 +-
 src/include/executor/execExpr.h       |   3 +-
 4 files changed, 136 insertions(+), 102 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index c2b76a5e5db..aee37cf50d5 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -2343,7 +2343,7 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP)
 		{
 			/* too complex for an inline implementation */
-			ExecAggPlainTransBatch(state, op, econtext);
+			ExecAggPlainTransBatchRowloop(state, op, econtext);
 
 			EEO_NEXT();
 		}
@@ -2351,7 +2351,8 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		EEO_CASE(EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT)
 		{
 			/* too complex for an inline implementation */
-			ExecAggPlainTransBatch(state, op, econtext);
+			ExecAggPlainTransBatchDirect(state, op, econtext);
+
 			EEO_NEXT();
 		}
 
@@ -6072,131 +6073,157 @@ ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
 	bv->nrows = i;
 }
 
-void
-ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+static bool
+ExecAggPlainTransBatchInitTrans(ExprState *state, ExprEvalStep *op,
+								TupleBatch *b)
 {
 	AggState   *aggstate = castNode(AggState, state->parent);
 	AggStatePerTrans	pertrans = op->d.agg_trans.pertrans;
 	AggStatePerGroup pergroup =
 		&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
 	BatchVectorSlice  *bvs = op->d.agg_trans.bvs;
+	const BatchVector *bv = bvs->bv;
+	int		batch_nrows = bvs ? bvs->bv->nrows : b->nvalid;
+	bool	found = false;
 	FunctionCallInfo	fcinfo = pertrans->transfn_fcinfo;
 	FmgrInfo		   *finfo = fcinfo->flinfo;
-	Datum		newVal;
-	TupleBatch *batch = econtext->outer_batch;
-	int			batch_nrows = bvs ? bvs->bv->nrows : batch->nvalid;
-	int			start_row = 0;
 
-	if (finfo->fn_strict)
+	if (!finfo->fn_strict || bvs == NULL)
+		return false;
+
+	for (int i = 0; i < batch_nrows; i++)
 	{
-		if (pergroup->noTransValue && bvs)
+		for (int j = 0; j < bvs->nargs; j++)
 		{
-			const BatchVector *bv = bvs->bv;
-			bool	found = false;
-
-			Assert(bv);
-			for (int i = 0; i < batch_nrows; i++)
+			if (!bv->nulls[bvs->argoffs[j]][i])
 			{
-				for (int j = 0; j < bvs->nargs; j++)
+				fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
+				fcinfo->args[1].isnull = false;
+				if (j == bvs->nargs - 1)
 				{
-					if (!bv->nulls[bvs->argoffs[j]][i])
-					{
-						fcinfo->args[1].value = bv->cols[bvs->argoffs[j]][i];
-						fcinfo->args[1].isnull = false;
-						if (j == bvs->nargs - 1)
-						{
-							found = true;
-							break;
-						}
-					}
-				}
-				if (found)
+					found = true;
 					break;
+				}
 			}
-			/* If transValue has not yet been initialized, do so now. */
-			ExecAggInitGroup(aggstate, pertrans, pergroup,
-							 op->d.agg_trans.aggcontext);
-			start_row = 1;
 		}
-		else if (pergroup->transValueIsNull)
+		if (found)
+			break;
+	}
+	/* If transValue has not yet been initialized, do so now. */
+	ExecAggInitGroup(aggstate, pertrans, pergroup,
+					 op->d.agg_trans.aggcontext);
+	return true;
+}
+
+void
+ExecAggPlainTransBatchDirect(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	AggState   *aggstate = castNode(AggState, state->parent);
+	AggStatePerTrans	pertrans = op->d.agg_trans.pertrans;
+	AggStatePerGroup pergroup =
+		&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+	BatchVectorSlice  *bvs = op->d.agg_trans.bvs;
+	FunctionCallInfo	fcinfo = pertrans->transfn_fcinfo;
+	Datum		newVal;
+	TupleBatch *b = econtext->outer_batch;
+	int			batch_nrows = bvs ? bvs->bv->nrows : b->nvalid;
+	int			start_row = 0;
+	void	   *save = fcinfo->flinfo->fn_extra;
+	AggBulkArgs ba = {batch_nrows, start_row};
+
+	if (pergroup->noTransValue)
+	{
+		if (ExecAggPlainTransBatchInitTrans(state, op, b))
+			start_row = 1;
+		else if (pergroup->transValueIsNull && fcinfo->flinfo->fn_strict)
 			return;
 	}
 
-	switch (ExecEvalStepOp(state, op))
+	if (bvs)
 	{
-		case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
-			/* Loop rows, call the original transfn per element using vector cols. */
-			for (int i = start_row; i < batch_nrows; i++)
-			{
-				bool hasnull = false;
+		const BatchVector *bv = bvs->bv;
+
+		Assert(bv);
+		ba.nargs = bvs->nargs;
+		ba.argoffs = bvs->argoffs;
+		ba.args = bv->cols;
+		ba.isnull = bv->nulls;
+		ba.hasnull = bv->hasnull;
+	}
+	fcinfo->flinfo->fn_extra = &ba;
+	fcinfo->args[0].value = pergroup->transValue;
+	fcinfo->args[0].isnull = pergroup->transValueIsNull;
+	fcinfo->isnull = false;		/* just in case transfn doesn't set it */
+	newVal = FunctionCallInvoke(fcinfo);   /* one call for the entire slice */
+	if (!pertrans->transtypeByVal &&
+		DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+		newVal = ExecAggCopyTransValue(aggstate, pertrans,
+									   newVal, fcinfo->isnull,
+									   pergroup->transValue,
+									   pergroup->transValueIsNull);
+	pergroup->transValue = newVal;
+	pergroup->transValueIsNull = fcinfo->isnull;
+	fcinfo->flinfo->fn_extra = save;
+}
 
-				/* Set up fcinfo args 1..m from column vectors at row i. */
-				if (bvs)
-				{
-					const BatchVector *bv = bvs->bv;
+void
+ExecAggPlainTransBatchRowloop(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	AggState   *aggstate = castNode(AggState, state->parent);
+	AggStatePerTrans	pertrans = op->d.agg_trans.pertrans;
+	AggStatePerGroup pergroup =
+		&aggstate->all_pergroups[op->d.agg_trans.setoff][op->d.agg_trans.transno];
+	BatchVectorSlice  *bvs = op->d.agg_trans.bvs;
+	FunctionCallInfo	fcinfo = pertrans->transfn_fcinfo;
+	FmgrInfo		   *finfo = fcinfo->flinfo;
+	Datum		newVal;
+	TupleBatch *b = econtext->outer_batch;
+	int			batch_nrows = bvs ? bvs->bv->nrows : b->nvalid;
+	int			start_row = 0;
 
-					for (int j = 0; j < bvs->nargs; j++)
-					{
-						int16	argoff = bvs->argoffs[j];
+	if (pergroup->noTransValue)
+	{
+		if (ExecAggPlainTransBatchInitTrans(state, op, b))
+			start_row = 1;
+		else if (pergroup->transValueIsNull && fcinfo->flinfo->fn_strict)
+			return;
+	}
 
-						fcinfo->args[j+1].value = bv->cols[argoff][i];
-						fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
-						if (!hasnull && bv->nulls[argoff][i])
-							hasnull = true;
-					}
-				}
-				/* fcinfo->args[0] is the existing transition state */
-				if (finfo->fn_strict && hasnull)
-					continue;
-				fcinfo->args[0].value = pergroup->transValue;
-				fcinfo->args[0].isnull = pergroup->transValueIsNull;
-				newVal = FunctionCallInvoke(fcinfo);
-				if (!pertrans->transtypeByVal &&
-					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
-					newVal = ExecAggCopyTransValue(aggstate, pertrans,
-												   newVal, fcinfo->isnull,
-												   pergroup->transValue,
-												   pergroup->transValueIsNull);
-				pergroup->transValue = newVal;
-				pergroup->transValueIsNull = fcinfo->isnull;
-			}
-			break;
+	/* Loop rows, call the original transfn per element using vector cols. */
+	for (int i = start_row; i < batch_nrows; i++)
+	{
+		bool hasnull = false;
 
-		case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+		/* Set up fcinfo args 1..m from column vectors at row i. */
+		if (bvs)
+		{
+			const BatchVector *bv = bvs->bv;
+
+			for (int j = 0; j < bvs->nargs; j++)
 			{
-				void *save = fcinfo->flinfo->fn_extra;
-				AggBulkArgs ba = {batch_nrows, start_row};
+				int16	argoff = bvs->argoffs[j];
 
-				if (bvs)
-				{
-					const BatchVector *bv = bvs->bv;
-
-					Assert(bv);
-					ba.nargs = bvs->nargs;
-					ba.argoffs = bvs->argoffs;
-					ba.args = bv->cols;
-					ba.isnull = bv->nulls;
-					ba.hasnull = bv->hasnull;
-				}
-				fcinfo->flinfo->fn_extra = &ba;
-				fcinfo->args[0].value = pergroup->transValue;
-				fcinfo->args[0].isnull = pergroup->transValueIsNull;
-				fcinfo->isnull = false;		/* just in case transfn doesn't set it */
-				newVal = FunctionCallInvoke(fcinfo);   /* one call for the entire slice */
-				if (!pertrans->transtypeByVal &&
-					DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
-					newVal = ExecAggCopyTransValue(aggstate, pertrans,
-												   newVal, fcinfo->isnull,
-												   pergroup->transValue,
-												   pergroup->transValueIsNull);
-				pergroup->transValue = newVal;
-				pergroup->transValueIsNull = fcinfo->isnull;
-				fcinfo->flinfo->fn_extra = save;
+				fcinfo->args[j+1].value = bv->cols[argoff][i];
+				fcinfo->args[j+1].isnull = bv->nulls[argoff][i];
+				if (!hasnull && bv->nulls[argoff][i])
+					hasnull = true;
 			}
-			break;
+		}
 
-		default:
-			elog(ERROR, "invalid ExprEvalOp in ExecAggPlainTransBatch()");
+		if (finfo->fn_strict && hasnull)
+			continue;
+		/* fcinfo->args[0] is the existing transition state */
+		fcinfo->args[0].value = pergroup->transValue;
+		fcinfo->args[0].isnull = pergroup->transValueIsNull;
+		newVal = FunctionCallInvoke(fcinfo);
+		if (!pertrans->transtypeByVal &&
+			DatumGetPointer(newVal) != DatumGetPointer(pergroup->transValue))
+			newVal = ExecAggCopyTransValue(aggstate, pertrans,
+										   newVal, fcinfo->isnull,
+										   pergroup->transValue,
+										   pergroup->transValueIsNull);
+		pergroup->transValue = newVal;
+		pergroup->transValueIsNull = fcinfo->isnull;
 	}
 }
 
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index b97d5faebde..2d1c8259d1a 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -3027,8 +3027,13 @@ llvm_compile_expr(ExprState *state)
 				break;
 
 			case EEOP_AGG_PLAIN_TRANS_BATCH_DIRECT:
+				build_EvalXFunc(b, mod, "ExecAggPlainTransBatchDirect",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_AGG_PLAIN_TRANS_BATCH_ROWLOOP:
-				build_EvalXFunc(b, mod, "ExecAggPlainTransBatch",
+				build_EvalXFunc(b, mod, "ExecAggPlainTransBatchRowloop",
 								v_state, op, v_econtext);
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index f4f756e7cb5..2cf3a60be51 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -186,7 +186,8 @@ void	   *referenced_functions[] =
 	ExecBuildInnerBatchVector,
 	ExecBuildOuterBatchVector,
 	ExecBuildScanBatchVector,
-	ExecAggPlainTransBatch,
+	ExecAggPlainTransBatchDirect,
+	ExecAggPlainTransBatchRowloop,
 	ExecQualBatchInitMask,
 	ExecQualBatchTerm,
 };
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index f50936acaaa..a3314ffd0c9 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -993,7 +993,8 @@ extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprCo
 extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
-extern void ExecAggPlainTransBatch(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatchDirect(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecAggPlainTransBatchRowloop(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
 
 /* See ExecQualBatchTerm(). */
 typedef enum BatchQualTermKind
-- 
2.47.3



  [application/octet-stream] v3-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch (16.9K, 6-v3-0005-WIP-Add-EEOPs-and-helpers-for-TupleBatch-processi.patch)
  download | inline diff:
From f3239ed6c0f196be5b495a586e6b390465d0326d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 22 Sep 2025 17:01:29 +0900
Subject: [PATCH v3 5/9] WIP: Add EEOPs and helpers for TupleBatch processing

Introduce new EEOP cases to fetch attributes into TupleBatch
vectors:
- EEOP_{INNER,OUTER,SCAN}_FETCHSOME_BATCH
- EEOP_BUILD_{INNER,OUTER,SCAN}_BATCH_VECTOR

Add ExecBuild{Inner,Outer,Scan}BatchVector() helpers to populate
column vectors (values, nulls, nrows, hasnull) from a TupleBatch.
Extend ExprContext with inner_batch, outer_batch, and scan_batch
fields so expression programs can access active batches directly.

Add slot_getsomeattrs_batch() to prefetch attributes across all
slots in a TupleBatch, similar to slot_getsomeattrs() for one slot.
---
 src/backend/executor/execExprInterp.c | 127 +++++++++++++++++++++++++-
 src/backend/executor/execTuples.c     |  32 +++++++
 src/backend/jit/llvm/llvmjit_expr.c   |  86 +++++++++++++++++
 src/backend/jit/llvm/llvmjit_types.c  |   4 +
 src/include/executor/execExpr.h       |  45 ++++++++-
 src/include/executor/tuptable.h       |   2 +
 src/include/nodes/execnodes.h         |  24 +++--
 7 files changed, 310 insertions(+), 10 deletions(-)

diff --git a/src/backend/executor/execExprInterp.c b/src/backend/executor/execExprInterp.c
index 0e1a74976f7..68629ad7991 100644
--- a/src/backend/executor/execExprInterp.c
+++ b/src/backend/executor/execExprInterp.c
@@ -59,6 +59,7 @@
 #include "access/heaptoast.h"
 #include "catalog/pg_type.h"
 #include "commands/sequence.h"
+#include "executor/execBatch.h"
 #include "executor/execExpr.h"
 #include "executor/nodeSubplan.h"
 #include "funcapi.h"
@@ -188,6 +189,11 @@ static pg_attribute_always_inline void ExecAggPlainTransByRef(AggState *aggstate
 															  int setno);
 static char *ExecGetJsonValueItemString(JsonbValue *item, bool *resnull);
 
+static pg_attribute_always_inline void ExecBuildBatchVector(ExprState *state,
+															ExprEvalStep *op,
+															ExprContext *econtext,
+															TupleBatch *b);
+
 /*
  * ScalarArrayOpExprHashEntry
  * 		Hash table entry type used during EEOP_HASHED_SCALARARRAYOP
@@ -446,7 +452,6 @@ ExecReadyInterpretedExpr(ExprState *state)
 	state->evalfunc_private = ExecInterpExpr;
 }
 
-
 /*
  * Evaluate expression identified by "state" in the execution context
  * given by "econtext".  *isnull is set to the is-null flag for the result,
@@ -466,6 +471,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	TupleTableSlot *scanslot;
 	TupleTableSlot *oldslot;
 	TupleTableSlot *newslot;
+	TupleBatch *innerbatch;
+	TupleBatch *outerbatch;
+	TupleBatch *scanbatch;
 
 	/*
 	 * This array has to be in the same order as enum ExprEvalOp.
@@ -479,6 +487,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_SCAN_FETCHSOME,
 		&&CASE_EEOP_OLD_FETCHSOME,
 		&&CASE_EEOP_NEW_FETCHSOME,
+		&&CASE_EEOP_INNER_FETCHSOME_BATCH,
+		&&CASE_EEOP_OUTER_FETCHSOME_BATCH,
+		&&CASE_EEOP_SCAN_FETCHSOME_BATCH,
 		&&CASE_EEOP_INNER_VAR,
 		&&CASE_EEOP_OUTER_VAR,
 		&&CASE_EEOP_SCAN_VAR,
@@ -592,6 +603,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 		&&CASE_EEOP_AGG_PRESORTED_DISTINCT_MULTI,
 		&&CASE_EEOP_AGG_ORDERED_TRANS_DATUM,
 		&&CASE_EEOP_AGG_ORDERED_TRANS_TUPLE,
+		&&CASE_EEOP_BUILD_INNER_BATCH_VECTOR,
+		&&CASE_EEOP_BUILD_OUTER_BATCH_VECTOR,
+		&&CASE_EEOP_BUILD_SCAN_BATCH_VECTOR,
 		&&CASE_EEOP_LAST
 	};
 
@@ -612,6 +626,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 	scanslot = econtext->ecxt_scantuple;
 	oldslot = econtext->ecxt_oldtuple;
 	newslot = econtext->ecxt_newtuple;
+	innerbatch = econtext->inner_batch;
+	outerbatch = econtext->outer_batch;
+	scanbatch = econtext->scan_batch;
 
 #if defined(EEO_USE_COMPUTED_GOTO)
 	EEO_DISPATCH();
@@ -658,6 +675,36 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_INNER_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, innerslot);
+
+			Assert(innerbatch);
+			slot_getsomeattrs_batch(innerbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_OUTER_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, outerslot);
+
+			Assert(outerbatch);
+			slot_getsomeattrs_batch(outerbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_SCAN_FETCHSOME_BATCH)
+		{
+			CheckOpSlotCompatibility(op, scanslot);
+
+			Assert(scanbatch);
+			slot_getsomeattrs_batch(scanbatch, op->d.fetch_batch.last_var);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_OLD_FETCHSOME)
 		{
 			CheckOpSlotCompatibility(op, oldslot);
@@ -2265,6 +2312,30 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
 			EEO_NEXT();
 		}
 
+		EEO_CASE(EEOP_BUILD_INNER_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildInnerBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_BUILD_OUTER_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildOuterBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
+		EEO_CASE(EEOP_BUILD_SCAN_BATCH_VECTOR)
+		{
+			/* too complex for an inline implementation */
+			ExecBuildScanBatchVector(state, op, econtext);
+
+			EEO_NEXT();
+		}
+
 		EEO_CASE(EEOP_LAST)
 		{
 			/* unreachable */
@@ -5914,3 +5985,57 @@ ExecAggPlainTransByRef(AggState *aggstate, AggStatePerTrans pertrans,
 
 	MemoryContextSwitchTo(oldContext);
 }
+
+void
+ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->inner_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->inner_batch);
+}
+
+void
+ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->outer_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->outer_batch);
+}
+
+void
+ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext)
+{
+	Assert(econtext->scan_batch);
+	ExecBuildBatchVector(state, op, econtext, econtext->scan_batch);
+}
+
+static pg_attribute_always_inline void
+ExecBuildBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext,
+					 TupleBatch *b)
+{
+	struct BatchVector *bv = op->d.batch_vector.bv;
+	int		i = 0;
+
+	if (bv->ncols == 0)
+		return;
+
+	/* Fetch each requested attribute into column vectors. */
+	TupleBatchRewind(b);
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		for (int j = 0; j < bv->ncols; j++)
+		{
+			AttrNumber attno = bv->attnos[j];
+			Datum  *cols  = bv->cols[j];
+			bool   *nulls  = bv->nulls[j];
+
+			Assert(attno <= slot->tts_nvalid);
+			cols[i] = slot->tts_values[attno - 1];
+			nulls[i] = slot->tts_isnull[attno - 1];
+			if (!bv->hasnull && nulls[i])
+				bv->hasnull = true;
+		}
+		i++;
+	}
+	bv->nrows = i;
+}
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..86d5dea8f8b 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -2111,6 +2111,38 @@ slot_getsomeattrs_int(TupleTableSlot *slot, int attnum)
 	}
 }
 
+void
+slot_getsomeattrs_batch(struct TupleBatch *b, int attnum)
+{
+	while (TupleBatchHasMore(b))
+	{
+		TupleTableSlot *slot = TupleBatchGetNextSlot(b);
+
+		/* Check for caller errors */
+		Assert(attnum > 0);
+
+		if (unlikely(attnum > slot->tts_tupleDescriptor->natts))
+			elog(ERROR, "invalid attribute number %d", attnum);
+
+		/* XXX - there should perhaps also be a batch-level att_nvalid */
+		if (attnum < slot->tts_nvalid)
+			continue;
+
+		/* Fetch as many attributes as possible from the underlying tuple. */
+		slot->tts_ops->getsomeattrs(slot, attnum);
+
+		/*
+		 * If the underlying tuple doesn't have enough attributes, tuple
+		 * descriptor must have the missing attributes.
+		 */
+		if (unlikely(slot->tts_nvalid < attnum))
+		{
+			slot_getmissingattrs(slot, slot->tts_nvalid, attnum);
+			slot->tts_nvalid = attnum;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecTypeFromTL
  *
diff --git a/src/backend/jit/llvm/llvmjit_expr.c b/src/backend/jit/llvm/llvmjit_expr.c
index 712b35df7e5..848f0b52d6f 100644
--- a/src/backend/jit/llvm/llvmjit_expr.c
+++ b/src/backend/jit/llvm/llvmjit_expr.c
@@ -109,6 +109,11 @@ llvm_compile_expr(ExprState *state)
 	LLVMValueRef v_newslot;
 	LLVMValueRef v_resultslot;
 
+	/* batches */
+	LLVMValueRef v_innerbatch;
+	LLVMValueRef v_outerbatch;
+	LLVMValueRef v_scanbatch;
+
 	/* nulls/values of slots */
 	LLVMValueRef v_innervalues;
 	LLVMValueRef v_innernulls;
@@ -221,6 +226,21 @@ llvm_compile_expr(ExprState *state)
 									 v_state,
 									 FIELDNO_EXPRSTATE_RESULTSLOT,
 									 "v_resultslot");
+	v_innerbatch = l_load_struct_gep(b,
+									 StructExprContext,
+									 v_econtext,
+									 FIELDNO_EXPRCONTEXT_OUTERBATCH,
+									 "v_innerbatch");
+	v_outerbatch = l_load_struct_gep(b,
+									 StructExprContext,
+									 v_econtext,
+									 FIELDNO_EXPRCONTEXT_OUTERBATCH,
+									 "v_outerbatch");
+	v_scanbatch = l_load_struct_gep(b,
+									StructExprContext,
+									v_econtext,
+									FIELDNO_EXPRCONTEXT_SCANBATCH,
+									"v_scanbatch");
 
 	/* build global values/isnull pointers */
 	v_scanvalues = l_load_struct_gep(b,
@@ -439,6 +459,54 @@ llvm_compile_expr(ExprState *state)
 					break;
 				}
 
+			case EEOP_INNER_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_innerbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
+			case EEOP_OUTER_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_outerbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
+			case EEOP_SCAN_FETCHSOME_BATCH:
+				{
+					LLVMValueRef params[2];
+
+					params[0] = v_scanbatch;
+					params[1] = l_int32_const(lc, op->d.fetch_batch.last_var);
+
+						l_call(b,
+							   llvm_pg_var_func_type("slot_getsomeattrs_batch"),
+							   llvm_pg_func(mod, "slot_getsomeattrs_batch"),
+							   params, lengthof(params), "");
+
+					LLVMBuildBr(b, opblocks[opno + 1]);
+					break;
+				}
+
 			case EEOP_INNER_VAR:
 			case EEOP_OUTER_VAR:
 			case EEOP_SCAN_VAR:
@@ -2940,6 +3008,24 @@ llvm_compile_expr(ExprState *state)
 				LLVMBuildBr(b, opblocks[opno + 1]);
 				break;
 
+			case EEOP_BUILD_INNER_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildInnerBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
+			case EEOP_BUILD_OUTER_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildOuterBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
+			case EEOP_BUILD_SCAN_BATCH_VECTOR:
+				build_EvalXFunc(b, mod, "ExecBuildScanBatchVector",
+								v_state, op, v_econtext);
+				LLVMBuildBr(b, opblocks[opno + 1]);
+				break;
+
 			case EEOP_LAST:
 				Assert(false);
 				break;
diff --git a/src/backend/jit/llvm/llvmjit_types.c b/src/backend/jit/llvm/llvmjit_types.c
index 167cd554b9c..6bb527c3f6f 100644
--- a/src/backend/jit/llvm/llvmjit_types.c
+++ b/src/backend/jit/llvm/llvmjit_types.c
@@ -179,7 +179,11 @@ void	   *referenced_functions[] =
 	MakeExpandedObjectReadOnlyInternal,
 	slot_getmissingattrs,
 	slot_getsomeattrs_int,
+	slot_getsomeattrs_batch,
 	strlen,
 	varsize_any,
 	ExecInterpExprStillValid,
+	ExecBuildInnerBatchVector,
+	ExecBuildOuterBatchVector,
+	ExecBuildScanBatchVector,
 };
diff --git a/src/include/executor/execExpr.h b/src/include/executor/execExpr.h
index 75366203706..99c86bac702 100644
--- a/src/include/executor/execExpr.h
+++ b/src/include/executor/execExpr.h
@@ -78,6 +78,11 @@ typedef enum ExprEvalOp
 	EEOP_OLD_FETCHSOME,
 	EEOP_NEW_FETCHSOME,
 
+	/* apply slot_getsomeattrs_batch() to corresponding batch */
+	EEOP_INNER_FETCHSOME_BATCH,
+	EEOP_OUTER_FETCHSOME_BATCH,
+	EEOP_SCAN_FETCHSOME_BATCH,
+
 	/* compute non-system Var value */
 	EEOP_INNER_VAR,
 	EEOP_OUTER_VAR,
@@ -292,11 +297,15 @@ typedef enum ExprEvalOp
 	EEOP_AGG_ORDERED_TRANS_DATUM,
 	EEOP_AGG_ORDERED_TRANS_TUPLE,
 
+	/* ExprContext.*_batch -> BatchVector */
+	EEOP_BUILD_INNER_BATCH_VECTOR,
+	EEOP_BUILD_OUTER_BATCH_VECTOR,
+	EEOP_BUILD_SCAN_BATCH_VECTOR,
+
 	/* non-existent operation, used e.g. to check array lengths */
 	EEOP_LAST
 } ExprEvalOp;
 
-
 typedef struct ExprEvalStep
 {
 	/*
@@ -331,6 +340,12 @@ typedef struct ExprEvalStep
 			const TupleTableSlotOps *kind;
 		}			fetch;
 
+		struct
+		{
+			/* attribute number up to which to fetch (inclusive) */
+			int			last_var;
+		}			fetch_batch;
+
 		/* for EEOP_INNER/OUTER/SCAN/OLD/NEW_[SYS]VAR */
 		struct
 		{
@@ -769,6 +784,12 @@ typedef struct ExprEvalStep
 			void	   *json_coercion_cache;
 			ErrorSaveContext *escontext;
 		}			jsonexpr_coercion;
+
+		/* for batch vector construction */
+		struct
+		{
+			struct BatchVector *bv;
+		}			batch_vector;
 	}			d;
 } ExprEvalStep;
 
@@ -917,4 +938,26 @@ extern void ExecEvalAggOrderedTransDatum(ExprState *state, ExprEvalStep *op,
 extern void ExecEvalAggOrderedTransTuple(ExprState *state, ExprEvalStep *op,
 										 ExprContext *econtext);
 
+/* ---------- BatchVector stuff ------------- */
+
+/* Vector fetch spec for a list of simple Vars. */
+typedef struct BatchVector
+{
+	/* immutable after BatchVectorCreate */
+	AttrNumber *attnos;		/* [ncols] */
+	int			ncols;
+	int			maxrows;
+	int			last_var;
+
+	/* per batch state */
+	Datum **cols;			/* [ncols][maxbatch] */
+	bool  **nulls;			/* [ncols][maxbatch] */
+	bool	hasnull;		/* is any datum in cols NULL? */
+	int		nrows;			/* #rows loaded into cols/nulls */
+} BatchVector;
+
+extern void ExecBuildInnerBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildOuterBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+extern void ExecBuildScanBatchVector(ExprState *state, ExprEvalStep *op, ExprContext *econtext);
+
 #endif							/* EXEC_EXPR_H */
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 43f1d999b91..82369fa6e8e 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -346,6 +346,8 @@ extern Datum ExecFetchSlotHeapTupleDatum(TupleTableSlot *slot);
 extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
 								 int lastAttNum);
 extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
+struct TupleBatch;
+extern void slot_getsomeattrs_batch(struct TupleBatch *b, int attnum);
 
 
 #ifndef FRONTEND
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 9b81b842161..fdfe8b4ddaf 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -277,6 +277,14 @@ typedef struct ExprContext
 #define FIELDNO_EXPRCONTEXT_OUTERTUPLE 3
 	TupleTableSlot *ecxt_outertuple;
 
+	/* For batched evaluation using batch-aware EEOPs */
+#define FIELDNO_EXPRCONTEXT_INNERBATCH 4
+	TupleBatch	   *inner_batch;
+#define FIELDNO_EXPRCONTEXT_OUTERBATCH 5
+	TupleBatch	   *outer_batch;
+#define FIELDNO_EXPRCONTEXT_SCANBATCH 6
+	TupleBatch	   *scan_batch;
+
 	/* Memory contexts for expression evaluation --- see notes above */
 	MemoryContext ecxt_per_query_memory;
 	MemoryContext ecxt_per_tuple_memory;
@@ -289,27 +297,27 @@ typedef struct ExprContext
 	 * Values to substitute for Aggref nodes in the expressions of an Agg
 	 * node, or for WindowFunc nodes within a WindowAgg node.
 	 */
-#define FIELDNO_EXPRCONTEXT_AGGVALUES 8
+#define FIELDNO_EXPRCONTEXT_AGGVALUES 11
 	Datum	   *ecxt_aggvalues; /* precomputed values for aggs/windowfuncs */
-#define FIELDNO_EXPRCONTEXT_AGGNULLS 9
+#define FIELDNO_EXPRCONTEXT_AGGNULLS 12
 	bool	   *ecxt_aggnulls;	/* null flags for aggs/windowfuncs */
 
 	/* Value to substitute for CaseTestExpr nodes in expression */
-#define FIELDNO_EXPRCONTEXT_CASEDATUM 10
+#define FIELDNO_EXPRCONTEXT_CASEDATUM 13
 	Datum		caseValue_datum;
-#define FIELDNO_EXPRCONTEXT_CASENULL 11
+#define FIELDNO_EXPRCONTEXT_CASENULL 14
 	bool		caseValue_isNull;
 
 	/* Value to substitute for CoerceToDomainValue nodes in expression */
-#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 12
+#define FIELDNO_EXPRCONTEXT_DOMAINDATUM 15
 	Datum		domainValue_datum;
-#define FIELDNO_EXPRCONTEXT_DOMAINNULL 13
+#define FIELDNO_EXPRCONTEXT_DOMAINNULL 16
 	bool		domainValue_isNull;
 
 	/* Tuples that OLD/NEW Var nodes in RETURNING may refer to */
-#define FIELDNO_EXPRCONTEXT_OLDTUPLE 14
+#define FIELDNO_EXPRCONTEXT_OLDTUPLE 17
 	TupleTableSlot *ecxt_oldtuple;
-#define FIELDNO_EXPRCONTEXT_NEWTUPLE 15
+#define FIELDNO_EXPRCONTEXT_NEWTUPLE 18
 	TupleTableSlot *ecxt_newtuple;
 
 	/* Link to containing EState (NULL if a standalone ExprContext) */
-- 
2.47.3



  [application/octet-stream] v3-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.7K, 7-v3-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
  download | inline diff:
From 51192c52275005649df88b5e3a75360942dc0fcd Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:56:17 +0900
Subject: [PATCH v3 1/9] Add batch table AM API and heapam implementation

Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.

Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.

Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.

Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.
---
 src/backend/access/heap/heapam.c         | 212 ++++++++++++++++++++++-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/include/access/heapam.h              |  21 +++
 src/include/access/tableam.h             |  58 +++++++
 src/include/pgstat.h                     |   5 +
 5 files changed, 299 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 568696333c2..8b9a80449c1 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1008,7 +1008,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 					int nkeys,
 					ScanKey key)
 {
-	HeapTuple	tuple = &(scan->rs_ctup);
+	HeapTuple tuple = &scan->rs_ctup;
 	Page		page;
 	uint32		lineindex;
 	uint32		linesleft;
@@ -1089,6 +1089,121 @@ continue_page:
 	scan->rs_inited = false;
 }
 
+/*
+ * heapgettup_pagemode_batch
+ *		Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ *		number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ *	- Ensures rs_cbuf pins the page from which tuples were produced.
+ *	- Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ *	  heapgettup_pagemode’s inner-loop effects).
+ *	- Does *not* change buffer pin counts except through normal page
+ *	  transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+						  ScanDirection dir,
+						  int nkeys, ScanKey key,
+						  HeapTupleData *tdata,
+						  int maxitems)
+{
+	Page		page;
+	uint32		lineindex;
+	uint32		linesleft;
+	int			nout = 0;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(maxitems > 0);
+
+	/*
+	 * If we have no current page (or the current page is exhausted),
+	 * advance to the next page that has any visible tuples and prepare it.
+	 * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+	 * as soon as we have a prepared page; we never produce from two pages.
+	 */
+	for (;;)
+	{
+		if (BufferIsValid(scan->rs_cbuf))
+		{
+			/* Are there more visible tuples left on this page? */
+			lineindex = scan->rs_cindex + dir;
+			if (ScanDirectionIsForward(dir))
+				linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+					(scan->rs_ntuples - lineindex) : 0;
+			else
+				linesleft = scan->rs_cindex;
+			if (linesleft > 0)
+				break;	/* continue on this page */
+		}
+
+		/* Move to next page and prepare its visible tuple list. */
+		heap_fetch_next_buffer(scan, dir);
+
+		if (!BufferIsValid(scan->rs_cbuf))
+		{
+			/* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_prefetch_block = InvalidBlockNumber;
+			scan->rs_inited = false;
+			return 0;
+		}
+
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+		heap_prepare_pagescan((TableScanDesc) scan);
+
+		/* After prepare, either rs_ntuples > 0 or we'll loop again. */
+		if (scan->rs_ntuples > 0)
+		{
+			lineindex = ScanDirectionIsForward(dir) ? 0 : scan->rs_ntuples - 1;
+			linesleft = scan->rs_ntuples - (ScanDirectionIsForward(dir) ? 0 : 0);
+			break;
+		}
+		/* else: page had no visible tuples; continue to next page */
+	}
+
+	/* From here on, we must only read tuples from this single page. */
+	page = BufferGetPage(scan->rs_cbuf);
+
+	/*
+	 * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+	 * until either the page is exhausted or the batch capacity is reached.
+	 */
+	for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+	{
+		OffsetNumber	lineoff;
+		ItemId			lpp;
+		HeapTupleData *dst = &tdata[nout];
+
+		Assert(lineindex <= (uint32) scan->rs_ntuples);
+		lineoff = scan->rs_vistuples[lineindex];
+		lpp = PageGetItemId(page, lineoff);
+		Assert(ItemIdIsNormal(lpp));
+
+		dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		dst->t_len  = ItemIdGetLength(lpp);
+		dst->t_tableOid = RelationGetRelid(scan->rs_base.rs_rd);
+		ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+		if (key != NULL &&
+			!HeapKeyTest(dst, RelationGetDescr(scan->rs_base.rs_rd),
+						 nkeys, key))
+			continue;
+
+		scan->rs_cindex = lineindex;
+		nout++;
+	}
+
+	return nout;
+}
 
 /* ----------------------------------------------------------------
  *					 heap access method interface
@@ -1136,6 +1251,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_base.rs_parallel = parallel_scan;
 	scan->rs_strategy = NULL;	/* set in initscan */
 	scan->rs_cbuf = InvalidBuffer;
+	scan->rs_batch_ctup = NULL;
+	scan->rs_batch_cbuf = InvalidBuffer;
 
 	/*
 	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
@@ -1315,6 +1432,8 @@ heap_endscan(TableScanDesc sscan)
 	 */
 	if (BufferIsValid(scan->rs_cbuf))
 		ReleaseBuffer(scan->rs_cbuf);
+	if (BufferIsValid(scan->rs_batch_cbuf))
+		ReleaseBuffer(scan->rs_batch_cbuf);
 
 	/*
 	 * Must free the read stream before freeing the BufferAccessStrategy.
@@ -1421,6 +1540,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
 	return true;
 }
 
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	HeapBatch  *hb;
+	Oid			relid;
+
+	Assert(maxitems > 0);
+
+	hb = palloc(sizeof(HeapBatch));
+	hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+	hb->maxitems = maxitems;
+	hb->nitems = 0;
+	hb->buf = InvalidBuffer;
+
+	/* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+	relid = RelationGetRelid(sscan->rs_rd);
+	for (int i = 0; i < maxitems; i++)
+		hb->tupdata[i].t_tableOid = relid;
+
+	return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	if (BufferIsValid(hb->buf))
+		ReleaseBuffer(hb->buf);
+
+	pfree(hb->tupdata);
+	pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	HeapScanDesc scan = (HeapScanDesc) sscan;
+	HeapBatch  *hb = (HeapBatch *) am_batch;
+	Buffer		curbuf;
+	int			n;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(hb->maxitems > 0);
+
+	/* Drop prior batch pin, if any. */
+	if (BufferIsValid(hb->buf))
+	{
+		ReleaseBuffer(hb->buf);
+		hb->buf = InvalidBuffer;
+	}
+
+	hb->nitems = 0;
+
+	/* One call per batch, never crosses a page. */
+	n = heapgettup_pagemode_batch(scan, dir,
+								  sscan->rs_nkeys, sscan->rs_key,
+								  hb->tupdata, hb->maxitems);
+
+	if (n == 0)
+		return 0;	/* end of scan */
+
+	/* Hold a shared pin for the batch lifetime so t_data stays valid. */
+	curbuf = scan->rs_cbuf;
+	IncrBufferRefCount(curbuf);
+	hb->buf = curbuf;
+
+	/* Per-tuple stats (can be collapsed into a future _multi() call). */
+	pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+	hb->nitems = n;
+	return n;
+}
+
+/*----- End of batching support -----*/
+
 void
 heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 				  ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index bcbac844bb6..ec4eeccf19c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_rescan = heap_rescan,
 	.scan_getnextslot = heap_getnextslot,
 
+	.scan_begin_batch = heap_begin_batch,
+	.scan_getnextbatch = heap_getnextbatch,
+	.scan_end_batch = heap_end_batch,
+
 	.scan_set_tidrange = heap_set_tidrange,
 	.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index e60d34dad25..02f7793fba0 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -74,6 +74,9 @@ typedef struct HeapScanDescData
 
 	HeapTupleData rs_ctup;		/* current tuple in scan, if any */
 
+	HeapTupleData *rs_batch_ctup;	/* NULL when not using batched mode */
+	Buffer	rs_batch_cbuf;		/* buffer feeding the batch */
+
 	/* For scans that stream reads */
 	ReadStream *rs_read_stream;
 
@@ -101,6 +104,19 @@ typedef struct HeapScanDescData
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+	HeapTupleData  *tupdata;	/* len = maxitems; headers only */
+	int				nitems;		/* tuples produced in last getnextbatch() */
+	int				maxitems;	/* fixed capacity set at begin_batch() */
+	Buffer			buf;		/* single pinned buffer for this batch */
+} HeapBatch;
+
 typedef struct BitmapHeapScanDescData
 {
 	HeapScanDescData rs_heap_base;
@@ -294,6 +310,11 @@ extern void heap_endscan(TableScanDesc sscan);
 extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
 extern bool heap_getnextslot(TableScanDesc sscan,
 							 ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
 extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..953207eac50 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
 									 ScanDirection direction,
 									 TupleTableSlot *slot);
 
+	/* ------------------------------------------------------------------------
+	 * Batched scan support
+	 * ------------------------------------------------------------------------
+	 */
+
+	void	   *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+	int			(*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+									 ScanDirection dir);
+	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_scan_begin_batch
+ *		Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_begin_batch != NULL);
+
+	return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ *		Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ *		Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	/* Only forward scans are supported in the batched mode. */
+	Assert(dir == ForwardScanDirection);
+	Assert(tam->scan_getnextbatch != NULL);
+
+	return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ *		Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	if (am_batch == NULL)
+		return;
+
+	Assert(tam->scan_end_batch != NULL);
+
+	tam->scan_end_batch(sscan, am_batch);
+}
+
 /* ----------------------------------------------------------------------------
  * TID Range scanning related functions.
  * ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index bc8077cbae6..249f3583f92 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -691,6 +691,11 @@ extern void pgstat_report_analyze(Relation rel,
 		if (pgstat_should_count_relation(rel))						\
 			(rel)->pgstat_info->counts.tuples_returned++;			\
 	} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n)						\
+	do {															\
+		if (pgstat_should_count_relation(rel))						\
+			(rel)->pgstat_info->counts.tuples_returned += n;		\
+	} while (0)
 #define pgstat_count_heap_fetch(rel)								\
 	do {															\
 		if (pgstat_should_count_relation(rel))						\
-- 
2.47.3



  [application/octet-stream] v3-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch (6.3K, 8-v3-0004-WIP-Add-agg_retrieve_direct_batch-for-plain-aggre.patch)
  download | inline diff:
From 87728dd22a56c35d3b7ee11e71e15a8d4193afd1 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 4 Sep 2025 22:55:25 +0900
Subject: [PATCH v3 4/9] WIP: Add agg_retrieve_direct_batch() for plain
 aggregates
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Teach Agg to consume child tuples in batches for AGG_PLAIN. A new
agg_retrieve_direct_batch() pulls TupleBatch from the child via
ExecProcNodeBatch(), materializes as needed, and advances per-agg
transition state over the batch. A first tuple is copied to match
the direct path’s behavior before batch processing.

Add AggCanUsePlainBatch() and select retrieve_plain at init:
batch path when no grouping sets, strategy is AGG_PLAIN, and the
child exposes ExecProcNodeBatch(); otherwise keep the row path.

Plan shape and EXPLAIN remain unchanged. Semantics are identical
to the non-batch direct path; this only reduces per-tuple overhead.
---
 src/backend/executor/nodeAgg.c | 123 +++++++++++++++++++++++++++++++++
 src/include/nodes/execnodes.h  |   5 ++
 2 files changed, 128 insertions(+)

diff --git a/src/backend/executor/nodeAgg.c b/src/backend/executor/nodeAgg.c
index a4f3d30f307..3ace6363509 100644
--- a/src/backend/executor/nodeAgg.c
+++ b/src/backend/executor/nodeAgg.c
@@ -820,6 +820,20 @@ advance_aggregates(AggState *aggstate)
 									  aggstate->tmpcontext);
 }
 
+static void
+advance_aggregates_batch(AggState *aggstate, TupleBatch *b)
+{
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	ExprState *evaltrans = aggstate->phase->evaltrans;
+
+	while (TupleBatchHasMore(b))
+	{
+		tmpcontext->ecxt_outertuple = TupleBatchGetNextSlot(b);
+		ExecEvalExprNoReturnSwitchContext(evaltrans, tmpcontext);
+		ResetExprContext(tmpcontext);
+	}
+}
+
 /*
  * Run the transition function for a DISTINCT or ORDER BY aggregate
  * with only one input.  This is called after we have completed
@@ -2260,6 +2274,9 @@ ExecAgg(PlanState *pstate)
 				result = agg_retrieve_hash_table(node);
 				break;
 			case AGG_PLAIN:
+				/* init-time choice */
+				result = node->retrieve_plain(node);
+				break;
 			case AGG_SORTED:
 				result = agg_retrieve_direct(node);
 				break;
@@ -2618,6 +2635,91 @@ agg_retrieve_direct(AggState *aggstate)
 	return NULL;
 }
 
+static TupleTableSlot *
+agg_retrieve_direct_batch(AggState *aggstate)
+{
+	PlanState *child = outerPlanState(aggstate);
+	ExprContext *econtext = aggstate->ss.ps.ps_ExprContext;
+	ExprContext *tmpcontext = aggstate->tmpcontext;
+	const bool hasGroupingSets = aggstate->phase->numsets > 0;
+	TupleTableSlot *firstSlot = aggstate->ss.ss_ScanTupleSlot;
+	TupleBatch *b = NULL;
+
+	Assert(child->ExecProcNodeBatch);
+
+	/* mimic the first-tuple copy from agg_retrieve_direct() */
+	for (;;)
+	{
+		b = ExecProcNodeBatch(child);
+		if (b == NULL)
+		{
+			if (hasGroupingSets)
+			{
+				aggstate->input_done = true;
+				break;
+			}
+			aggstate->agg_done = true;
+			break;
+		}
+		if (b->nvalid == 0)
+			continue;
+
+		TupleBatchMaterializeAll(b);
+		aggstate->grp_firstTuple = ExecCopySlotHeapTuple(TupleBatchGetSlot(b, 0));
+		break;
+	}
+
+	/* initialize_aggregates etc. as in the direct path */
+	ReScanExprContext(econtext);
+	for (int i = 0; i < Max(aggstate->phase->numsets, 1); i++)
+		ReScanExprContext(aggstate->aggcontexts[i]);
+
+	initialize_aggregates(aggstate, aggstate->pergroups,
+						  Max(aggstate->phase->numsets, 1));
+
+	if (aggstate->grp_firstTuple)
+	{
+		ExecForceStoreHeapTuple(aggstate->grp_firstTuple, firstSlot, true);
+		aggstate->grp_firstTuple = NULL;
+		tmpcontext->ecxt_outertuple = firstSlot;
+
+		advance_aggregates_batch(aggstate, b);
+		ResetExprContext(tmpcontext);
+	}
+
+	/* consume remaining rows in current and subsequent batches */
+	if (b)
+	{
+		if (TupleBatchHasMore(b))
+			advance_aggregates_batch(aggstate, b);
+		for (;;)
+		{
+			b = ExecProcNodeBatch(child);
+			if (b == NULL)
+			{
+				if (hasGroupingSets)
+					aggstate->input_done = true;
+				else
+					aggstate->agg_done = true;
+				break;
+			}
+			if (b->nvalid == 0)
+				continue;
+
+			TupleBatchMaterializeAll(b);
+			advance_aggregates_batch(aggstate, b);
+		}
+	}
+
+	/* finalize and project like the direct path */
+	econtext->ecxt_outertuple = firstSlot;
+	prepare_projection_slot(aggstate, econtext->ecxt_outertuple, 0);
+	select_current_set(aggstate, 0, false);
+	finalize_aggregates(aggstate, aggstate->peragg, aggstate->pergroups[0]);
+
+	return project_aggregates(aggstate);
+}
+
 /*
  * ExecAgg for hashed case: read input and build hash table
  */
@@ -3265,6 +3367,22 @@ hashagg_reset_spill_state(AggState *aggstate)
 	}
 }
 
+static bool
+AggCanUsePlainBatch(AggState *aggstate)
+{
+	const Agg *aggnode = (const Agg *) aggstate->ss.ps.plan;
+
+	Assert(outerPlanState(aggstate));
+
+	/* grouping sets present -> bail */
+	if (aggnode->groupingSets != NIL)
+		return false;
+
+	if (aggstate->phase->aggstrategy != AGG_PLAIN)
+		return false;
+
+	return outerPlanState(aggstate)->ExecProcNodeBatch;
+}
 
 /* -----------------
  * ExecInitAgg
@@ -4060,6 +4178,11 @@ ExecInitAgg(Agg *node, EState *estate, int eflags)
 				(errcode(ERRCODE_GROUPING_ERROR),
 				 errmsg("aggregate function calls cannot be nested")));
 
+	if (AggCanUsePlainBatch(aggstate))
+		aggstate->retrieve_plain = agg_retrieve_direct_batch;
+	else
+		aggstate->retrieve_plain = agg_retrieve_direct;
+
 	/*
 	 * Build expressions doing all the transition work at once. We build a
 	 * different one for each phase, as the number of transition function
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a104591ac20..9b81b842161 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -2535,6 +2535,9 @@ typedef struct AggStatePerGroupData *AggStatePerGroup;
 typedef struct AggStatePerPhaseData *AggStatePerPhase;
 typedef struct AggStatePerHashData *AggStatePerHash;
 
+struct AggState;
+typedef TupleTableSlot *(*AggRetrievePlainFn)(struct AggState *);
+
 typedef struct AggState
 {
 	ScanState	ss;				/* its first field is NodeTag */
@@ -2610,6 +2613,8 @@ typedef struct AggState
 	AggStatePerGroup *all_pergroups;	/* array of first ->pergroups, than
 										 * ->hash_pergroup */
 	SharedAggInfo *shared_info; /* one entry per worker */
+
+	AggRetrievePlainFn retrieve_plain; /* init-time choice */
 } AggState;
 
 /* ----------------
-- 
2.47.3



  [application/octet-stream] v3-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch (9.0K, 9-v3-0003-Executor-add-ExecProcNodeBatch-and-integrate-SeqS.patch)
  download | inline diff:
From 1ee09ba42c595d108356f78a46ea4e00a03ce123 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 22:18:30 +0900
Subject: [PATCH v3 3/9] Executor: add ExecProcNodeBatch() and integrate
 SeqScan with batch API
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Introduce a batch-capable executor interface alongside the existing
slot-at-a-time path:

 * ExecProcNodeBatch() is added to return a TupleBatch instead of a
   TupleTableSlot. PlanState gains ExecProcNodeBatch as a function
   pointer.

Integrate SeqScan with this interface:

 * Add ExecSeqScanBatch* routines that drive heap via the batch table
   AM API and return a TupleBatch.
 * At init, set ps.ExecProcNodeBatch to these routines when
   ScanCanUseBatching() allows.
 * Retain ExecSeqScanBatchSlot* variants for slot-at-a-time consumers.

This builds on 0002, which introduced TupleBatch and made SeqScan
consume the AM’s batch API internally but still surface slots. With this
patch, SeqScan can surface batches directly to batch-aware upper nodes.

Plan shape and EXPLAIN output remain unchanged; only internal tuple flow
differs when batching is enabled and allowed.
---
 src/backend/executor/execProcnode.c | 52 +++++++++++++++++++++++++++++
 src/backend/executor/nodeSeqscan.c  | 35 +++++++++++++++++++
 src/include/executor/execScan.h     | 51 ++++++++++++++++++++++++++++
 src/include/executor/executor.h     | 10 ++++++
 src/include/nodes/execnodes.h       |  5 +++
 5 files changed, 153 insertions(+)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..a8c0315e874 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,6 +121,8 @@
 
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
 static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+static TupleBatch *ExecProcNodeBatchFirst(PlanState *node);
+static TupleBatch *ExecProcNodeBatchInstr(PlanState *node);
 static bool ExecShutdownNode_walker(PlanState *node, void *context);
 
 
@@ -389,6 +391,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 	}
 
 	ExecSetExecProcNode(result, result->ExecProcNode);
+	if (result->ExecProcNodeBatch)
+		ExecSetExecProcNodeBatch(result, result->ExecProcNodeBatch);
 
 	/*
 	 * Initialize any initPlans present in this node.  The planner put them in
@@ -489,6 +493,54 @@ ExecProcNodeInstr(PlanState *node)
 	return result;
 }
 
+/*
+ * ExecSetExecProcNodeBatch
+ *		Install ExecProcNodeBatch with first-call wrapper, mirroring row path.
+ */
+void
+ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function)
+{
+	node->ExecProcNodeBatchReal = function;
+	node->ExecProcNodeBatch = ExecProcNodeBatchFirst;
+}
+
+/*
+ * ExecProcNodeBatchFirst
+ *		One-time stack-depth check; then pick instrument/no-instrument wrapper.
+ */
+static TupleBatch *
+ExecProcNodeBatchFirst(PlanState *node)
+{
+	check_stack_depth();
+
+	if (node->instrument)
+		node->ExecProcNodeBatch = ExecProcNodeBatchInstr;
+	else
+		node->ExecProcNodeBatch = node->ExecProcNodeBatchReal;
+
+	return node->ExecProcNodeBatch(node);
+}
+
+/*
+ * ExecProcNodeBatchInstr
+ *		Instrumentation wrapper for batch calls.
+ *
+ * Note: we can record nrows as the "tuple" count for this call. That keeps
+ * instrumentation meaningful without changing Instr API.
+ */
+static TupleBatch *
+ExecProcNodeBatchInstr(PlanState *node)
+{
+	TupleBatch *b;
+
+	InstrStartNode(node->instrument);
+
+	b = node->ExecProcNodeBatchReal(node);
+
+	InstrStopNode(node->instrument, b ? (double) b->nvalid : 0.0);
+
+	return b;
+}
 
 /* ----------------------------------------------------------------
  *		MultiExecProcNode
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 2552d420f1c..a4cf1e51af0 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -334,6 +334,37 @@ ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
 									 pstate->qual, pstate->ps_ProjInfo);
 }
 
+static TupleBatch *
+ExecSeqScanBatch(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatch(&node->ss,
+								 (ExecScanAccessBatchMtd) SeqNextBatch,
+								 NULL, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation is required.
+ */
+static TupleBatch *
+ExecSeqScanBatchWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatch(&node->ss,
+								 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+								 pstate->qual, NULL);
+}
+
 /* Batch SeqScan enablement and dispatch */
 static void
 SeqScanInitBatching(SeqScanState *scanstate, int eflags)
@@ -348,10 +379,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 	{
 		if (scanstate->ss.ps.ps_ProjInfo == NULL)
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatch;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
 		}
 		else
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = NULL;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
 		}
 	}
@@ -359,10 +392,12 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 	{
 		if (scanstate->ss.ps.ps_ProjInfo == NULL)
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = ExecSeqScanBatchWithQual;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
 		}
 		else
 		{
+			scanstate->ss.ps.ExecProcNodeBatch = NULL;
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
 		}
 	}
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index fec606471c8..fb4b57a831c 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -297,4 +297,55 @@ ExecScanExtendedBatchSlot(ScanState *node,
 	}
 }
 
+static inline TupleBatch *
+ExecScanExtendedBatch(ScanState *node,
+					  ExecScanAccessBatchMtd accessBatchMtd,
+					  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+	int			qualified;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next batch from the AM */
+		if (!accessBatchMtd(node))
+			return NULL;
+
+		if (qual != NULL)
+		{
+			qualified = 0;
+			while (TupleBatchHasMore(b))
+			{
+				TupleTableSlot *in = TupleBatchGetNextSlot(b);
+
+				Assert(in);
+				ResetExprContext(econtext);
+				econtext->ecxt_scantuple = in;
+
+				if (ExecQual(qual, econtext))
+				{
+					TupleBatchStoreInOut(b, qualified, in);
+					qualified++;
+				}
+				else
+					InstrCountFiltered1(node, 1);
+			}
+			TupleBatchUseOutput(b, qualified);
+		}
+		else
+			qualified = b->nvalid;
+
+		if (qualified > 0)
+			return b;
+		/* else get the next batch from the AM */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 17258f7ae2d..cf5b0c7e05c 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -294,6 +294,7 @@ extern void EvalPlanQualEnd(EPQState *epqstate);
  */
 extern PlanState *ExecInitNode(Plan *node, EState *estate, int eflags);
 extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
+extern void ExecSetExecProcNodeBatch(PlanState *node, ExecProcNodeBatchMtd function);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern void ExecShutdownNode(PlanState *node);
@@ -315,6 +316,15 @@ ExecProcNode(PlanState *node)
 
 	return node->ExecProcNode(node);
 }
+
+static inline TupleBatch *
+ExecProcNodeBatch(PlanState *node)
+{
+	if (node->chgParam != NULL) /* something changed? */
+		ExecReScan(node);		/* let ReScan handle this */
+
+	return node->ExecProcNodeBatch(node);
+}
 #endif
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index f4bb8f7dd7f..a104591ac20 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1147,6 +1147,7 @@ typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
 /* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
 struct TupleBatch;
 typedef struct TupleBatch TupleBatch;
+typedef TupleBatch *(*ExecProcNodeBatchMtd)(struct PlanState *ps);
 
 /* ----------------
  *		PlanState node
@@ -1171,6 +1172,10 @@ typedef struct PlanState
 	ExecProcNodeMtd ExecProcNodeReal;	/* actual function, if above is a
 										 * wrapper */
 
+	/* Optional batch-producing entry point (NULL => no batching). */
+	ExecProcNodeBatchMtd ExecProcNodeBatch;
+	ExecProcNodeBatchMtd ExecProcNodeBatchReal;
+
 	Instrumentation *instrument;	/* Optional runtime stats for this node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
-- 
2.47.3



  [application/octet-stream] v3-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.2K, 10-v3-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
  download | inline diff:
From dac7cf1cd2a01347faf6b7fab3107c08da88ac90 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 1 Sep 2025 21:59:56 +0900
Subject: [PATCH v3 2/9] SeqScan: add batch-driven variants returning slots

Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.

Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.

Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.

Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.

Notes / current limits:

- Batching uses EXEC_BATCH_ROWS (currently 64) as the target capacity.
- With the current heapam, batches are composed from a single page, so
  the batch may not always be full. Future work may let SeqScan and/or
  AMs top up batches across pages when safe to do so.
---
 src/backend/access/heap/heapam.c          |  29 ++++
 src/backend/access/heap/heapam_handler.c  |  15 ++
 src/backend/access/table/tableam.c        |  11 ++
 src/backend/executor/Makefile             |   1 +
 src/backend/executor/execBatch.c          | 117 ++++++++++++++
 src/backend/executor/execScan.c           |  31 ++++
 src/backend/executor/meson.build          |   1 +
 src/backend/executor/nodeSeqscan.c        | 176 +++++++++++++++++++++-
 src/backend/utils/init/globals.c          |   3 +
 src/backend/utils/misc/guc_parameters.dat |   7 +
 src/include/access/heapam.h               |   1 +
 src/include/access/tableam.h              |  27 ++++
 src/include/executor/execBatch.h          | 102 +++++++++++++
 src/include/executor/execScan.h           |  54 +++++++
 src/include/executor/executor.h           |   4 +
 src/include/miscadmin.h                   |   1 +
 src/include/nodes/execnodes.h             |   8 +
 17 files changed, 587 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/executor/execBatch.c
 create mode 100644 src/include/executor/execBatch.h

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8b9a80449c1..355ddd9838d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1570,6 +1570,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
 	return hb;
 }
 
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	Assert(n <= hb->nitems);
+
+	for (int i = 0; i < n; i++)
+	{
+		HeapTupleData *tuple = &hb->tupdata[i];
+		HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+		/* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+		slot->tuple = tuple;
+		slot->off = 0;
+		slot->base.tts_nvalid = 0;
+		slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+		slot->base.tts_tid = tuple->t_self;
+		slot->base.tts_tableOid = tuple->t_tableOid;
+		slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+	}
+}
+
 /*
  * heap_scan_end_batch
  *
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index ec4eeccf19c..8e88cc9e8f1 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,20 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps = {
+	.materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+	return &TupleBatchHeapOps;
+}
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -2617,6 +2631,7 @@ static const TableAmRoutine heapam_methods = {
 	.type = T_TableAmRoutine,
 
 	.slot_callbacks = heapam_slot_callbacks,
+	.batch_callbacks = heapam_batch_callbacks,
 
 	.scan_begin = heap_beginscan,
 	.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..5a8ebb8b97c 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
 	return slot;
 }
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+	if (relation->rd_tableam)
+		return relation->rd_tableam->batch_callbacks(relation);
+	elog(ERROR, "relation does not support TupleBatch operations");
+}
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	execAmi.o \
 	execAsync.o \
+	execBatch.o \
 	execCurrent.o \
 	execExpr.o \
 	execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ *		Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ *		Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+	TupleBatch  *b;
+	TupleTableSlot **inslots,
+				   **outslots;
+
+	inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	for (int i = 0; i < capacity; i++)
+		inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+	b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+	/* Initial state: empty envelope */
+	b->am_payload = NULL;
+	b->ntuples = 0;
+	b->inslots = inslots;
+	b->outslots = outslots;
+	b->activeslots = NULL;
+	b->outslots = outslots;
+	b->maxslots = capacity;
+
+	b->nvalid = 0;
+	b->next = 0;
+
+	return b;
+}
+
+/*
+ * TupleBatchReset
+ *		Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+	if (b == NULL)
+		return;
+
+	for (int i = 0; i < b->maxslots; i++)
+	{
+		ExecClearTuple(b->inslots[i]);
+		if (drop_slots)
+			ExecDropSingleTupleTableSlot(b->inslots[i]);
+	}
+
+	if (drop_slots)
+	{
+		pfree(b->inslots);
+		pfree(b->outslots);
+		b->inslots = b->outslots = NULL;
+	}
+
+	b->ntuples = 0;
+	b->nvalid = 0;
+	b->next = 0;
+	b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->inslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->outslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+	return	b != NULL &&
+			b->maxslots > 0 &&
+			b->inslots != NULL &&
+			b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+	b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+	return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 90726949a87..f24c5d73ae1 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
  */
 #include "postgres.h"
 
+#include "access/tableam.h"
 #include "executor/executor.h"
 #include "executor/execScan.h"
 #include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
 		}
 	}
 }
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+	Relation	relation = scanstate->ss_currentRelation;
+
+	return	executor_batching &&
+			(scanstate->ps.state->es_epq_active == NULL) &&
+			!(eflags & EXEC_FLAG_BACKWARD) &&
+			relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+	TupleBatch *b = scanstate->ps.ps_Batch;
+
+	if (b)
+	{
+		TupleBatchReset(b, drop);
+		if (b->am_payload)
+		{
+			table_scan_end_batch(scanstate->ss_currentScanDesc,
+								 b->am_payload);
+			b->am_payload = NULL;
+		}
+		if (drop)
+			pfree(b);
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'execAmi.c',
   'execAsync.c',
+  'execBatch.c',
   'execCurrent.c',
   'execExpr.c',
   'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..2552d420f1c 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
 					(ExecScanRecheckMtd) SeqRecheck);
 }
 
+/* ----------------------------------------------------------------
+ *						Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+	TableScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+
+	Assert(node->ss.ps.ps_Batch != NULL);
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	Assert(direction == ForwardScanDirection);
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the scan is not parallel, or if we're serially
+		 * executing a scan that was planned to be parallel.
+		 */
+		scandesc = table_beginscan(node->ss.ss_currentRelation,
+								   estate->es_snapshot,
+								   0, NULL);
+		node->ss.ss_currentScanDesc = scandesc;
+	}
+
+	/* Lazily create the AM batch payload. */
+	if (node->ss.ps.ps_Batch->am_payload == NULL)
+	{
+		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+		Assert(tam && tam->scan_begin_batch);
+		node->ss.ps.ps_Batch->am_payload =
+			table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+		node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+	}
+
+	node->ss.ps.ps_Batch->ntuples =
+		table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+	node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+	node->ss.ps.ps_Batch->materialized = false;
+
+	return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+	if (SeqNextBatch(node))
+	{
+		TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+		return true;
+	}
+
+	return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	/*
+	 * Use pg_assume() for != NULL tests to make the compiler realize no
+	 * runtime check for the field is needed in ExecScanExtended().
+	 */
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+	const int cap = EXEC_BATCH_ROWS;
+	TupleDesc	scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+	scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+	/* Choose batch variant to preserve your specialization matrix */
+	if (scanstate->ss.ps.qual == NULL)
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+		}
+	}
+	else
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
 ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 {
 	SeqScanState *scanstate;
+	bool	use_batching;
 
 	/*
 	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 							 node->scan.scanrelid,
 							 eflags);
 
+	use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
 	/* and create slot with the appropriate rowtype */
 	ExecInitScanTupleSlot(estate, &scanstate->ss,
 						  RelationGetDescr(scanstate->ss.ss_currentRelation),
+						  use_batching ? &TTSOpsHeapTuple :
 						  table_slot_callbacks(scanstate->ss.ss_currentRelation));
 
 	/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
 	}
 
+	if (use_batching)
+		SeqScanInitBatching(scanstate, eflags);
+
 	return scanstate;
 }
 
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
 	 */
 	scanDesc = node->ss.ss_currentScanDesc;
 
+	ScanResetBatching(&node->ss, true);
+
 	/*
 	 * close heap scan
 	 */
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
 	if (scan != NULL)
 		table_rescan(scan,		/* scan desc */
 					 NULL);		/* new scan keys */
-
+	ScanResetBatching(&node->ss, false);
 	ExecScanReScan((ScanState *) node);
 }
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..b4a0996a717 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* executor batching */
+bool		executor_batching = false;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index b176d5130e4..a4bc8c10cc2 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -887,6 +887,13 @@
   boot_val => 'true',
 },
 
+{ name => 'executor_batching', type => 'bool', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Use tuple batching during execution.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'executor_batching',
+  boot_val => 'true',
+},
+
 { name => 'data_sync_retry', type => 'bool', context => 'PGC_POSTMASTER', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Whether to continue running after a failure to sync data files.',
   variable => 'data_sync_retry',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 02f7793fba0..13ce6166ec3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -314,6 +314,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
 extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
 extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
 extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
 
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 953207eac50..05f828b9762 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "commands/vacuum.h"
+#include "executor/execBatch.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
 typedef struct IndexInfo IndexInfo;
 typedef struct SampleScanState SampleScanState;
 typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
 
 /*
  * Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
 	 * Return slot implementation suitable for storing a tuple of this AM.
 	 */
 	const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+	const TupleBatchOps *(*batch_callbacks)(Relation rel);
 
 
 	/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
 									 ScanDirection dir);
 	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
 
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
  */
 extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_supports_batching
+ *		Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+	const TableAmRoutine *tam = relation->rd_tableam;
+
+	return tam->scan_getnextbatch != NULL;
+}
+
 /*
  * table_scan_begin_batch
  *		Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2116,5 +2142,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
  */
 
 extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
 
 #endif							/* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..6f1a38d14bd
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,102 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ *		Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/* XXX fixed 64 for PoC */
+#define	EXEC_BATCH_ROWS		64
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+	void (*materialize_all)(void *am_payload,
+							TupleTableSlot **dst,
+							int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+	void	   *am_payload;
+	const TupleBatchOps *ops;
+	int			ntuples;				/* number of tuples in am_payload */
+	bool		materialized;		 /* tuples in slots valid? */
+	struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+	struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+									   * batch */
+	struct TupleTableSlot **activeslots;
+	int			maxslots;
+
+	int		nvalid;		/* number of returnable tuples in outslots */
+	int		next;		/* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+	return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+	Assert(index < b->nvalid);
+	return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+	Assert(TupleBatchIsValid(b));
+	b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+	return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+	if (b->materialized)
+		return;
+
+	if (b->ops == NULL || b->ops->materialize_all == NULL)
+		elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+	b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+	TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 837ea7785bb..fec606471c8 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -243,4 +243,58 @@ ExecScanExtended(ScanState *node,
 	}
 }
 
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+						  ExecScanAccessBatchMtd accessBatchMtd,
+						  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		TupleTableSlot *in;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next input slot from current batch, or refill */
+		if (!TupleBatchHasMore(b))
+		{
+			if (!accessBatchMtd(node))
+				return NULL;
+		}
+
+		in = TupleBatchGetNextSlot(b);
+		Assert(in);
+
+		/* No qual, no projection: direct return */
+		if (qual == NULL && projInfo == NULL)
+			return in;
+
+		ResetExprContext(econtext);
+		econtext->ecxt_scantuple = in;
+
+		/* Qual only */
+		if (projInfo == NULL)
+		{
+			if (qual == NULL || ExecQual(qual, econtext))
+				return in;
+			else
+				InstrCountFiltered1(node, 1);
+			continue;
+		}
+
+		/* Projection (with or without qual) */
+		if (qual == NULL || ExecQual(qual, econtext))
+			return ExecProject(projInfo);
+		else
+			InstrCountFiltered1(node, 1);
+		/* else try next tuple */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..17258f7ae2d 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -575,12 +575,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
 extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
 
 /*
  * prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 1bef98471c3..b8e7afda57c 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -287,6 +287,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
 extern PGDLLIMPORT int VacuumCostBalance;
 extern PGDLLIMPORT bool VacuumCostActive;
 
+extern PGDLLIMPORT bool executor_batching;
 
 /* in utils/misc/stack_depth.c */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..f4bb8f7dd7f 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
 #define EXECNODES_H
 
 #include "access/tupconvert.h"
+#include "executor/execBatch.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
 #include "lib/ilist.h"
@@ -1143,6 +1144,10 @@ typedef struct JsonExprState
  */
 typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
 
+/* Return a batch; may reuse caller-provided envelope. NULL => end of scan. */
+struct TupleBatch;
+typedef struct TupleBatch TupleBatch;
+
 /* ----------------
  *		PlanState node
  *
@@ -1198,6 +1203,9 @@ typedef struct PlanState
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 
+	/* Batching state if node supports it. */
+	TupleBatch *ps_Batch;
+
 	bool		async_capable;	/* true if node is async-capable */
 
 	/*
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-27 07:24  Amit Langote <[email protected]>
  parent: Tomas Vondra <[email protected]>
  3 siblings, 1 reply; 22+ messages in thread

From: Amit Langote @ 2025-10-27 07:24 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

Hi Tomas,

On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <[email protected]> wrote:
>
> Hi Amit,
>
> Thanks for the patch. I took a look over the weekend, and done a couple
> experiments / benchmarks, so let me share some initial feedback (or
> rather a bunch of questions I came up with).

Thank you for reviewing the patch and taking the time to run those
experiments. I appreciate the detailed feedback and questions.  I also
apologize for my late reply, I spent perhaps way too much time going
over your index prefetching thread trying to understand the notion of
batching that it uses and getting sidelined by other things while
writing this reply.

> I'll start with some general thoughts, before going into some nitpicky
> comments about patches / code and perf results.
>
> I think the general goal of the patch - reducing the per-tuple overhead
> and making the executor more efficient for OLAP workloads - is very
> desirable. I believe the limitations of per-row executor are one of the
> reasons why attempts to implement a columnar TAM mostly failed. The
> compression is nice, but it's hard to be competitive without an executor
> that leverages that too. So starting with an executor, in a way that
> helps even heap, seems like a good plan. So +1 to this.

I'm happy to hear that you find the overall direction worthwhile.

> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.
>
> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.
>
> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.

Yes, the design of this patch does differ from the index prefetching
approach, and that’s largely due to the differing goals as you say.
AIUI, the index prefetching patch uses a shared batch structure
managed mostly by indexam.c and populated by the index AM.  In my
patch, each table AM produces its own batch format that gets wrapped
in a generic TupleBatch which contains the AM-specified TupleBatchOps
for operations on the AM's opaque data. This was a conscious choice:
in prefetching, the aim seems to be to make indexam.c manage batches
and operations based on it in a mostly AM-agnostic manner.  But for
executor batching, the aim is to retain TAM-specific optimizations as
much as possible and rely on the TAM for most operations on the batch
contents. Both designs have their merits given their respective use
cases, but I guess you understand that very well.

> But for the batching, we want to retain the custom format as long as
> possible. Presumably, the various advantages of the TAMs are tied to the
> custom/columnar storage format. Memory efficiency thanks to compression,
> execution on compressed data, etc. Keeping the custom format as long as
> possible is the whole point of "late materialization" (and materializing
> as late as possible is one of the important details in column stores).

Exactly -- keeping the TAM-specific batch format as long as possible
is a key goal here. As you noted, the benefits of a custom storage
format (compression, operating on compressed data, etc.) are best
realized when we delay materialization until absolutely necessary.  I
want to design this patch that each TAM can produce and use its own
batch representation internally, only wrapping it when interfacing
with the executor in a generic way.  I admit that's not entirely true
with the patch as it stands as I write above below.

> How far ahead have you though about these capabilities? I was wondering
> about two things in particular. First, at which point do we have to
> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> I get it that you want to enable passing batches between nodes, but
> would those use the same "format" as the underlying scan node, or some
> generic one? Second, will it be possible to execute expressions on the
> custom batches (i.e. on "compressed data")? Or is it necessary to
> "materialize" the batch into regular tuple slots? I realize those may
> not be there "now" but maybe it'd be nice to plan for the future.

I have been thinking about those future capabilities. Currently, the
patch keeps tuples in the TAM-specific batch format up until they need
to be consumed by a node that doesn’t understand that format or has
not been modified to invoke the TAM callbacks to decode it.  In the
current patch, that means we materialize to regular TupleTableSlots at
nodes that require it (for example, the scan node reading from TAM
needing to evaluate quals, etc.). However, the intention is to allow
batches to be passed through as many nodes as possible without
materialization, ideally using the same format produced by the scan
node all the way up until reaching a node that can only work with
tuples in TupleTableSlots.

As for executing expressions directly on the custom batch data: that’s
something I would like to enable in the future. Right now, expressions
(quals, projections, etc.) are evaluated after materializing into
normal tuples in TupleTableSlots stored in TupleBatch, because the
expression evaluation code isn’t yet totally batch-aware or is very
from doing things like operate on compressed data in its native form.
Patches 0004-0008 do try to add batch-aware expression evaluation but
that's just a prototype.  In the long term, the goal is to allow
expression evaluation on batch data (for example, applying a WHERE
clause or aggregate transition directly on a columnar batch without
converting it to heap tuples first). This will require significant new
infrastructure (perhaps specialized batch-aware expression operators
and functions), so it's not in the current patch, but I agree it's
important to plan for it. The current design doesn’t preclude it, it
lays some groundwork by introducing the batch abstraction -- but fully
supporting that will be future work.

That said, one area I’d like to mention while at it, especially to
enable native execution on compressed or columnar batches, is giving
the table AM more control over how expression evaluation is performed
on its batch data. In the current patch, the AM can provide a
materialize function via TupleBatchOps, but that always produces an
array of TupleTableSlots stored in the TupleBatch, not an opaque
representation that remains under AM control. Maybe that's not bad for
a v1 patch.  When evaluating expressions over a batch, a BatchVector
is built by looping over these slots and invoking the standard
per-tuple getsomeattrs() to "deform" a tuple into needed columns.
While that enables batch-style EEOPs for qual evaluation and aggregate
transition (and is already a gain over per-row evaluation), it misses
the opportunity to leverage any batch-specific optimizations the AM
could offer, such as vectorized decoding or filtering over compressed
data, and other AM optimizations for getting only the necessary
columns out possibly in a vector format.

I’m considering extending TupleTableSlotOps with a batch-aware variant
of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
AMs can populate column vectors (e.g., BatchVector) directly from
their native format. That would allow bypassing slot materialization
entirely and plug AM-provided decoding logic directly into the
executor’s batch expression paths. This isn’t implemented yet, but I
see it as a necessary step toward supporting fully native expression
evaluation over compressed or columnar formats. I’m not yet sure if
TupleTableSlotOps is the right place for such a hook, it might belong
elsewhere in the abstraction, but exposing a batch-aware interface for
this purpose seems like the right direction.

> It might be worth exploring some columnar formats, and see if this
> design would be a good fit. Let's say we want to process data read from
> a parquet file. Would we be able to leverage the format, or would we
> need to "materialize" into slots too early? Or maybe it'd be good to
> look at the VCI extension [1], discussed in a nearby thread. AFAICS
> that's still based on an index AM, but there were suggestions to use TAM
> instead (and maybe that'd be a better choice).

Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
design should be able to accommodate true columnar formats like
Parquet. If we had a table AM (or FDW) that reads Parquet files into a
columnar batch structure, the executor batching framework should
ideally allow us to pass that batch along without immediately
materializing to tuples.  As mentioned before, we might have to adjust
or extend the TupleBatch abstraction to handle a wider variety of
batch formats, but conceptually it fits -- the goal is to avoid
forcing early materialization. I will definitely keep the Parquet
use-case in mind and perhaps do some experiments with a columnar
source to ensure we aren’t baking in any unnecessary materialization.
Also, thanks for the reference to the VCI extension thread; I'll take
a look at that.

> The other option would be to "create batches" during execution, say by
> having a new node that accumulates tuples, builds a batch and sends it
> to the node above. This would help both in cases when either the lower
> node does not produce batches at all, or the batches are too small (due
> to filtering, aggregation, ...). Or course, it'd only win if this
> increases efficiency of the upper part of the plan enough to pay for
> building the batches. That can be a hard decision.

Yes, introducing a dedicated executor node to accumulate and form
batches on the fly is an interesting idea, I have thought about it and
even mentioned it in passing in the pgconf.dev unconference. This
could indeed cover scenarios where the data source (a node) doesn't
produce batches (e.g., a non-batching node feeding into a
batching-aware upper node) or where batches coming from below are too
small to be efficient. The current patch set doesn’t implement such a
node; I focused on enabling batching at the scan/TAM level first. The
cost/benefit decision for a batch-aggregator node is tricky, as you
said. We’d need a way to decide when the overhead of gathering tuples
into a batch is outweighed by the benefits to the upper node. This
likely ties into costing or adaptive execution decisions. It's
something I’m open to exploring in a future iteration, perhaps once we
have more feedback on how the existing batching performs in various
scenarios. It might also require some planner or executor smarts
(maybe the executor can decide to batch on the fly if it sees a
pattern of use, or the planner could insert such nodes when
beneficial).

> You also mentioned we could make batches larger by letting them span
> multiple pages, etc. I'm not sure that's worth it - wouldn't that
> substantially complicate the TAM code, which would need to pin+track
> multiple buffers for each batch, etc.? Possible, but is it worth it?
>
> I'm not sure allowing multi-page batches would actually solve the issue.
> It'd help with batches at the "scan level", but presumably the batch
> size in the upper nodes matters just as much. Large scan batches may
> help, but hard to predict.
>
> In the index prefetching patch we chose to keep batches 1:1 with leaf
> pages, at least for now. Instead we allowed having multiple batches at
> once. I'm not sure that'd be necessary for TAMs, though.

I tend to agree with you here. Allowing a single batch to span
multiple pages would add quite a bit of complexity to the table AM
implementations (managing multiple buffer pins per batch, tracking
page boundaries, etc.), and it's unclear if the benefit would justify
that complexity. For now, I'm inclined not to pursue multi-page
batches at the scan level in this patch. We can keep the batch
page-local (e.g., for heap, one batch corresponds to max one page, as
it does now). If we need larger batch sizes overall, we might address
that by other means -- for example, by the above-mentioned idea of a
higher-level batching node or by simply producing multiple batches in
quick succession.

You’re right that even if we made scan batches larger, it doesn’t
necessarily solve everything, since the effective batch size at
higher-level nodes could still be constrained by other factors. So
rather than complicating the low-level TAM code with multi-page
batches, I'd prefer to first see if the current approach (with
one-page batches) yields good benefits and then consider alternatives.
We could also consider letting a scan node produce multiple batches
before yielding to the upper node (similar to how the index
prefetching patch can have multiple leaf page batches in flight) if
needed, but as you note, it might not be necessary for TAMs yet. So at
this stage, I'll keep it simple.

> This also reminds me of LIMIT queries. The way I imagine a "batchified"
> executor to work is that batches are essentially "units of work". For
> example, a nested loop would grab a batch of tuples from the outer
> relation, lookup inner tuples for the whole batch, and only then pass
> the result batch. (I'm ignoring the cases when the batch explodes due to
> duplicates.)
>
> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
> just the first tuple, and the rest of the batch is wasted work? Plenty
> of (very expensive) OLAP have that, and many would likely benefit from
> batching, so just disabling batching if there's LIMIT seems way too
> heavy handed.

Yeah, LIMIT does complicate downstream batching decisions. If we
always use a full-size batch (say 64 tuples) for every operation, a
query with LIMIT 1 could end up doing a lot of unnecessary work
fetching and processing 63 tuples that never get used. Disabling
batching entirely for queries with LIMIT would indeed be overkill and
lose benefits for cases where the limit is not extremely selective.

> Perhaps it'd be good to gradually ramp up the batch size? Start with
> small batches, and then make them larger. The index prefetching does
> that too, indirectly - it reads the whole leaf page as a batch, but then
> gradually ramps up the prefetch distance (well, read_stream does that).
> Maybe the batching should have similar thing ...

An adaptive batch size that ramps up makes a lot of sense as a
solution. We could start with a very small batch (say 4 tuples) and if
we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
yet or more output is still being consumed), then increase the batch
size for subsequent operations. This way, a query that stops early
doesn’t incur the full batching overhead, whereas a query that does
process lots of tuples will gradually get to a larger batch size to
gain efficiency. This is analogous to how the index prefetching ramps
up prefetch distance, as you mentioned.

Implementing that will require some careful thought. It could be done
either in the planner (choose initial batch sizes based on context
like LIMIT) or more dynamically in the executor (adjust on the fly). I
lean towards a runtime heuristic because it’s hard for the planner to
predict exactly how a LIMIT will play out, especially in complex
plans. In any case, I agree that a gradual ramp-up or other adaptive
approach would make batching more robust in the presence of query
execution variability. I will definitely consider adding such logic,
perhaps as an improvement once the basic framework is in.

> In fact, how shall the optimizer decide whether to use batching? It's
> one thing to decide whether a node can produce/consume batches, but
> another thing is "should it"? With a node that "builds" a batch, this
> decision would apply to even more plans, I guess.
>
> I don't have a great answer to this, it seems like an incredibly tricky
> costing issue. I'm a bit worried we might end up with something too
> coarse, like "jit=on" which we know is causing problems (admittedly,
> mostly due to a lot of the LLVM work being unpredictable/external). But
> having some "adaptive" heuristics (like the gradual ramp up) might make
> it less risky.

I agree that deciding when to use batching is tricky. So far, the
patch takes a fairly simplistic approach: if a node (particularly a
scan node) supports batching, it just does it, and other parts of the
plan will consume batches if they are capable. There isn’t yet a
nuanced cost-based decision in the planner for enabling batching. This
is indeed something we’ll have to refine. We don’t want to end up with
a blunt on/off GUC that could cause regressions in some cases.

One idea is to introduce costing for batching: for example, estimate
the per-tuple savings from batching vs the overhead of materialization
or batch setup. However, developing a reliable cost model for that
will take time and experimentation, especially with the possibility of
variable batch sizes or adaptive behavior. Not to mention, that will
be adding one more dimension to planner's costing model making the
planning more expensive and unpredictable.  In the near term, I’m fine
with relying on feedback and perhaps manual tuning (GUCs, etc.) to
decide on batching, but that’s perhaps not a long-term solution.

I share your inclination that adaptive heuristics might be the safer
path initially. Perhaps the executor can decide to batch or not batch
based on runtime conditions. The gradual ramp-up of batch size is one
such adaptive approach. We could also consider things like monitoring
how effective batching is (are we actually processing full batches or
frequently getting cut off?) and adjust behavior. These are somewhat
speculative ideas at the moment, but the bottom line is I’m aware we
need a smarter strategy than a simple switch. This will likely evolve
as we test the patch in more scenarios.

> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> hard to say. It'd be good to be able to experiment with different
> values, so I suggest we make this a GUC and not a hard-coded constant.

Yeah, I was thinking the same while testing -- the optimal batch size
might vary by workload or hardware, and 64 was a somewhat arbitrary
starting point. I will make the batch size limit configurable
(probably as a GUC executor_batch_tuples, maybe only developer-focused
at first). That will let us and others experiment easily with
different batch sizes to see how it affects performance. It should
also help with your earlier point: for example, on a machine where 64
is too low or too high, we can adjust it without recompiling. So yes,
I'll add a GUC for the batch size in the next version of the patch.

> As for what to add to explain, I'd start by adding info about which
> nodes are "batched" (consuming/producing batches), and some info about
> the batch sizes. An average size, maybe a histogram if you want to be a
> bit fancy.

Adding more information to EXPLAIN is a good idea. In the current
patch, EXPLAIN does not show anything about batching, but it would be
very helpful for debugging and user transparency to indicate which
nodes are operating in batch mode.  I will update EXPLAIN to mark
nodes that produce or consume batches. Likely I’ll start with
something simple like an extra line or tag for a node, e.g., "Batch:
true (avg batch size 64)" or something along those lines. An average
batch size could be computed if we have instrumentation, which would
be useful to see if, say, the batch sizes ended up smaller due to
LIMIT or other factors. A full histogram might be more detail than
most users need, but I agree even just knowing average or maximum
batch size per node could be useful for performance analysis. I'll
implement at least the basics for now, and we can refine it (maybe add
more stats) if needed.

(I had added a flag in the EXPLAIN output at one point, but removed it
due to finding the regression output churn too noisy, though I
understand I'll have to bite the bullet at some point.)

> Now, numbers from some microbenchmarks:
>
> On 9/26/25 15:28, Amit Langote wrote:
> > To evaluate the overheads and benefits, I ran microbenchmarks with
> > single and multi-aggregate queries on a single table, with and without
> > WHERE clauses. Tables were fully VACUUMed so visibility maps are set
> > and IO costs are minimal. shared_buffers was large enough to fit the
> > whole table (up to 10M rows, ~43 on each page), and all pages were
> > prewarmed into cache before tests. Table schema/script is at [2].
> >
> > Observations from benchmarking (Detailed benchmark tables are at [3];
> > below is just a high-level summary of the main patterns):
> >
> > * Single aggregate, no WHERE (SELECT count(*) FROM bar_N, SELECT
> > sum(a) FROM bar_N): batching scan output alone improved latency by
> > ~10-20%. Adding batched transition evaluation pushed gains to ~30-40%,
> > especially once fmgr overhead was paid per batch instead of per row.
> >
> > * Single aggregate, with WHERE (WHERE a > 0 AND a < N): batching the
> > qual interpreter gave a big step up, with latencies dropping by
> > ~30-40% compared to batching=off.
> >
> > * Five aggregates, no WHERE: batching input from the child scan cut
> > ~15% off runtime. Adding batched transition evaluation increased
> > improvements to ~30%.
> >
> > * Five aggregates, with WHERE: modest gains from scan/input batching,
> > but per-batch transition evaluation and batched quals brought ~20-30%
> > improvement.
> >
> > * Across all cases, executor overheads became visible only after IO
> > was minimized. Once executor cost dominated, batching consistently
> > reduced CPU time, with the largest benefits coming from avoiding
> > per-row fmgr calls and evaluating quals across batches.
> >
> > I would appreciate if others could try these patches with their own
> > microbenchmarks or workloads and see if they can reproduce numbers
> > similar to mine. Feedback on both the general direction and the
> > details of the patches would be very helpful. In particular, patches
> > 0001-0003, which add the basic batch APIs and integrate them into
> > SeqScan, are intended to be the first candidates for review and
> > eventual commit. Comments on the later, more experimental patches
> > (aggregate input batching and expression evaluation (qual, aggregate
> > transition) batching) are also welcome.
> >
>
> I tried to replicate the results, but the numbers I see are not this
> good. In fact, I see a fair number of regressions (and some are not
> negligible).
>
> I'm attaching the scripts I used to build the tables / run the test. I
> used the same table structure, and tried to follow the same query
> pattern with 1 or 5 aggregates (I used "avg"), [0, 1, 5] where
> conditions (with 100% selectivity).
>
> I measured master vs. 0001-0003 vs. 0001-0007 (with batching on/off).
> And I did that on my (relatively) new ryzen machine, and old xeon. The
> behavior is quite different for the two machines, but none of them shows
> such improvements. I used clang 19.0, and --with-llvm.
>
> See the attached PDFs with a summary of the results, comparing the
> results for master and the two batching branches.
>
> The ryzen is much "smoother" - it shows almost no difference with
> batching "off" (as expected). The "scan" branch (with 0001-0003) shows
> an improvement of 5-10% - it's consistent, but much less than the 10-20%
> you report. For the "agg" branch the benefits are much larger, but
> there's also a significant regression for the largest table with 100M
> rows (which is ~18GB on disk).
>
> For xeon, the results are a bit more variable, but it affects runs both
> with batching "on" and "off". The machine is just more noisy. There
> seems to be a small benefit of "scan" batching (in most cases much less
> than the 10-20%). The "agg" is a clear win, with up to 30-40% speedup,
> and no regression similar to the ryzen.
>
> Perhaps I did something wrong. It does not surprise me this is somewhat
> CPU dependent. It's a bit sad the improvements are smaller for the newer
> CPU, though.

Thanks for sharing your benchmark results -- that’s very useful data.
I haven’t yet finished investigating why there's a regression relative
to master when executor_batching is turned off. I re-ran my benchmarks
to include comparisons with master and did observe some regressions in
a few cases too, but I didn't see anything obvious in profiles that
explained the slowdown. I initially assumed it might be noise, but now
I suspect it could be related to structural changes in the scan code
-- for example, I added a few new fields in the middle of
HeapScanDescData, and even though the batching logic is bypassed when
executor_batching is off, it’s possible that change alone affects
memory layout or cache behavior in a way that penalizes the unbatched
path. I haven’t confirmed that yet, but it’s on my list to look into
more closely.

Your observation that newer CPUs like the Ryzen may see smaller
improvements makes sense -- perhaps they handle the per-tuple overhead
more efficiently to begin with. Still, I’d prefer not to see
regressions at all, even in the unbatched case, so I’ll focus on
understanding and fixing that part before drawing conclusions from the
performance data.

Thanks again for the scripts -- those will help a lot in narrowing things down.

> I also tried running TPC-H. I don't have useful numbers yet, but I ran
> into a segfault - see the attached backtrace. It only happens with the
> batching, and only on Q22 for some reason. I initially thought it's a
> bug in clang, because I saw it with clang-22 built from git, and not
> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
> debian 13). Still could be a clang bug, of course. I've seen ~20 of
> those segfaults so far, and the backtraces look exactly the same.

The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
evaluation that I suspect is also causing the crash you saw.

I'll try to post a v4 in a couple of weeks with some of the things I
mentioned above.

--
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-27 16:18  Tomas Vondra <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: Tomas Vondra @ 2025-10-27 16:18 UTC (permalink / raw)
  To: Amit Langote <[email protected]>; +Cc: pgsql-hackers

On 10/27/25 08:24, Amit Langote wrote:
> Hi Tomas,
> 
> On Mon, Sep 29, 2025 at 8:01 PM Tomas Vondra <[email protected]> wrote:
>>
>> Hi Amit,
>>
>> Thanks for the patch. I took a look over the weekend, and done a couple
>> experiments / benchmarks, so let me share some initial feedback (or
>> rather a bunch of questions I came up with).
> 
> Thank you for reviewing the patch and taking the time to run those
> experiments. I appreciate the detailed feedback and questions.  I also
> apologize for my late reply, I spent perhaps way too much time going
> over your index prefetching thread trying to understand the notion of
> batching that it uses and getting sidelined by other things while
> writing this reply.
> 

Cool! Now you can do a review of the index prefetch patch ;-)

>> I'll start with some general thoughts, before going into some nitpicky
>> comments about patches / code and perf results.
>>
>> I think the general goal of the patch - reducing the per-tuple overhead
>> and making the executor more efficient for OLAP workloads - is very
>> desirable. I believe the limitations of per-row executor are one of the
>> reasons why attempts to implement a columnar TAM mostly failed. The
>> compression is nice, but it's hard to be competitive without an executor
>> that leverages that too. So starting with an executor, in a way that
>> helps even heap, seems like a good plan. So +1 to this.
> 
> I'm happy to hear that you find the overall direction worthwhile.
> 
>> While looking at the patch, I couldn't help but think about the index
>> prefetching stuff that I work on. It also introduces the concept of a
>> "batch", for passing data between an index AM and the executor. It's
>> interesting how different the designs are in some respects. I'm not
>> saying one of those designs is wrong, it's more due different goals.
>>
>> For example, the index prefetching patch establishes a "shared" batch
>> struct, and the index AM is expected to fill it with data. After that,
>> the batch is managed entirely by indexam.c, with no AM calls. The only
>> AM-specific bit in the batch is "position", but that's used only when
>> advancing to the next page, etc.
>>
>> This patch does things differently. IIUC, each TAM may produce it's own
>> "batch", which is then wrapped in a generic one. For example, heap
>> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
>> is fine. In the prefetching we chose to move all this code (walking the
>> batch items) from the AMs into the layer above, and make it AM agnostic.
> 
> ...
> 
>> But for the batching, we want to retain the custom format as long as
>> possible. Presumably, the various advantages of the TAMs are tied to the
>> custom/columnar storage format. Memory efficiency thanks to compression,
>> execution on compressed data, etc. Keeping the custom format as long as
>> possible is the whole point of "late materialization" (and materializing
>> as late as possible is one of the important details in column stores).
> 
> Exactly -- keeping the TAM-specific batch format as long as possible
> is a key goal here. As you noted, the benefits of a custom storage
> format (compression, operating on compressed data, etc.) are best
> realized when we delay materialization until absolutely necessary.  I
> want to design this patch that each TAM can produce and use its own
> batch representation internally, only wrapping it when interfacing
> with the executor in a generic way.  I admit that's not entirely true
> with the patch as it stands as I write above below.
> 

Understood. Makes sense in general.

>> How far ahead have you though about these capabilities? I was wondering
>> about two things in particular. First, at which point do we have to
>> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
>> I get it that you want to enable passing batches between nodes, but
>> would those use the same "format" as the underlying scan node, or some
>> generic one? Second, will it be possible to execute expressions on the
>> custom batches (i.e. on "compressed data")? Or is it necessary to
>> "materialize" the batch into regular tuple slots? I realize those may
>> not be there "now" but maybe it'd be nice to plan for the future.
> 
> I have been thinking about those future capabilities. Currently, the
> patch keeps tuples in the TAM-specific batch format up until they need
> to be consumed by a node that doesn’t understand that format or has
> not been modified to invoke the TAM callbacks to decode it.  In the
> current patch, that means we materialize to regular TupleTableSlots at
> nodes that require it (for example, the scan node reading from TAM
> needing to evaluate quals, etc.). However, the intention is to allow
> batches to be passed through as many nodes as possible without
> materialization, ideally using the same format produced by the scan
> node all the way up until reaching a node that can only work with
> tuples in TupleTableSlots.
> 
> As for executing expressions directly on the custom batch data: that’s
> something I would like to enable in the future. Right now, expressions
> (quals, projections, etc.) are evaluated after materializing into
> normal tuples in TupleTableSlots stored in TupleBatch, because the
> expression evaluation code isn’t yet totally batch-aware or is very
> from doing things like operate on compressed data in its native form.
> Patches 0004-0008 do try to add batch-aware expression evaluation but
> that's just a prototype.  In the long term, the goal is to allow
> expression evaluation on batch data (for example, applying a WHERE
> clause or aggregate transition directly on a columnar batch without
> converting it to heap tuples first). This will require significant new
> infrastructure (perhaps specialized batch-aware expression operators
> and functions), so it's not in the current patch, but I agree it's
> important to plan for it. The current design doesn’t preclude it, it
> lays some groundwork by introducing the batch abstraction -- but fully
> supporting that will be future work.
> 
> That said, one area I’d like to mention while at it, especially to
> enable native execution on compressed or columnar batches, is giving
> the table AM more control over how expression evaluation is performed
> on its batch data. In the current patch, the AM can provide a
> materialize function via TupleBatchOps, but that always produces an
> array of TupleTableSlots stored in the TupleBatch, not an opaque
> representation that remains under AM control. Maybe that's not bad for
> a v1 patch.

I think materializing into a batch of TupleTableSlots (and then doing
the regular expression evaluation) seems perfectly fine for v1. It's the
simplest fallback possible, and we'll need it anyway if overriding the
expression evaluation will be optional (which I assume it will be?).

> When evaluating expressions over a batch, a BatchVector
> is built by looping over these slots and invoking the standard
> per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> While that enables batch-style EEOPs for qual evaluation and aggregate
> transition (and is already a gain over per-row evaluation), it misses
> the opportunity to leverage any batch-specific optimizations the AM
> could offer, such as vectorized decoding or filtering over compressed
> data, and other AM optimizations for getting only the necessary
> columns out possibly in a vector format.
> 

I'm not sure about this BatchVector thing. I haven't looked into that
very much, I'd expect the construction to be more expensive than the
benefits (compared to just doing the materialize + regular evaluation),
but maybe I'm completely wrong. Or maybe we could keep the vector
representation for multiple operations? No idea.

But it seems like a great area for experimenting ...

> I’m considering extending TupleTableSlotOps with a batch-aware variant
> of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> AMs can populate column vectors (e.g., BatchVector) directly from
> their native format. That would allow bypassing slot materialization
> entirely and plug AM-provided decoding logic directly into the
> executor’s batch expression paths. This isn’t implemented yet, but I
> see it as a necessary step toward supporting fully native expression
> evaluation over compressed or columnar formats. I’m not yet sure if
> TupleTableSlotOps is the right place for such a hook, it might belong
> elsewhere in the abstraction, but exposing a batch-aware interface for
> this purpose seems like the right direction.
> 

No opinion. I don't see it as a necessary prerequisite for the other
parts of the patch series, but maybe the BatchVector really helps, and
then this would make perfect sense. I'm not sure there's a single
"correct" sequence in which to do these improvements, it's always a
matter of opinion.

>> It might be worth exploring some columnar formats, and see if this
>> design would be a good fit. Let's say we want to process data read from
>> a parquet file. Would we be able to leverage the format, or would we
>> need to "materialize" into slots too early? Or maybe it'd be good to
>> look at the VCI extension [1], discussed in a nearby thread. AFAICS
>> that's still based on an index AM, but there were suggestions to use TAM
>> instead (and maybe that'd be a better choice).
> 
> Yeah, looking at columnar TAMs or FDWs is on my list. I do think the
> design should be able to accommodate true columnar formats like
> Parquet. If we had a table AM (or FDW) that reads Parquet files into a
> columnar batch structure, the executor batching framework should
> ideally allow us to pass that batch along without immediately
> materializing to tuples.  As mentioned before, we might have to adjust
> or extend the TupleBatch abstraction to handle a wider variety of
> batch formats, but conceptually it fits -- the goal is to avoid
> forcing early materialization. I will definitely keep the Parquet
> use-case in mind and perhaps do some experiments with a columnar
> source to ensure we aren’t baking in any unnecessary materialization.
> Also, thanks for the reference to the VCI extension thread; I'll take
> a look at that.
> 

+1 I think having a TAM/FDW reading those established and common formats
is a good way to validate the overall design.

>> The other option would be to "create batches" during execution, say by
>> having a new node that accumulates tuples, builds a batch and sends it
>> to the node above. This would help both in cases when either the lower
>> node does not produce batches at all, or the batches are too small (due
>> to filtering, aggregation, ...). Or course, it'd only win if this
>> increases efficiency of the upper part of the plan enough to pay for
>> building the batches. That can be a hard decision.
> 
> Yes, introducing a dedicated executor node to accumulate and form
> batches on the fly is an interesting idea, I have thought about it and
> even mentioned it in passing in the pgconf.dev unconference. This
> could indeed cover scenarios where the data source (a node) doesn't
> produce batches (e.g., a non-batching node feeding into a
> batching-aware upper node) or where batches coming from below are too
> small to be efficient. The current patch set doesn’t implement such a
> node; I focused on enabling batching at the scan/TAM level first. The
> cost/benefit decision for a batch-aggregator node is tricky, as you
> said. We’d need a way to decide when the overhead of gathering tuples
> into a batch is outweighed by the benefits to the upper node. This
> likely ties into costing or adaptive execution decisions. It's
> something I’m open to exploring in a future iteration, perhaps once we
> have more feedback on how the existing batching performs in various
> scenarios. It might also require some planner or executor smarts
> (maybe the executor can decide to batch on the fly if it sees a
> pattern of use, or the planner could insert such nodes when
> beneficial).
> 

Yeah, those are good questions. I don't have a clear idea how should we
decide when to do this batching. Costing during planning is the
"traditional" option, with all the issues (e.g. it requires a reasonably
good cost model). Another option would be some sort of execution-time
heuristics - buts then which node would be responsible for building the
batches (if we didn't create them during planning)?

I agree it makes sense to focus on batching at the TAM/scan level for
now. That's a pretty big project already.

>> You also mentioned we could make batches larger by letting them span
>> multiple pages, etc. I'm not sure that's worth it - wouldn't that
>> substantially complicate the TAM code, which would need to pin+track
>> multiple buffers for each batch, etc.? Possible, but is it worth it?
>>
>> I'm not sure allowing multi-page batches would actually solve the issue.
>> It'd help with batches at the "scan level", but presumably the batch
>> size in the upper nodes matters just as much. Large scan batches may
>> help, but hard to predict.
>>
>> In the index prefetching patch we chose to keep batches 1:1 with leaf
>> pages, at least for now. Instead we allowed having multiple batches at
>> once. I'm not sure that'd be necessary for TAMs, though.
> 
> I tend to agree with you here. Allowing a single batch to span
> multiple pages would add quite a bit of complexity to the table AM
> implementations (managing multiple buffer pins per batch, tracking
> page boundaries, etc.), and it's unclear if the benefit would justify
> that complexity. For now, I'm inclined not to pursue multi-page
> batches at the scan level in this patch. We can keep the batch
> page-local (e.g., for heap, one batch corresponds to max one page, as
> it does now). If we need larger batch sizes overall, we might address
> that by other means -- for example, by the above-mentioned idea of a
> higher-level batching node or by simply producing multiple batches in
> quick succession.
> 

+1

> You’re right that even if we made scan batches larger, it doesn’t
> necessarily solve everything, since the effective batch size at
> higher-level nodes could still be constrained by other factors. So
> rather than complicating the low-level TAM code with multi-page
> batches, I'd prefer to first see if the current approach (with
> one-page batches) yields good benefits and then consider alternatives.
> We could also consider letting a scan node produce multiple batches
> before yielding to the upper node (similar to how the index
> prefetching patch can have multiple leaf page batches in flight) if
> needed, but as you note, it might not be necessary for TAMs yet. So at
> this stage, I'll keep it simple.
> 

+1

>> This also reminds me of LIMIT queries. The way I imagine a "batchified"
>> executor to work is that batches are essentially "units of work". For
>> example, a nested loop would grab a batch of tuples from the outer
>> relation, lookup inner tuples for the whole batch, and only then pass
>> the result batch. (I'm ignoring the cases when the batch explodes due to
>> duplicates.)
>>
>> But what if there's a LIMIT 1 on top? Maybe it'd be enough to process
>> just the first tuple, and the rest of the batch is wasted work? Plenty
>> of (very expensive) OLAP have that, and many would likely benefit from
>> batching, so just disabling batching if there's LIMIT seems way too
>> heavy handed.
> 
> Yeah, LIMIT does complicate downstream batching decisions. If we
> always use a full-size batch (say 64 tuples) for every operation, a
> query with LIMIT 1 could end up doing a lot of unnecessary work
> fetching and processing 63 tuples that never get used. Disabling
> batching entirely for queries with LIMIT would indeed be overkill and
> lose benefits for cases where the limit is not extremely selective.
> 
>> Perhaps it'd be good to gradually ramp up the batch size? Start with
>> small batches, and then make them larger. The index prefetching does
>> that too, indirectly - it reads the whole leaf page as a batch, but then
>> gradually ramps up the prefetch distance (well, read_stream does that).
>> Maybe the batching should have similar thing ...
> 
> An adaptive batch size that ramps up makes a lot of sense as a
> solution. We could start with a very small batch (say 4 tuples) and if
> we detect that the query needs more (e.g., the LIMIT wasn’t satisfied
> yet or more output is still being consumed), then increase the batch
> size for subsequent operations. This way, a query that stops early
> doesn’t incur the full batching overhead, whereas a query that does
> process lots of tuples will gradually get to a larger batch size to
> gain efficiency. This is analogous to how the index prefetching ramps
> up prefetch distance, as you mentioned.
> 
> Implementing that will require some careful thought. It could be done
> either in the planner (choose initial batch sizes based on context
> like LIMIT) or more dynamically in the executor (adjust on the fly). I
> lean towards a runtime heuristic because it’s hard for the planner to
> predict exactly how a LIMIT will play out, especially in complex
> plans. In any case, I agree that a gradual ramp-up or other adaptive
> approach would make batching more robust in the presence of query
> execution variability. I will definitely consider adding such logic,
> perhaps as an improvement once the basic framework is in.
> 

I agree a runtime heuristics is probably the right approach. After all,
a lot of the issues with LIMIT queries is due to the planner not knowing
the real data distribution, etc.

>> In fact, how shall the optimizer decide whether to use batching? It's
>> one thing to decide whether a node can produce/consume batches, but
>> another thing is "should it"? With a node that "builds" a batch, this
>> decision would apply to even more plans, I guess.
>>
>> I don't have a great answer to this, it seems like an incredibly tricky
>> costing issue. I'm a bit worried we might end up with something too
>> coarse, like "jit=on" which we know is causing problems (admittedly,
>> mostly due to a lot of the LLVM work being unpredictable/external). But
>> having some "adaptive" heuristics (like the gradual ramp up) might make
>> it less risky.
> 
> I agree that deciding when to use batching is tricky. So far, the
> patch takes a fairly simplistic approach: if a node (particularly a
> scan node) supports batching, it just does it, and other parts of the
> plan will consume batches if they are capable. There isn’t yet a
> nuanced cost-based decision in the planner for enabling batching. This
> is indeed something we’ll have to refine. We don’t want to end up with
> a blunt on/off GUC that could cause regressions in some cases.
> 
> One idea is to introduce costing for batching: for example, estimate
> the per-tuple savings from batching vs the overhead of materialization
> or batch setup. However, developing a reliable cost model for that
> will take time and experimentation, especially with the possibility of
> variable batch sizes or adaptive behavior. Not to mention, that will
> be adding one more dimension to planner's costing model making the
> planning more expensive and unpredictable.  In the near term, I’m fine
> with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> decide on batching, but that’s perhaps not a long-term solution.
> 

Yeah, the cost model is going to be hard, because this depends on so
much low-level plan/hardware details. Like, the TAM may allow execution
on compressed data / leverage vectorization, .... But maybe the CPU does
not do that efficiently? There's so many unknown unknowns ...

Considering we still haven't fixed the JIT cost model, maybe it's better
to not rely on it too much for this batching patch? Also, all those
details contradict the idea that cost models are a simplified model of
the reality.

> I share your inclination that adaptive heuristics might be the safer
> path initially. Perhaps the executor can decide to batch or not batch
> based on runtime conditions. The gradual ramp-up of batch size is one
> such adaptive approach. We could also consider things like monitoring
> how effective batching is (are we actually processing full batches or
> frequently getting cut off?) and adjust behavior. These are somewhat
> speculative ideas at the moment, but the bottom line is I’m aware we
> need a smarter strategy than a simple switch. This will likely evolve
> as we test the patch in more scenarios.
> 

I think the big question is how much can the batching change the
relative cost of two plans (I mean, actual cost, not just estimates).

Imagine plans P1 and P2, where

   cost(P1) < cost(P2) = cost(P1) + delta

where "delta" is small (so P1 is faster, but not much).  If we
"batchify" the plans into P1' and P2', can this happen?

  cost(P1') >> cost(P2')

That is, can the "slower" plan P2 benefit much more from the batching,
making it significantly faster?

If this is unlikely, we could entirely ignore batching during planning,
and only do that as post-processing on the selected plan, or perhaps
even just during execution.

OTOH that's what JIT does, and we know it's not perfect - but that's
mostly because JIT has rather unpredictable costs when enabling. Maybe
batching doesn't have that.

>> FWIW the current batch size limit (64 tuples) seems rather low, but it's
>> hard to say. It'd be good to be able to experiment with different
>> values, so I suggest we make this a GUC and not a hard-coded constant.
> 
> Yeah, I was thinking the same while testing -- the optimal batch size
> might vary by workload or hardware, and 64 was a somewhat arbitrary
> starting point. I will make the batch size limit configurable
> (probably as a GUC executor_batch_tuples, maybe only developer-focused
> at first). That will let us and others experiment easily with
> different batch sizes to see how it affects performance. It should
> also help with your earlier point: for example, on a machine where 64
> is too low or too high, we can adjust it without recompiling. So yes,
> I'll add a GUC for the batch size in the next version of the patch.
> 

+1 to have developer-only GUC for testing. But the goal should be to not
expect users to tune this.

>> As for what to add to explain, I'd start by adding info about which
>> nodes are "batched" (consuming/producing batches), and some info about
>> the batch sizes. An average size, maybe a histogram if you want to be a
>> bit fancy.
> 
> Adding more information to EXPLAIN is a good idea. In the current
> patch, EXPLAIN does not show anything about batching, but it would be
> very helpful for debugging and user transparency to indicate which
> nodes are operating in batch mode.  I will update EXPLAIN to mark
> nodes that produce or consume batches. Likely I’ll start with
> something simple like an extra line or tag for a node, e.g., "Batch:
> true (avg batch size 64)" or something along those lines. An average
> batch size could be computed if we have instrumentation, which would
> be useful to see if, say, the batch sizes ended up smaller due to
> LIMIT or other factors. A full histogram might be more detail than
> most users need, but I agree even just knowing average or maximum
> batch size per node could be useful for performance analysis. I'll
> implement at least the basics for now, and we can refine it (maybe add
> more stats) if needed.

+1 to start with something simple

> 
> (I had added a flag in the EXPLAIN output at one point, but removed it
> due to finding the regression output churn too noisy, though I
> understand I'll have to bite the bullet at some point.)
> 

Why would there be regression churn, if the option is disabled by default?

>> Now, numbers from some microbenchmarks:
>>
>> ...
>>>> Perhaps I did something wrong. It does not surprise me this is somewhat
>> CPU dependent. It's a bit sad the improvements are smaller for the newer
>> CPU, though.
> 
> Thanks for sharing your benchmark results -- that’s very useful data.
> I haven’t yet finished investigating why there's a regression relative
> to master when executor_batching is turned off. I re-ran my benchmarks
> to include comparisons with master and did observe some regressions in
> a few cases too, but I didn't see anything obvious in profiles that
> explained the slowdown. I initially assumed it might be noise, but now
> I suspect it could be related to structural changes in the scan code
> -- for example, I added a few new fields in the middle of
> HeapScanDescData, and even though the batching logic is bypassed when
> executor_batching is off, it’s possible that change alone affects
> memory layout or cache behavior in a way that penalizes the unbatched
> path. I haven’t confirmed that yet, but it’s on my list to look into
> more closely.
> 
> Your observation that newer CPUs like the Ryzen may see smaller
> improvements makes sense -- perhaps they handle the per-tuple overhead
> more efficiently to begin with. Still, I’d prefer not to see
> regressions at all, even in the unbatched case, so I’ll focus on
> understanding and fixing that part before drawing conclusions from the
> performance data.
> 
> Thanks again for the scripts -- those will help a lot in narrowing things down.
> 

If needed, I can rerun the tests and collect additional information
(e.g. maybe perf-stat or perf-diff would be interesting).

>> I also tried running TPC-H. I don't have useful numbers yet, but I ran
>> into a segfault - see the attached backtrace. It only happens with the
>> batching, and only on Q22 for some reason. I initially thought it's a
>> bug in clang, because I saw it with clang-22 built from git, and not
>> with clang-14 or gcc. But since then I reproduced it with clang-19 (on
>> debian 13). Still could be a clang bug, of course. I've seen ~20 of
>> those segfaults so far, and the backtraces look exactly the same.
> 
> The v3 I posted fixes a tricky bug in the new EEOPs for batched-agg
> evaluation that I suspect is also causing the crash you saw.
> 
> I'll try to post a v4 in a couple of weeks with some of the things I
> mentioned above.
> 

Sounds good. Thank you.


regards

-- 
Tomas Vondra






^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-27 17:37  Peter Geoghegan <[email protected]>
  parent: Tomas Vondra <[email protected]>
  3 siblings, 1 reply; 22+ messages in thread

From: Peter Geoghegan @ 2025-10-27 17:37 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Amit Langote <[email protected]>; pgsql-hackers

On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <[email protected]> wrote:
> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.

I've been working on a new prototype enhancement to the index
prefetching patch. The new spinoff patch has index scans batch up
calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
return. This optimization is effective whenever an index scan returns
a contiguous group of TIDs that all point to the same heap page. We're
able to lock and unlock heap page buffers at the same point that
they're pinned and unpinned, which can dramatically decrease the
number of heap buffer locks acquired by index scans that return
contiguous TIDs (which is very common).

I find that speedups for pgbench SELECT variants with a predicate such
as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
throughput, at least in cases with low client counts (think 1 or 2
clients). These are cases where everything can fit in shared buffers,
so we're not getting any benefit from I/O prefetching (in spite of the
fact that this is built on top of the index prefetching patchset).

It makes sense to put this in scope for the index prefetching work
because that work will already give code outside of an index AM
visibility into which group of TIDs need to be read next. Right now
(on master) there is some trivial sense in which index AMs use their
own batches, but that's completely hidden from external callers.

> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.

The major difficulty with my heap batching prototype is getting the
layering right (no surprises there). In some sense we're deliberately
sharing information across different what we currently think of as
different layers of abstraction, in order to be able to "schedule" the
work more intelligently. There's a number of competing considerations.

I have invented a new concept of heap batch, that is orthogonal to the
existing concept of index batches. Right now these are just an array
of HeapTuple structs that relate to exactly one group of group of
contiguous heap TIDs (i.e. if the index scan returns TIDs even a
little out of order, which is fairly common, we cannot currently
reorder the work in the current prototype patch).

Once a batch is prepared, calls to heapam_index_fetch_tuple just
return the next TID from the batch (until the next time we have to
return a TID pointing to some distinct heap block). In the case of
pgbench queries like the one I mentioned, we only need to call
LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
returned (not once per heap tuple returned).

Importantly, the new interface added by my new prototype spinoff patch
is higher level than the existing
table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
executor asks the table AM "give me the next heap TID in the current
scan direction", rather than asking "give me this heap TID". The
general idea is that the table AM has a direct understanding of
ordered index scans.

The advantage of this higher-level interface is that it gives the
table AM maximum freedom to reorder work. As I said already, we won't
do things like merge together logically noncontiguous accesses to the
same heap page into one physical access right now. But I think that
that should at least be enabled by this interface.

The downside of this approach is that table AM (not the executor
proper) is responsible for interfacing with the index AM layer. I
think that this can be generalized without very much code duplication
across table AMs. But it's hard.

> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.

I think that the base index prefetching patch's current notion of
index-AM-wise batches can be kept quite separate from any table AM
batch concept that might be invented, either as part of what I'm
working on, or in Amit's patch.

It probably wouldn't be terribly difficult to get the new interface
I've described to return heap tuples in whatever batch format Amit
comes up with. That only has a benefit if it makes life easier for
expression evaluation in higher levels of the plan tree, but it might
just make sense to always do it that way. I doubt that adopting Amit's
batch format will make life much harder for the
heap_hot_search_buffer-batching mechanism (at least if it is generally
understood that its new index scan interface's builds batches in
Amit's format on a best-effort basis).

-- 
Peter Geoghegan





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-28 13:11  Amit Langote <[email protected]>
  parent: Peter Geoghegan <[email protected]>
  0 siblings, 0 replies; 22+ messages in thread

From: Amit Langote @ 2025-10-28 13:11 UTC (permalink / raw)
  To: Peter Geoghegan <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers

Hi Peter,

Thanks for chiming in here.

On Tue, Oct 28, 2025 at 2:37 AM Peter Geoghegan <[email protected]> wrote:
>
> On Mon, Sep 29, 2025 at 7:01 AM Tomas Vondra <[email protected]> wrote:
> > While looking at the patch, I couldn't help but think about the index
> > prefetching stuff that I work on. It also introduces the concept of a
> > "batch", for passing data between an index AM and the executor. It's
> > interesting how different the designs are in some respects. I'm not
> > saying one of those designs is wrong, it's more due different goals.
>
> I've been working on a new prototype enhancement to the index
> prefetching patch. The new spinoff patch has index scans batch up
> calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
> return. This optimization is effective whenever an index scan returns
> a contiguous group of TIDs that all point to the same heap page. We're
> able to lock and unlock heap page buffers at the same point that
> they're pinned and unpinned, which can dramatically decrease the
> number of heap buffer locks acquired by index scans that return
> contiguous TIDs (which is very common).
>
> I find that speedups for pgbench SELECT variants with a predicate such
> as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
> throughput, at least in cases with low client counts (think 1 or 2
> clients). These are cases where everything can fit in shared buffers,
> so we're not getting any benefit from I/O prefetching (in spite of the
> fact that this is built on top of the index prefetching patchset).

I gathered from the index prefetching thread that it is mainly about
enabling I/O prefetching, so it's nice to see that kind of speedup
even for the in-memory case.

Is this spinoff patch separate from the one that adds amgetbatch() to
IndexAmRoutine which you posted on Oct 12? If so, where can I find it?

> It makes sense to put this in scope for the index prefetching work
> because that work will already give code outside of an index AM
> visibility into which group of TIDs need to be read next. Right now
> (on master) there is some trivial sense in which index AMs use their
> own batches, but that's completely hidden from external callers.

As you might know, heapam's TableAmRoutine.scan_* functions use a
"pagemode" in some cases, which fills a batch of tuples in
HeapScanData.rs_vistuples. However, that batch currently only stores
the tuples’ offset numbers. I started this work based on Andres’s
suggestion to propagate that batch up into the executor’s scan nodes.
The idea is to create a HeapTuple array sized according to the
executor’s batch size, and then populate it when the scan node calls
the new TableAmRoutine.scan_batch* variant. There might be some
overlap between our respective ideas.

> > For example, the index prefetching patch establishes a "shared" batch
> > struct, and the index AM is expected to fill it with data. After that,
> > the batch is managed entirely by indexam.c, with no AM calls. The only
> > AM-specific bit in the batch is "position", but that's used only when
> > advancing to the next page, etc.
>
> The major difficulty with my heap batching prototype is getting the
> layering right (no surprises there). In some sense we're deliberately
> sharing information across different what we currently think of as
> different layers of abstraction, in order to be able to "schedule" the
> work more intelligently. There's a number of competing considerations.
>
> I have invented a new concept of heap batch, that is orthogonal to the
> existing concept of index batches. Right now these are just an array
> of HeapTuple structs that relate to exactly one group of group of
> contiguous heap TIDs (i.e. if the index scan returns TIDs even a
> little out of order, which is fairly common, we cannot currently
> reorder the work in the current prototype patch).
>
> Once a batch is prepared, calls to heapam_index_fetch_tuple just
> return the next TID from the batch (until the next time we have to
> return a TID pointing to some distinct heap block). In the case of
> pgbench queries like the one I mentioned, we only need to call
> LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
> returned (not once per heap tuple returned).
>
> Importantly, the new interface added by my new prototype spinoff patch
> is higher level than the existing
> table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
> executor asks the table AM "give me the next heap TID in the current
> scan direction", rather than asking "give me this heap TID". The
> general idea is that the table AM has a direct understanding of
> ordered index scans.
>
> The advantage of this higher-level interface is that it gives the
> table AM maximum freedom to reorder work. As I said already, we won't
> do things like merge together logically noncontiguous accesses to the
> same heap page into one physical access right now. But I think that
> that should at least be enabled by this interface.

Interesting. It sounds like you aim to replace the fetch_tuple
interface with a more generic one, is that right?

> The downside of this approach is that table AM (not the executor
> proper) is responsible for interfacing with the index AM layer. I
> think that this can be generalized without very much code duplication
> across table AMs. But it's hard.

Seems so.

> > This patch does things differently. IIUC, each TAM may produce it's own
> > "batch", which is then wrapped in a generic one. For example, heap
> > produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> > is fine. In the prefetching we chose to move all this code (walking the
> > batch items) from the AMs into the layer above, and make it AM agnostic.
>
> I think that the base index prefetching patch's current notion of
> index-AM-wise batches can be kept quite separate from any table AM
> batch concept that might be invented, either as part of what I'm
> working on, or in Amit's patch.
>
> It probably wouldn't be terribly difficult to get the new interface
> I've described to return heap tuples in whatever batch format Amit
> comes up with. That only has a benefit if it makes life easier for
> expression evaluation in higher levels of the plan tree, but it might
> just make sense to always do it that way. I doubt that adopting Amit's
> batch format will make life much harder for the
> heap_hot_search_buffer-batching mechanism (at least if it is generally
> understood that its new index scan interface's builds batches in
> Amit's format on a best-effort basis).

In my implementation, the new TableAmRoutine.scan_getnextbatch()
returns a batch as an opaque table AM structure, which can then be
passed up to the upper levels of the plan. Patch 0001 in my series
adds the following to the TableAmRoutine API:

+    /* ------------------------------------------------------------------------
+     * Batched scan support
+     * ------------------------------------------------------------------------
+     */
+
+    void       *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+    int         (*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+                                     ScanDirection dir);
+    void        (*scan_end_batch)(TableScanDesc sscan, void *am_batch);

I haven't seen what your version looks like, but if it is compatible
with the above, I'd be happy to adopt a batch format that accommodates
multiple use cases.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-28 13:40  Amit Langote <[email protected]>
  parent: Tomas Vondra <[email protected]>
  0 siblings, 2 replies; 22+ messages in thread

From: Amit Langote @ 2025-10-28 13:40 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

On Tue, Oct 28, 2025 at 1:18 AM Tomas Vondra <[email protected]> wrote:
> On 10/27/25 08:24, Amit Langote wrote:
> > Thank you for reviewing the patch and taking the time to run those
> > experiments. I appreciate the detailed feedback and questions.  I also
> > apologize for my late reply, I spent perhaps way too much time going
> > over your index prefetching thread trying to understand the notion of
> > batching that it uses and getting sidelined by other things while
> > writing this reply.
>
> Cool! Now you can do a review of the index prefetch patch ;-)

Would love to and I'm adding that to my list.  :)

> >> How far ahead have you though about these capabilities? I was wondering
> >> about two things in particular. First, at which point do we have to
> >> "materialize" the TupleBatch into some generic format (e.g. TupleSlots).
> >> I get it that you want to enable passing batches between nodes, but
> >> would those use the same "format" as the underlying scan node, or some
> >> generic one? Second, will it be possible to execute expressions on the
> >> custom batches (i.e. on "compressed data")? Or is it necessary to
> >> "materialize" the batch into regular tuple slots? I realize those may
> >> not be there "now" but maybe it'd be nice to plan for the future.
> >
> > I have been thinking about those future capabilities. Currently, the
> > patch keeps tuples in the TAM-specific batch format up until they need
> > to be consumed by a node that doesn’t understand that format or has
> > not been modified to invoke the TAM callbacks to decode it.  In the
> > current patch, that means we materialize to regular TupleTableSlots at
> > nodes that require it (for example, the scan node reading from TAM
> > needing to evaluate quals, etc.). However, the intention is to allow
> > batches to be passed through as many nodes as possible without
> > materialization, ideally using the same format produced by the scan
> > node all the way up until reaching a node that can only work with
> > tuples in TupleTableSlots.
> >
> > As for executing expressions directly on the custom batch data: that’s
> > something I would like to enable in the future. Right now, expressions
> > (quals, projections, etc.) are evaluated after materializing into
> > normal tuples in TupleTableSlots stored in TupleBatch, because the
> > expression evaluation code isn’t yet totally batch-aware or is very
> > from doing things like operate on compressed data in its native form.
> > Patches 0004-0008 do try to add batch-aware expression evaluation but
> > that's just a prototype.  In the long term, the goal is to allow
> > expression evaluation on batch data (for example, applying a WHERE
> > clause or aggregate transition directly on a columnar batch without
> > converting it to heap tuples first). This will require significant new
> > infrastructure (perhaps specialized batch-aware expression operators
> > and functions), so it's not in the current patch, but I agree it's
> > important to plan for it. The current design doesn’t preclude it, it
> > lays some groundwork by introducing the batch abstraction -- but fully
> > supporting that will be future work.
> >
> > That said, one area I’d like to mention while at it, especially to
> > enable native execution on compressed or columnar batches, is giving
> > the table AM more control over how expression evaluation is performed
> > on its batch data. In the current patch, the AM can provide a
> > materialize function via TupleBatchOps, but that always produces an
> > array of TupleTableSlots stored in the TupleBatch, not an opaque
> > representation that remains under AM control. Maybe that's not bad for
> > a v1 patch.
>
> I think materializing into a batch of TupleTableSlots (and then doing
> the regular expression evaluation) seems perfectly fine for v1. It's the
> simplest fallback possible, and we'll need it anyway if overriding the
> expression evaluation will be optional (which I assume it will be?).

Yes.  The ability to materialize into TupleTableSlots won't be
optional for the table AM's BatchOps.  Converting to other formats
would be.

> > When evaluating expressions over a batch, a BatchVector
> > is built by looping over these slots and invoking the standard
> > per-tuple getsomeattrs() to "deform" a tuple into needed columns.
> > While that enables batch-style EEOPs for qual evaluation and aggregate
> > transition (and is already a gain over per-row evaluation), it misses
> > the opportunity to leverage any batch-specific optimizations the AM
> > could offer, such as vectorized decoding or filtering over compressed
> > data, and other AM optimizations for getting only the necessary
> > columns out possibly in a vector format.
> >
>
> I'm not sure about this BatchVector thing. I haven't looked into that
> very much, I'd expect the construction to be more expensive than the
> benefits (compared to just doing the materialize + regular evaluation),
> but maybe I'm completely wrong. Or maybe we could keep the vector
> representation for multiple operations? No idea.

Constructing the BatchVector does require looping over the batch and
deforming each tuple, typically via getsomeattrs(). So yes, there’s an
up-front cost similar to materialization. But the goal is to amortize
that by enabling expression evaluation to run in a tight loop over
column vectors, avoiding repeated jumps into slot/AM code for each
tuple and each column. That can reduce branching and improve locality.

In its current form, the BatchVector is ephemeral -- it's built just
before expression evaluation and discarded after. But your idea of
reusing the same vector across multiple operations is interesting.
That would let us spread out the construction cost even further and
might be necessary to justify the overhead fully in some cases. I’ll
keep that in mind.

> But it seems like a great area for experimenting ...

Yep.

> > I’m considering extending TupleTableSlotOps with a batch-aware variant
> > of getsomeattrs(), something like slot_getsomeattrs_batch(), so that
> > AMs can populate column vectors (e.g., BatchVector) directly from
> > their native format. That would allow bypassing slot materialization
> > entirely and plug AM-provided decoding logic directly into the
> > executor’s batch expression paths. This isn’t implemented yet, but I
> > see it as a necessary step toward supporting fully native expression
> > evaluation over compressed or columnar formats. I’m not yet sure if
> > TupleTableSlotOps is the right place for such a hook, it might belong
> > elsewhere in the abstraction, but exposing a batch-aware interface for
> > this purpose seems like the right direction.
> >
>
> No opinion. I don't see it as a necessary prerequisite for the other
> parts of the patch series, but maybe the BatchVector really helps, and
> then this would make perfect sense. I'm not sure there's a single
> "correct" sequence in which to do these improvements, it's always a
> matter of opinion.

Yes, I think we can come back to this later.

> >> The other option would be to "create batches" during execution, say by
> >> having a new node that accumulates tuples, builds a batch and sends it
> >> to the node above. This would help both in cases when either the lower
> >> node does not produce batches at all, or the batches are too small (due
> >> to filtering, aggregation, ...). Or course, it'd only win if this
> >> increases efficiency of the upper part of the plan enough to pay for
> >> building the batches. That can be a hard decision.
> >
> > Yes, introducing a dedicated executor node to accumulate and form
> > batches on the fly is an interesting idea, I have thought about it and
> > even mentioned it in passing in the pgconf.dev unconference. This
> > could indeed cover scenarios where the data source (a node) doesn't
> > produce batches (e.g., a non-batching node feeding into a
> > batching-aware upper node) or where batches coming from below are too
> > small to be efficient. The current patch set doesn’t implement such a
> > node; I focused on enabling batching at the scan/TAM level first. The
> > cost/benefit decision for a batch-aggregator node is tricky, as you
> > said. We’d need a way to decide when the overhead of gathering tuples
> > into a batch is outweighed by the benefits to the upper node. This
> > likely ties into costing or adaptive execution decisions. It's
> > something I’m open to exploring in a future iteration, perhaps once we
> > have more feedback on how the existing batching performs in various
> > scenarios. It might also require some planner or executor smarts
> > (maybe the executor can decide to batch on the fly if it sees a
> > pattern of use, or the planner could insert such nodes when
> > beneficial).
> >
>
> Yeah, those are good questions. I don't have a clear idea how should we
> decide when to do this batching. Costing during planning is the
> "traditional" option, with all the issues (e.g. it requires a reasonably
> good cost model). Another option would be some sort of execution-time
> heuristics - buts then which node would be responsible for building the
> batches (if we didn't create them during planning)?
>
> I agree it makes sense to focus on batching at the TAM/scan level for
> now. That's a pretty big project already.

Right -- batching at the TAM/scan level is already a sizable project,
especially given its interaction with prefetching work (maybe). I
think it's best to focus design effort there and on batched expression
evaluation first, and only revisit batch-creation nodes once that
groundwork is in place.

> >> In fact, how shall the optimizer decide whether to use batching? It's
> >> one thing to decide whether a node can produce/consume batches, but
> >> another thing is "should it"? With a node that "builds" a batch, this
> >> decision would apply to even more plans, I guess.
> >>
> >> I don't have a great answer to this, it seems like an incredibly tricky
> >> costing issue. I'm a bit worried we might end up with something too
> >> coarse, like "jit=on" which we know is causing problems (admittedly,
> >> mostly due to a lot of the LLVM work being unpredictable/external). But
> >> having some "adaptive" heuristics (like the gradual ramp up) might make
> >> it less risky.
> >
> > I agree that deciding when to use batching is tricky. So far, the
> > patch takes a fairly simplistic approach: if a node (particularly a
> > scan node) supports batching, it just does it, and other parts of the
> > plan will consume batches if they are capable. There isn’t yet a
> > nuanced cost-based decision in the planner for enabling batching. This
> > is indeed something we’ll have to refine. We don’t want to end up with
> > a blunt on/off GUC that could cause regressions in some cases.
> >
> > One idea is to introduce costing for batching: for example, estimate
> > the per-tuple savings from batching vs the overhead of materialization
> > or batch setup. However, developing a reliable cost model for that
> > will take time and experimentation, especially with the possibility of
> > variable batch sizes or adaptive behavior. Not to mention, that will
> > be adding one more dimension to planner's costing model making the
> > planning more expensive and unpredictable.  In the near term, I’m fine
> > with relying on feedback and perhaps manual tuning (GUCs, etc.) to
> > decide on batching, but that’s perhaps not a long-term solution.
> >
>
> Yeah, the cost model is going to be hard, because this depends on so
> much low-level plan/hardware details. Like, the TAM may allow execution
> on compressed data / leverage vectorization, .... But maybe the CPU does
> not do that efficiently? There's so many unknown unknowns ...
>
> Considering we still haven't fixed the JIT cost model, maybe it's better
> to not rely on it too much for this batching patch? Also, all those
> details contradict the idea that cost models are a simplified model of
> the reality.

Yeah, totally agreed -- the complexity and unpredictability here are
real, and your point about JIT costing is a good reminder not to
over-index on planner models for now.

> > I share your inclination that adaptive heuristics might be the safer
> > path initially. Perhaps the executor can decide to batch or not batch
> > based on runtime conditions. The gradual ramp-up of batch size is one
> > such adaptive approach. We could also consider things like monitoring
> > how effective batching is (are we actually processing full batches or
> > frequently getting cut off?) and adjust behavior. These are somewhat
> > speculative ideas at the moment, but the bottom line is I’m aware we
> > need a smarter strategy than a simple switch. This will likely evolve
> > as we test the patch in more scenarios.
> >
>
> I think the big question is how much can the batching change the
> relative cost of two plans (I mean, actual cost, not just estimates).
>
> Imagine plans P1 and P2, where
>
>    cost(P1) < cost(P2) = cost(P1) + delta
>
> where "delta" is small (so P1 is faster, but not much).  If we
> "batchify" the plans into P1' and P2', can this happen?
>
>   cost(P1') >> cost(P2')
>
> That is, can the "slower" plan P2 benefit much more from the batching,
> making it significantly faster?
>
> If this is unlikely, we could entirely ignore batching during planning,
> and only do that as post-processing on the selected plan, or perhaps
> even just during execution.
>
> OTOH that's what JIT does, and we know it's not perfect - but that's
> mostly because JIT has rather unpredictable costs when enabling. Maybe
> batching doesn't have that.

That’s an interesting scenario. I suspect batching (even with SIMD)
won’t usually flip plan orderings that dramatically -- i.e., turning
the clearly slower plan into the faster one -- though I could be
wrong. But I agree with the conclusion: this supports treating
batching as an executor concern, at least initially. Might be worth
seeing if there’s any relevant guidance in systems literature too.

> >> FWIW the current batch size limit (64 tuples) seems rather low, but it's
> >> hard to say. It'd be good to be able to experiment with different
> >> values, so I suggest we make this a GUC and not a hard-coded constant.
> >
> > Yeah, I was thinking the same while testing -- the optimal batch size
> > might vary by workload or hardware, and 64 was a somewhat arbitrary
> > starting point. I will make the batch size limit configurable
> > (probably as a GUC executor_batch_tuples, maybe only developer-focused
> > at first). That will let us and others experiment easily with
> > different batch sizes to see how it affects performance. It should
> > also help with your earlier point: for example, on a machine where 64
> > is too low or too high, we can adjust it without recompiling. So yes,
> > I'll add a GUC for the batch size in the next version of the patch.
> >
>
> +1 to have developer-only GUC for testing. But the goal should be to not
> expect users to tune this.

Yes.

> >> As for what to add to explain, I'd start by adding info about which
> >> nodes are "batched" (consuming/producing batches), and some info about
> >> the batch sizes. An average size, maybe a histogram if you want to be a
> >> bit fancy.
> >
> > Adding more information to EXPLAIN is a good idea. In the current
> > patch, EXPLAIN does not show anything about batching, but it would be
> > very helpful for debugging and user transparency to indicate which
> > nodes are operating in batch mode.  I will update EXPLAIN to mark
> > nodes that produce or consume batches. Likely I’ll start with
> > something simple like an extra line or tag for a node, e.g., "Batch:
> > true (avg batch size 64)" or something along those lines. An average
> > batch size could be computed if we have instrumentation, which would
> > be useful to see if, say, the batch sizes ended up smaller due to
> > LIMIT or other factors. A full histogram might be more detail than
> > most users need, but I agree even just knowing average or maximum
> > batch size per node could be useful for performance analysis. I'll
> > implement at least the basics for now, and we can refine it (maybe add
> > more stats) if needed.
>
> +1 to start with something simple
>
> >
> > (I had added a flag in the EXPLAIN output at one point, but removed it
> > due to finding the regression output churn too noisy, though I
> > understand I'll have to bite the bullet at some point.)
> >
>
> Why would there be regression churn, if the option is disabled by default?

executor_batching is on my default in my patch, so a seq scan will
always use batching provided the query features preventing it are not
present, which is true for a huge number of plans appearing in
regression suite output.

> >> Now, numbers from some microbenchmarks:
> >>
> >> ...
> >>>> Perhaps I did something wrong. It does not surprise me this is somewhat
> >> CPU dependent. It's a bit sad the improvements are smaller for the newer
> >> CPU, though.
> >
> > Thanks for sharing your benchmark results -- that’s very useful data.
> > I haven’t yet finished investigating why there's a regression relative
> > to master when executor_batching is turned off. I re-ran my benchmarks
> > to include comparisons with master and did observe some regressions in
> > a few cases too, but I didn't see anything obvious in profiles that
> > explained the slowdown. I initially assumed it might be noise, but now
> > I suspect it could be related to structural changes in the scan code
> > -- for example, I added a few new fields in the middle of
> > HeapScanDescData, and even though the batching logic is bypassed when
> > executor_batching is off, it’s possible that change alone affects
> > memory layout or cache behavior in a way that penalizes the unbatched
> > path. I haven’t confirmed that yet, but it’s on my list to look into
> > more closely.
> >
> > Your observation that newer CPUs like the Ryzen may see smaller
> > improvements makes sense -- perhaps they handle the per-tuple overhead
> > more efficiently to begin with. Still, I’d prefer not to see
> > regressions at all, even in the unbatched case, so I’ll focus on
> > understanding and fixing that part before drawing conclusions from the
> > performance data.
> >
> > Thanks again for the scripts -- those will help a lot in narrowing things down.
>
> If needed, I can rerun the tests and collect additional information
> (e.g. maybe perf-stat or perf-diff would be interesting).

That would be nice to see if you have the time, but maybe after I post
a new version.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-28 14:32  Daniil Davydov <[email protected]>
  parent: Amit Langote <[email protected]>
  1 sibling, 1 reply; 22+ messages in thread

From: Daniil Davydov @ 2025-10-28 14:32 UTC (permalink / raw)
  To: Amit Langote <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers

Hi,

As far as I understand, this work partially overlaps with what we did in the
thread [1] (in short - we introduce support for batching within the ModifyTable
node). Am I correct?

It's worth saying that the patch in that thread is already quite old -
now I have
an improved implementation and tests for it (as well as performance
measurements). But the basic idea and design remained unchanged.

Maybe we can combine approaches? I haven't reviewed patches in this thread
yet, but I'll try to do it in the near future.

[1] https://www.postgresql.org/message-id/flat/CALj2ACVi9eTRYR%3Dgdca5wxtj3Kk_9q9qVccxsS1hngTGOCjPwQ%40m...

--
Best regards,
Daniil Davydov





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-29 02:22  Amit Langote <[email protected]>
  parent: Daniil Davydov <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: Amit Langote @ 2025-10-29 02:22 UTC (permalink / raw)
  To: Daniil Davydov <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers

Hi Daniil,

On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <[email protected]> wrote:
>
> Hi,
>
> As far as I understand, this work partially overlaps with what we did in the
> thread [1] (in short - we introduce support for batching within the ModifyTable
> node). Am I correct?

There might be some relation, but not much overlap. The thread you
mention seems to focus on batching in the write path (for INSERT,
etc.), while this work targets batching in the read path via Table AM
scan callbacks. I think they can be developed independently, though
I'm happy to take a look.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-29 06:37  Amit Langote <[email protected]>
  parent: Amit Langote <[email protected]>
  1 sibling, 1 reply; 22+ messages in thread

From: Amit Langote @ 2025-10-29 06:37 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <[email protected]> wrote:
> That would be nice to see if you have the time, but maybe after I post
> a new version.

I’ve created a CF entry marked WoA for this in the next CF under the
title “Batching in executor, part 1: add batch variant of table AM
scan API.” The idea is to track this piece separately so that later
parts can have their own entries and we don’t end up with a single
long-lived entry that never gets marked done. :-)

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-10-30 12:12  Daniil Davydov <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: Daniil Davydov @ 2025-10-30 12:12 UTC (permalink / raw)
  To: Amit Langote <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers

Hi,

On Wed, Oct 29, 2025 at 9:23 AM Amit Langote <[email protected]> wrote:
>
> Hi Daniil,
>
> On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <[email protected]> wrote:
> >
> > Hi,
> >
> > As far as I understand, this work partially overlaps with what we did in the
> > thread [1] (in short - we introduce support for batching within the ModifyTable
> > node). Am I correct?
>
> There might be some relation, but not much overlap. The thread you
> mention seems to focus on batching in the write path (for INSERT,
> etc.), while this work targets batching in the read path via Table AM
> scan callbacks. I think they can be developed independently, though
> I'm happy to take a look.

Oh, I got it. Thanks!

I looked at 0001-0003 patches and got some comments :
1)
I noticed that some Nodes may set SO_ALLOW_PAGEMODE flag to 'false'
during ExecReScan. heap_getnextslot works carefully with it - checks whether
pagemode is allowed at every call. If not - it just uses tuple-at-a-time mode.
At the same time, heap_getnextbatch always expects that pagemode is enabled.
I didn't find any code paths which can lead to an assertion [1] fail.
If such a code
path is unreachable under any circumstances, maybe we should add a comment
why?

2)
heapgettup_pagemode_batch : Do we really need to compute lineindex variable
in this way? :
***
            lineindex = scan->rs_cindex + dir;
            if (ScanDirectionIsForward(dir))
                linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
                    (scan->rs_ntuples - lineindex) : 0;
***

As far as I understand, this is enough :
***
        lineindex = scan->rs_cindex + dir;
        if (ScanDirectionIsForward(dir))
            linesleft = scan->rs_ntuples - lineindex;
***

3)
Is this code inside heapgettup_pagemode_batch necessary? :
***
ScanDirectionIsForward(dir) ? 0 : 0
***

4)
heapgettup_pagemode has this change :
HeapTuple    tuple = &(scan->rs_ctup) ---> HeapTuple tuple = &scan->rs_ctup
I guess it was changed accidentally.

5)
I apologize for the tediousness, but these braces are not in the
postgres style :
***
static const TupleBatchOps TupleBatchHeapOps = {
    .materialize_all = heap_materialize_batch_all
};
***

[1] heap_getnextbatch : Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE)

--
Best regards,
Daniil Davydov





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-12-04 15:54  Amit Langote <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: Amit Langote @ 2025-12-04 15:54 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

On Wed, Oct 29, 2025 at 3:37 PM Amit Langote <[email protected]> wrote:
> On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <[email protected]> wrote:
> > That would be nice to see if you have the time, but maybe after I post
> > a new version.
>
> I’ve created a CF entry marked WoA for this in the next CF under the
> title “Batching in executor, part 1: add batch variant of table AM
> scan API.” The idea is to track this piece separately so that later
> parts can have their own entries and we don’t end up with a single
> long-lived entry that never gets marked done. :-)

I intend to continue working on this, so have just moved it into the
next fest.  I will post a new patch version next week that addresses
Daniil's comments and implements a few other things I mentioned I will
in my reply to Tomas on Oct 28; sorry for the delay.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-12-20 14:12  Amit Langote <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: Amit Langote @ 2025-12-20 14:12 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: pgsql-hackers

On Fri, Dec 5, 2025 at 12:54 AM Amit Langote <[email protected]>
wrote:
> On Wed, Oct 29, 2025 at 3:37 PM Amit Langote <[email protected]>
wrote:
> > On Tue, Oct 28, 2025 at 10:40 PM Amit Langote <[email protected]>
wrote:
> > > That would be nice to see if you have the time, but maybe after I post
> > > a new version.
> >
> > I’ve created a CF entry marked WoA for this in the next CF under the
> > title “Batching in executor, part 1: add batch variant of table AM
> > scan API.” The idea is to track this piece separately so that later
> > parts can have their own entries and we don’t end up with a single
> > long-lived entry that never gets marked done. :-)
>
> I intend to continue working on this, so have just moved it into the
> next fest.  I will post a new patch version next week that addresses
> Daniil's comments and implements a few other things I mentioned I will
> in my reply to Tomas on Oct 28; sorry for the delay.

Before I go on vacation for a couple of weeks, here's an updated patch
set.  I am only including the patches that add TAM interface, add
TupleBatch executor wrapper for TAM batches, and use it in SeqScan as I had
posted before.  There is a new patch to add a BATCHES option to EXPLAIN.  I
renamed the testing GUC to executor_batch_rows (integer) from the boolean
executor_batching.  EXPLAIN (BATCHES) example:

+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off)
select * from batch_test');
+                         explain_filter
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)

What I have not included in this set are the patches that add
ExecProcNodeBatch() so that TupleBatch can be passed from one plan node to
another (parent), ExprEvalOps (EEOPs) for batched expression evaluation
(qual and aggregate transition).  I would like to focus on the patches that
allow reading batches from TAM into Scan nodes (only SeqScan for now).

After I'm back from vacation, I will post patches for batched qual
evaluation in SeqScan filter quals (once bugs are fixed and polished).
Batching in Agg node can wait for now.

In the meantime, what I would like to have someone's thoughts on:

* the shape of the TAM APIs -- should I add a TAMBatch or something that is
created, populated, and destroyed by the TAM instead of the current void
pointer and TupleBatchOps that are initialized in the executor like this
(excerpt from 0002):

+    /* Lazily create the AM batch payload. */
+    if (node->ss.ps.ps_Batch->am_payload == NULL)
+    {
+        const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY =
scandesc->rs_rd->rd_tableam;
+
+        Assert(tam && tam->scan_begin_batch);
+        node->ss.ps.ps_Batch->am_payload =
+            table_scan_begin_batch(scandesc,
node->ss.ps.ps_Batch->maxslots);
+        node->ss.ps.ps_Batch->ops =
table_batch_callbacks(node->ss.ss_currentRelation);
+    }

* the shape of TupleBatch itself -- its contents and operations defined in
execBatch.c/h.

* any other thoughts you might have on the project, patches.

Benchmark:

Scripts attached if you want to try them.

(Negative % = faster than master)

SELECT * FROM table LIMIT 1 OFFSET N:
Rows      Master    batch=0   vs master   batch=64   vs master
--------------------------------------------------------------
1M          11ms       11ms        -0%        8ms       -23%
2M          23ms       22ms        -1%       18ms       -23%
3M          36ms       34ms        -5%       27ms       -25%
4M          51ms       50ms        -2%       38ms       -26%
5M          64ms       64ms        -1%       48ms       -26%
10M        147ms      145ms        -1%      114ms       -22%

SELECT * FROM WHERE a > 0 LIMIT 1 OFFSET N:
Rows      Master    batch=0   vs master   batch=64   vs master
--------------------------------------------------------------
1M          31ms       31ms        +0%       16ms       -48%
2M          64ms       64ms        -0%       34ms       -47%
3M          67ms       66ms        -1%       50ms       -25%
4M          91ms       90ms        -1%       71ms       -22%
5M         119ms      113ms        -5%       88ms       -26%
10M        262ms      261ms        -0%      205ms       -21%

SELECT * FROM table WHERE o > 0 LIMIT 1 OFFSET N (last column -
deform-heavy):
Rows      Master    batch=0   vs master   batch=64   vs master
--------------------------------------------------------------
1M          38ms       37ms        -2%       38ms        +0%
2M          79ms       75ms        -6%       77ms        -4%
3M         182ms      186ms        +2%      160ms       -12%
4M         250ms      252ms        +1%      219ms       -12%
5M         314ms      316ms        +1%      273ms       -13%
10M        647ms      651ms        +1%      604ms        -7%

The smaller improvement with WHERE o > 0 is expected since accessing the
last column requires deforming most of the tuple, which dominates the
execution time. Future work on batched tuple deformation could help here.

Note on regressions with executor_batch_rows = 0 vs master:

I am not seeing the regressions with batch_rows=0 vs master as I did
before.  I think some of it might have to do with my removing some stray
fields from HeapScanData that were accidentally left there in the earlier
patches.  Also, the regressions I was observing earlier seemed more to have
to do with using gcc to compile master tree and clang to compile patched
tree, which resulted in code layout changes that seemed to cause patched
binary to regress.  Would be nice if these numbers can be verified by
others.

-- 
Thanks, Amit Langote


Attachments:

  [application/octet-stream] v4-0001-Add-batch-table-AM-API-and-heapam-implementation.patch (13.4K, 3-v4-0001-Add-batch-table-AM-API-and-heapam-implementation.patch)
  download | inline diff:
From 24a3d208db93312788745882a01b526957919966 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 17:21:56 +0900
Subject: [PATCH v4 1/3] Add batch table AM API and heapam implementation

Introduce new table AM callbacks to fetch multiple tuples per call.
This reduces per-tuple call overhead by letting executor nodes work
in batches.

Define a HeapBatch structure and supporting code in tableam.h.
Batches are limited to tuples from a single page and at most
EXEC_BATCH_ROWS (currently 64) entries.

Provide initial heapam support with heapgettup_pagemode_batch().
No executor node is switched over yet; a later commit will adapt
SeqScan to use this API. Other nodes may adopt it in the future.

Also add pgstat_count_heap_getnext_batch() to record batched fetches
in pgstat.

Reviewed-by: Daniil Davydov <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/access/heap/heapam.c         | 219 ++++++++++++++++++++++-
 src/backend/access/heap/heapam_handler.c |   4 +
 src/include/access/heapam.h              |  18 ++
 src/include/access/tableam.h             |  58 ++++++
 src/include/pgstat.h                     |   5 +
 5 files changed, 303 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 6daf4a87dec..fcc0813f139 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1023,7 +1023,7 @@ heapgettup_pagemode(HeapScanDesc scan,
 					int nkeys,
 					ScanKey key)
 {
-	HeapTuple	tuple = &(scan->rs_ctup);
+	HeapTuple tuple = &scan->rs_ctup;
 	Page		page;
 	uint32		lineindex;
 	uint32		linesleft;
@@ -1104,6 +1104,132 @@ continue_page:
 	scan->rs_inited = false;
 }
 
+/*
+ * heapgettup_pagemode_batch
+ *		Collect up to 'maxitems' visible tuples from a single page in page mode.
+ *
+ * This function returns a *batch* of tuples from one heap page. If the
+ * current page (as tracked by the scan desc) has no more tuples left,
+ * it will advance to the next page and prepare it (via heap_prepare_pagescan).
+ * It will not cross a page boundary while filling the batch.
+ *
+ * Return value:
+ *		number of tuples written into 'tdata' (0 at end-of-scan).
+ *
+ * Side effects:
+ *	- Ensures rs_cbuf pins the page from which tuples were produced.
+ *	- Sets rs_cblock, rs_cindex, rs_ntuples consistently (same as
+ *	  heapgettup_pagemode’s inner-loop effects).
+ *	- Does *not* change buffer pin counts except through normal page
+ *	  transitions performed by heap_fetch_next_buffer().
+ */
+static int
+heapgettup_pagemode_batch(HeapScanDesc scan,
+						  ScanDirection dir,
+						  int nkeys, ScanKey key,
+						  HeapTupleData *tdata,
+						  int maxitems)
+{
+	Page		page;
+	uint32		lineindex;
+	uint32		linesleft;
+	int			nout = 0;
+	Relation	rel = scan->rs_base.rs_rd;
+	Oid			tableOid = RelationGetRelid(rel);
+	TupleDesc	tupdesc = key ? RelationGetDescr(rel) : NULL;
+
+	/*
+	 * Current batching limitations (may be relaxed in future):
+	 *
+	 * - Forward scans only: backward scan support would require changes to
+	 *   batch iteration and page advancement logic.
+	 *
+	 * - Pagemode required: batching relies on the pre-built rs_vistuples[]
+	 *   array from heap_prepare_pagescan(). This is guaranteed by
+	 *   ScanCanUseBatching() which only enables batching when SO_ALLOW_PAGEMODE
+	 *   is set. Unlike heap_getnextslot, we don't support dynamic fallback to
+	 *   tuple-at-a-time mode since the batch execution path is selected at
+	 *   ExecInit time.
+	 */
+	Assert(ScanDirectionIsForward(dir));
+	Assert(scan->rs_base.rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(maxitems > 0);
+
+	/*
+	 * If we have no current page (or the current page is exhausted),
+	 * advance to the next page that has any visible tuples and prepare it.
+	 * This mirrors the outer loop of heapgettup_pagemode(), but we stop
+	 * as soon as we have a prepared page; we never produce from two pages.
+	 */
+	for (;;)
+	{
+		if (BufferIsValid(scan->rs_cbuf))
+		{
+			/* Are there more visible tuples left on this page? */
+			lineindex = scan->rs_cindex + dir;
+			linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
+				(scan->rs_ntuples - lineindex) : 0;
+			if (linesleft > 0)
+				break;	/* continue on this page */
+		}
+
+		/* Move to next page and prepare its visible tuple list. */
+		heap_fetch_next_buffer(scan, dir);
+
+		if (!BufferIsValid(scan->rs_cbuf))
+		{
+			/* end of scan; keep rs_cbuf invalid like heapgettup_pagemode */
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_prefetch_block = InvalidBlockNumber;
+			scan->rs_inited = false;
+			return 0;
+		}
+
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+		heap_prepare_pagescan((TableScanDesc) scan);
+
+		/* After prepare, either rs_ntuples > 0 or we'll loop again. */
+		if (scan->rs_ntuples > 0)
+		{
+			lineindex = 0;
+			linesleft = scan->rs_ntuples;
+			break;
+		}
+		/* else: page had no visible tuples; continue to next page */
+	}
+
+	/* From here on, we must only read tuples from this single page. */
+	page = BufferGetPage(scan->rs_cbuf);
+
+	/*
+	 * Walk rs_vistuples[] from 'lineindex', copying headers into tdata[]
+	 * until either the page is exhausted or the batch capacity is reached.
+	 */
+	for (; linesleft > 0 && nout < maxitems; linesleft--, lineindex += dir)
+	{
+		OffsetNumber	lineoff;
+		ItemId			lpp;
+		HeapTupleData *dst = &tdata[nout];
+
+		Assert(lineindex <= (uint32) scan->rs_ntuples);
+		lineoff = scan->rs_vistuples[lineindex];
+		lpp = PageGetItemId(page, lineoff);
+		Assert(ItemIdIsNormal(lpp));
+
+		dst->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
+		dst->t_len = ItemIdGetLength(lpp);
+		dst->t_tableOid = tableOid;
+		ItemPointerSet(&(dst->t_self), scan->rs_cblock, lineoff);
+
+		if (key != NULL && !HeapKeyTest(dst, tupdesc, nkeys, key))
+			continue;
+
+		scan->rs_cindex = lineindex;
+		nout++;
+	}
+
+	return nout;
+}
 
 /* ----------------------------------------------------------------
  *					 heap access method interface
@@ -1436,6 +1562,97 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
 	return true;
 }
 
+/*---------- Batching support -----------*/
+
+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	HeapBatch  *hb;
+	Oid			relid;
+
+	Assert(maxitems > 0);
+
+	hb = palloc(sizeof(HeapBatch));
+	hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
+	hb->maxitems = maxitems;
+	hb->nitems = 0;
+	hb->buf = InvalidBuffer;
+
+	/* Initialize static fields of HeapTupleData. Row bodies remain on page. */
+	relid = RelationGetRelid(sscan->rs_rd);
+	for (int i = 0; i < maxitems; i++)
+		hb->tupdata[i].t_tableOid = relid;
+
+	return hb;
+}
+
+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	if (BufferIsValid(hb->buf))
+		ReleaseBuffer(hb->buf);
+
+	pfree(hb->tupdata);
+	pfree(hb);
+}
+
+int
+heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	HeapScanDesc scan = (HeapScanDesc) sscan;
+	HeapBatch  *hb = (HeapBatch *) am_batch;
+	Buffer		curbuf;
+	int			n;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+	Assert(hb->maxitems > 0);
+
+	/* Drop prior batch pin, if any. */
+	if (BufferIsValid(hb->buf))
+	{
+		ReleaseBuffer(hb->buf);
+		hb->buf = InvalidBuffer;
+	}
+
+	hb->nitems = 0;
+
+	/* One call per batch, never crosses a page. */
+	n = heapgettup_pagemode_batch(scan, dir,
+								  sscan->rs_nkeys, sscan->rs_key,
+								  hb->tupdata, hb->maxitems);
+
+	if (n == 0)
+		return 0;	/* end of scan */
+
+	/* Hold a shared pin for the batch lifetime so t_data stays valid. */
+	curbuf = scan->rs_cbuf;
+	IncrBufferRefCount(curbuf);
+	hb->buf = curbuf;
+
+	/* Per-tuple stats (can be collapsed into a future _multi() call). */
+	pgstat_count_heap_getnext_batch(sscan->rs_rd, n);
+
+	hb->nitems = n;
+	return n;
+}
+
+/*----- End of batching support -----*/
+
 void
 heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 				  ItemPointer maxtid)
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index dd4fe6bf62f..550b788553c 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2623,6 +2623,10 @@ static const TableAmRoutine heapam_methods = {
 	.scan_rescan = heap_rescan,
 	.scan_getnextslot = heap_getnextslot,
 
+	.scan_begin_batch = heap_begin_batch,
+	.scan_getnextbatch = heap_getnextbatch,
+	.scan_end_batch = heap_end_batch,
+
 	.scan_set_tidrange = heap_set_tidrange,
 	.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f7e4ae3843c..f6675043fb3 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -101,6 +101,19 @@ typedef struct HeapScanDescData
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
+/*
+ * HeapBatch -- stateless per-batch buffer. A batch pins one page and
+ * exposes up to maxitems HeapTupleData headers whose t_data point into that
+ * page.
+ */
+typedef struct HeapBatch
+{
+	HeapTupleData  *tupdata;	/* len = maxitems; headers only */
+	int				nitems;		/* tuples produced in last getnextbatch() */
+	int				maxitems;	/* fixed capacity set at begin_batch() */
+	Buffer			buf;		/* single pinned buffer for this batch */
+} HeapBatch;
+
 typedef struct BitmapHeapScanDescData
 {
 	HeapScanDescData rs_heap_base;
@@ -337,6 +350,11 @@ extern void heap_endscan(TableScanDesc sscan);
 extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
 extern bool heap_getnextslot(TableScanDesc sscan,
 							 ScanDirection direction, TupleTableSlot *slot);
+
+extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
+extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
+extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
 extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 2fa790b6bf5..3ec3c3dd008 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -351,6 +351,16 @@ typedef struct TableAmRoutine
 									 ScanDirection direction,
 									 TupleTableSlot *slot);
 
+	/* ------------------------------------------------------------------------
+	 * Batched scan support
+	 * ------------------------------------------------------------------------
+	 */
+
+	void	   *(*scan_begin_batch)(TableScanDesc sscan, int maxitems);
+	int			(*scan_getnextbatch)(TableScanDesc sscan, void *am_batch,
+									 ScanDirection dir);
+	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -1036,6 +1046,54 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_scan_begin_batch
+ *		Allocate AM-owned batch payload with capacity 'maxitems'.
+ */
+static inline void *
+table_scan_begin_batch(TableScanDesc sscan, int maxitems)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_begin_batch != NULL);
+
+	return tam->scan_begin_batch(sscan, maxitems);
+}
+
+/*
+ * table_scan_getnextbatch
+ *		Fill next batch from the AM. Returns number of tuples, 0 => EOS.
+ *		Batches are single-page in v1. Direction is forward only in v1.
+ */
+static inline int
+table_scan_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	/* Only forward scans are supported in the batched mode. */
+	Assert(dir == ForwardScanDirection);
+	Assert(tam->scan_getnextbatch != NULL);
+
+	return tam->scan_getnextbatch(sscan, am_batch, dir);
+}
+
+/*
+ * table_scan_end_batch
+ *		Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, void *am_batch)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	if (am_batch == NULL)
+		return;
+
+	Assert(tam->scan_end_batch != NULL);
+
+	tam->scan_end_batch(sscan, am_batch);
+}
+
 /* ----------------------------------------------------------------------------
  * TID Range scanning related functions.
  * ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6714363144a..85f76dee468 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -697,6 +697,11 @@ extern void pgstat_report_analyze(Relation rel,
 		if (pgstat_should_count_relation(rel))						\
 			(rel)->pgstat_info->counts.tuples_returned++;			\
 	} while (0)
+#define pgstat_count_heap_getnext_batch(rel, n)						\
+	do {															\
+		if (pgstat_should_count_relation(rel))						\
+			(rel)->pgstat_info->counts.tuples_returned += n;		\
+	} while (0)
 #define pgstat_count_heap_fetch(rel)								\
 	do {															\
 		if (pgstat_should_count_relation(rel))						\
-- 
2.47.3



  [application/octet-stream] v4-0002-SeqScan-add-batch-driven-variants-returning-slots.patch (27.6K, 4-v4-0002-SeqScan-add-batch-driven-variants-returning-slots.patch)
  download | inline diff:
From 5630836aefb87948bb745d7faad01e9e3534a64c Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 17:23:12 +0900
Subject: [PATCH v4 2/3] SeqScan: add batch-driven variants returning slots

Teach SeqScan to drive the table AM via new the batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers. This reduces per tuple AM crossings without
changing the node interface seen by parents.

Add TupleBatch and supporting code in execBatch.c/h to hold executor
side batching state. PlanState gains ps_Batch to carry the active
TupleBatch when a node supports batching.

Wire up runtime selection in ExecInitSeqScan using
ScanCanUseBatching(). When executor_batching is enabled, EPQ is
inactive, the scan is not backward, and the relation supports
batching, ps.ExecProcNode is set to a batch-driven variant. Otherwise
the non-batch path is used.

Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled and allowed.

Add executor_batch_rows GUC to specify the maximum number of rows
that can be added into a batch.

Notes / current limits:

- With the current heapam, batches are composed from a single page, so
  the batch may not always be full. Future work may let SeqScan and/or
  AMs top up batches across pages when safe to do so.

Reviewed-by: Daniil Davydov <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/access/heap/heapam.c          |  29 ++++
 src/backend/access/heap/heapam_handler.c  |  16 ++
 src/backend/access/table/tableam.c        |  11 ++
 src/backend/executor/Makefile             |   1 +
 src/backend/executor/execBatch.c          | 117 ++++++++++++++
 src/backend/executor/execScan.c           |  31 ++++
 src/backend/executor/meson.build          |   1 +
 src/backend/executor/nodeSeqscan.c        | 176 +++++++++++++++++++++-
 src/backend/utils/init/globals.c          |   3 +
 src/backend/utils/misc/guc_parameters.dat |   9 ++
 src/include/access/heapam.h               |   1 +
 src/include/access/tableam.h              |  27 ++++
 src/include/executor/execBatch.h          |  99 ++++++++++++
 src/include/executor/execScan.h           |  69 +++++++++
 src/include/executor/executor.h           |   4 +
 src/include/miscadmin.h                   |   1 +
 src/include/nodes/execnodes.h             |   4 +
 17 files changed, 598 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/executor/execBatch.c
 create mode 100644 src/include/executor/execBatch.h

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index fcc0813f139..0c0b2384f0e 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -1592,6 +1592,35 @@ heap_begin_batch(TableScanDesc sscan, int maxitems)
 	return hb;
 }
 
+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)
+{
+	HeapBatch *hb = (HeapBatch *) am_batch;
+
+	Assert(n <= hb->nitems);
+
+	for (int i = 0; i < n; i++)
+	{
+		HeapTupleData *tuple = &hb->tupdata[i];
+		HeapTupleTableSlot *slot = (HeapTupleTableSlot *) slots[i];
+
+		/* Inline of ExecStoreHeapTuple(tuple, slot, false) */
+		slot->tuple = tuple;
+		slot->off = 0;
+		slot->base.tts_nvalid = 0;
+		slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
+		slot->base.tts_tid = tuple->t_self;
+		slot->base.tts_tableOid = tuple->t_tableOid;
+		slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
+	}
+}
+
 /*
  * heap_scan_end_batch
  *
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 550b788553c..a4de7e5b4f5 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -72,6 +72,21 @@ heapam_slot_callbacks(Relation relation)
 	return &TTSOpsBufferHeapTuple;
 }
 
+/* ------------------------------------------------------------------------
+ * TupleBatch related callbacks for heap AM
+ * ------------------------------------------------------------------------
+ */
+
+static const TupleBatchOps TupleBatchHeapOps =
+{
+	.materialize_all = heap_materialize_batch_all
+};
+
+static const TupleBatchOps *
+heapam_batch_callbacks(Relation relation)
+{
+	return &TupleBatchHeapOps;
+}
 
 /* ------------------------------------------------------------------------
  * Index Scan Callbacks for heap AM
@@ -2617,6 +2632,7 @@ static const TableAmRoutine heapam_methods = {
 	.type = T_TableAmRoutine,
 
 	.slot_callbacks = heapam_slot_callbacks,
+	.batch_callbacks = heapam_batch_callbacks,
 
 	.scan_begin = heap_beginscan,
 	.scan_end = heap_endscan,
diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 73ebc01a08f..d281aacaf94 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,17 @@ table_slot_create(Relation relation, List **reglist)
 	return slot;
 }
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch support routines
+ * ----------------------------------------------------------------------------
+ */
+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+	if (relation->rd_tableam)
+		return relation->rd_tableam->batch_callbacks(relation);
+	elog(ERROR, "relation does not support TupleBatch operations");
+}
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..3e72f3fe03c 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	execAmi.o \
 	execAsync.o \
+	execBatch.o \
 	execCurrent.o \
 	execExpr.o \
 	execExprInterp.o \
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
new file mode 100644
index 00000000000..007ae535687
--- /dev/null
+++ b/src/backend/executor/execBatch.c
@@ -0,0 +1,117 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.c
+ *		Helpers for TupleBatch
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+#include "executor/execBatch.h"
+
+/*
+ * TupleBatchCreate
+ *		Allocate and initialize a new TupleBatch envelope.
+ */
+TupleBatch *
+TupleBatchCreate(TupleDesc scandesc, int capacity)
+{
+	TupleBatch  *b;
+	TupleTableSlot **inslots,
+				   **outslots;
+
+	inslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	outslots = palloc(sizeof(TupleTableSlot *) * capacity);
+	for (int i = 0; i < capacity; i++)
+		inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);
+
+	b = (TupleBatch *) palloc(sizeof(TupleBatch));
+
+	/* Initial state: empty envelope */
+	b->am_payload = NULL;
+	b->ntuples = 0;
+	b->inslots = inslots;
+	b->outslots = outslots;
+	b->activeslots = NULL;
+	b->outslots = outslots;
+	b->maxslots = capacity;
+
+	b->nvalid = 0;
+	b->next = 0;
+
+	return b;
+}
+
+/*
+ * TupleBatchReset
+ *		Reset an existing TupleBatch envelope to empty.
+ */
+void
+TupleBatchReset(TupleBatch *b, bool drop_slots)
+{
+	if (b == NULL)
+		return;
+
+	for (int i = 0; i < b->maxslots; i++)
+	{
+		ExecClearTuple(b->inslots[i]);
+		if (drop_slots)
+			ExecDropSingleTupleTableSlot(b->inslots[i]);
+	}
+
+	if (drop_slots)
+	{
+		pfree(b->inslots);
+		pfree(b->outslots);
+		b->inslots = b->outslots = NULL;
+	}
+
+	b->ntuples = 0;
+	b->nvalid = 0;
+	b->next = 0;
+	b->activeslots = NULL;
+}
+
+void
+TupleBatchUseInput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->inslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+void
+TupleBatchUseOutput(TupleBatch *b, int nvalid)
+{
+	b->materialized = true;
+	b->activeslots = b->outslots;
+	b->nvalid = nvalid;
+	b->next = 0;
+}
+
+bool
+TupleBatchIsValid(TupleBatch *b)
+{
+	return	b != NULL &&
+			b->maxslots > 0 &&
+			b->inslots != NULL &&
+			b->outslots != NULL;
+}
+
+void
+TupleBatchRewind(TupleBatch *b)
+{
+	b->next = 0;
+}
+
+int
+TupleBatchGetNumValid(TupleBatch *b)
+{
+	return b->nvalid;
+}
diff --git a/src/backend/executor/execScan.c b/src/backend/executor/execScan.c
index 31ed4783c1d..ba25daa5e46 100644
--- a/src/backend/executor/execScan.c
+++ b/src/backend/executor/execScan.c
@@ -18,6 +18,7 @@
  */
 #include "postgres.h"
 
+#include "access/tableam.h"
 #include "executor/executor.h"
 #include "executor/execScan.h"
 #include "miscadmin.h"
@@ -154,3 +155,33 @@ ExecScanReScan(ScanState *node)
 		}
 	}
 }
+
+bool
+ScanCanUseBatching(ScanState *scanstate, int eflags)
+{
+	Relation	relation = scanstate->ss_currentRelation;
+
+	return	executor_batch_rows > 0 &&
+			(scanstate->ps.state->es_epq_active == NULL) &&
+			!(eflags & EXEC_FLAG_BACKWARD) &&
+			relation && table_supports_batching(relation);
+}
+
+void
+ScanResetBatching(ScanState *scanstate, bool drop)
+{
+	TupleBatch *b = scanstate->ps.ps_Batch;
+
+	if (b)
+	{
+		TupleBatchReset(b, drop);
+		if (b->am_payload)
+		{
+			table_scan_end_batch(scanstate->ss_currentScanDesc,
+								 b->am_payload);
+			b->am_payload = NULL;
+		}
+		if (drop)
+			pfree(b);
+	}
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index 2cea41f8771..40ffc28f3cb 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'execAmi.c',
   'execAsync.c',
+  'execBatch.c',
   'execCurrent.c',
   'execExpr.c',
   'execExprInterp.c',
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 94047d29430..a9071e32560 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -203,6 +203,171 @@ ExecSeqScanEPQ(PlanState *pstate)
 					(ExecScanRecheckMtd) SeqRecheck);
 }
 
+/* ----------------------------------------------------------------
+ *						Batch Support
+ * ----------------------------------------------------------------
+ */
+static inline bool
+SeqNextBatch(SeqScanState *node)
+{
+	TableScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+
+	Assert(node->ss.ps.ps_Batch != NULL);
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	Assert(direction == ForwardScanDirection);
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the scan is not parallel, or if we're serially
+		 * executing a scan that was planned to be parallel.
+		 */
+		scandesc = table_beginscan(node->ss.ss_currentRelation,
+								   estate->es_snapshot,
+								   0, NULL);
+		node->ss.ss_currentScanDesc = scandesc;
+	}
+
+	/* Lazily create the AM batch payload. */
+	if (node->ss.ps.ps_Batch->am_payload == NULL)
+	{
+		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+		Assert(tam && tam->scan_begin_batch);
+		node->ss.ps.ps_Batch->am_payload =
+			table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
+		node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+	}
+
+	node->ss.ps.ps_Batch->ntuples =
+		table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
+	node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
+	node->ss.ps.ps_Batch->materialized = false;
+
+	return node->ss.ps.ps_Batch->ntuples > 0;
+}
+
+static inline bool
+SeqNextBatchMaterialize(SeqScanState *node)
+{
+	if (SeqNextBatch(node))
+	{
+		TupleBatchMaterializeAll(node->ss.ps.ps_Batch);
+		return true;
+	}
+
+	return false;
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	/*
+	 * Use pg_assume() for != NULL tests to make the compiler realize no
+	 * runtime check for the field is needed in ExecScanExtended().
+	 */
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return ExecScanExtendedBatchSlot(&node->ss,
+									 (ExecScanAccessBatchMtd) SeqNextBatchMaterialize,
+									 pstate->qual, pstate->ps_ProjInfo);
+}
+
+/* Batch SeqScan enablement and dispatch */
+static void
+SeqScanInitBatching(SeqScanState *scanstate, int eflags)
+{
+	const int cap = executor_batch_rows;
+	TupleDesc	scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+
+	scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+
+	/* Choose batch variant to preserve your specialization matrix */
+	if (scanstate->ss.ps.qual == NULL)
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+		}
+	}
+	else
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+		}
+		else
+		{
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+		}
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
@@ -211,6 +376,7 @@ SeqScanState *
 ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 {
 	SeqScanState *scanstate;
+	bool	use_batching;
 
 	/*
 	 * Once upon a time it was possible to have an outerPlan of a SeqScan, but
@@ -241,9 +407,12 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 							 node->scan.scanrelid,
 							 eflags);
 
+	use_batching = ScanCanUseBatching(&scanstate->ss, eflags);
+
 	/* and create slot with the appropriate rowtype */
 	ExecInitScanTupleSlot(estate, &scanstate->ss,
 						  RelationGetDescr(scanstate->ss.ss_currentRelation),
+						  use_batching ? &TTSOpsHeapTuple :
 						  table_slot_callbacks(scanstate->ss.ss_currentRelation));
 
 	/*
@@ -280,6 +449,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
 	}
 
+	if (use_batching)
+		SeqScanInitBatching(scanstate, eflags);
+
 	return scanstate;
 }
 
@@ -299,6 +471,8 @@ ExecEndSeqScan(SeqScanState *node)
 	 */
 	scanDesc = node->ss.ss_currentScanDesc;
 
+	ScanResetBatching(&node->ss, true);
+
 	/*
 	 * close heap scan
 	 */
@@ -327,7 +501,7 @@ ExecReScanSeqScan(SeqScanState *node)
 	if (scan != NULL)
 		table_rescan(scan,		/* scan desc */
 					 NULL);		/* new scan keys */
-
+	ScanResetBatching(&node->ss, false);
 	ExecScanReScan((ScanState *) node);
 }
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index d31cb45a058..266502e9778 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* executor batching */
+int			executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 3b9d8349078..fd97d26c073 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1001,6 +1001,15 @@
   boot_val => 'true',
 },
 
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Number of rows to include in batches during execution.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'executor_batch_rows',
+  boot_val => '64',
+  min => '0',
+  max => '1024',
+},
+
 { name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Terminate session on any error.',
   variable => 'ExitOnAnyError',
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index f6675043fb3..fe07b21eaa2 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -354,6 +354,7 @@ extern bool heap_getnextslot(TableScanDesc sscan,
 extern void *heap_begin_batch(TableScanDesc sscan, int maxitems);
 extern void heap_end_batch(TableScanDesc sscan, void *am_batch);
 extern int heap_getnextbatch(TableScanDesc sscan, void *am_batch, ScanDirection dir);
+extern void heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n);
 
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 3ec3c3dd008..13a95f7a589 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -21,6 +21,7 @@
 #include "access/sdir.h"
 #include "access/xact.h"
 #include "commands/vacuum.h"
+#include "executor/execBatch.h"
 #include "executor/tuptable.h"
 #include "storage/read_stream.h"
 #include "utils/rel.h"
@@ -39,6 +40,7 @@ typedef struct BulkInsertStateData BulkInsertStateData;
 typedef struct IndexInfo IndexInfo;
 typedef struct SampleScanState SampleScanState;
 typedef struct ValidateIndexState ValidateIndexState;
+typedef struct TupleBatchOps TupleBatchOps;
 
 /*
  * Bitmask values for the flags argument to the scan_begin callback.
@@ -301,6 +303,7 @@ typedef struct TableAmRoutine
 	 * Return slot implementation suitable for storing a tuple of this AM.
 	 */
 	const TupleTableSlotOps *(*slot_callbacks) (Relation rel);
+	const TupleBatchOps *(*batch_callbacks)(Relation rel);
 
 
 	/* ------------------------------------------------------------------------
@@ -361,6 +364,7 @@ typedef struct TableAmRoutine
 									 ScanDirection dir);
 	void		(*scan_end_batch)(TableScanDesc sscan, void *am_batch);
 
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -872,6 +876,16 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
  */
 extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
 
+/* ----------------------------------------------------------------------------
+ * TupleBatch functions.
+ * ----------------------------------------------------------------------------
+ */
+
+/*
+ * Returns callbacks for manipulating TupleBatch for tuples of the given
+ * relation.
+ */
+extern const TupleBatchOps *table_batch_callbacks(Relation relation);
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
@@ -1046,6 +1060,18 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_supports_batching
+ *		Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation)
+{
+	const TableAmRoutine *tam = relation->rd_tableam;
+
+	return tam->scan_getnextbatch != NULL;
+}
+
 /*
  * table_scan_begin_batch
  *		Allocate AM-owned batch payload with capacity 'maxitems'.
@@ -2128,5 +2154,6 @@ extern const TableAmRoutine *GetTableAmRoutine(Oid amhandler);
  */
 
 extern const TableAmRoutine *GetHeapamTableAmRoutine(void);
+extern struct TupleBatchOps *GetHeapamTupleBatchOps(void);
 
 #endif							/* TABLEAM_H */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
new file mode 100644
index 00000000000..2d0066103ce
--- /dev/null
+++ b/src/include/executor/execBatch.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * execBatch.h
+ *		Executor batch envelope for passing tuple batch state upward
+ *
+ * Portions Copyright (c) 1996-2025, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/executor/execBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECBATCH_H
+#define EXECBATCH_H
+
+#include "executor/tuptable.h"
+
+/*
+ * TupleBatchOps -- AM-specific helpers for lazy materialization.
+ */
+typedef struct TupleBatchOps
+{
+	void (*materialize_all)(void *am_payload,
+							TupleTableSlot **dst,
+							int maxslots);
+} TupleBatchOps;
+
+/*
+ * TupleBatch
+ *
+ * Envelope for a batch of tuples produced by a plan node (e.g., SeqScan) per
+ * call to a batch variant of ExecSeqScan().
+ */
+typedef struct TupleBatch
+{
+	void	   *am_payload;
+	const TupleBatchOps *ops;
+	int			ntuples;				/* number of tuples in am_payload */
+	bool		materialized;		 /* tuples in slots valid? */
+	struct TupleTableSlot **inslots; /* slots for tuples read "into" batch */
+	struct TupleTableSlot **outslots; /* slots for tuples going "out of"
+									   * batch */
+	struct TupleTableSlot **activeslots;
+	int			maxslots;
+
+	int		nvalid;		/* number of returnable tuples in outslots */
+	int		next;		/* 0-based index of next tuple to be returned */
+} TupleBatch;
+
+
+/* Helpers */
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
+extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
+extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
+extern bool TupleBatchIsValid(TupleBatch *b);
+extern void TupleBatchRewind(TupleBatch *b);
+extern int TupleBatchGetNumValid(TupleBatch *b);
+
+static inline TupleTableSlot *
+TupleBatchGetNextSlot(TupleBatch *b)
+{
+	return b->next < b->nvalid ? b->activeslots[b->next++] : NULL;
+}
+
+static inline TupleTableSlot *
+TupleBatchGetSlot(TupleBatch *b, int index)
+{
+	Assert(index < b->nvalid);
+	return b->activeslots[index];
+}
+
+static inline void
+TupleBatchStoreInOut(TupleBatch *b, int index, TupleTableSlot *out)
+{
+	Assert(TupleBatchIsValid(b));
+	b->outslots[index] = out;
+}
+
+static inline bool
+TupleBatchHasMore(TupleBatch *b)
+{
+	return b->activeslots && b->next < b->nvalid;
+}
+
+static inline void
+TupleBatchMaterializeAll(TupleBatch *b)
+{
+	if (b->materialized)
+		return;
+
+	if (b->ops == NULL || b->ops->materialize_all == NULL)
+		elog(ERROR, "TupleBatch has no slots and no materialize_all op");
+
+	b->ops->materialize_all(b->am_payload, b->inslots, b->ntuples);
+	TupleBatchUseInput(b, b->ntuples);
+}
+
+#endif	/* EXECBATCH_H */
diff --git a/src/include/executor/execScan.h b/src/include/executor/execScan.h
index 2003cbc7ed5..c1add8ca331 100644
--- a/src/include/executor/execScan.h
+++ b/src/include/executor/execScan.h
@@ -251,4 +251,73 @@ ExecScanExtended(ScanState *node,
 	}
 }
 
+/*
+ * ExecScanExtendedBatchSlot
+ *		Batch-driven variant of ExecScanExtended.
+ *
+ * Returns one tuple at a time to callers, but internally fetches tuples
+ * in batches from the AM via accessBatchMtd. This reduces per-tuple AM
+ * call overhead while preserving the single-slot interface expected by
+ * parent nodes.
+ *
+ * The batch is refilled when exhausted by calling accessBatchMtd, which
+ * returns false at end-of-scan.
+ *
+ * Note: EPQ is not supported in the batch path; callers must ensure
+ * es_epq_active is NULL before using this function.
+ */
+static inline TupleTableSlot *
+ExecScanExtendedBatchSlot(ScanState *node,
+						  ExecScanAccessBatchMtd accessBatchMtd,
+						  ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ps.ps_ExprContext;
+	TupleBatch *b = node->ps.ps_Batch;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ps.state->es_epq_active == NULL);
+	Assert(TupleBatchIsValid(b));
+
+	for (;;)
+	{
+		TupleTableSlot *in;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next input slot from current batch, or refill */
+		if (!TupleBatchHasMore(b))
+		{
+			if (!accessBatchMtd(node))
+				return NULL;
+		}
+
+		in = TupleBatchGetNextSlot(b);
+		Assert(in);
+
+		/* No qual, no projection: direct return */
+		if (qual == NULL && projInfo == NULL)
+			return in;
+
+		ResetExprContext(econtext);
+		econtext->ecxt_scantuple = in;
+
+		/* Qual only */
+		if (projInfo == NULL)
+		{
+			if (qual == NULL || ExecQual(qual, econtext))
+				return in;
+			else
+				InstrCountFiltered1(node, 1);
+			continue;
+		}
+
+		/* Projection (with or without qual) */
+		if (qual == NULL || ExecQual(qual, econtext))
+			return ExecProject(projInfo);
+		else
+			InstrCountFiltered1(node, 1);
+		/* else try next tuple */
+	}
+}
+
 #endif							/* EXECSCAN_H */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 7cd6a49309f..c1f05ce6273 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -578,12 +578,16 @@ extern Datum ExecMakeFunctionResultSet(SetExprState *fcache,
  */
 typedef TupleTableSlot *(*ExecScanAccessMtd) (ScanState *node);
 typedef bool (*ExecScanRecheckMtd) (ScanState *node, TupleTableSlot *slot);
+typedef bool (*ExecScanAccessBatchMtd)(ScanState *node);
 
 extern TupleTableSlot *ExecScan(ScanState *node, ExecScanAccessMtd accessMtd,
 								ExecScanRecheckMtd recheckMtd);
+
 extern void ExecAssignScanProjectionInfo(ScanState *node);
 extern void ExecAssignScanProjectionInfoWithVarno(ScanState *node, int varno);
 extern void ExecScanReScan(ScanState *node);
+extern bool ScanCanUseBatching(ScanState *scanstate, int eflags);
+extern void ScanResetBatching(ScanState *scanstate, bool drop);
 
 /*
  * prototypes from functions in execTuples.c
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 9a7d733ddef..13285210998 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
 extern PGDLLIMPORT int VacuumCostBalance;
 extern PGDLLIMPORT bool VacuumCostActive;
 
+extern PGDLLIMPORT int executor_batch_rows;
 
 /* in utils/misc/stack_depth.c */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3968429f991..219a722c49a 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -30,6 +30,7 @@
 #define EXECNODES_H
 
 #include "access/tupconvert.h"
+#include "executor/execBatch.h"
 #include "executor/instrument.h"
 #include "fmgr.h"
 #include "lib/ilist.h"
@@ -1204,6 +1205,9 @@ typedef struct PlanState
 	ExprContext *ps_ExprContext;	/* node's expression-evaluation context */
 	ProjectionInfo *ps_ProjInfo;	/* info for doing tuple projection */
 
+	/* Batching state if node supports it. */
+	TupleBatch *ps_Batch;
+
 	bool		async_capable;	/* true if node is async-capable */
 
 	/*
-- 
2.47.3



  [application/octet-stream] v4-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (13.8K, 5-v4-0003-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
  download | inline diff:
From 189edab507d407cce6446a944b3a48c327167ec3 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v4 3/3] Add EXPLAIN (BATCHES) option for tuple batching
 statistics

Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.

For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.

Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.

Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/commands/explain.c        | 30 ++++++++++++++
 src/backend/commands/explain_state.c  |  2 +
 src/backend/executor/execBatch.c      |  8 +++-
 src/backend/executor/nodeSeqscan.c    | 24 +++++------
 src/include/commands/explain_state.h  |  1 +
 src/include/executor/execBatch.h      | 35 +++++++++++++++-
 src/include/executor/instrument.h     |  1 +
 src/test/regress/expected/explain.out | 57 +++++++++++++++++++++++++++
 src/test/regress/sql/explain.sql      | 26 ++++++++++++
 9 files changed, 171 insertions(+), 13 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5a6390631eb..3a639a13807 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
 #include "commands/explain_format.h"
 #include "commands/explain_state.h"
 #include "commands/prepare.h"
+#include "executor/execBatch.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -517,6 +518,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		instrument_option |= INSTRUMENT_BUFFERS;
 	if (es->wal)
 		instrument_option |= INSTRUMENT_WAL;
+	if (es->batches)
+		instrument_option |= INSTRUMENT_BATCHES;
 
 	/*
 	 * We always collect timing for the entire statement, even when node-level
@@ -2292,6 +2295,33 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		show_buffer_usage(es, &planstate->instrument->bufusage);
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->walusage);
+	if (es->batches && planstate->ps_Batch)
+	{
+		TupleBatch *b = planstate->ps_Batch;
+
+		if (b->stat_batches > 0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batches: %lld  Avg Rows: %.1f  Max: %d  Min: %d\n",
+								 (long long) b->stat_batches,
+								 TupleBatchAvgRows(b),
+								 b->stat_max_rows,
+								 b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows);
+			}
+			else
+			{
+				ExplainPropertyInteger("Batches", NULL, b->stat_batches, es);
+				ExplainPropertyFloat("Average Batch Rows", NULL,
+									 TupleBatchAvgRows(b), 1, es);
+				ExplainPropertyInteger("Max Batch Rows", NULL, b->stat_max_rows, es);
+				ExplainPropertyInteger("Min Batch Rows", NULL,
+									   b->stat_min_rows == INT_MAX ? 0 : b->stat_min_rows, es);
+			}
+		}
+	}
 
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index a6623f8fa52..6ef6055c479 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 								"EXPLAIN", opt->defname, p),
 						 parser_errposition(pstate, opt->location)));
 		}
+		else if (strcmp(opt->defname, "batches") == 0)
+			es->batches = defGetBoolean(opt);
 		else if (!ApplyExtensionExplainOption(es, opt, pstate))
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
diff --git a/src/backend/executor/execBatch.c b/src/backend/executor/execBatch.c
index 007ae535687..93c90680d3d 100644
--- a/src/backend/executor/execBatch.c
+++ b/src/backend/executor/execBatch.c
@@ -19,7 +19,7 @@
  *		Allocate and initialize a new TupleBatch envelope.
  */
 TupleBatch *
-TupleBatchCreate(TupleDesc scandesc, int capacity)
+TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats)
 {
 	TupleBatch  *b;
 	TupleTableSlot **inslots,
@@ -44,6 +44,12 @@ TupleBatchCreate(TupleDesc scandesc, int capacity)
 	b->nvalid = 0;
 	b->next = 0;
 
+	b->track_stats = track_stats;
+	b->stat_batches = 0;
+	b->stat_rows = 0;
+	b->stat_max_rows = 0;
+	b->stat_min_rows = INT_MAX;
+
 	return b;
 }
 
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index a9071e32560..73eb9b6a51e 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -213,8 +213,9 @@ SeqNextBatch(SeqScanState *node)
 	TableScanDesc scandesc;
 	EState	   *estate;
 	ScanDirection direction;
+	TupleBatch *b = node->ss.ps.ps_Batch;
 
-	Assert(node->ss.ps.ps_Batch != NULL);
+	Assert(b != NULL);
 
 	/*
 	 * get information from the estate and scan state
@@ -237,22 +238,21 @@ SeqNextBatch(SeqScanState *node)
 	}
 
 	/* Lazily create the AM batch payload. */
-	if (node->ss.ps.ps_Batch->am_payload == NULL)
+	if (b->am_payload == NULL)
 	{
 		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
 
 		Assert(tam && tam->scan_begin_batch);
-		node->ss.ps.ps_Batch->am_payload =
-			table_scan_begin_batch(scandesc, node->ss.ps.ps_Batch->maxslots);
-		node->ss.ps.ps_Batch->ops = table_batch_callbacks(node->ss.ss_currentRelation);
+		b->am_payload = table_scan_begin_batch(scandesc, b->maxslots);
+		b->ops = table_batch_callbacks(node->ss.ss_currentRelation);
 	}
 
-	node->ss.ps.ps_Batch->ntuples =
-		table_scan_getnextbatch(scandesc, node->ss.ps.ps_Batch->am_payload, direction);
-	node->ss.ps.ps_Batch->nvalid = node->ss.ps.ps_Batch->ntuples;
-	node->ss.ps.ps_Batch->materialized = false;
+	b->ntuples = table_scan_getnextbatch(scandesc, b->am_payload, direction);
+	b->nvalid = b->ntuples;
+	b->materialized = false;
+	TupleBatchRecordStats(b, b->ntuples);
 
-	return node->ss.ps.ps_Batch->ntuples > 0;
+	return b->ntuples > 0;
 }
 
 static inline bool
@@ -340,8 +340,10 @@ SeqScanInitBatching(SeqScanState *scanstate, int eflags)
 {
 	const int cap = executor_batch_rows;
 	TupleDesc	scandesc = RelationGetDescr(scanstate->ss.ss_currentRelation);
+	EState *estate = scanstate->ss.ps.state;
+	bool track_stats = estate->es_instrument && (estate->es_instrument & INSTRUMENT_BATCHES);
 
-	scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap);
+	scanstate->ss.ps.ps_Batch = TupleBatchCreate(scandesc, cap, track_stats);
 
 	/* Choose batch variant to preserve your specialization matrix */
 	if (scanstate->ss.ps.qual == NULL)
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index ba073b86918..b82f7ac0829 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -55,6 +55,7 @@ typedef struct ExplainState
 	bool		memory;			/* print planner's memory usage information */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
+	bool		batches;		/* print batch statistics */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execBatch.h b/src/include/executor/execBatch.h
index 2d0066103ce..e3a4f762284 100644
--- a/src/include/executor/execBatch.h
+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,7 @@
 #ifndef EXECBATCH_H
 #define EXECBATCH_H
 
+#include "limits.h"
 #include "executor/tuptable.h"
 
 /*
@@ -45,11 +46,18 @@ typedef struct TupleBatch
 
 	int		nvalid;		/* number of returnable tuples in outslots */
 	int		next;		/* 0-based index of next tuple to be returned */
+
+	/* Statistics (populated when EXPLAIN ANALYZE BATCHES) */
+	bool	track_stats;	/* whether to collect stats */
+	int64	stat_batches;	/* total number of batches fetched */
+	int64	stat_rows;		/* total tuples across all batches */
+	int		stat_max_rows;	/* max rows in any single batch */
+	int		stat_min_rows;	/* min rows in any single batch (non-zero) */
 } TupleBatch;
 
 
 /* Helpers */
-extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity);
+extern TupleBatch *TupleBatchCreate(TupleDesc scandesc, int capacity, bool track_stats);
 extern void TupleBatchReset(TupleBatch *b, bool drop_slots);
 extern void TupleBatchUseInput(TupleBatch *b, int nvalid);
 extern void TupleBatchUseOutput(TupleBatch *b, int nvalid);
@@ -96,4 +104,29 @@ TupleBatchMaterializeAll(TupleBatch *b)
 	TupleBatchUseInput(b, b->ntuples);
 }
 
+/* === Batching stats. ===*/
+
+static inline void
+TupleBatchRecordStats(TupleBatch *b, int rows)
+{
+	if (!b->track_stats)
+		return;
+
+	b->stat_batches++;
+	b->stat_rows += rows;
+	if (rows > b->stat_max_rows)
+		b->stat_max_rows = rows;
+	if (rows < b->stat_min_rows && rows > 0)
+		b->stat_min_rows = rows;
+}
+
+static inline double
+TupleBatchAvgRows(TupleBatch *b)
+{
+	if (b->stat_batches == 0)
+		return 0.0;
+
+	return (double) b->stat_rows / b->stat_batches;
+}
+
 #endif	/* EXECBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index ffe470f2b84..0af02db3760 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..fef3b4a5497 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,60 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 (9 rows)
 
 reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Filter: (a > N)
+   Rows Removed by Filter: N
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+                            explain_filter                            
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+   ->  Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+         Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+reset executor_batch_rows;
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+ ?column? 
+----------
+ 0
+(1 row)
+
+drop table batch_test;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..87bb179ced9 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,29 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 -- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
 select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
 reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- Basic batch stats output
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With LIMIT - partial scan shows fewer batches
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Batching disabled - no batch line
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+reset executor_batch_rows;
+
+-- JSON format
+select explain_filter_to_json('explain (analyze, batches, buffers off, format json) select * from batch_test where a < 1000') #> '{0,Plan,Batches}';
+
+drop table batch_test;
-- 
2.47.3



  [text/x-sh] bar_limit.sh (1.7K, 6-bar_limit.sh)
  download | inline:
home=$HOME
master=$home/pg/install/master-opt/bin
patched=$home/pg/install/patched-opt/bin
master_data=$home/pg/data/master
patched_data=$home/pg/data/patched
logdir=$home/pg/log

# master
export PATH=$master:$PATH
which postgres
pg_ctl -D  $master_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $master_data -l $logdir/pg_master_log stop

export PATH=$patched:$PATH;
which postgres
echo "executor_batch_rows=0" >> $patched_data/postgresql.conf
echo "executor_batch_rows=0"
pg_ctl -D  $patched_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $patched_data -l $logdir/pg_master_log stop

which postgres
echo "executor_batch_rows=64" >> $patched_data/postgresql.conf
echo "executor_batch_rows=64"
pg_ctl -D  $patched_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $patched_data -l $logdir/pg_master_log stop

  [text/x-sh] bar_limit_where_o.sh (1.7K, 7-bar_limit_where_o.sh)
  download | inline:
home=$HOME
master=$home/pg/install/master-opt/bin
patched=$home/pg/install/patched-opt/bin
master_data=$home/pg/data/master
patched_data=$home/pg/data/patched
logdir=$home/pg/log

# master
export PATH=$master:$PATH
which postgres
pg_ctl -D  $master_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i where o > 0 limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $master_data -l $logdir/pg_master_log stop

export PATH=$patched:$PATH;
which postgres
echo "executor_batch_rows=0" >> $patched_data/postgresql.conf
echo "executor_batch_rows=0"
pg_ctl -D  $patched_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i where o > 0 limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $patched_data -l $logdir/pg_master_log stop

which postgres
echo "executor_batch_rows=64" >> $patched_data/postgresql.conf
echo "executor_batch_rows=64"
pg_ctl -D  $patched_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i where o > 0 limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $patched_data -l $logdir/pg_master_log stop

  [text/x-sh] bar_limit_where_a.sh (1.7K, 8-bar_limit_where_a.sh)
  download | inline:
home=$HOME
master=$home/pg/install/master-opt/bin
patched=$home/pg/install/patched-opt/bin
master_data=$home/pg/data/master
patched_data=$home/pg/data/patched
logdir=$home/pg/log

# master
export PATH=$master:$PATH
which postgres
pg_ctl -D  $master_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i where a > 0 limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $master_data -l $logdir/pg_master_log stop

export PATH=$patched:$PATH;
which postgres
echo "executor_batch_rows=0" >> $patched_data/postgresql.conf
echo "executor_batch_rows=0";
pg_ctl -D  $patched_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i where a > 0 limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $patched_data -l $logdir/pg_master_log stop

which postgres
echo "executor_batch_rows=64" >> $patched_data/postgresql.conf
echo "executor_batch_rows=64"
pg_ctl -D  $patched_data -l $logdir/pg_master_log start

for i in 1000000 2000000 3000000 4000000 5000000 10000000; do
	psql -c "select pg_prewarm('bar_$i')" > /dev/null 2>&1
	psql -c "vacuum bar_$i" > /dev/null 2>&1
	printf "%s\t" "$i"
	echo "select * from bar_$i where a > 0 limit 1 offset $i" > /tmp/bar_limit.sql
	pgbench -n -T5 -f /tmp/bar_limit.sql | grep latency
done

pg_ctl -D  $patched_data -l $logdir/pg_master_log stop

^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-12-20 14:36  Amit Langote <[email protected]>
  parent: Daniil Davydov <[email protected]>
  0 siblings, 0 replies; 22+ messages in thread

From: Amit Langote @ 2025-12-20 14:36 UTC (permalink / raw)
  To: Daniil Davydov <[email protected]>; +Cc: Tomas Vondra <[email protected]>; pgsql-hackers

Hi Daniil,


On Thu, Oct 30, 2025 at 9:12 PM Daniil Davydov <[email protected]> wrote:
> On Wed, Oct 29, 2025 at 9:23 AM Amit Langote <[email protected]> wrote:
> >
> > Hi Daniil,
> >
> > On Tue, Oct 28, 2025 at 11:32 PM Daniil Davydov <[email protected]> wrote:
> > >
> > > Hi,
> > >
> > > As far as I understand, this work partially overlaps with what we did in the
> > > thread [1] (in short - we introduce support for batching within the ModifyTable
> > > node). Am I correct?
> >
> > There might be some relation, but not much overlap. The thread you
> > mention seems to focus on batching in the write path (for INSERT,
> > etc.), while this work targets batching in the read path via Table AM
> > scan callbacks. I think they can be developed independently, though
> > I'm happy to take a look.
>
> Oh, I got it. Thanks!
>
> I looked at 0001-0003 patches and got some comments :
> 1)
> I noticed that some Nodes may set SO_ALLOW_PAGEMODE flag to 'false'
> during ExecReScan. heap_getnextslot works carefully with it - checks whether
> pagemode is allowed at every call. If not - it just uses tuple-at-a-time mode.
> At the same time, heap_getnextbatch always expects that pagemode is enabled.
> I didn't find any code paths which can lead to an assertion [1] fail.
> If such a code
> path is unreachable under any circumstances, maybe we should add a comment
> why?
>
> 2)
> heapgettup_pagemode_batch : Do we really need to compute lineindex variable
> in this way? :
> ***
>             lineindex = scan->rs_cindex + dir;
>             if (ScanDirectionIsForward(dir))
>                 linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
>                     (scan->rs_ntuples - lineindex) : 0;
> ***
>
> As far as I understand, this is enough :
> ***
>         lineindex = scan->rs_cindex + dir;
>         if (ScanDirectionIsForward(dir))
>             linesleft = scan->rs_ntuples - lineindex;
> ***
>
> 3)
> Is this code inside heapgettup_pagemode_batch necessary? :
> ***
> ScanDirectionIsForward(dir) ? 0 : 0
> ***
>
> 4)
> heapgettup_pagemode has this change :
> HeapTuple    tuple = &(scan->rs_ctup) ---> HeapTuple tuple = &scan->rs_ctup
> I guess it was changed accidentally.
>
> 5)
> I apologize for the tediousness, but these braces are not in the
> postgres style :
> ***
> static const TupleBatchOps TupleBatchHeapOps = {
>     .materialize_all = heap_materialize_batch_all
> };
> ***
>
> [1] heap_getnextbatch : Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE)

Thanks for the review and apologies for getting to them so late.

I think I've addressed your comments in v4 that I just posted.

-- 
Thanks, Amit Langote





^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2025-12-22 11:45  =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: =?utf-8?B?Y2NhNTUwNw==?= @ 2025-12-22 11:45 UTC (permalink / raw)
  To: =?utf-8?B?QW1pdCBMYW5nb3Rl?= <[email protected]>; +Cc: pgsql-hackers; =?utf-8?B?VG9tYXMgVm9uZHJh?= <[email protected]>

Hi,

Some comments for v4:

0001
====

1) table_scan_getnextbatch()
"Assert(dir == ForwardScanDirection);" -> "Assert(ScanDirectionIsForward(dir));"

2) heapgettup_pagemode_batch()
"TupleDesc	tupdesc = key ? RelationGetDescr(rel) : NULL;" -> "TupleDesc	tupdesc = RelationGetDescr(rel);"
I think the latter is enough.

3) heapgettup_pagemode_batch()
```
			/* Are there more visible tuples left on this page? */
			lineindex = scan->rs_cindex + dir;
			linesleft = (lineindex <= (uint32) scan->rs_ntuples) ?
				(scan->rs_ntuples - lineindex) : 0;
			if (linesleft > 0)
				break;	/* continue on this page */
```
The "scan->rs_ntuples" is already an uint32.

4) heapgettup_pagemode_batch()
```
		Assert(lineindex <= (uint32) scan->rs_ntuples);
```
The "scan->rs_ntuples" is already an uint32. And I think this should be "Assert(lineindex < scan->rs_ntuples);", the related
assert in heapgettup_pagemode() is also wrong.

5) heapgettup_pagemode_batch()
If the scan key filters out all tuples on a page, we may return 0 before reaching the end of scan, right?

6) heap_begin_batch()
```
	hb = palloc(sizeof(HeapBatch));
	hb->tupdata = palloc(sizeof(HeapTupleData) * maxitems);
```
Can we just use one palloc() for cache-friendly?

0002
====

1) heap_materialize_batch_all()
```
		slot->base.tts_flags &= ~(TTS_FLAG_EMPTY | TTS_FLAG_SHOULDFREE);
		slot->base.tts_tid = tuple->t_self;
		slot->base.tts_tableOid = tuple->t_tableOid;
		slot->base.tts_flags &= ~(TTS_FLAG_SHOULDFREE | TTS_FLAG_EMPTY);
```
Redundant of "slot->base.tts_flags &="?

2) TupleBatchCreate()
```
	inslots = palloc(sizeof(TupleTableSlot *) * capacity);
	outslots = palloc(sizeof(TupleTableSlot *) * capacity);
	for (int i = 0; i < capacity; i++)
		inslots[i] = MakeSingleTupleTableSlot(scandesc, &TTSOpsHeapTuple);

	b = (TupleBatch *) palloc(sizeof(TupleBatch));
```
Can we just use one palloc() for cache-friendly?

3) TupleBatchCreate()
```
	b->outslots = outslots;
	b->activeslots = NULL;
	b->outslots = outslots;
```
Redundant of "b->outslots = outslots;"?

4) TupleBatchReset()
```
	if (b == NULL)
		return;
```
This can never happen, convert to a assert or just delete it?

5) SeqNextBatch()
"Assert(direction == ForwardScanDirection);" -> "Assert(ScanDirectionIsForward(direction));"

--
Regards,
ChangAo Chen


^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2026-03-24 00:59  Amit Langote <[email protected]>
  parent: =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  0 siblings, 1 reply; 22+ messages in thread

From: Amit Langote @ 2026-03-24 00:59 UTC (permalink / raw)
  To: Junwang Zhao <[email protected]>; +Cc: cca5507 <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>

Hi,

Here is a significantly revised version of the patch series. A lot has
changed since the January submission, so I want to summarize the
design changes before getting into the patches.  I think it does
address the points in the two reviews that landed since v5 but maybe a
bunch of points became moot after my rewrite of the relevant portions
(thanks Junwang and ChangAo for the review in any case).

At this point it might be better to think of this as targeting v20,
except that if there is review bandwidth in the remaining two weeks
before the v19 feature freeze, the rs_vistuples[] change described
below as a standalone improvement to the existing pagemode scan path
could be considered for v19, though that too is an optimistic
scenario.

It is also worth noting that Andres identified a number of
inefficiencies in the existing scan path in:

Re: unnecessary executor overheads around seqscans
https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw

that are worth fixing independently of batching. Some of those fixes
may be better pursued first, both because they benefit all scan paths
and because they would make batching's gains more honest.

Separately, after looking at the previous version, Andres pointed out
offlist two fundamental issues with the patch's design:

* The heapam implementation (in a version of the patch I didn't post
to the thread) duplicated heap_prepare_pagescan() logic in a separate
batch-specific code path, which is not acceptable as changes should
benefit the existing slot interface too.  Code duplication is not good
either from a future maintainability aspect. The v5 version of that
code is not great in that respect either; it instead duplicated
heapggettup_pagemode() to slap batching on it.

* Allocating executor_batch_rows slots on the executor side to receive
rows from the AM adds significant overhead for slot initialization and
management, and for non-row-organized AMs that do not produce
individual rows at all, those slots would never be meaningfully
populated.

In any case, he just wasn't a fan of the slot-array approach the
moment I mentioned it. The previous version had two slot arrays,
inslots and outslots, of TTSOpsHeapTuple type (not
TTSOpsBufferHeapTuple because buffer pins were managed by the batch
code, which has its own modularity/correctness issues), populated via
a materialize_all callback. A batch qual evaluator would copy
qualifying tuples into outslots, with an activeslots pointer switching
between the two depending on whether batch qual evaluation was used.

The new design addresses both issues and differs from the previous
version in several other ways:

 * Single slot instead of slot arrays: there is a single
TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
was already determined by the AM via table_slot_callbacks().  The slot
is re-pointed to each HeapTuple in the current buffer page via a new
repoint_slot AM callback, with no materialization or copying.  Tuples
are returned one by one from the executor's perspective, but the AM
serves them in page-sized batches from pre-built HeapTupleData
descriptors in rs_vistuples[], avoiding repeated descent into heapam
per tuple.  This is heapam's implementation of the batch interface;
there is no intention to force other AMs into the same row-oriented
model.

 * Batch qual evaluator not included: with the single-slot model,
quals are evaluated per tuple via the existing ExecQual path after
each repoint_slot call.  A natural next step would be a new opcode
(EEOP) that calls repoint_slot() internally within expression
evaluation, allowing ExecQual to advance through multiple tuples from
the same batch without returning to the scan node each time, with qual
results accumulated in a bitmask in ExprState.  The details of that
will be worked out in a follow-on series.

* heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
allows heap_getnextbatch() to simply advance a slice pointer into that
array without any additional copying or re-entering heap code, making
a separate batch-specific scan function unnecessary.

 * TupleBatch renamed to RowBatch: "row batch" is more natural
terminology for this concept and also consistent with how similar
abstractions are named in columnar and OLAP systems.

 * AM callbacks now take RowBatch directly: previously
heap_getnextbatch() returned a void pointer that the executor would
store into RowBatch.am_payload, because only the executor knew the
internals of RowBatch.  Now the AM receives RowBatch directly as a
parameter and can populate it without the executor acting as an
intermediary.  This is also why RowBatch is introduced in its own
patch ahead of the AM API addition, so the struct definition is
available to both sides.

Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
instead of OffsetNumbers, as a standalone improvement to the existing
pagemode scan path. Measured on a pg_prewarm'd  (also vaccum freeze'd
in the all-visible case) table with 1M/5M/10M rows:

  query                           all-visible      not-all-visible
  count(*)                        -0.2% to +0.9%   -0.4% to +0.5%
  count(*) WHERE id % 10 = 0     -1.1% to +3.4%   +0.2% to +1.5%
  SELECT * LIMIT 1 OFFSET N      -2.2% to -0.6%   -0.9% to +6.6%
  SELECT * WHERE id%10=0 LIMIT   -0.8% to +3.9%   +0.9% to +9.6%

No significant regression on either page type. The structural
improvement is most visible on not-all-visible pages where
HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
visibility checks, so persisting the result into rs_vistuples[]
eliminates the downstream re-read (in heapgettupe_pagemode()) with no
measurable overhead.  That said, these numbers are somewhat noisy on
my machine.  Results on other machines would be welcome.

Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
and heapam implementation including seqscan variants that use the new
scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
respectively. With batching enabled (executor_batch_rows=300,
~MaxHeapTuplesPerPage):

  query                           all-visible    not-all-visible
  count(*)                        +11 to +15%    +9 to +13%
  count(*) WHERE id % 10 = 0     +6 to +11%     +10 to +14%
  SELECT * LIMIT 1 OFFSET N      +16 to +19%    +16 to +22%
  SELECT * WHERE id%10=0 LIMIT   +8 to +10%     +8 to +13%

With executor_batch_rows=0, results are within noise of master across
all query types and sizes, confirming no regression from the
infrastructure changes themselves.  The not-all-visible results tend
to show slightly higher gains than the all-visible case. This is
likely because the existing heapam code is more optimized for the
all-visible path, so the not-all-visible path, which goes through
HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
more headroom that batching can exploit.

Setting aside the current series for a moment, there are some broader
design questions worth raising while we have attention on this area.
Some of these echo points Tomas raised in his first reply on this
thread, and I am reiterating them deliberately since I have not
managed to fully address them on my own or I simply didn't need to for
the TAM-to-scan-node batching and think they would benefit from wider
input rather than just my own iteration.

We should also start thinking about other ways the executor can
consume batch rows, not always assuming they are presented as
HeapTupleData. For instance, an AM could expose decoded column arrays
directly to operators that can consume them, bypassing slot-based
deform entirely, or a columnar AM could implement scan_getnextbatch by
decoding column strips directly into the batch without going through
per-tuple HeapTupleData at all. Feedback on whether the current
RowBatch design and the choices made in the scan_getnextbatch and
RowBatchOps API make that sort of thing harder than it needs to be
would be appreciated. For example, heapam's implementation of
scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
to HeapTupleData entries one at a time via repoint_slot in
RowBatchHeapOps. That works for heapam but a columnar AM could
implement scan_getnextbatch to decode column strips directly into
arrays in the batch, with no per-row repoint step needed at all. Any
adjustments that would make RowBatch more AM-agnostic are worth
discussing now before the design hardens.

There are also broader open questions about how far the batch model
can extend beyond the scan node. Qual pushdown into the AM has been
discussed in nearby threads and would be one way to allow expression
evaluation to happen before data reaches the executor proper, though
that is a separate effort. For the purposes of this series, expression
evaluation still happens in the executor after scan_getnextbatch
returns. If the scan node does not project, the buffer heap slot is
passed directly to the parent node, which calls slot callbacks to
deform as needed. But once a node above projects, aggregates, or
joins, the notion of a page-sized batch from a single AM loses its
meaning and virtual slots take over. Whether RowBatch is usable or
meaningful beyond the scan/TAM boundary in any form, and whether the
core executor will ever have non-HeapTupleData batch consumption paths
or leave that entirely to extensions, are open questions worth
discussing.

For RowBatch to eventually play the role that TupleTableSlot plays for
row-at-a-time execution, something inside it would need to serve as
the common currency for batch data, analogous to TupleTableSlot's
datum/isnull arrays. Column arrays are the obvious direction, but even
that leaves open the question of representation. PostgreSQL's Datum is
a pointer-sized abstraction that boxes everything, whereas vectorized
systems use typed packed arrays of native types with validity
bitmasks, which is a significant part of why tight vectorized loops
are fast there. Whether column arrays of Datum would be good enough,
or whether going further toward typed packed arrays would be necessary
to get meaningful vectorization, is a deeper design question that this
series deliberately does not try to answer.

Even though the focus is on getting batching working at the scan/TAM
boundary first, thoughts on any of these points would be welcome.

--
Thanks, Amit Langote


Attachments:

  [application/x-patch] v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch (19.0K, 2-v6-0003-Add-batch-table-AM-API-and-heapam-implementation.patch)
  download | inline diff:
From a095d26e1b5a361a7d42300e5364da948496f2ba Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 23 Mar 2026 18:21:47 +0900
Subject: [PATCH v6 3/5] Add batch table AM API and heapam implementation

Introduce table AM callbacks for batched tuple fetching:
scan_begin_batch, scan_getnextbatch, scan_reset_batch, and
scan_end_batch.  AMs implement all four or none; checked by
table_supports_batching().

scan_reset_batch releases held resources (e.g. buffer pins)
without freeing, allowing reuse across rescans.

Provide the heapam implementation.  HeapPageBatch (stored in
RowBatch.am_payload) is a thin slice descriptor over the scan's
rs_vistuples[] array, which was introduced in the previous commit.
Rather than owning a copy of tuple headers, HeapPageBatch holds a
pointer into scan->rs_vistuples[] for the current slice and a buffer
pin for the current page.

heap_getnextbatch() calls heap_prepare_pagescan() to populate
rs_vistuples[] for each new page, then re-points hb->tuples to the
next slice of rs_vistuples[] on each call.  If the page has more
tuples than the executor's max_rows, subsequent calls return the
next slice without re-entering page preparation.  The buffer pin is
held until the page is fully consumed.

scan_begin_batch creates a single TupleTableSlot with
TTSOpsBufferHeapTuple ops.  heap_repoint_slot() re-points this slot
to each tuple in turn via ExecStoreBufferHeapTuple().  Consumers
that need to retain the slot across calls rely on the normal slot
materialization contract.

Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/access/heap/heapam.c         | 229 ++++++++++++++++++++++-
 src/backend/access/heap/heapam_handler.c |   8 +-
 src/include/access/heapam.h              |  33 ++++
 src/include/access/tableam.h             | 136 ++++++++++++++
 src/include/pgstat.h                     |   4 +-
 5 files changed, 403 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index c6d0aacc5c9..e70c0ccbe82 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
 #include "catalog/pg_database.h"
 #include "catalog/pg_database_d.h"
 #include "commands/vacuum.h"
+#include "executor/execRowBatch.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -109,6 +110,7 @@ static int	bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
 static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 										bool *copy);
+static void heap_repoint_slot(RowBatch *b, int idx);
 
 
 /*
@@ -1213,7 +1215,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_cbuf = InvalidBuffer;
 
 	/*
-	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
+	 * Disable page-at-a-time mode if the snapshot does not allow it.
 	 */
 	if (!(snapshot && IsMVCCSnapshot(snapshot)))
 		scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
@@ -1463,7 +1465,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 	 * the proper return buffer and return the tuple.
 	 */
 
-	pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+	pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
 
 	return &scan->rs_ctup;
 }
@@ -1491,13 +1493,232 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
 	 * the proper return buffer and return the tuple.
 	 */
 
-	pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+	pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
 
 	ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
 							 scan->rs_cbuf);
 	return true;
 }
 
+/*---------- Batching support -----------*/
+
+static const RowBatchOps RowBatchHeapOps =
+{
+	.repoint_slot = heap_repoint_slot
+};
+
+/*
+ * heap_batch_feasible
+ *		Batching requires a MVCC snapshot since it relies on
+ *		page-at-a-time mode, which heap_beginscan() disables for
+ *		non-MVCC snapshots.
+ */
+bool
+heap_batch_feasible(Relation relation, Snapshot snapshot)
+{
+	return snapshot && IsMVCCSnapshot(snapshot);
+}
+
+/*
+ * heap_begin_batch
+ *		Initialize AM-side batch state for a heap scan.
+ *
+ * Allocates a HeapPageBatch, which acts as a thin slice descriptor over
+ * the scan's rs_vistuples[] array.  Unlike the previous version there is
+ * no separate tuple header storage in HeapPageBatch itself; rs_vistuples[]
+ * in HeapScanDescData (populated by page_collect_tuples() via
+ * heap_prepare_pagescan()) serves as the page-level buffer.  HeapPageBatch
+ * holds a pointer into that array for the current slice and the buffer pin
+ * for the current page.
+ *
+ * b->slot must be a TTSOpsBufferHeapTuple slot.
+ */
+void
+heap_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+	HeapPageBatch  *hb;
+
+	/* Batch path relies on executor-level qual eval, not AM scan keys */
+	Assert(sscan->rs_nkeys == 0);
+	Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+	hb = palloc(sizeof(HeapPageBatch));
+	hb->tuples = NULL;
+	hb->ntuples = 0;
+	hb->nextitem = 0;
+	hb->buf = InvalidBuffer;
+
+	b->am_payload = hb;
+	b->ops = &RowBatchHeapOps;
+}
+
+/*
+ * heap_reset_batch
+ *		Release pin and reset for rescan, keeping allocations.
+ */
+void
+heap_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+	HeapPageBatch  *hb = (HeapPageBatch *) b->am_payload;
+
+	Assert(hb != NULL);
+	if (BufferIsValid(hb->buf))
+	{
+		ReleaseBuffer(hb->buf);
+		hb->buf = InvalidBuffer;
+	}
+	hb->ntuples = 0;
+	hb->nextitem = 0;
+}
+
+/*
+ * heap_end_batch
+ *		Release all batch resources.
+ */
+void
+heap_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+	HeapPageBatch  *hb = (HeapPageBatch *) b->am_payload;
+
+	if (BufferIsValid(hb->buf))
+		ReleaseBuffer(hb->buf);
+
+	pfree(hb);
+	b->am_payload = NULL;
+}
+
+/*
+ * heap_getnextbatch
+ *		Fetch the next slice of visible tuples from a heap scan.
+ *
+ * Serves slices from the current page's rs_vistuples[] array.  If the
+ * current page has remaining tuples, sets hb->tuples to point at the next
+ * slice without re-entering the page scan.  If the page is exhausted,
+ * advances to the next page via heap_fetch_next_buffer(), prepares it
+ * with heap_prepare_pagescan(), and serves the first slice from it.
+ *
+ * hb->tuples points directly into scan->rs_vistuples[]; the entries remain
+ * valid as long as hb->buf (the page's buffer pin) is held.  The pin is
+ * released at the top of the next call once the page is fully consumed.
+ *
+ * Each call returns at most b->max_rows tuples.
+ *
+ * Returns true if tuples were fetched, false at end of scan.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+	HeapScanDesc	scan = (HeapScanDesc) sscan;
+	HeapPageBatch  *hb = (HeapPageBatch *) b->am_payload;
+	int				remaining;
+	int				nserve;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+
+	/*
+	 * Try to serve from the current page first.  No page advance, no buffer
+	 * management, no re-entry into heap code.
+	 */
+	remaining = scan->rs_ntuples - hb->nextitem;
+	if (remaining > 0)
+	{
+		nserve = Min(remaining, b->max_rows);
+
+		hb->tuples = &scan->rs_vistuples[hb->nextitem];
+		hb->ntuples = nserve;
+		hb->nextitem += nserve;
+
+		b->nrows = nserve;
+		b->pos = 0;
+
+		pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+		return true;
+	}
+
+	/*
+	 * Current page exhausted.  Advance to the next page with visible tuples.
+	 */
+	for (;;)
+	{
+		/*
+		 * Release the previous page's pin.  The page is fully consumed at
+		 * this point -- all slices have been served.
+		 */
+		if (BufferIsValid(hb->buf))
+		{
+			ReleaseBuffer(hb->buf);
+			hb->buf = InvalidBuffer;
+		}
+
+		heap_fetch_next_buffer(scan, dir);
+
+		if (!BufferIsValid(scan->rs_cbuf))
+		{
+			/* End of scan */
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_prefetch_block = InvalidBlockNumber;
+			scan->rs_inited = false;
+			b->nrows = 0;
+			return false;
+		}
+
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+		/*
+		 * Prepare the page: prune, run visibility checks, and populate
+		 * scan->rs_vistuples[0..rs_ntuples-1] via page_collect_tuples().
+		 */
+		heap_prepare_pagescan(sscan);
+
+		if (scan->rs_ntuples > 0)
+		{
+			/*
+			 * Pin the page so tuple data stays valid while the executor
+			 * processes slices.  Released at the top of the next call
+			 * once the page is fully consumed.
+			 */
+			IncrBufferRefCount(scan->rs_cbuf);
+			hb->buf = scan->rs_cbuf;
+
+			nserve = Min(scan->rs_ntuples, b->max_rows);
+
+			hb->tuples = &scan->rs_vistuples[0];
+			hb->ntuples = nserve;
+			hb->nextitem = nserve;
+
+			b->nrows = nserve;
+			b->pos = 0;
+
+			pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+			return true;
+		}
+
+		/* Empty page (all dead/invisible tuples), try next */
+	}
+}
+
+/*
+ * heap_repoint_slot
+ *		Re-point the batch's single slot to the tuple at index idx.
+ *
+ * Called by RowBatchGetNextSlot() for each tuple served to the parent
+ * node.  hb->tuples[idx] was populated by page_collect_tuples() via
+ * heap_prepare_pagescan() and remains valid as long as hb->buf is pinned.
+ */
+static void
+heap_repoint_slot(RowBatch *b, int idx)
+{
+	HeapPageBatch		*hb = (HeapPageBatch *) b->am_payload;
+
+	Assert(idx >= 0 && idx < hb->ntuples);
+	Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+	ExecStoreBufferHeapTuple(&hb->tuples[idx], b->slot, hb->buf);
+}
+
+/*----- End of batching support -----*/
+
 void
 heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 				  ItemPointer maxtid)
@@ -1639,7 +1860,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
 	 * if we get here it means we have a new current scan tuple, so point to
 	 * the proper return buffer and return the tuple.
 	 */
-	pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+	pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
 
 	ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
 	return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 2fd120028bb..8124d573ac3 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2348,7 +2348,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
 			ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
 
 			/* Count successfully-fetched tuples as heap fetches */
-			pgstat_count_heap_getnext(scan->rs_rd);
+			pgstat_count_heap_getnext(scan->rs_rd, 1);
 
 			return true;
 		}
@@ -2637,6 +2637,12 @@ static const TableAmRoutine heapam_methods = {
 	.scan_rescan = heap_rescan,
 	.scan_getnextslot = heap_getnextslot,
 
+	.scan_batch_feasible = heap_batch_feasible,
+	.scan_begin_batch = heap_begin_batch,
+	.scan_getnextbatch = heap_getnextbatch,
+	.scan_end_batch = heap_end_batch,
+	.scan_reset_batch = heap_reset_batch,
+
 	.scan_set_tidrange = heap_set_tidrange,
 	.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 09b9566d0ac..0783fa13c4c 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -107,6 +107,32 @@ typedef struct HeapScanDescData
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
+/*
+ * HeapPageBatch -- heapam-private page-level batch state.
+ *
+ * Thin slice descriptor over the scan's rs_vistuples[] array.  Rather
+ * than owning a copy of tuple headers, HeapPageBatch holds a pointer
+ * into scan->rs_vistuples[] for the current slice, which was populated
+ * by page_collect_tuples() during heap_prepare_pagescan().
+ *
+ * The executor consumes tuples in slices.  Each heap_getnextbatch call
+ * re-points tuples to the next slice and advances nextitem, serving up
+ * to RowBatch.max_rows tuples from the current page before advancing
+ * to the next.
+ *
+ * buf holds the pin for the current page.  tuple data referenced via
+ * tuples remains valid as long as buf is pinned.
+ *
+ * Stored in RowBatch.am_payload.
+ */
+typedef struct HeapPageBatch
+{
+	HeapTupleData  *tuples;		/* points into scan->rs_vistuples[nextitem] */
+	int				ntuples;	/* tuples in current slice */
+	int				nextitem;	/* next unserved tuple index in rs_vistuples[] */
+	Buffer			buf;		/* pinned buffer for current page */
+} HeapPageBatch;
+
 typedef struct BitmapHeapScanDescData
 {
 	HeapScanDescData rs_heap_base;
@@ -362,6 +388,13 @@ extern void heap_endscan(TableScanDesc sscan);
 extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
 extern bool heap_getnextslot(TableScanDesc sscan,
 							 ScanDirection direction, TupleTableSlot *slot);
+
+extern bool heap_batch_feasible(Relation relation, Snapshot snapshot);
+extern void heap_begin_batch(TableScanDesc sscan, RowBatch *batch);
+extern bool heap_getnextbatch(TableScanDesc sscan, RowBatch *batch, ScanDirection dir);
+extern void heap_end_batch(TableScanDesc sscan, RowBatch *batch);
+extern void heap_reset_batch(TableScanDesc sscan, RowBatch *batch);
+
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
 extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 06084752245..a72be111c26 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -275,6 +275,8 @@ typedef void (*IndexBuildCallback) (Relation index,
 									bool tupleIsAlive,
 									void *state);
 
+typedef struct RowBatch RowBatch;
+
 /*
  * API struct for a table AM.  Note this must be allocated in a
  * server-lifetime manner, typically as a static const struct, which then gets
@@ -351,6 +353,56 @@ typedef struct TableAmRoutine
 									 ScanDirection direction,
 									 TupleTableSlot *slot);
 
+	/* ------------------------------------------------------------------------
+	 * Batched scan support
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Returns true if the AM can support batching for a scan with the
+	 * given snapshot.  Called at plan init time before the scan descriptor
+	 * exists.  AMs that have no snapshot-based restrictions can omit this
+	 * callback, in which case batching is considered feasible.
+	 */
+	bool		(*scan_batch_feasible)(Relation relation, Snapshot snapshot);
+
+	/*
+	 * Initialize AM-owned batch state for a scan.  Called once before
+	 * the first scan_getnextbatch call.  The AM allocates whatever
+	 * private state it needs and stores it in b->am_payload.  b->slot
+	 * is the scan node's ss_ScanTupleSlot, whose type was already
+	 * determined by the AM via table_slot_callbacks().  The AM's
+	 * repoint_slot callback re-points it to each tuple in the batch
+	 * in turn.  Future interfaces may allow the AM to expose batch
+	 * data in other forms without going through a slot.
+	 */
+	void		(*scan_begin_batch)(TableScanDesc sscan, RowBatch *b);
+
+	/*
+	 * Fetch the next batch of tuples from the scan into b.  Sets b->nrows
+	 * to the number of tuples available and resets b->pos to 0.  Returns
+	 * true if any tuples were fetched, false at end of scan.  The caller
+	 * advances through the batch via RowBatchGetNextSlot(), which calls
+	 * ops->repoint_slot for each position up to b->nrows.
+	 */
+	bool		(*scan_getnextbatch)(TableScanDesc sscan, RowBatch *b,
+									 ScanDirection dir);
+
+	/*
+	 * Release all AM-owned batch resources, including any buffer pins
+	 * held in am_payload.  Called when the scan node is shut down.
+	 * After this call b->am_payload must not be used.
+	 */
+	void		(*scan_end_batch)(TableScanDesc sscan, RowBatch *b);
+
+	/*
+	 * Reset batch state for rescan.  Release any held resources (e.g.
+	 * buffer pins) and reset counts, but keep the allocation so the
+	 * next getnextbatch call can reuse it without re-entering
+	 * begin_batch.
+	 */
+	void		(*scan_reset_batch)(TableScanDesc sscan, RowBatch *b);
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -1047,6 +1099,90 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_supports_batching
+ *		Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation, Snapshot snapshot)
+{
+	const TableAmRoutine *tam = relation->rd_tableam;
+
+	if (tam->scan_getnextbatch == NULL)
+		return false;
+
+	Assert(tam->scan_begin_batch != NULL);
+	Assert(tam->scan_reset_batch != NULL);
+	Assert(tam->scan_end_batch != NULL);
+
+	/*
+	 * Optional: AM may restrict batching based on snapshot or other conditions.
+	 */
+	if (tam->scan_batch_feasible != NULL &&
+		!tam->scan_batch_feasible(relation, snapshot))
+		return false;
+
+	return true;
+}
+
+/*
+ * table_scan_begin_batch
+ *		Allocate AM-owned batch payload in the RowBatch
+ */
+static inline void
+table_scan_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_begin_batch != NULL);
+
+	return tam->scan_begin_batch(sscan, b);
+}
+
+/*
+ * table_scan_getnextbatch
+ *		Fetch the next batch of tuples from the AM.  Returns true if tuples
+ *		were fetched, false at end of scan.  Only forward scans are supported.
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(tam->scan_getnextbatch != NULL);
+
+	return tam->scan_getnextbatch(sscan, b, dir);
+}
+
+/*
+ * table_scan_end_batch
+ *		Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_end_batch != NULL);
+
+	tam->scan_end_batch(sscan, b);
+}
+
+/*
+ * table_scan_reset_batch
+ *		Reset AM-owned batch state for rescan without freeing.
+ */
+static inline void
+table_scan_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_reset_batch != NULL);
+
+	tam->scan_reset_batch(sscan, b);
+}
+
 /* ----------------------------------------------------------------------------
  * TID Range scanning related functions.
  * ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 216b93492ba..0344c4e88c3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -695,10 +695,10 @@ extern void pgstat_report_analyze(Relation rel,
 		if (pgstat_should_count_relation(rel))						\
 			(rel)->pgstat_info->counts.numscans++;					\
 	} while (0)
-#define pgstat_count_heap_getnext(rel)								\
+#define pgstat_count_heap_getnext(rel, n)							\
 	do {															\
 		if (pgstat_should_count_relation(rel))						\
-			(rel)->pgstat_info->counts.tuples_returned++;			\
+			(rel)->pgstat_info->counts.tuples_returned += (n);		\
 	} while (0)
 #define pgstat_count_heap_fetch(rel)								\
 	do {															\
-- 
2.47.3



  [application/x-patch] v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch (12.8K, 3-v6-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch)
  download | inline diff:
From d7e8f76144cb27e761e2d4bc9c687dd0a2de203e Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 12 Mar 2026 09:18:04 +0900
Subject: [PATCH v6 1/5] heapam: store full HeapTupleData in rs_vistuples[] for
 pagemode scans

page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the OffsetNumber
of each survivor into rs_vistuples[].  heapgettup_pagemode() then
re-derived those same values on every call from the saved OffsetNumber
via PageGetItemId() and PageGetItem().

Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
block, and relid are already in scope, so no additional page reads are
needed.  For the all_visible path (the common case on a primary not
under active modification) the write piggy-backs on the existing
per-lineoff loop.  For the !all_visible path, HeapTupleData entries are
written during the visibility loop and compacted to visible survivors
afterwards using batchmvcc.visible[], avoiding a return to pd_linp[] via
PageGetItemId().

With rs_vistuples[] populated, heapgettup_pagemode() replaces the
per-tuple PageGetItemId/PageGetItem calls with a single struct copy:

    *tuple = scan->rs_vistuples[lineindex];

The stack-local HeapTupleData array in BatchMVCCState is eliminated by
passing rs_vistuples[] directly to HeapTupleSatisfiesMVCCBatch(),
saving MaxHeapTuplesPerPage * 24 bytes of stack per page_collect_tuples()
call.  HeapTupleSatisfiesMVCCBatch() loses its vistuples_dense parameter
since compaction is now handled by the caller.

t_tableOid is pre-initialized for all rs_vistuples[] entries at scan
start in heap_beginscan(), eliminating a store per visible tuple from the
fill loop.  The raw ItemId word is read once per tuple with lp_off and
lp_len extracted via mask and shift rather than calling ItemIdGetOffset()
and ItemIdGetLength() separately, avoiding a potential second load from
the same address in the inner loop.

Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line pointer
traversal.

Suggested-by: Andres Freund <[email protected]>
---
 src/backend/access/heap/heapam.c            | 73 ++++++++++++---------
 src/backend/access/heap/heapam_handler.c    | 19 ++----
 src/backend/access/heap/heapam_visibility.c | 21 +++---
 src/include/access/heapam.h                 |  5 +-
 4 files changed, 58 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e5bd062de77..c6d0aacc5c9 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -524,7 +524,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 					BlockNumber block, int lines,
 					bool all_visible, bool check_serializable)
 {
-	Oid			relid = RelationGetRelid(scan->rs_base.rs_rd);
 	int			ntup = 0;
 	int			nvis = 0;
 	BatchMVCCState batchmvcc;
@@ -536,7 +535,7 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 	for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
 	{
 		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTuple	tup;
+		HeapTuple   tup = &scan->rs_vistuples[ntup];
 
 		if (unlikely(!ItemIdIsNormal(lpp)))
 			continue;
@@ -549,25 +548,33 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 		 */
 		if (!all_visible || check_serializable)
 		{
-			tup = &batchmvcc.tuples[ntup];
+			uint32  lp_val = *(uint32 *) lpp;
 
-			tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-			tup->t_len = ItemIdGetLength(lpp);
-			tup->t_tableOid = relid;
+			tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+			tup->t_len  = lp_val >> 17;
+			Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
 			ItemPointerSet(&(tup->t_self), block, lineoff);
 		}
 
-		/*
-		 * If the page is all visible, these fields otherwise won't be
-		 * populated in loop below.
-		 */
 		if (all_visible)
 		{
 			if (check_serializable)
-			{
 				batchmvcc.visible[ntup] = true;
+
+			/*
+			 * In the all_visible && !check_serializable path, the block
+			 * above was skipped, so tup's fields have not been set yet.
+			 * Fill them here while lpp is still in hand.
+			 */
+			if (!check_serializable)
+			{
+				uint32  lp_val = *(uint32 *) lpp;
+
+				tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+				tup->t_len  = lp_val >> 17;
+				Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+				ItemPointerSet(&tup->t_self, block, lineoff);
 			}
-			scan->rs_vistuples[ntup] = lineoff;
 		}
 
 		ntup++;
@@ -598,11 +605,24 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 		{
 			HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
 												scan->rs_base.rs_rd,
-												&batchmvcc.tuples[i],
+												&scan->rs_vistuples[i],
 												buffer, snapshot);
 		}
 	}
 
+
+	/* Now compact rs_vistuples[] to visible survivors only */
+	if (!all_visible)
+	{
+		int dst = 0;
+		for (int i = 0; i < ntup; i++)
+		{
+			if (batchmvcc.visible[i])
+				scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+		}
+		Assert(dst == nvis);
+	}
+
 	return nvis;
 }
 
@@ -1073,14 +1093,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	Page		page;
 	uint32		lineindex;
 	uint32		linesleft;
 
 	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		page = BufferGetPage(scan->rs_cbuf);
+		Assert(BufferIsValid(scan->rs_cbuf));
 
 		lineindex = scan->rs_cindex + dir;
 		if (ScanDirectionIsForward(dir))
@@ -1108,29 +1127,21 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		/* prune the page and determine visible tuple offsets */
 		heap_prepare_pagescan((TableScanDesc) scan);
-		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
 
-		/* block is the same for all tuples, set it once outside the loop */
-		ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
 		/* lineindex now references the next or previous visible tid */
 continue_page:
 
 		for (; linesleft > 0; linesleft--, lineindex += dir)
 		{
-			ItemId		lpp;
-			OffsetNumber lineoff;
-
-			Assert(lineindex < scan->rs_ntuples);
-			lineoff = scan->rs_vistuples[lineindex];
-			lpp = PageGetItemId(page, lineoff);
-			Assert(ItemIdIsNormal(lpp));
-
-			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
+			/*
+			 * Headers were pre-built by page_collect_tuples() into
+			 * rs_vistuples[].  Copy the entry; t_data still points into the
+			 * pinned page, which is safe for the lifetime of the current page
+			 * scan.
+			 */
+			*tuple = scan->rs_vistuples[lineindex];
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -1244,6 +1255,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	/* we only need to set this up once */
 	scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+	for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+		scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
 
 	/*
 	 * Allocate memory to keep track of page allocation for parallel workers
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 253a735b6c1..2fd120028bb 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2153,9 +2153,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
 {
 	BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
 	HeapScanDesc hscan = (HeapScanDesc) bscan;
-	OffsetNumber targoffset;
-	Page		page;
-	ItemId		lp;
 
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
@@ -2170,15 +2167,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
 			return false;
 	}
 
-	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
-	page = BufferGetPage(hscan->rs_cbuf);
-	lp = PageGetItemId(page, targoffset);
-	Assert(ItemIdIsNormal(lp));
-
-	hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
-	hscan->rs_ctup.t_len = ItemIdGetLength(lp);
-	hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
-	ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+	hscan->rs_ctup = hscan->rs_vistuples[hscan->rs_cindex];
 
 	pgstat_count_heap_fetch(scan->rs_rd);
 
@@ -2456,7 +2445,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 		while (start < end)
 		{
 			uint32		mid = start + (end - start) / 2;
-			OffsetNumber curoffset = hscan->rs_vistuples[mid];
+			OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
 
 			if (tupoffset == curoffset)
 				return true;
@@ -2575,7 +2564,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
 			ItemPointerSet(&tid, block, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
-				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+				hscan->rs_vistuples[ntup++] = heapTuple;
 		}
 	}
 	else
@@ -2604,7 +2593,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
-				hscan->rs_vistuples[ntup++] = offnum;
+				hscan->rs_vistuples[ntup++] = loctup;
 				PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
 								 HeapTupleHeaderGetXmin(loctup.t_data));
 			}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index fc64f4343ce..cd6cd4d8d69 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1670,16 +1670,16 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 }
 
 /*
- * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
  * efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
  *
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[].  The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
  *
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
  * setting hint bits more efficiently (see the one BufferFinishSetHintBits()
  * call below).
  *
@@ -1689,7 +1689,7 @@ int
 HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 							int ntups,
 							BatchMVCCState *batchmvcc,
-							OffsetNumber *vistuples_dense)
+							HeapTupleData *tuples)
 {
 	int			nvis = 0;
 	SetHintBitsState state = SHB_INITIAL;
@@ -1699,16 +1699,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 	for (int i = 0; i < ntups; i++)
 	{
 		bool		valid;
-		HeapTuple	tup = &batchmvcc->tuples[i];
+		HeapTuple	tup = &tuples[i];
 
 		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
 		batchmvcc->visible[i] = valid;
 
 		if (likely(valid))
-		{
-			vistuples_dense[nvis] = tup->t_self.ip_posid;
 			nvis++;
-		}
 	}
 
 	if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 2fdc50b865b..09b9566d0ac 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -103,7 +103,7 @@ typedef struct HeapScanDescData
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	uint32		rs_cindex;		/* current tuple's index in vistuples */
 	uint32		rs_ntuples;		/* number of visible tuples on page */
-	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	HeapTupleData rs_vistuples[MaxHeapTuplesPerPage];	/* tuples */
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
@@ -483,14 +483,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
  */
 typedef struct BatchMVCCState
 {
-	HeapTupleData tuples[MaxHeapTuplesPerPage];
 	bool		visible[MaxHeapTuplesPerPage];
 } BatchMVCCState;
 
 extern int	HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 										int ntups,
 										BatchMVCCState *batchmvcc,
-										OffsetNumber *vistuples_dense);
+										HeapTupleData *tuples);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
-- 
2.47.3



  [application/x-patch] v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch (6.5K, 4-v6-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch)
  download | inline diff:
From 0d810ceed77e394883ab0e95eafe36051b546040 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 17:42:19 +0900
Subject: [PATCH v6 2/5] Add RowBatch infrastructure for batched tuple
 processing

Introduce RowBatch, a data carrier that allows table AMs to deliver
multiple rows per call and the executor to process them as a group.

RowBatch separates three concerns:

  - am_payload: opaque, AM-owned storage (e.g. HeapBatch with pinned
    page and tuple headers).  The AM allocates this in its
    scan_begin_batch callback.

  - slots[]: TupleTableSlot array, created by RowBatchCreateSlots()
    with AM-appropriate slot ops.  Populated from am_payload by
    ops->materialize_into_slots when the executor needs tuple data.

  - max_rows: executor-set upper bound that the AM respects when
    filling a batch.

RowBatch does not own selection/filtering state.  Which rows survive
qual evaluation is the executor's concern, tracked separately in
scan node state.  This keeps RowBatch focused on the AM-to-executor
data transfer boundary.

RowBatchOps provides a vtable for AM-specific operations; currently
only materialize_into_slots is defined.
---
 src/backend/executor/Makefile       |  1 +
 src/backend/executor/execRowBatch.c | 54 ++++++++++++++++++
 src/backend/executor/meson.build    |  1 +
 src/include/executor/execRowBatch.h | 88 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list    |  2 +
 5 files changed, 146 insertions(+)
 create mode 100644 src/backend/executor/execRowBatch.c
 create mode 100644 src/include/executor/execRowBatch.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..99a00e762f6 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	execAmi.o \
 	execAsync.o \
+	execRowBatch.o \
 	execCurrent.o \
 	execExpr.o \
 	execExprInterp.o \
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
new file mode 100644
index 00000000000..6a298813bd8
--- /dev/null
+++ b/src/backend/executor/execRowBatch.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.c
+ *		Helpers for RowBatch
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execRowBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execRowBatch.h"
+
+/*
+ * RowBatchCreate
+ *		Allocate and initialize a new RowBatch envelope.
+ */
+RowBatch *
+RowBatchCreate(int max_rows)
+{
+	RowBatch   *b;
+
+	Assert(max_rows > 0);
+
+	b = palloc(sizeof(RowBatch));
+	b->am_payload = NULL;
+	b->ops = NULL;
+	b->max_rows = max_rows;
+	b->nrows = 0;
+	b->pos = 0;
+	b->materialized = false;
+	b->slot = NULL;
+
+	return b;
+}
+
+/*
+ * RowBatchReset
+ *		Reset an existing RowBatch envelope to empty.
+ */
+void
+RowBatchReset(RowBatch *b, bool drop_slots)
+{
+	Assert(b != NULL);
+
+	b->nrows = 0;
+	b->pos = 0;
+	b->materialized = false;
+	/* b->slot belongs to the owning PlanState node */
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..fd0bf80bacd 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'execAmi.c',
   'execAsync.c',
+  'execRowBatch.c',
   'execCurrent.c',
   'execExpr.c',
   'execExprInterp.c',
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
new file mode 100644
index 00000000000..021fdeecc73
--- /dev/null
+++ b/src/include/executor/execRowBatch.h
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.h
+ *		Executor batch envelope for passing row batch state upward
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/executor/execRowBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECROWBATCH_H
+#define EXECROWBATCH_H
+
+#include "executor/tuptable.h"
+
+typedef struct RowBatchOps RowBatchOps;
+
+/*
+ * RowBatch
+ *
+ * Data carrier from table AM to executor. The AM populates am_payload
+ * and nrows via scan_getnextbatch(). The executor calls ops->materialize_all
+ * to populate slots[] when it needs tuple data.
+ *
+ * Selection state (which rows survived qual eval) is owned by the executor,
+ * not the batch.
+ */
+typedef struct RowBatch
+{
+	void	   *am_payload;
+	const RowBatchOps *ops;
+
+	int			max_rows;			/* executor-set upper bound */
+	int			nrows;				/* rows TAM put in */
+	int			pos;				/* iteration position */
+	bool		materialized;		/* tuples in slots valid? */
+
+	TupleTableSlot *slot;			/* row view */
+} RowBatch;
+
+/*
+ * RowBatchOps -- AM-specific operations on a RowBatch.
+ *
+ * Table AMs set b->ops during scan_begin_batch to provide
+ * callbacks that the executor uses to access batch contents.
+ *
+ * repoint_slot re-points the batch's single slot to the tuple at
+ * index idx within the current batch.  The slot remains valid until
+ * the next call or until the batch is exhausted.
+ *
+ * Additional callbacks can be added here as new AMs or executor
+ * features require them.
+ */
+typedef struct RowBatchOps
+{
+	void		(*repoint_slot) (RowBatch *b, int idx);
+} RowBatchOps;
+
+/* Create/teardown */
+extern RowBatch *RowBatchCreate(int max_rows);
+extern void RowBatchReset(RowBatch *b, bool drop_slots);
+
+/* Validation */
+static inline bool
+RowBatchIsValid(RowBatch *b)
+{
+	return b != NULL && b->max_rows > 0;
+}
+
+/* Iteration over materialized slots */
+static inline bool
+RowBatchHasMore(RowBatch *b)
+{
+	return b->pos < b->nrows;
+}
+
+static inline TupleTableSlot *
+RowBatchGetNextSlot(RowBatch *b)
+{
+	if (b->pos >= b->nrows)
+		return NULL;
+	b->ops->repoint_slot(b, b->pos++);
+	return b->slot;
+}
+
+#endif	/* EXECROWBATCH_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 52f8603a7be..a2b0b1d99d4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2663,6 +2663,8 @@ RoleSpec
 RoleSpecType
 RoleStmtType
 RollupData
+RowBatch
+RowBatchOps
 RowCompareExpr
 RowExpr
 RowIdentityVarInfo
-- 
2.47.3



  [application/x-patch] v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (17.4K, 5-v6-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
  download | inline diff:
From c5f58f57cda191408855ab243c05f15580ca5eef Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v6 5/5] Add EXPLAIN (BATCHES) option for tuple batching
 statistics

Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.

For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.

Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.

Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/commands/explain.c        |  44 +++++++++++
 src/backend/commands/explain_state.c  |   8 ++
 src/backend/executor/execRowBatch.c   |  44 ++++++++++-
 src/backend/executor/nodeSeqscan.c    |   8 +-
 src/include/commands/explain_state.h  |   1 +
 src/include/executor/execRowBatch.h   |  22 +++++-
 src/include/executor/instrument.h     |   1 +
 src/test/regress/expected/explain.out | 107 ++++++++++++++++++++++++++
 src/test/regress/sql/explain.sql      |  59 ++++++++++++++
 9 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 296ea8a1ed2..b507fec0dab 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
 #include "commands/explain_format.h"
 #include "commands/explain_state.h"
 #include "commands/prepare.h"
+#include "executor/execRowBatch.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -519,6 +520,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		instrument_option |= INSTRUMENT_BUFFERS;
 	if (es->wal)
 		instrument_option |= INSTRUMENT_WAL;
+	if (es->batches)
+		instrument_option |= INSTRUMENT_BATCHES;
 
 	/*
 	 * We always collect timing for the entire statement, even when node-level
@@ -1372,6 +1375,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	int			save_indent = es->indent;
 	bool		haschildren;
 	bool		isdisabled;
+	RowBatch   *batch = NULL;
 
 	/*
 	 * Prepare per-worker output buffers, if needed.  We'll append the data in
@@ -2297,6 +2301,46 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->walusage);
 
+	/* BATCHES */
+	switch (nodeTag(plan))
+	{
+		case T_SeqScan:
+			batch = castNode(SeqScanState, planstate)->batch;
+			break;
+		default:
+			break;
+	}
+
+	if (es->batches && batch)
+	{
+		RowBatchStats *stats = batch->stats;
+
+		Assert(stats);
+		if (stats->batches > 0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batches: %lld  Avg Rows: %.1f  Max: %d  Min: %d\n",
+								 (long long) stats->batches,
+								 RowBatchAvgRows(batch), stats->max_rows,
+								 stats->min_rows == INT_MAX ? 0 :
+								 stats->min_rows);
+			}
+			else
+			{
+				ExplainPropertyInteger("Batches", NULL, stats->batches, es);
+				ExplainPropertyFloat("Average Batch Rows", NULL,
+									 RowBatchAvgRows(batch), 1, es);
+				ExplainPropertyInteger("Max Batch Rows", NULL, stats->max_rows, es);
+				ExplainPropertyInteger("Min Batch Rows", NULL,
+									   stats->min_rows == INT_MAX ? 0 :
+									   stats->min_rows, es);
+			}
+		}
+	}
+
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
 	{
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 77f59b8e500..28022a171cd 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 								"EXPLAIN", opt->defname, p),
 						 parser_errposition(pstate, opt->location)));
 		}
+		else if (strcmp(opt->defname, "batches") == 0)
+			es->batches = defGetBoolean(opt);
 		else if (!ApplyExtensionExplainOption(es, opt, pstate))
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -198,6 +200,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 				 errmsg("%s options %s and %s cannot be used together",
 						"EXPLAIN", "ANALYZE", "GENERIC_PLAN")));
 
+	/* check that BATCHES is used with EXPLAIN ANALYZE */
+	if (es->batches && !es->analyze)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option %s requires ANALYZE", "BATCHES")));
+
 	/* if the summary was not set explicitly, set default value */
 	es->summary = (summary_set) ? es->summary : es->analyze;
 
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
index 6a298813bd8..6ef54deca04 100644
--- a/src/backend/executor/execRowBatch.c
+++ b/src/backend/executor/execRowBatch.c
@@ -20,7 +20,7 @@
  *		Allocate and initialize a new RowBatch envelope.
  */
 RowBatch *
-RowBatchCreate(int max_rows)
+RowBatchCreate(int max_rows, bool track_stats)
 {
 	RowBatch   *b;
 
@@ -35,6 +35,20 @@ RowBatchCreate(int max_rows)
 	b->materialized = false;
 	b->slot = NULL;
 
+	if (track_stats)
+	{
+		RowBatchStats *stats = palloc_object(RowBatchStats);
+
+		stats->batches = 0;
+		stats->rows = 0;
+		stats->max_rows = 0;
+		stats->min_rows = INT_MAX;
+
+		b->stats = stats;
+	}
+	else
+		b->stats = NULL;
+
 	return b;
 }
 
@@ -52,3 +66,31 @@ RowBatchReset(RowBatch *b, bool drop_slots)
 	b->materialized = false;
 	/* b->slot belongs to the owning PlanState node */
 }
+
+void
+RowBatchRecordStats(RowBatch *b, int rows)
+{
+	RowBatchStats *stats = b->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->batches++;
+	stats->rows += rows;
+	if (rows > stats->max_rows)
+		stats->max_rows = rows;
+	if (rows < stats->min_rows && rows > 0)
+		stats->min_rows = rows;
+}
+
+double
+RowBatchAvgRows(RowBatch *b)
+{
+	RowBatchStats *stats = b->stats;
+
+	Assert(stats != NULL);
+	if (stats->batches == 0)
+		return 0.0;
+
+	return (double) stats->rows / stats->batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index b41d18b67e3..c1527be946a 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -245,8 +245,12 @@ SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
 static void
 SeqScanInitBatching(SeqScanState *scanstate)
 {
-	RowBatch   *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+	RowBatch   *batch;
+	EState	   *estate = scanstate->ss.ps.state;
+	bool		track_stats = estate->es_instrument &&
+		(estate->es_instrument & INSTRUMENT_BATCHES);
 
+	batch = RowBatchCreate(MaxHeapTuplesPerPage, track_stats);
 	batch->slot = scanstate->ss.ss_ScanTupleSlot;
 	scanstate->batch = batch;
 
@@ -347,6 +351,8 @@ SeqNextBatch(SeqScanState *node)
 	if (!table_scan_getnextbatch(scandesc, b, direction))
 		return false;
 
+	RowBatchRecordStats(b, b->nrows);
+
 	return true;
 }
 
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 5a48bc6fbb1..579ca4cfa20 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
 	bool		memory;			/* print planner's memory usage information */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
+	bool		batches;		/* print batch statistics */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
index 021fdeecc73..ad0b4763b70 100644
--- a/src/include/executor/execRowBatch.h
+++ b/src/include/executor/execRowBatch.h
@@ -13,9 +13,12 @@
 #ifndef EXECROWBATCH_H
 #define EXECROWBATCH_H
 
+#include <limits.h>
+
 #include "executor/tuptable.h"
 
 typedef struct RowBatchOps RowBatchOps;
+typedef struct RowBatchStats RowBatchStats;
 
 /*
  * RowBatch
@@ -38,6 +41,9 @@ typedef struct RowBatch
 	bool		materialized;		/* tuples in slots valid? */
 
 	TupleTableSlot *slot;			/* row view */
+
+	RowBatchStats *stats;			/* NULL if instrumentation stats
+									 * are not requested */
 } RowBatch;
 
 /*
@@ -58,8 +64,17 @@ typedef struct RowBatchOps
 	void		(*repoint_slot) (RowBatch *b, int idx);
 } RowBatchOps;
 
+/* Instrumentation stats populated for EXPLAIN ANALYZE BATCHES */
+typedef struct RowBatchStats
+{
+	int64	batches;	/* total number of batches fetched */
+	int64	rows;		/* total tuples across all batches */
+	int		max_rows;	/* max rows in any single batch */
+	int		min_rows;	/* min rows in any single batch (non-zero) */
+} RowBatchStats;
+
 /* Create/teardown */
-extern RowBatch *RowBatchCreate(int max_rows);
+extern RowBatch *RowBatchCreate(int max_rows, bool track_stats);
 extern void RowBatchReset(RowBatch *b, bool drop_slots);
 
 /* Validation */
@@ -85,4 +100,9 @@ RowBatchGetNextSlot(RowBatch *b)
 	return b->slot;
 }
 
+/* === Batching stats. ===*/
+
+extern void RowBatchRecordStats(RowBatch *b, int rows);
+extern double RowBatchAvgRows(RowBatch *b);
+
 #endif	/* EXECROWBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..bee69b4ac8f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..950de5a9d78 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,110 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 (9 rows)
 
 reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+ERROR:  EXPLAIN option BATCHES requires ANALYZE
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+ERROR:  EXPLAIN option BATCHES requires ANALYZE
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Filter: (a > N)
+   Rows Removed by Filter: N
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Filter: ((a > N) OR (b IS NULL))
+   Rows Removed by Filter: N
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+                            explain_filter                            
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+   ->  Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+         Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Verify batch stats keys present in JSON output
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches,
+  j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+  j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+  j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches | has_avg | has_max | has_min 
+-------------+---------+---------+---------
+ t           | t       | t       | t
+(1 row)
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+  j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+  j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+  j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+  j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+ child_has_batches | child_has_avg | child_has_max | child_has_min 
+-------------------+---------------+---------------+---------------
+ t                 | t             | t             | t
+(1 row)
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+-- Batching disabled: no batch keys in JSON
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches 
+-------------
+ f
+(1 row)
+
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..55acb9058ce 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,62 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 -- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
 select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
 reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Verify batch stats keys present in JSON output
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches,
+  j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+  j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+  j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+  j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+  j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+  j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+  j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- Batching disabled: no batch keys in JSON
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+reset executor_batch_rows;
-- 
2.47.3



  [application/x-patch] v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch (12.6K, 6-v6-0004-SeqScan-add-batch-driven-variants-returning-slots.patch)
  download | inline diff:
From 074facc85aae66ebab49b08eadf9957a6dca778d Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 11:28:16 +0900
Subject: [PATCH v6 4/5] SeqScan: add batch-driven variants returning slots

Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers.  This reduces per-tuple AM crossings without
changing the node interface seen by parents.

SeqScanState gains a RowBatch pointer that holds the current batch
when batching is active.  Batch state is localized to SeqScanState
-- no changes to PlanState or ScanState.

Add executor_batch_rows GUC (DEVELOPER_OPTIONS, default 64) to
control the maximum batch size.  Setting it to 0 disables batching.
XXX currently ignored when reading from heapam tables.

Wire up runtime selection in ExecInitSeqScan via
SeqScanCanUseBatching().  When executor_batch_rows > 1, EPQ is
inactive, the scan is forward-only, and the relation's AM supports
batching, ExecProcNode is set to a batch-driven variant.  Otherwise
the non-batch path is used with zero overhead.

Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled.

Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/executor/nodeSeqscan.c        | 276 ++++++++++++++++++++++
 src/backend/utils/init/globals.c          |   3 +
 src/backend/utils/misc/guc_parameters.dat |   9 +
 src/include/miscadmin.h                   |   1 +
 src/include/nodes/execnodes.h             |   2 +
 5 files changed, 291 insertions(+)

diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 8f219f60a93..b41d18b67e3 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,12 +29,17 @@
 
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "executor/execRowBatch.h"
 #include "executor/execScan.h"
 #include "executor/executor.h"
 #include "executor/nodeSeqscan.h"
 #include "utils/rel.h"
 
 static TupleTableSlot *SeqNext(SeqScanState *node);
+static TupleTableSlot *ExecSeqScanBatchSlot(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQual(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithProject(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQualProject(PlanState *pstate);
 
 /* ----------------------------------------------------------------
  *						Scan Support
@@ -203,6 +208,271 @@ ExecSeqScanEPQ(PlanState *pstate)
 					(ExecScanRecheckMtd) SeqRecheck);
 }
 
+/* ----------------------------------------------------------------
+ *						Batch Support
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * SeqScanCanUseBatching
+ *		Check whether this SeqScan can use batch mode execution.
+ *
+ * Batching requires: the GUC is enabled, no EPQ recheck is active, the scan
+ * is forward-only, and the table AM supports batching with the current
+ * snapshot (see table_supports_batching()).
+ */
+static bool
+SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
+{
+	Relation	relation = scanstate->ss.ss_currentRelation;
+
+	return	executor_batch_rows > 1 &&
+			relation &&
+			table_supports_batching(relation,
+									scanstate->ss.ps.state->es_snapshot) &&
+			!(eflags & EXEC_FLAG_BACKWARD) &&
+			scanstate->ss.ps.state->es_epq_active == NULL;
+}
+
+/*
+ * SeqScanInitBatching
+ *		Set up batch execution state and select the appropriate
+ *		ExecProcNode variant for batch mode.
+ *
+ * Called from ExecInitSeqScan when SeqScanCanUseBatching returns true.
+ * Overwrites the ExecProcNode pointer set by the non-batch path.
+ */
+static void
+SeqScanInitBatching(SeqScanState *scanstate)
+{
+	RowBatch   *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+
+	batch->slot = scanstate->ss.ss_ScanTupleSlot;
+	scanstate->batch = batch;
+
+	/* Choose batch variant */
+	if (scanstate->ss.ps.qual == NULL)
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+		else
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+	}
+	else
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+		else
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+	}
+}
+
+/*
+ * SeqScanResetBatching
+ *		Reset or tear down batch execution state.
+ *
+ * When drop is false (rescan), resets the RowBatch and releases any
+ * AM-held resources like buffer pins, but keeps allocations for reuse.
+ * When drop is true (end of node), frees everything.
+ */
+static void
+SeqScanResetBatching(SeqScanState *scanstate, bool drop)
+{
+	RowBatch *b = scanstate->batch;
+
+	if (b)
+	{
+		RowBatchReset(b, drop);
+		if (b->am_payload)
+		{
+			if (drop)
+			{
+				table_scan_end_batch(scanstate->ss.ss_currentScanDesc, b);
+				b->am_payload = NULL;
+			}
+			else
+				table_scan_reset_batch(scanstate->ss.ss_currentScanDesc, b);
+		}
+		if (drop)
+			pfree(b);
+	}
+}
+
+/*
+ * SeqNextBatch
+ *		Fetch the next batch of tuples from the table AM.
+ *
+ * Lazily initializes the scan descriptor and AM batch state on first
+ * call.  Returns false at end of scan.
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+	TableScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	RowBatch *b = node->batch;
+
+	Assert(b != NULL);
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	Assert(ScanDirectionIsForward(direction));
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the scan is not parallel, or if we're serially
+		 * executing a scan that was planned to be parallel.
+		 */
+		scandesc = table_beginscan(node->ss.ss_currentRelation,
+								   estate->es_snapshot,
+								   0, NULL);
+		node->ss.ss_currentScanDesc = scandesc;
+	}
+
+	/* Lazily create the AM batch payload. */
+	if (b->am_payload == NULL)
+	{
+		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+		Assert(tam && tam->scan_begin_batch);
+		table_scan_begin_batch(scandesc, b);
+	}
+
+	if (!table_scan_getnextbatch(scandesc, b, direction))
+		return false;
+
+	return true;
+}
+
+/*
+ * SeqScanBatchSlot
+ *		Core loop for batch-driven SeqScan variants.
+ *
+ * Internally fetches tuples in batches from the table AM, but returns
+ * one slot at a time to preserve the single-slot interface expected by
+ * parent nodes.  When the current batch is exhausted, fetches and
+ * materializes the next one.
+ *
+ * qual and projInfo are passed explicitly so the compiler can eliminate
+ * dead branches when inlined into the typed wrapper functions (e.g.
+ * ExecSeqScanBatchSlot passes NULL for both).
+ *
+ * EPQ is not supported in the batch path; asserted at entry.
+ */
+static inline TupleTableSlot *
+SeqScanBatchSlot(SeqScanState *node,
+				 ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ss.ps.ps_ExprContext;
+	RowBatch *b = node->batch;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ss.ps.state->es_epq_active == NULL);
+	Assert(RowBatchIsValid(b));
+
+	for (;;)
+	{
+		TupleTableSlot *in;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next input slot from current batch, or refill */
+		if (!RowBatchHasMore(b))
+		{
+			if (!SeqNextBatch(node))
+				return NULL;
+		}
+
+		in = RowBatchGetNextSlot(b);
+		Assert(in);
+
+		/* No qual, no projection: direct return */
+		if (qual == NULL && projInfo == NULL)
+			return in;
+
+		ResetExprContext(econtext);
+		econtext->ecxt_scantuple = in;
+
+		/* Check qual if present */
+		if (qual != NULL && !ExecQual(qual, econtext))
+		{
+			InstrCountFiltered1(node, 1);
+			continue;
+		}
+
+		/* Project if needed, otherwise return scan tuple directly */
+		if (projInfo != NULL)
+			return ExecProject(projInfo);
+
+		return in;
+	}
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return SeqScanBatchSlot(node, NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	/*
+	 * Use pg_assume() for != NULL tests to make the compiler realize no
+	 * runtime check for the field is needed in ExecScanExtended().
+	 */
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return SeqScanBatchSlot(node, pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return SeqScanBatchSlot(node, NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return SeqScanBatchSlot(node, pstate->qual, pstate->ps_ProjInfo);
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
@@ -281,6 +551,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
 	}
 
+	if (SeqScanCanUseBatching(scanstate, eflags))
+		SeqScanInitBatching(scanstate);
+
 	return scanstate;
 }
 
@@ -300,6 +573,8 @@ ExecEndSeqScan(SeqScanState *node)
 	 */
 	scanDesc = node->ss.ss_currentScanDesc;
 
+	SeqScanResetBatching(node, true);
+
 	/*
 	 * close heap scan
 	 */
@@ -329,6 +604,7 @@ ExecReScanSeqScan(SeqScanState *node)
 		table_rescan(scan,		/* scan desc */
 					 NULL);		/* new scan keys */
 
+	SeqScanResetBatching(node, false);
 	ExecScanReScan((ScanState *) node);
 }
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* executor batching */
+int			executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a5a0edf2534..e1eadcf643d 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1004,6 +1004,15 @@
   boot_val => 'true',
 },
 
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Number of rows to include in batches during execution.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'executor_batch_rows',
+  boot_val => '64',
+  min => '0',
+  max => '1024',
+},
+
 { name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Terminate session on any error.',
   variable => 'ExitOnAnyError',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index f16f35659b9..ad406bf53f3 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
 extern PGDLLIMPORT int VacuumCostBalance;
 extern PGDLLIMPORT bool VacuumCostActive;
 
+extern PGDLLIMPORT int executor_batch_rows;
 
 /* in utils/misc/stack_depth.c */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0716c5a9aed..6f038cfcc60 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -67,6 +67,7 @@ typedef struct TupleTableSlot TupleTableSlot;
 typedef struct TupleTableSlotOps TupleTableSlotOps;
 typedef struct WalUsage WalUsage;
 typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct RowBatch RowBatch;
 
 
 /* ----------------
@@ -1644,6 +1645,7 @@ typedef struct SeqScanState
 {
 	ScanState	ss;				/* its first field is NodeTag */
 	Size		pscan_len;		/* size of parallel heap scan descriptor */
+	RowBatch   *batch;			/* NULL if batching disabled */
 } SeqScanState;
 
 /* ----------------
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 22+ messages in thread

* Re: Batching in executor
@ 2026-04-06 12:02  Amit Langote <[email protected]>
  parent: Amit Langote <[email protected]>
  0 siblings, 0 replies; 22+ messages in thread

From: Amit Langote @ 2026-04-06 12:02 UTC (permalink / raw)
  To: Junwang Zhao <[email protected]>; +Cc: cca5507 <[email protected]>; Daniil Davydov <[email protected]>; pgsql-hackers; Tomas Vondra <[email protected]>

On Tue, Mar 24, 2026 at 9:59 AM Amit Langote <[email protected]> wrote:
> Here is a significantly revised version of the patch series. A lot has
> changed since the January submission, so I want to summarize the
> design changes before getting into the patches.  I think it does
> address the points in the two reviews that landed since v5 but maybe a
> bunch of points became moot after my rewrite of the relevant portions
> (thanks Junwang and ChangAo for the review in any case).
>
> At this point it might be better to think of this as targeting v20,
> except that if there is review bandwidth in the remaining two weeks
> before the v19 feature freeze, the rs_vistuples[] change described
> below as a standalone improvement to the existing pagemode scan path
> could be considered for v19, though that too is an optimistic
> scenario.
>
> It is also worth noting that Andres identified a number of
> inefficiencies in the existing scan path in:
>
> Re: unnecessary executor overheads around seqscans
> https://postgr.es/m/xzflwwjtwxin3dxziyblrnygy3gfygo5dsuw6ltcoha73ecmnf%40nh6nonzta7kw
>
> that are worth fixing independently of batching. Some of those fixes
> may be better pursued first, both because they benefit all scan paths
> and because they would make batching's gains more honest.
>
> Separately, after looking at the previous version, Andres pointed out
> offlist two fundamental issues with the patch's design:
>
> * The heapam implementation (in a version of the patch I didn't post
> to the thread) duplicated heap_prepare_pagescan() logic in a separate
> batch-specific code path, which is not acceptable as changes should
> benefit the existing slot interface too.  Code duplication is not good
> either from a future maintainability aspect. The v5 version of that
> code is not great in that respect either; it instead duplicated
> heapggettup_pagemode() to slap batching on it.
>
> * Allocating executor_batch_rows slots on the executor side to receive
> rows from the AM adds significant overhead for slot initialization and
> management, and for non-row-organized AMs that do not produce
> individual rows at all, those slots would never be meaningfully
> populated.
>
> In any case, he just wasn't a fan of the slot-array approach the
> moment I mentioned it. The previous version had two slot arrays,
> inslots and outslots, of TTSOpsHeapTuple type (not
> TTSOpsBufferHeapTuple because buffer pins were managed by the batch
> code, which has its own modularity/correctness issues), populated via
> a materialize_all callback. A batch qual evaluator would copy
> qualifying tuples into outslots, with an activeslots pointer switching
> between the two depending on whether batch qual evaluation was used.
>
> The new design addresses both issues and differs from the previous
> version in several other ways:
>
>  * Single slot instead of slot arrays: there is a single
> TupleTableSlot, reusing the scan node's ss_ScanTupleSlot whose type
> was already determined by the AM via table_slot_callbacks().  The slot
> is re-pointed to each HeapTuple in the current buffer page via a new
> repoint_slot AM callback, with no materialization or copying.  Tuples
> are returned one by one from the executor's perspective, but the AM
> serves them in page-sized batches from pre-built HeapTupleData
> descriptors in rs_vistuples[], avoiding repeated descent into heapam
> per tuple.  This is heapam's implementation of the batch interface;
> there is no intention to force other AMs into the same row-oriented
> model.
>
>  * Batch qual evaluator not included: with the single-slot model,
> quals are evaluated per tuple via the existing ExecQual path after
> each repoint_slot call.  A natural next step would be a new opcode
> (EEOP) that calls repoint_slot() internally within expression
> evaluation, allowing ExecQual to advance through multiple tuples from
> the same batch without returning to the scan node each time, with qual
> results accumulated in a bitmask in ExprState.  The details of that
> will be worked out in a follow-on series.
>
> * heapgettup_pagemode_batch() gone: patch 0001 (described below) makes
> HeapScanDesc store full HeapTupleData entries in rs_vistuples[], which
> allows heap_getnextbatch() to simply advance a slice pointer into that
> array without any additional copying or re-entering heap code, making
> a separate batch-specific scan function unnecessary.
>
>  * TupleBatch renamed to RowBatch: "row batch" is more natural
> terminology for this concept and also consistent with how similar
> abstractions are named in columnar and OLAP systems.
>
>  * AM callbacks now take RowBatch directly: previously
> heap_getnextbatch() returned a void pointer that the executor would
> store into RowBatch.am_payload, because only the executor knew the
> internals of RowBatch.  Now the AM receives RowBatch directly as a
> parameter and can populate it without the executor acting as an
> intermediary.  This is also why RowBatch is introduced in its own
> patch ahead of the AM API addition, so the struct definition is
> available to both sides.
>
> Patch 0001 changes rs_vistuples[] to store full HeapTupleData entries
> instead of OffsetNumbers, as a standalone improvement to the existing
> pagemode scan path. Measured on a pg_prewarm'd  (also vaccum freeze'd
> in the all-visible case) table with 1M/5M/10M rows:
>
>   query                           all-visible      not-all-visible
>   count(*)                        -0.2% to +0.9%   -0.4% to +0.5%
>   count(*) WHERE id % 10 = 0     -1.1% to +3.4%   +0.2% to +1.5%
>   SELECT * LIMIT 1 OFFSET N      -2.2% to -0.6%   -0.9% to +6.6%
>   SELECT * WHERE id%10=0 LIMIT   -0.8% to +3.9%   +0.9% to +9.6%
>
> No significant regression on either page type. The structural
> improvement is most visible on not-all-visible pages where
> HeapTupleSatisfiesMVCCBatch() already reads every tuple header during
> visibility checks, so persisting the result into rs_vistuples[]
> eliminates the downstream re-read (in heapgettupe_pagemode()) with no
> measurable overhead.  That said, these numbers are somewhat noisy on
> my machine.  Results on other machines would be welcome.
>
> Patches 0002-0005 add the RowBatch infrastructure, the batch AM API
> and heapam implementation including seqscan variants that use the new
> scan_getnextbatch() API, and EXPLAIN (ANALYZE, BATCHES) support,
> respectively. With batching enabled (executor_batch_rows=300,
> ~MaxHeapTuplesPerPage):
>
>   query                           all-visible    not-all-visible
>   count(*)                        +11 to +15%    +9 to +13%
>   count(*) WHERE id % 10 = 0     +6 to +11%     +10 to +14%
>   SELECT * LIMIT 1 OFFSET N      +16 to +19%    +16 to +22%
>   SELECT * WHERE id%10=0 LIMIT   +8 to +10%     +8 to +13%
>
> With executor_batch_rows=0, results are within noise of master across
> all query types and sizes, confirming no regression from the
> infrastructure changes themselves.  The not-all-visible results tend
> to show slightly higher gains than the all-visible case. This is
> likely because the existing heapam code is more optimized for the
> all-visible path, so the not-all-visible path, which goes through
> HeapTupleSatisfiesMVCCBatch() for per-tuple visibility checks, has
> more headroom that batching can exploit.
>
> Setting aside the current series for a moment, there are some broader
> design questions worth raising while we have attention on this area.
> Some of these echo points Tomas raised in his first reply on this
> thread, and I am reiterating them deliberately since I have not
> managed to fully address them on my own or I simply didn't need to for
> the TAM-to-scan-node batching and think they would benefit from wider
> input rather than just my own iteration.
>
> We should also start thinking about other ways the executor can
> consume batch rows, not always assuming they are presented as
> HeapTupleData. For instance, an AM could expose decoded column arrays
> directly to operators that can consume them, bypassing slot-based
> deform entirely, or a columnar AM could implement scan_getnextbatch by
> decoding column strips directly into the batch without going through
> per-tuple HeapTupleData at all. Feedback on whether the current
> RowBatch design and the choices made in the scan_getnextbatch and
> RowBatchOps API make that sort of thing harder than it needs to be
> would be appreciated. For example, heapam's implementation of
> scan_getnextbatch uses a single TTSOpsBufferHeapTuple slot re-pointed
> to HeapTupleData entries one at a time via repoint_slot in
> RowBatchHeapOps. That works for heapam but a columnar AM could
> implement scan_getnextbatch to decode column strips directly into
> arrays in the batch, with no per-row repoint step needed at all. Any
> adjustments that would make RowBatch more AM-agnostic are worth
> discussing now before the design hardens.
>
> There are also broader open questions about how far the batch model
> can extend beyond the scan node. Qual pushdown into the AM has been
> discussed in nearby threads and would be one way to allow expression
> evaluation to happen before data reaches the executor proper, though
> that is a separate effort. For the purposes of this series, expression
> evaluation still happens in the executor after scan_getnextbatch
> returns. If the scan node does not project, the buffer heap slot is
> passed directly to the parent node, which calls slot callbacks to
> deform as needed. But once a node above projects, aggregates, or
> joins, the notion of a page-sized batch from a single AM loses its
> meaning and virtual slots take over. Whether RowBatch is usable or
> meaningful beyond the scan/TAM boundary in any form, and whether the
> core executor will ever have non-HeapTupleData batch consumption paths
> or leave that entirely to extensions, are open questions worth
> discussing.
>
> For RowBatch to eventually play the role that TupleTableSlot plays for
> row-at-a-time execution, something inside it would need to serve as
> the common currency for batch data, analogous to TupleTableSlot's
> datum/isnull arrays. Column arrays are the obvious direction, but even
> that leaves open the question of representation. PostgreSQL's Datum is
> a pointer-sized abstraction that boxes everything, whereas vectorized
> systems use typed packed arrays of native types with validity
> bitmasks, which is a significant part of why tight vectorized loops
> are fast there. Whether column arrays of Datum would be good enough,
> or whether going further toward typed packed arrays would be necessary
> to get meaningful vectorization, is a deeper design question that this
> series deliberately does not try to answer.
>
> Even though the focus is on getting batching working at the scan/TAM
> boundary first, thoughts on any of these points would be welcome.

Rebased.

-- 
Thanks, Amit Langote


Attachments:

  [application/octet-stream] v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch (12.8K, 2-v7-0001-heapam-store-full-HeapTupleData-in-rs_vistuples-f.patch)
  download | inline diff:
From 1557236686140c29be98dc461e97f8df4a0f1a73 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 12 Mar 2026 09:18:04 +0900
Subject: [PATCH v7 1/5] heapam: store full HeapTupleData in rs_vistuples[] for
 pagemode scans

page_collect_tuples() builds full HeapTupleData headers for every
visible tuple on a page -- t_data, t_len, t_self, t_tableOid -- but
previously discarded them immediately after writing just the OffsetNumber
of each survivor into rs_vistuples[].  heapgettup_pagemode() then
re-derived those same values on every call from the saved OffsetNumber
via PageGetItemId() and PageGetItem().

Change rs_vistuples[] element type from OffsetNumber to HeapTupleData
and populate it inside page_collect_tuples() while lpp, lineoff, page,
block, and relid are already in scope, so no additional page reads are
needed.  For the all_visible path (the common case on a primary not
under active modification) the write piggy-backs on the existing
per-lineoff loop.  For the !all_visible path, HeapTupleData entries are
written during the visibility loop and compacted to visible survivors
afterwards using batchmvcc.visible[], avoiding a return to pd_linp[] via
PageGetItemId().

With rs_vistuples[] populated, heapgettup_pagemode() replaces the
per-tuple PageGetItemId/PageGetItem calls with a single struct copy:

    *tuple = scan->rs_vistuples[lineindex];

The stack-local HeapTupleData array in BatchMVCCState is eliminated by
passing rs_vistuples[] directly to HeapTupleSatisfiesMVCCBatch(),
saving MaxHeapTuplesPerPage * 24 bytes of stack per page_collect_tuples()
call.  HeapTupleSatisfiesMVCCBatch() loses its vistuples_dense parameter
since compaction is now handled by the caller.

t_tableOid is pre-initialized for all rs_vistuples[] entries at scan
start in heap_beginscan(), eliminating a store per visible tuple from the
fill loop.  The raw ItemId word is read once per tuple with lp_off and
lp_len extracted via mask and shift rather than calling ItemIdGetOffset()
and ItemIdGetLength() separately, avoiding a potential second load from
the same address in the inner loop.

Having pre-built HeapTupleData headers available at the scan descriptor
level also lays groundwork for a batched tuple interface, where an AM
can serve multiple tuples per call without repeating the line pointer
traversal.

Suggested-by: Andres Freund <[email protected]>
---
 src/backend/access/heap/heapam.c            | 73 ++++++++++++---------
 src/backend/access/heap/heapam_handler.c    | 19 ++----
 src/backend/access/heap/heapam_visibility.c | 21 +++---
 src/include/access/heapam.h                 |  5 +-
 4 files changed, 58 insertions(+), 60 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e06ce2db2cf..b70c75c8288 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -524,7 +524,6 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 					BlockNumber block, int lines,
 					bool all_visible, bool check_serializable)
 {
-	Oid			relid = RelationGetRelid(scan->rs_base.rs_rd);
 	int			ntup = 0;
 	int			nvis = 0;
 	BatchMVCCState batchmvcc;
@@ -536,7 +535,7 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 	for (OffsetNumber lineoff = FirstOffsetNumber; lineoff <= lines; lineoff++)
 	{
 		ItemId		lpp = PageGetItemId(page, lineoff);
-		HeapTuple	tup;
+		HeapTuple   tup = &scan->rs_vistuples[ntup];
 
 		if (unlikely(!ItemIdIsNormal(lpp)))
 			continue;
@@ -549,25 +548,33 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 		 */
 		if (!all_visible || check_serializable)
 		{
-			tup = &batchmvcc.tuples[ntup];
+			uint32  lp_val = *(uint32 *) lpp;
 
-			tup->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-			tup->t_len = ItemIdGetLength(lpp);
-			tup->t_tableOid = relid;
+			tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+			tup->t_len  = lp_val >> 17;
+			Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
 			ItemPointerSet(&(tup->t_self), block, lineoff);
 		}
 
-		/*
-		 * If the page is all visible, these fields otherwise won't be
-		 * populated in loop below.
-		 */
 		if (all_visible)
 		{
 			if (check_serializable)
-			{
 				batchmvcc.visible[ntup] = true;
+
+			/*
+			 * In the all_visible && !check_serializable path, the block
+			 * above was skipped, so tup's fields have not been set yet.
+			 * Fill them here while lpp is still in hand.
+			 */
+			if (!check_serializable)
+			{
+				uint32  lp_val = *(uint32 *) lpp;
+
+				tup->t_data = (HeapTupleHeader) ((char *) page + (lp_val & 0x7fff));
+				tup->t_len  = lp_val >> 17;
+				Assert(tup->t_tableOid == RelationGetRelid(scan->rs_base.rs_rd));
+				ItemPointerSet(&tup->t_self, block, lineoff);
 			}
-			scan->rs_vistuples[ntup] = lineoff;
 		}
 
 		ntup++;
@@ -598,11 +605,24 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snapshot,
 		{
 			HeapCheckForSerializableConflictOut(batchmvcc.visible[i],
 												scan->rs_base.rs_rd,
-												&batchmvcc.tuples[i],
+												&scan->rs_vistuples[i],
 												buffer, snapshot);
 		}
 	}
 
+
+	/* Now compact rs_vistuples[] to visible survivors only */
+	if (!all_visible)
+	{
+		int dst = 0;
+		for (int i = 0; i < ntup; i++)
+		{
+			if (batchmvcc.visible[i])
+				scan->rs_vistuples[dst++] = scan->rs_vistuples[i];
+		}
+		Assert(dst == nvis);
+	}
+
 	return nvis;
 }
 
@@ -1074,14 +1094,13 @@ heapgettup_pagemode(HeapScanDesc scan,
 					ScanKey key)
 {
 	HeapTuple	tuple = &(scan->rs_ctup);
-	Page		page;
 	uint32		lineindex;
 	uint32		linesleft;
 
 	if (likely(scan->rs_inited))
 	{
 		/* continue from previously returned page/tuple */
-		page = BufferGetPage(scan->rs_cbuf);
+		Assert(BufferIsValid(scan->rs_cbuf));
 
 		lineindex = scan->rs_cindex + dir;
 		if (ScanDirectionIsForward(dir))
@@ -1109,29 +1128,21 @@ heapgettup_pagemode(HeapScanDesc scan,
 
 		/* prune the page and determine visible tuple offsets */
 		heap_prepare_pagescan((TableScanDesc) scan);
-		page = BufferGetPage(scan->rs_cbuf);
 		linesleft = scan->rs_ntuples;
 		lineindex = ScanDirectionIsForward(dir) ? 0 : linesleft - 1;
 
-		/* block is the same for all tuples, set it once outside the loop */
-		ItemPointerSetBlockNumber(&tuple->t_self, scan->rs_cblock);
-
 		/* lineindex now references the next or previous visible tid */
 continue_page:
 
 		for (; linesleft > 0; linesleft--, lineindex += dir)
 		{
-			ItemId		lpp;
-			OffsetNumber lineoff;
-
-			Assert(lineindex < scan->rs_ntuples);
-			lineoff = scan->rs_vistuples[lineindex];
-			lpp = PageGetItemId(page, lineoff);
-			Assert(ItemIdIsNormal(lpp));
-
-			tuple->t_data = (HeapTupleHeader) PageGetItem(page, lpp);
-			tuple->t_len = ItemIdGetLength(lpp);
-			ItemPointerSetOffsetNumber(&tuple->t_self, lineoff);
+			/*
+			 * Headers were pre-built by page_collect_tuples() into
+			 * rs_vistuples[].  Copy the entry; t_data still points into the
+			 * pinned page, which is safe for the lifetime of the current page
+			 * scan.
+			 */
+			*tuple = scan->rs_vistuples[lineindex];
 
 			/* skip any tuples that don't match the scan key */
 			if (key != NULL &&
@@ -1245,6 +1256,8 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 
 	/* we only need to set this up once */
 	scan->rs_ctup.t_tableOid = RelationGetRelid(relation);
+	for (int i = 0; i < MaxHeapTuplesPerPage; i++)
+		scan->rs_vistuples[i].t_tableOid = RelationGetRelid(relation);
 
 	/*
 	 * Allocate memory to keep track of page allocation for parallel workers
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 07f07188d46..88add129674 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2050,9 +2050,6 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
 {
 	BitmapHeapScanDesc bscan = (BitmapHeapScanDesc) scan;
 	HeapScanDesc hscan = (HeapScanDesc) bscan;
-	OffsetNumber targoffset;
-	Page		page;
-	ItemId		lp;
 
 	/*
 	 * Out of range?  If so, nothing more to look at on this page
@@ -2067,15 +2064,7 @@ heapam_scan_bitmap_next_tuple(TableScanDesc scan,
 			return false;
 	}
 
-	targoffset = hscan->rs_vistuples[hscan->rs_cindex];
-	page = BufferGetPage(hscan->rs_cbuf);
-	lp = PageGetItemId(page, targoffset);
-	Assert(ItemIdIsNormal(lp));
-
-	hscan->rs_ctup.t_data = (HeapTupleHeader) PageGetItem(page, lp);
-	hscan->rs_ctup.t_len = ItemIdGetLength(lp);
-	hscan->rs_ctup.t_tableOid = scan->rs_rd->rd_id;
-	ItemPointerSet(&hscan->rs_ctup.t_self, hscan->rs_cblock, targoffset);
+	hscan->rs_ctup = hscan->rs_vistuples[hscan->rs_cindex];
 
 	pgstat_count_heap_fetch(scan->rs_rd);
 
@@ -2353,7 +2342,7 @@ SampleHeapTupleVisible(TableScanDesc scan, Buffer buffer,
 		while (start < end)
 		{
 			uint32		mid = start + (end - start) / 2;
-			OffsetNumber curoffset = hscan->rs_vistuples[mid];
+			OffsetNumber curoffset = hscan->rs_vistuples[mid].t_self.ip_posid;
 
 			if (tupoffset == curoffset)
 				return true;
@@ -2473,7 +2462,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
 			ItemPointerSet(&tid, block, offnum);
 			if (heap_hot_search_buffer(&tid, scan->rs_rd, buffer, snapshot,
 									   &heapTuple, NULL, true))
-				hscan->rs_vistuples[ntup++] = ItemPointerGetOffsetNumber(&tid);
+				hscan->rs_vistuples[ntup++] = heapTuple;
 		}
 	}
 	else
@@ -2502,7 +2491,7 @@ BitmapHeapScanNextBlock(TableScanDesc scan,
 			valid = HeapTupleSatisfiesVisibility(&loctup, snapshot, buffer);
 			if (valid)
 			{
-				hscan->rs_vistuples[ntup++] = offnum;
+				hscan->rs_vistuples[ntup++] = loctup;
 				PredicateLockTID(scan->rs_rd, &loctup.t_self, snapshot,
 								 HeapTupleHeaderGetXmin(loctup.t_data));
 			}
diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
index 3a6a1e5a084..7162c848097 100644
--- a/src/backend/access/heap/heapam_visibility.c
+++ b/src/backend/access/heap/heapam_visibility.c
@@ -1671,16 +1671,16 @@ HeapTupleSatisfiesHistoricMVCC(HeapTuple htup, Snapshot snapshot,
 }
 
 /*
- * Perform HeaptupleSatisfiesMVCC() on each passed in tuple. This is more
+ * Perform HeapTupleSatisfiesMVCC() on each passed in tuple. This is more
  * efficient than doing HeapTupleSatisfiesMVCC() one-by-one.
  *
- * To be checked tuples are passed via BatchMVCCState->tuples. Each tuple's
- * visibility is stored in batchmvcc->visible[]. In addition,
- * ->vistuples_dense is set to contain the offsets of visible tuples.
+ * Each tuple's visibility is stored in batchmvcc->visible[].  The caller
+ * is responsible for compacting the tuples array to contain only visible
+ * survivors after this function returns.
  *
- * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that it
- * avoids a cross-translation-unit function call for each tuple, allows the
- * compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
+ * The reason this is more efficient than HeapTupleSatisfiesMVCC() is that
+ * it avoids a cross-translation-unit function call for each tuple, allows
+ * the compiler to optimize across calls to HeapTupleSatisfiesMVCC and allows
  * setting hint bits more efficiently (see the one BufferFinishSetHintBits()
  * call below).
  *
@@ -1690,7 +1690,7 @@ int
 HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 							int ntups,
 							BatchMVCCState *batchmvcc,
-							OffsetNumber *vistuples_dense)
+							HeapTupleData *tuples)
 {
 	int			nvis = 0;
 	SetHintBitsState state = SHB_INITIAL;
@@ -1700,16 +1700,13 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 	for (int i = 0; i < ntups; i++)
 	{
 		bool		valid;
-		HeapTuple	tup = &batchmvcc->tuples[i];
+		HeapTuple	tup = &tuples[i];
 
 		valid = HeapTupleSatisfiesMVCC(tup, snapshot, buffer, &state);
 		batchmvcc->visible[i] = valid;
 
 		if (likely(valid))
-		{
-			vistuples_dense[nvis] = tup->t_self.ip_posid;
 			nvis++;
-		}
 	}
 
 	if (state == SHB_ENABLED)
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 5176478c295..56f2d1a5748 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -102,7 +102,7 @@ typedef struct HeapScanDescData
 	/* these fields only used in page-at-a-time mode and for bitmap scans */
 	uint32		rs_cindex;		/* current tuple's index in vistuples */
 	uint32		rs_ntuples;		/* number of visible tuples on page */
-	OffsetNumber rs_vistuples[MaxHeapTuplesPerPage];	/* their offsets */
+	HeapTupleData rs_vistuples[MaxHeapTuplesPerPage];	/* tuples */
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
@@ -498,14 +498,13 @@ extern bool HeapTupleIsSurelyDead(HeapTuple htup,
  */
 typedef struct BatchMVCCState
 {
-	HeapTupleData tuples[MaxHeapTuplesPerPage];
 	bool		visible[MaxHeapTuplesPerPage];
 } BatchMVCCState;
 
 extern int	HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
 										int ntups,
 										BatchMVCCState *batchmvcc,
-										OffsetNumber *vistuples_dense);
+										HeapTupleData *tuples);
 
 /*
  * To avoid leaking too much knowledge about reorderbuffer implementation
-- 
2.47.3



  [application/octet-stream] v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch (17.4K, 3-v7-0005-Add-EXPLAIN-BATCHES-option-for-tuple-batching-sta.patch)
  download | inline diff:
From 8beefb53e7fa94a060456d1321f36abb221cbe47 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Sat, 20 Dec 2025 23:09:37 +0900
Subject: [PATCH v7 5/5] Add EXPLAIN (BATCHES) option for tuple batching
 statistics

Add a BATCHES option to EXPLAIN that reports per-node batch statistics
when a node uses batch mode execution.

For nodes that support batching (currently SeqScan), this shows the
number of batches fetched along with average, minimum, and maximum
rows per batch. Output is supported in both text and non-text formats.

Add regression tests covering text output, JSON format, filtered scans,
LIMIT, and disabled batching.

Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/commands/explain.c        |  44 +++++++++++
 src/backend/commands/explain_state.c  |   8 ++
 src/backend/executor/execRowBatch.c   |  44 ++++++++++-
 src/backend/executor/nodeSeqscan.c    |   8 +-
 src/include/commands/explain_state.h  |   1 +
 src/include/executor/execRowBatch.h   |  22 +++++-
 src/include/executor/instrument.h     |   1 +
 src/test/regress/expected/explain.out | 107 ++++++++++++++++++++++++++
 src/test/regress/sql/explain.sql      |  59 ++++++++++++++
 9 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73eaaf176ac..8c98ca57c92 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -22,6 +22,7 @@
 #include "commands/explain_format.h"
 #include "commands/explain_state.h"
 #include "commands/prepare.h"
+#include "executor/execRowBatch.h"
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "libpq/pqformat.h"
@@ -519,6 +520,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		instrument_option |= INSTRUMENT_BUFFERS;
 	if (es->wal)
 		instrument_option |= INSTRUMENT_WAL;
+	if (es->batches)
+		instrument_option |= INSTRUMENT_BATCHES;
 
 	/*
 	 * We always collect timing for the entire statement, even when node-level
@@ -1370,6 +1373,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	int			save_indent = es->indent;
 	bool		haschildren;
 	bool		isdisabled;
+	RowBatch   *batch = NULL;
 
 	/*
 	 * Prepare per-worker output buffers, if needed.  We'll append the data in
@@ -2296,6 +2300,46 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->instr.walusage);
 
+	/* BATCHES */
+	switch (nodeTag(plan))
+	{
+		case T_SeqScan:
+			batch = castNode(SeqScanState, planstate)->batch;
+			break;
+		default:
+			break;
+	}
+
+	if (es->batches && batch)
+	{
+		RowBatchStats *stats = batch->stats;
+
+		Assert(stats);
+		if (stats->batches > 0)
+		{
+			if (es->format == EXPLAIN_FORMAT_TEXT)
+			{
+				ExplainIndentText(es);
+				appendStringInfo(es->str,
+								 "Batches: %lld  Avg Rows: %.1f  Max: %d  Min: %d\n",
+								 (long long) stats->batches,
+								 RowBatchAvgRows(batch), stats->max_rows,
+								 stats->min_rows == INT_MAX ? 0 :
+								 stats->min_rows);
+			}
+			else
+			{
+				ExplainPropertyInteger("Batches", NULL, stats->batches, es);
+				ExplainPropertyFloat("Average Batch Rows", NULL,
+									 RowBatchAvgRows(batch), 1, es);
+				ExplainPropertyInteger("Max Batch Rows", NULL, stats->max_rows, es);
+				ExplainPropertyInteger("Min Batch Rows", NULL,
+									   stats->min_rows == INT_MAX ? 0 :
+									   stats->min_rows, es);
+			}
+		}
+	}
+
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
 	{
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index 77f59b8e500..28022a171cd 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -159,6 +159,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 								"EXPLAIN", opt->defname, p),
 						 parser_errposition(pstate, opt->location)));
 		}
+		else if (strcmp(opt->defname, "batches") == 0)
+			es->batches = defGetBoolean(opt);
 		else if (!ApplyExtensionExplainOption(es, opt, pstate))
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -198,6 +200,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 				 errmsg("%s options %s and %s cannot be used together",
 						"EXPLAIN", "ANALYZE", "GENERIC_PLAN")));
 
+	/* check that BATCHES is used with EXPLAIN ANALYZE */
+	if (es->batches && !es->analyze)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option %s requires ANALYZE", "BATCHES")));
+
 	/* if the summary was not set explicitly, set default value */
 	es->summary = (summary_set) ? es->summary : es->analyze;
 
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
index 6a298813bd8..6ef54deca04 100644
--- a/src/backend/executor/execRowBatch.c
+++ b/src/backend/executor/execRowBatch.c
@@ -20,7 +20,7 @@
  *		Allocate and initialize a new RowBatch envelope.
  */
 RowBatch *
-RowBatchCreate(int max_rows)
+RowBatchCreate(int max_rows, bool track_stats)
 {
 	RowBatch   *b;
 
@@ -35,6 +35,20 @@ RowBatchCreate(int max_rows)
 	b->materialized = false;
 	b->slot = NULL;
 
+	if (track_stats)
+	{
+		RowBatchStats *stats = palloc_object(RowBatchStats);
+
+		stats->batches = 0;
+		stats->rows = 0;
+		stats->max_rows = 0;
+		stats->min_rows = INT_MAX;
+
+		b->stats = stats;
+	}
+	else
+		b->stats = NULL;
+
 	return b;
 }
 
@@ -52,3 +66,31 @@ RowBatchReset(RowBatch *b, bool drop_slots)
 	b->materialized = false;
 	/* b->slot belongs to the owning PlanState node */
 }
+
+void
+RowBatchRecordStats(RowBatch *b, int rows)
+{
+	RowBatchStats *stats = b->stats;
+
+	if (stats == NULL)
+		return;
+
+	stats->batches++;
+	stats->rows += rows;
+	if (rows > stats->max_rows)
+		stats->max_rows = rows;
+	if (rows < stats->min_rows && rows > 0)
+		stats->min_rows = rows;
+}
+
+double
+RowBatchAvgRows(RowBatch *b)
+{
+	RowBatchStats *stats = b->stats;
+
+	Assert(stats != NULL);
+	if (stats->batches == 0)
+		return 0.0;
+
+	return (double) stats->rows / stats->batches;
+}
diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index d0ce8858c49..135b0a4f9a2 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -247,8 +247,12 @@ SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
 static void
 SeqScanInitBatching(SeqScanState *scanstate)
 {
-	RowBatch   *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+	RowBatch   *batch;
+	EState	   *estate = scanstate->ss.ps.state;
+	bool		track_stats = estate->es_instrument &&
+		(estate->es_instrument & INSTRUMENT_BATCHES);
 
+	batch = RowBatchCreate(MaxHeapTuplesPerPage, track_stats);
 	batch->slot = scanstate->ss.ss_ScanTupleSlot;
 	scanstate->batch = batch;
 
@@ -351,6 +355,8 @@ SeqNextBatch(SeqScanState *node)
 	if (!table_scan_getnextbatch(scandesc, b, direction))
 		return false;
 
+	RowBatchRecordStats(b, b->nrows);
+
 	return true;
 }
 
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 5a48bc6fbb1..579ca4cfa20 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
 	bool		memory;			/* print planner's memory usage information */
 	bool		settings;		/* print modified settings */
 	bool		generic;		/* generate a generic plan */
+	bool		batches;		/* print batch statistics */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
 	ExplainFormat format;		/* output format */
 	/* state for output formatting --- not reset for each new plan tree */
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
index 021fdeecc73..ad0b4763b70 100644
--- a/src/include/executor/execRowBatch.h
+++ b/src/include/executor/execRowBatch.h
@@ -13,9 +13,12 @@
 #ifndef EXECROWBATCH_H
 #define EXECROWBATCH_H
 
+#include <limits.h>
+
 #include "executor/tuptable.h"
 
 typedef struct RowBatchOps RowBatchOps;
+typedef struct RowBatchStats RowBatchStats;
 
 /*
  * RowBatch
@@ -38,6 +41,9 @@ typedef struct RowBatch
 	bool		materialized;		/* tuples in slots valid? */
 
 	TupleTableSlot *slot;			/* row view */
+
+	RowBatchStats *stats;			/* NULL if instrumentation stats
+									 * are not requested */
 } RowBatch;
 
 /*
@@ -58,8 +64,17 @@ typedef struct RowBatchOps
 	void		(*repoint_slot) (RowBatch *b, int idx);
 } RowBatchOps;
 
+/* Instrumentation stats populated for EXPLAIN ANALYZE BATCHES */
+typedef struct RowBatchStats
+{
+	int64	batches;	/* total number of batches fetched */
+	int64	rows;		/* total tuples across all batches */
+	int		max_rows;	/* max rows in any single batch */
+	int		min_rows;	/* min rows in any single batch (non-zero) */
+} RowBatchStats;
+
 /* Create/teardown */
-extern RowBatch *RowBatchCreate(int max_rows);
+extern RowBatch *RowBatchCreate(int max_rows, bool track_stats);
 extern void RowBatchReset(RowBatch *b, bool drop_slots);
 
 /* Validation */
@@ -85,4 +100,9 @@ RowBatchGetNextSlot(RowBatch *b)
 	return b->slot;
 }
 
+/* === Batching stats. ===*/
+
+extern void RowBatchRecordStats(RowBatch *b, int rows);
+extern double RowBatchAvgRows(RowBatch *b);
+
 #endif	/* EXECROWBATCH_H */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc9fbb0e2f0..89df74a86c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -64,6 +64,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_BUFFERS = 1 << 1,	/* needs buffer usage */
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
+	INSTRUMENT_BATCHES = 1 << 4, /* needs batches */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..950de5a9d78 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,110 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 (9 rows)
 
 reset work_mem;
+-- Test BATCHES option
+set executor_batch_rows = 64;
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+ERROR:  EXPLAIN option BATCHES requires ANALYZE
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+ERROR:  EXPLAIN option BATCHES requires ANALYZE
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(4 rows)
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Filter: (a > N)
+   Rows Removed by Filter: N
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+   Filter: ((a > N) OR (b IS NULL))
+   Rows Removed by Filter: N
+   Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(6 rows)
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+                            explain_filter                            
+----------------------------------------------------------------------
+ Limit (actual time=N.N..N.N rows=N.N loops=N)
+   ->  Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+         Batches: N  Avg Rows: N.N  Max: N  Min: N
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(5 rows)
+
+-- Verify batch stats keys present in JSON output
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches,
+  j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+  j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+  j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches | has_avg | has_max | has_min 
+-------------+---------+---------+---------
+ t           | t       | t       | t
+(1 row)
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+  j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+  j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+  j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+  j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+ child_has_batches | child_has_avg | child_has_max | child_has_min 
+-------------------+---------------+---------------+---------------
+ t                 | t             | t             | t
+(1 row)
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+                         explain_filter                         
+----------------------------------------------------------------
+ Seq Scan on batch_test (actual time=N.N..N.N rows=N.N loops=N)
+ Planning Time: N.N ms
+ Execution Time: N.N ms
+(3 rows)
+
+-- Batching disabled: no batch keys in JSON
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+ has_batches 
+-------------
+ f
+(1 row)
+
+reset executor_batch_rows;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..55acb9058ce 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,62 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 -- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
 select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
 reset work_mem;
+
+-- Test BATCHES option
+set executor_batch_rows = 64;
+
+create temp table batch_test (a int, b text);
+insert into batch_test select i, repeat('x', 100) from generate_series(1, 10000) i;
+analyze batch_test;
+
+-- BATCHES without ANALYZE should error
+explain (batches, costs off) select * from batch_test;
+
+-- BATCHES without ANALYZE but with other options
+explain (batches, buffers off, costs off) select * from batch_test;
+
+-- Basic: verify batch stats line appears in text format
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- With filter: batch line still appears
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000');
+
+-- With non-batchable qual (OR): batching still active but
+-- batch qual falls back to per-tuple ExecQual
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test where a > 5000 or b is null');
+
+-- With LIMIT: batch stats appear on child Seq Scan node
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test limit 100');
+
+-- Verify batch stats keys present in JSON output
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches,
+  j #> '{0,Plan}' ? 'Average Batch Rows' as has_avg,
+  j #> '{0,Plan}' ? 'Max Batch Rows' as has_max,
+  j #> '{0,Plan}' ? 'Min Batch Rows' as has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+-- With LIMIT: batch stats keys on child node in JSON
+select
+  j #> '{0,Plan,Plans,0}' ? 'Batches' as child_has_batches,
+  j #> '{0,Plan,Plans,0}' ? 'Average Batch Rows' as child_has_avg,
+  j #> '{0,Plan,Plans,0}' ? 'Max Batch Rows' as child_has_max,
+  j #> '{0,Plan,Plans,0}' ? 'Min Batch Rows' as child_has_min
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test limit 100'
+) as j;
+
+-- Batching disabled: no batch stats in text output
+set executor_batch_rows = 0;
+select explain_filter('explain (analyze, batches, buffers off, costs off) select * from batch_test');
+
+-- Batching disabled: no batch keys in JSON
+select
+  j #> '{0,Plan}' ? 'Batches' as has_batches
+from explain_filter_to_json(
+  'explain (analyze, batches, buffers off, format json) select * from batch_test'
+) as j;
+
+reset executor_batch_rows;
-- 
2.47.3



  [application/octet-stream] v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch (6.5K, 4-v7-0002-Add-RowBatch-infrastructure-for-batched-tuple-pro.patch)
  download | inline diff:
From 815d001dcc7a2cda50e3d55522bfaf30ad7fceee Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 17:42:19 +0900
Subject: [PATCH v7 2/5] Add RowBatch infrastructure for batched tuple
 processing

Introduce RowBatch, a data carrier that allows table AMs to deliver
multiple rows per call and the executor to process them as a group.

RowBatch separates three concerns:

  - am_payload: opaque, AM-owned storage (e.g. HeapBatch with pinned
    page and tuple headers).  The AM allocates this in its
    scan_begin_batch callback.

  - slots[]: TupleTableSlot array, created by RowBatchCreateSlots()
    with AM-appropriate slot ops.  Populated from am_payload by
    ops->materialize_into_slots when the executor needs tuple data.

  - max_rows: executor-set upper bound that the AM respects when
    filling a batch.

RowBatch does not own selection/filtering state.  Which rows survive
qual evaluation is the executor's concern, tracked separately in
scan node state.  This keeps RowBatch focused on the AM-to-executor
data transfer boundary.

RowBatchOps provides a vtable for AM-specific operations; currently
only materialize_into_slots is defined.
---
 src/backend/executor/Makefile       |  1 +
 src/backend/executor/execRowBatch.c | 54 ++++++++++++++++++
 src/backend/executor/meson.build    |  1 +
 src/include/executor/execRowBatch.h | 88 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list    |  2 +
 5 files changed, 146 insertions(+)
 create mode 100644 src/backend/executor/execRowBatch.c
 create mode 100644 src/include/executor/execRowBatch.h

diff --git a/src/backend/executor/Makefile b/src/backend/executor/Makefile
index 11118d0ce02..99a00e762f6 100644
--- a/src/backend/executor/Makefile
+++ b/src/backend/executor/Makefile
@@ -15,6 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = \
 	execAmi.o \
 	execAsync.o \
+	execRowBatch.o \
 	execCurrent.o \
 	execExpr.o \
 	execExprInterp.o \
diff --git a/src/backend/executor/execRowBatch.c b/src/backend/executor/execRowBatch.c
new file mode 100644
index 00000000000..6a298813bd8
--- /dev/null
+++ b/src/backend/executor/execRowBatch.c
@@ -0,0 +1,54 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.c
+ *		Helpers for RowBatch
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/executor/execRowBatch.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "executor/execRowBatch.h"
+
+/*
+ * RowBatchCreate
+ *		Allocate and initialize a new RowBatch envelope.
+ */
+RowBatch *
+RowBatchCreate(int max_rows)
+{
+	RowBatch   *b;
+
+	Assert(max_rows > 0);
+
+	b = palloc(sizeof(RowBatch));
+	b->am_payload = NULL;
+	b->ops = NULL;
+	b->max_rows = max_rows;
+	b->nrows = 0;
+	b->pos = 0;
+	b->materialized = false;
+	b->slot = NULL;
+
+	return b;
+}
+
+/*
+ * RowBatchReset
+ *		Reset an existing RowBatch envelope to empty.
+ */
+void
+RowBatchReset(RowBatch *b, bool drop_slots)
+{
+	Assert(b != NULL);
+
+	b->nrows = 0;
+	b->pos = 0;
+	b->materialized = false;
+	/* b->slot belongs to the owning PlanState node */
+}
diff --git a/src/backend/executor/meson.build b/src/backend/executor/meson.build
index dc45be0b2ce..fd0bf80bacd 100644
--- a/src/backend/executor/meson.build
+++ b/src/backend/executor/meson.build
@@ -3,6 +3,7 @@
 backend_sources += files(
   'execAmi.c',
   'execAsync.c',
+  'execRowBatch.c',
   'execCurrent.c',
   'execExpr.c',
   'execExprInterp.c',
diff --git a/src/include/executor/execRowBatch.h b/src/include/executor/execRowBatch.h
new file mode 100644
index 00000000000..021fdeecc73
--- /dev/null
+++ b/src/include/executor/execRowBatch.h
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * execRowBatch.h
+ *		Executor batch envelope for passing row batch state upward
+ *
+ * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/include/executor/execRowBatch.h
+ *-------------------------------------------------------------------------
+ */
+#ifndef EXECROWBATCH_H
+#define EXECROWBATCH_H
+
+#include "executor/tuptable.h"
+
+typedef struct RowBatchOps RowBatchOps;
+
+/*
+ * RowBatch
+ *
+ * Data carrier from table AM to executor. The AM populates am_payload
+ * and nrows via scan_getnextbatch(). The executor calls ops->materialize_all
+ * to populate slots[] when it needs tuple data.
+ *
+ * Selection state (which rows survived qual eval) is owned by the executor,
+ * not the batch.
+ */
+typedef struct RowBatch
+{
+	void	   *am_payload;
+	const RowBatchOps *ops;
+
+	int			max_rows;			/* executor-set upper bound */
+	int			nrows;				/* rows TAM put in */
+	int			pos;				/* iteration position */
+	bool		materialized;		/* tuples in slots valid? */
+
+	TupleTableSlot *slot;			/* row view */
+} RowBatch;
+
+/*
+ * RowBatchOps -- AM-specific operations on a RowBatch.
+ *
+ * Table AMs set b->ops during scan_begin_batch to provide
+ * callbacks that the executor uses to access batch contents.
+ *
+ * repoint_slot re-points the batch's single slot to the tuple at
+ * index idx within the current batch.  The slot remains valid until
+ * the next call or until the batch is exhausted.
+ *
+ * Additional callbacks can be added here as new AMs or executor
+ * features require them.
+ */
+typedef struct RowBatchOps
+{
+	void		(*repoint_slot) (RowBatch *b, int idx);
+} RowBatchOps;
+
+/* Create/teardown */
+extern RowBatch *RowBatchCreate(int max_rows);
+extern void RowBatchReset(RowBatch *b, bool drop_slots);
+
+/* Validation */
+static inline bool
+RowBatchIsValid(RowBatch *b)
+{
+	return b != NULL && b->max_rows > 0;
+}
+
+/* Iteration over materialized slots */
+static inline bool
+RowBatchHasMore(RowBatch *b)
+{
+	return b->pos < b->nrows;
+}
+
+static inline TupleTableSlot *
+RowBatchGetNextSlot(RowBatch *b)
+{
+	if (b->pos >= b->nrows)
+		return NULL;
+	b->ops->repoint_slot(b, b->pos++);
+	return b->slot;
+}
+
+#endif	/* EXECROWBATCH_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 35acda59851..e5c172628b3 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2694,6 +2694,8 @@ RoleSpec
 RoleSpecType
 RoleStmtType
 RollupData
+RowBatch
+RowBatchOps
 RowCompareExpr
 RowExpr
 RowIdentityVarInfo
-- 
2.47.3



  [application/octet-stream] v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch (19.0K, 5-v7-0003-Add-batch-table-AM-API-and-heapam-implementation.patch)
  download | inline diff:
From dd122f0913affbafe95ee4fc79eb656b482fe1e0 Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Mon, 23 Mar 2026 18:21:47 +0900
Subject: [PATCH v7 3/5] Add batch table AM API and heapam implementation

Introduce table AM callbacks for batched tuple fetching:
scan_begin_batch, scan_getnextbatch, scan_reset_batch, and
scan_end_batch.  AMs implement all four or none; checked by
table_supports_batching().

scan_reset_batch releases held resources (e.g. buffer pins)
without freeing, allowing reuse across rescans.

Provide the heapam implementation.  HeapPageBatch (stored in
RowBatch.am_payload) is a thin slice descriptor over the scan's
rs_vistuples[] array, which was introduced in the previous commit.
Rather than owning a copy of tuple headers, HeapPageBatch holds a
pointer into scan->rs_vistuples[] for the current slice and a buffer
pin for the current page.

heap_getnextbatch() calls heap_prepare_pagescan() to populate
rs_vistuples[] for each new page, then re-points hb->tuples to the
next slice of rs_vistuples[] on each call.  If the page has more
tuples than the executor's max_rows, subsequent calls return the
next slice without re-entering page preparation.  The buffer pin is
held until the page is fully consumed.

scan_begin_batch creates a single TupleTableSlot with
TTSOpsBufferHeapTuple ops.  heap_repoint_slot() re-points this slot
to each tuple in turn via ExecStoreBufferHeapTuple().  Consumers
that need to retain the slot across calls rely on the normal slot
materialization contract.

Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/access/heap/heapam.c         | 229 ++++++++++++++++++++++-
 src/backend/access/heap/heapam_handler.c |   8 +-
 src/include/access/heapam.h              |  33 ++++
 src/include/access/tableam.h             | 136 ++++++++++++++
 src/include/pgstat.h                     |   4 +-
 5 files changed, 403 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index b70c75c8288..d45f509fa6b 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -43,6 +43,7 @@
 #include "catalog/pg_database.h"
 #include "catalog/pg_database_d.h"
 #include "commands/vacuum.h"
+#include "executor/execRowBatch.h"
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "storage/lmgr.h"
@@ -109,6 +110,7 @@ static int	bottomup_sort_and_shrink(TM_IndexDeleteOp *delstate);
 static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static HeapTuple ExtractReplicaIdentity(Relation relation, HeapTuple tp, bool key_required,
 										bool *copy);
+static void heap_repoint_slot(RowBatch *b, int idx);
 
 
 /*
@@ -1214,7 +1216,7 @@ heap_beginscan(Relation relation, Snapshot snapshot,
 	scan->rs_cbuf = InvalidBuffer;
 
 	/*
-	 * Disable page-at-a-time mode if it's not a MVCC-safe snapshot.
+	 * Disable page-at-a-time mode if the snapshot does not allow it.
 	 */
 	if (!(snapshot && IsMVCCSnapshot(snapshot)))
 		scan->rs_base.rs_flags &= ~SO_ALLOW_PAGEMODE;
@@ -1464,7 +1466,7 @@ heap_getnext(TableScanDesc sscan, ScanDirection direction)
 	 * the proper return buffer and return the tuple.
 	 */
 
-	pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+	pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
 
 	return &scan->rs_ctup;
 }
@@ -1492,13 +1494,232 @@ heap_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableSlot *s
 	 * the proper return buffer and return the tuple.
 	 */
 
-	pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+	pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
 
 	ExecStoreBufferHeapTuple(&scan->rs_ctup, slot,
 							 scan->rs_cbuf);
 	return true;
 }
 
+/*---------- Batching support -----------*/
+
+static const RowBatchOps RowBatchHeapOps =
+{
+	.repoint_slot = heap_repoint_slot
+};
+
+/*
+ * heap_batch_feasible
+ *		Batching requires a MVCC snapshot since it relies on
+ *		page-at-a-time mode, which heap_beginscan() disables for
+ *		non-MVCC snapshots.
+ */
+bool
+heap_batch_feasible(Relation relation, Snapshot snapshot)
+{
+	return snapshot && IsMVCCSnapshot(snapshot);
+}
+
+/*
+ * heap_begin_batch
+ *		Initialize AM-side batch state for a heap scan.
+ *
+ * Allocates a HeapPageBatch, which acts as a thin slice descriptor over
+ * the scan's rs_vistuples[] array.  Unlike the previous version there is
+ * no separate tuple header storage in HeapPageBatch itself; rs_vistuples[]
+ * in HeapScanDescData (populated by page_collect_tuples() via
+ * heap_prepare_pagescan()) serves as the page-level buffer.  HeapPageBatch
+ * holds a pointer into that array for the current slice and the buffer pin
+ * for the current page.
+ *
+ * b->slot must be a TTSOpsBufferHeapTuple slot.
+ */
+void
+heap_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+	HeapPageBatch  *hb;
+
+	/* Batch path relies on executor-level qual eval, not AM scan keys */
+	Assert(sscan->rs_nkeys == 0);
+	Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+	hb = palloc(sizeof(HeapPageBatch));
+	hb->tuples = NULL;
+	hb->ntuples = 0;
+	hb->nextitem = 0;
+	hb->buf = InvalidBuffer;
+
+	b->am_payload = hb;
+	b->ops = &RowBatchHeapOps;
+}
+
+/*
+ * heap_reset_batch
+ *		Release pin and reset for rescan, keeping allocations.
+ */
+void
+heap_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+	HeapPageBatch  *hb = (HeapPageBatch *) b->am_payload;
+
+	Assert(hb != NULL);
+	if (BufferIsValid(hb->buf))
+	{
+		ReleaseBuffer(hb->buf);
+		hb->buf = InvalidBuffer;
+	}
+	hb->ntuples = 0;
+	hb->nextitem = 0;
+}
+
+/*
+ * heap_end_batch
+ *		Release all batch resources.
+ */
+void
+heap_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+	HeapPageBatch  *hb = (HeapPageBatch *) b->am_payload;
+
+	if (BufferIsValid(hb->buf))
+		ReleaseBuffer(hb->buf);
+
+	pfree(hb);
+	b->am_payload = NULL;
+}
+
+/*
+ * heap_getnextbatch
+ *		Fetch the next slice of visible tuples from a heap scan.
+ *
+ * Serves slices from the current page's rs_vistuples[] array.  If the
+ * current page has remaining tuples, sets hb->tuples to point at the next
+ * slice without re-entering the page scan.  If the page is exhausted,
+ * advances to the next page via heap_fetch_next_buffer(), prepares it
+ * with heap_prepare_pagescan(), and serves the first slice from it.
+ *
+ * hb->tuples points directly into scan->rs_vistuples[]; the entries remain
+ * valid as long as hb->buf (the page's buffer pin) is held.  The pin is
+ * released at the top of the next call once the page is fully consumed.
+ *
+ * Each call returns at most b->max_rows tuples.
+ *
+ * Returns true if tuples were fetched, false at end of scan.
+ */
+bool
+heap_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+	HeapScanDesc	scan = (HeapScanDesc) sscan;
+	HeapPageBatch  *hb = (HeapPageBatch *) b->am_payload;
+	int				remaining;
+	int				nserve;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(sscan->rs_flags & SO_ALLOW_PAGEMODE);
+
+	/*
+	 * Try to serve from the current page first.  No page advance, no buffer
+	 * management, no re-entry into heap code.
+	 */
+	remaining = scan->rs_ntuples - hb->nextitem;
+	if (remaining > 0)
+	{
+		nserve = Min(remaining, b->max_rows);
+
+		hb->tuples = &scan->rs_vistuples[hb->nextitem];
+		hb->ntuples = nserve;
+		hb->nextitem += nserve;
+
+		b->nrows = nserve;
+		b->pos = 0;
+
+		pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+		return true;
+	}
+
+	/*
+	 * Current page exhausted.  Advance to the next page with visible tuples.
+	 */
+	for (;;)
+	{
+		/*
+		 * Release the previous page's pin.  The page is fully consumed at
+		 * this point -- all slices have been served.
+		 */
+		if (BufferIsValid(hb->buf))
+		{
+			ReleaseBuffer(hb->buf);
+			hb->buf = InvalidBuffer;
+		}
+
+		heap_fetch_next_buffer(scan, dir);
+
+		if (!BufferIsValid(scan->rs_cbuf))
+		{
+			/* End of scan */
+			scan->rs_cblock = InvalidBlockNumber;
+			scan->rs_prefetch_block = InvalidBlockNumber;
+			scan->rs_inited = false;
+			b->nrows = 0;
+			return false;
+		}
+
+		Assert(BufferGetBlockNumber(scan->rs_cbuf) == scan->rs_cblock);
+
+		/*
+		 * Prepare the page: prune, run visibility checks, and populate
+		 * scan->rs_vistuples[0..rs_ntuples-1] via page_collect_tuples().
+		 */
+		heap_prepare_pagescan(sscan);
+
+		if (scan->rs_ntuples > 0)
+		{
+			/*
+			 * Pin the page so tuple data stays valid while the executor
+			 * processes slices.  Released at the top of the next call
+			 * once the page is fully consumed.
+			 */
+			IncrBufferRefCount(scan->rs_cbuf);
+			hb->buf = scan->rs_cbuf;
+
+			nserve = Min(scan->rs_ntuples, b->max_rows);
+
+			hb->tuples = &scan->rs_vistuples[0];
+			hb->ntuples = nserve;
+			hb->nextitem = nserve;
+
+			b->nrows = nserve;
+			b->pos = 0;
+
+			pgstat_count_heap_getnext(sscan->rs_rd, nserve);
+			return true;
+		}
+
+		/* Empty page (all dead/invisible tuples), try next */
+	}
+}
+
+/*
+ * heap_repoint_slot
+ *		Re-point the batch's single slot to the tuple at index idx.
+ *
+ * Called by RowBatchGetNextSlot() for each tuple served to the parent
+ * node.  hb->tuples[idx] was populated by page_collect_tuples() via
+ * heap_prepare_pagescan() and remains valid as long as hb->buf is pinned.
+ */
+static void
+heap_repoint_slot(RowBatch *b, int idx)
+{
+	HeapPageBatch		*hb = (HeapPageBatch *) b->am_payload;
+
+	Assert(idx >= 0 && idx < hb->ntuples);
+	Assert(TTS_IS_BUFFERTUPLE(b->slot));
+
+	ExecStoreBufferHeapTuple(&hb->tuples[idx], b->slot, hb->buf);
+}
+
+/*----- End of batching support -----*/
+
 void
 heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 				  ItemPointer maxtid)
@@ -1640,7 +1861,7 @@ heap_getnextslot_tidrange(TableScanDesc sscan, ScanDirection direction,
 	 * if we get here it means we have a new current scan tuple, so point to
 	 * the proper return buffer and return the tuple.
 	 */
-	pgstat_count_heap_getnext(scan->rs_base.rs_rd);
+	pgstat_count_heap_getnext(scan->rs_base.rs_rd, 1);
 
 	ExecStoreBufferHeapTuple(&scan->rs_ctup, slot, scan->rs_cbuf);
 	return true;
diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
index 88add129674..828b1a71362 100644
--- a/src/backend/access/heap/heapam_handler.c
+++ b/src/backend/access/heap/heapam_handler.c
@@ -2245,7 +2245,7 @@ heapam_scan_sample_next_tuple(TableScanDesc scan, SampleScanState *scanstate,
 			ExecStoreBufferHeapTuple(tuple, slot, hscan->rs_cbuf);
 
 			/* Count successfully-fetched tuples as heap fetches */
-			pgstat_count_heap_getnext(scan->rs_rd);
+			pgstat_count_heap_getnext(scan->rs_rd, 1);
 
 			return true;
 		}
@@ -2535,6 +2535,12 @@ static const TableAmRoutine heapam_methods = {
 	.scan_rescan = heap_rescan,
 	.scan_getnextslot = heap_getnextslot,
 
+	.scan_batch_feasible = heap_batch_feasible,
+	.scan_begin_batch = heap_begin_batch,
+	.scan_getnextbatch = heap_getnextbatch,
+	.scan_end_batch = heap_end_batch,
+	.scan_reset_batch = heap_reset_batch,
+
 	.scan_set_tidrange = heap_set_tidrange,
 	.scan_getnextslot_tidrange = heap_getnextslot_tidrange,
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 56f2d1a5748..d980dd29a44 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -106,6 +106,32 @@ typedef struct HeapScanDescData
 } HeapScanDescData;
 typedef struct HeapScanDescData *HeapScanDesc;
 
+/*
+ * HeapPageBatch -- heapam-private page-level batch state.
+ *
+ * Thin slice descriptor over the scan's rs_vistuples[] array.  Rather
+ * than owning a copy of tuple headers, HeapPageBatch holds a pointer
+ * into scan->rs_vistuples[] for the current slice, which was populated
+ * by page_collect_tuples() during heap_prepare_pagescan().
+ *
+ * The executor consumes tuples in slices.  Each heap_getnextbatch call
+ * re-points tuples to the next slice and advances nextitem, serving up
+ * to RowBatch.max_rows tuples from the current page before advancing
+ * to the next.
+ *
+ * buf holds the pin for the current page.  tuple data referenced via
+ * tuples remains valid as long as buf is pinned.
+ *
+ * Stored in RowBatch.am_payload.
+ */
+typedef struct HeapPageBatch
+{
+	HeapTupleData  *tuples;		/* points into scan->rs_vistuples[nextitem] */
+	int				ntuples;	/* tuples in current slice */
+	int				nextitem;	/* next unserved tuple index in rs_vistuples[] */
+	Buffer			buf;		/* pinned buffer for current page */
+} HeapPageBatch;
+
 typedef struct BitmapHeapScanDescData
 {
 	HeapScanDescData rs_heap_base;
@@ -360,6 +386,13 @@ extern void heap_endscan(TableScanDesc sscan);
 extern HeapTuple heap_getnext(TableScanDesc sscan, ScanDirection direction);
 extern bool heap_getnextslot(TableScanDesc sscan,
 							 ScanDirection direction, TupleTableSlot *slot);
+
+extern bool heap_batch_feasible(Relation relation, Snapshot snapshot);
+extern void heap_begin_batch(TableScanDesc sscan, RowBatch *batch);
+extern bool heap_getnextbatch(TableScanDesc sscan, RowBatch *batch, ScanDirection dir);
+extern void heap_end_batch(TableScanDesc sscan, RowBatch *batch);
+extern void heap_reset_batch(TableScanDesc sscan, RowBatch *batch);
+
 extern void heap_set_tidrange(TableScanDesc sscan, ItemPointer mintid,
 							  ItemPointer maxtid);
 extern bool heap_getnextslot_tidrange(TableScanDesc sscan,
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index 4647785fd35..28caa3dcf37 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -303,6 +303,8 @@ typedef void (*IndexBuildCallback) (Relation index,
 									bool tupleIsAlive,
 									void *state);
 
+typedef struct RowBatch RowBatch;
+
 /*
  * API struct for a table AM.  Note this must be allocated in a
  * server-lifetime manner, typically as a static const struct, which then gets
@@ -380,6 +382,56 @@ typedef struct TableAmRoutine
 									 ScanDirection direction,
 									 TupleTableSlot *slot);
 
+	/* ------------------------------------------------------------------------
+	 * Batched scan support
+	 * ------------------------------------------------------------------------
+	 */
+
+	/*
+	 * Returns true if the AM can support batching for a scan with the
+	 * given snapshot.  Called at plan init time before the scan descriptor
+	 * exists.  AMs that have no snapshot-based restrictions can omit this
+	 * callback, in which case batching is considered feasible.
+	 */
+	bool		(*scan_batch_feasible)(Relation relation, Snapshot snapshot);
+
+	/*
+	 * Initialize AM-owned batch state for a scan.  Called once before
+	 * the first scan_getnextbatch call.  The AM allocates whatever
+	 * private state it needs and stores it in b->am_payload.  b->slot
+	 * is the scan node's ss_ScanTupleSlot, whose type was already
+	 * determined by the AM via table_slot_callbacks().  The AM's
+	 * repoint_slot callback re-points it to each tuple in the batch
+	 * in turn.  Future interfaces may allow the AM to expose batch
+	 * data in other forms without going through a slot.
+	 */
+	void		(*scan_begin_batch)(TableScanDesc sscan, RowBatch *b);
+
+	/*
+	 * Fetch the next batch of tuples from the scan into b.  Sets b->nrows
+	 * to the number of tuples available and resets b->pos to 0.  Returns
+	 * true if any tuples were fetched, false at end of scan.  The caller
+	 * advances through the batch via RowBatchGetNextSlot(), which calls
+	 * ops->repoint_slot for each position up to b->nrows.
+	 */
+	bool		(*scan_getnextbatch)(TableScanDesc sscan, RowBatch *b,
+									 ScanDirection dir);
+
+	/*
+	 * Release all AM-owned batch resources, including any buffer pins
+	 * held in am_payload.  Called when the scan node is shut down.
+	 * After this call b->am_payload must not be used.
+	 */
+	void		(*scan_end_batch)(TableScanDesc sscan, RowBatch *b);
+
+	/*
+	 * Reset batch state for rescan.  Release any held resources (e.g.
+	 * buffer pins) and reset counts, but keep the allocation so the
+	 * next getnextbatch call can reuse it without re-entering
+	 * begin_batch.
+	 */
+	void		(*scan_reset_batch)(TableScanDesc sscan, RowBatch *b);
+
 	/*-----------
 	 * Optional functions to provide scanning for ranges of ItemPointers.
 	 * Implementations must either provide both of these functions, or neither
@@ -1099,6 +1151,90 @@ table_scan_getnextslot(TableScanDesc sscan, ScanDirection direction, TupleTableS
 	return sscan->rs_rd->rd_tableam->scan_getnextslot(sscan, direction, slot);
 }
 
+/*
+ * table_supports_batching
+ *		Does the relation's AM support batching?
+ */
+static inline bool
+table_supports_batching(Relation relation, Snapshot snapshot)
+{
+	const TableAmRoutine *tam = relation->rd_tableam;
+
+	if (tam->scan_getnextbatch == NULL)
+		return false;
+
+	Assert(tam->scan_begin_batch != NULL);
+	Assert(tam->scan_reset_batch != NULL);
+	Assert(tam->scan_end_batch != NULL);
+
+	/*
+	 * Optional: AM may restrict batching based on snapshot or other conditions.
+	 */
+	if (tam->scan_batch_feasible != NULL &&
+		!tam->scan_batch_feasible(relation, snapshot))
+		return false;
+
+	return true;
+}
+
+/*
+ * table_scan_begin_batch
+ *		Allocate AM-owned batch payload in the RowBatch
+ */
+static inline void
+table_scan_begin_batch(TableScanDesc sscan, RowBatch *b)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_begin_batch != NULL);
+
+	return tam->scan_begin_batch(sscan, b);
+}
+
+/*
+ * table_scan_getnextbatch
+ *		Fetch the next batch of tuples from the AM.  Returns true if tuples
+ *		were fetched, false at end of scan.  Only forward scans are supported.
+ */
+static inline bool
+table_scan_getnextbatch(TableScanDesc sscan, RowBatch *b, ScanDirection dir)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(ScanDirectionIsForward(dir));
+	Assert(tam->scan_getnextbatch != NULL);
+
+	return tam->scan_getnextbatch(sscan, b, dir);
+}
+
+/*
+ * table_scan_end_batch
+ *		Release AM-owned resources for the batch payload.
+ */
+static inline void
+table_scan_end_batch(TableScanDesc sscan, RowBatch *b)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_end_batch != NULL);
+
+	tam->scan_end_batch(sscan, b);
+}
+
+/*
+ * table_scan_reset_batch
+ *		Reset AM-owned batch state for rescan without freeing.
+ */
+static inline void
+table_scan_reset_batch(TableScanDesc sscan, RowBatch *b)
+{
+	const TableAmRoutine *tam = sscan->rs_rd->rd_tableam;
+
+	Assert(tam->scan_reset_batch != NULL);
+
+	tam->scan_reset_batch(sscan, b);
+}
+
 /* ----------------------------------------------------------------------------
  * TID Range scanning related functions.
  * ----------------------------------------------------------------------------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2786a7c5ffb..df06e33fba2 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -719,10 +719,10 @@ extern void pgstat_report_analyze(Relation rel,
 		if (pgstat_should_count_relation(rel))						\
 			(rel)->pgstat_info->counts.numscans++;					\
 	} while (0)
-#define pgstat_count_heap_getnext(rel)								\
+#define pgstat_count_heap_getnext(rel, n)							\
 	do {															\
 		if (pgstat_should_count_relation(rel))						\
-			(rel)->pgstat_info->counts.tuples_returned++;			\
+			(rel)->pgstat_info->counts.tuples_returned += (n);		\
 	} while (0)
 #define pgstat_count_heap_fetch(rel)								\
 	do {															\
-- 
2.47.3



  [application/octet-stream] v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch (12.6K, 6-v7-0004-SeqScan-add-batch-driven-variants-returning-slots.patch)
  download | inline diff:
From e76a49df42dbf22a3169eb2e1d880d9282c1f02f Mon Sep 17 00:00:00 2001
From: Amit Langote <[email protected]>
Date: Thu, 5 Mar 2026 11:28:16 +0900
Subject: [PATCH v7 4/5] SeqScan: add batch-driven variants returning slots

Teach SeqScan to drive the table AM via the new batch API added in
the previous commit, while still returning one TupleTableSlot at a
time to callers.  This reduces per-tuple AM crossings without
changing the node interface seen by parents.

SeqScanState gains a RowBatch pointer that holds the current batch
when batching is active.  Batch state is localized to SeqScanState
-- no changes to PlanState or ScanState.

Add executor_batch_rows GUC (DEVELOPER_OPTIONS, default 64) to
control the maximum batch size.  Setting it to 0 disables batching.
XXX currently ignored when reading from heapam tables.

Wire up runtime selection in ExecInitSeqScan via
SeqScanCanUseBatching().  When executor_batch_rows > 1, EPQ is
inactive, the scan is forward-only, and the relation's AM supports
batching, ExecProcNode is set to a batch-driven variant.  Otherwise
the non-batch path is used with zero overhead.

Plan shape and EXPLAIN output remain unchanged; only the internal
tuple flow differs when batching is enabled.

Reviewed-by: Daniil Davydov <[email protected]>
Reviewed-by: ChangAo Chen <[email protected]>
Discussion: https://postgr.es/m/CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com
---
 src/backend/executor/nodeSeqscan.c        | 278 ++++++++++++++++++++++
 src/backend/utils/init/globals.c          |   3 +
 src/backend/utils/misc/guc_parameters.dat |   9 +
 src/include/miscadmin.h                   |   1 +
 src/include/nodes/execnodes.h             |   2 +
 5 files changed, 293 insertions(+)

diff --git a/src/backend/executor/nodeSeqscan.c b/src/backend/executor/nodeSeqscan.c
index 04803b0e37d..d0ce8858c49 100644
--- a/src/backend/executor/nodeSeqscan.c
+++ b/src/backend/executor/nodeSeqscan.c
@@ -29,12 +29,17 @@
 
 #include "access/relscan.h"
 #include "access/tableam.h"
+#include "executor/execRowBatch.h"
 #include "executor/execScan.h"
 #include "executor/executor.h"
 #include "executor/nodeSeqscan.h"
 #include "utils/rel.h"
 
 static TupleTableSlot *SeqNext(SeqScanState *node);
+static TupleTableSlot *ExecSeqScanBatchSlot(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQual(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithProject(PlanState *pstate);
+static TupleTableSlot *ExecSeqScanBatchSlotWithQualProject(PlanState *pstate);
 
 /* ----------------------------------------------------------------
  *						Scan Support
@@ -205,6 +210,273 @@ ExecSeqScanEPQ(PlanState *pstate)
 					(ExecScanRecheckMtd) SeqRecheck);
 }
 
+/* ----------------------------------------------------------------
+ *						Batch Support
+ * ----------------------------------------------------------------
+ */
+
+/*
+ * SeqScanCanUseBatching
+ *		Check whether this SeqScan can use batch mode execution.
+ *
+ * Batching requires: the GUC is enabled, no EPQ recheck is active, the scan
+ * is forward-only, and the table AM supports batching with the current
+ * snapshot (see table_supports_batching()).
+ */
+static bool
+SeqScanCanUseBatching(SeqScanState *scanstate, int eflags)
+{
+	Relation	relation = scanstate->ss.ss_currentRelation;
+
+	return	executor_batch_rows > 1 &&
+			relation &&
+			table_supports_batching(relation,
+									scanstate->ss.ps.state->es_snapshot) &&
+			!(eflags & EXEC_FLAG_BACKWARD) &&
+			scanstate->ss.ps.state->es_epq_active == NULL;
+}
+
+/*
+ * SeqScanInitBatching
+ *		Set up batch execution state and select the appropriate
+ *		ExecProcNode variant for batch mode.
+ *
+ * Called from ExecInitSeqScan when SeqScanCanUseBatching returns true.
+ * Overwrites the ExecProcNode pointer set by the non-batch path.
+ */
+static void
+SeqScanInitBatching(SeqScanState *scanstate)
+{
+	RowBatch   *batch = RowBatchCreate(MaxHeapTuplesPerPage);
+
+	batch->slot = scanstate->ss.ss_ScanTupleSlot;
+	scanstate->batch = batch;
+
+	/* Choose batch variant */
+	if (scanstate->ss.ps.qual == NULL)
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlot;
+		else
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithProject;
+	}
+	else
+	{
+		if (scanstate->ss.ps.ps_ProjInfo == NULL)
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQual;
+		else
+			scanstate->ss.ps.ExecProcNode = ExecSeqScanBatchSlotWithQualProject;
+	}
+}
+
+/*
+ * SeqScanResetBatching
+ *		Reset or tear down batch execution state.
+ *
+ * When drop is false (rescan), resets the RowBatch and releases any
+ * AM-held resources like buffer pins, but keeps allocations for reuse.
+ * When drop is true (end of node), frees everything.
+ */
+static void
+SeqScanResetBatching(SeqScanState *scanstate, bool drop)
+{
+	RowBatch *b = scanstate->batch;
+
+	if (b)
+	{
+		RowBatchReset(b, drop);
+		if (b->am_payload)
+		{
+			if (drop)
+			{
+				table_scan_end_batch(scanstate->ss.ss_currentScanDesc, b);
+				b->am_payload = NULL;
+			}
+			else
+				table_scan_reset_batch(scanstate->ss.ss_currentScanDesc, b);
+		}
+		if (drop)
+			pfree(b);
+	}
+}
+
+/*
+ * SeqNextBatch
+ *		Fetch the next batch of tuples from the table AM.
+ *
+ * Lazily initializes the scan descriptor and AM batch state on first
+ * call.  Returns false at end of scan.
+ */
+static bool
+SeqNextBatch(SeqScanState *node)
+{
+	TableScanDesc scandesc;
+	EState	   *estate;
+	ScanDirection direction;
+	RowBatch *b = node->batch;
+
+	Assert(b != NULL);
+
+	/*
+	 * get information from the estate and scan state
+	 */
+	scandesc = node->ss.ss_currentScanDesc;
+	estate = node->ss.ps.state;
+	direction = estate->es_direction;
+	Assert(ScanDirectionIsForward(direction));
+
+	if (scandesc == NULL)
+	{
+		/*
+		 * We reach here if the scan is not parallel, or if we're serially
+		 * executing a scan that was planned to be parallel.
+		 */
+		scandesc = table_beginscan(node->ss.ss_currentRelation,
+								   estate->es_snapshot,
+								   0, NULL,
+								   ScanRelIsReadOnly(&node->ss) ?
+								   SO_HINT_REL_READ_ONLY : SO_NONE);
+		node->ss.ss_currentScanDesc = scandesc;
+	}
+
+	/* Lazily create the AM batch payload. */
+	if (b->am_payload == NULL)
+	{
+		const TableAmRoutine *tam PG_USED_FOR_ASSERTS_ONLY = scandesc->rs_rd->rd_tableam;
+
+		Assert(tam && tam->scan_begin_batch);
+		table_scan_begin_batch(scandesc, b);
+	}
+
+	if (!table_scan_getnextbatch(scandesc, b, direction))
+		return false;
+
+	return true;
+}
+
+/*
+ * SeqScanBatchSlot
+ *		Core loop for batch-driven SeqScan variants.
+ *
+ * Internally fetches tuples in batches from the table AM, but returns
+ * one slot at a time to preserve the single-slot interface expected by
+ * parent nodes.  When the current batch is exhausted, fetches and
+ * materializes the next one.
+ *
+ * qual and projInfo are passed explicitly so the compiler can eliminate
+ * dead branches when inlined into the typed wrapper functions (e.g.
+ * ExecSeqScanBatchSlot passes NULL for both).
+ *
+ * EPQ is not supported in the batch path; asserted at entry.
+ */
+static inline TupleTableSlot *
+SeqScanBatchSlot(SeqScanState *node,
+				 ExprState *qual, ProjectionInfo *projInfo)
+{
+	ExprContext *econtext = node->ss.ps.ps_ExprContext;
+	RowBatch *b = node->batch;
+
+	/* Batch path does not support EPQ */
+	Assert(node->ss.ps.state->es_epq_active == NULL);
+	Assert(RowBatchIsValid(b));
+
+	for (;;)
+	{
+		TupleTableSlot *in;
+
+		CHECK_FOR_INTERRUPTS();
+
+		/* Get next input slot from current batch, or refill */
+		if (!RowBatchHasMore(b))
+		{
+			if (!SeqNextBatch(node))
+				return NULL;
+		}
+
+		in = RowBatchGetNextSlot(b);
+		Assert(in);
+
+		/* No qual, no projection: direct return */
+		if (qual == NULL && projInfo == NULL)
+			return in;
+
+		ResetExprContext(econtext);
+		econtext->ecxt_scantuple = in;
+
+		/* Check qual if present */
+		if (qual != NULL && !ExecQual(qual, econtext))
+		{
+			InstrCountFiltered1(node, 1);
+			continue;
+		}
+
+		/* Project if needed, otherwise return scan tuple directly */
+		if (projInfo != NULL)
+			return ExecProject(projInfo);
+
+		return in;
+	}
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlot(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return SeqScanBatchSlot(node, NULL, NULL);
+}
+
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQual(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	/*
+	 * Use pg_assume() for != NULL tests to make the compiler realize no
+	 * runtime check for the field is needed in ExecScanExtended().
+	 */
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	Assert(pstate->ps_ProjInfo == NULL);
+
+	return SeqScanBatchSlot(node, pstate->qual, NULL);
+}
+
+/*
+ * Variant of ExecSeqScan() but when projection is required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	Assert(pstate->qual == NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return SeqScanBatchSlot(node, NULL, pstate->ps_ProjInfo);
+}
+
+/*
+ * Variant of ExecSeqScan() but when qual evaluation and projection are
+ * required.
+ */
+static TupleTableSlot *
+ExecSeqScanBatchSlotWithQualProject(PlanState *pstate)
+{
+	SeqScanState *node = castNode(SeqScanState, pstate);
+
+	Assert(pstate->state->es_epq_active == NULL);
+	pg_assume(pstate->qual != NULL);
+	pg_assume(pstate->ps_ProjInfo != NULL);
+
+	return SeqScanBatchSlot(node, pstate->qual, pstate->ps_ProjInfo);
+}
+
 /* ----------------------------------------------------------------
  *		ExecInitSeqScan
  * ----------------------------------------------------------------
@@ -283,6 +555,9 @@ ExecInitSeqScan(SeqScan *node, EState *estate, int eflags)
 			scanstate->ss.ps.ExecProcNode = ExecSeqScanWithQualProject;
 	}
 
+	if (SeqScanCanUseBatching(scanstate, eflags))
+		SeqScanInitBatching(scanstate);
+
 	return scanstate;
 }
 
@@ -302,6 +577,8 @@ ExecEndSeqScan(SeqScanState *node)
 	 */
 	scanDesc = node->ss.ss_currentScanDesc;
 
+	SeqScanResetBatching(node, true);
+
 	/*
 	 * close heap scan
 	 */
@@ -331,6 +608,7 @@ ExecReScanSeqScan(SeqScanState *node)
 		table_rescan(scan,		/* scan desc */
 					 NULL);		/* new scan keys */
 
+	SeqScanResetBatching(node, false);
 	ExecScanReScan((ScanState *) node);
 }
 
diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c
index 36ad708b360..535e29d7823 100644
--- a/src/backend/utils/init/globals.c
+++ b/src/backend/utils/init/globals.c
@@ -165,3 +165,6 @@ int			notify_buffers = 16;
 int			serializable_buffers = 32;
 int			subtransaction_buffers = 0;
 int			transaction_buffers = 0;
+
+/* executor batching */
+int			executor_batch_rows = 64;
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index a315c4ab8ab..a59b5d012a2 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1045,6 +1045,15 @@
   boot_val => 'true',
 },
 
+{ name => 'executor_batch_rows', type => 'int', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Number of rows to include in batches during execution.',
+  flags => 'GUC_NOT_IN_SAMPLE',
+  variable => 'executor_batch_rows',
+  boot_val => '64',
+  min => '0',
+  max => '1024',
+},
+
 { name => 'exit_on_error', type => 'bool', context => 'PGC_USERSET', group => 'ERROR_HANDLING_OPTIONS',
   short_desc => 'Terminate session on any error.',
   variable => 'ExitOnAnyError',
diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h
index 7277c37e779..302c0e33165 100644
--- a/src/include/miscadmin.h
+++ b/src/include/miscadmin.h
@@ -288,6 +288,7 @@ extern PGDLLIMPORT double VacuumCostDelay;
 extern PGDLLIMPORT int VacuumCostBalance;
 extern PGDLLIMPORT bool VacuumCostActive;
 
+extern PGDLLIMPORT int executor_batch_rows;
 
 /* in utils/misc/stack_depth.c */
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..0f8431ee854 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -70,6 +70,7 @@ typedef struct TupleTableSlot TupleTableSlot;
 typedef struct TupleTableSlotOps TupleTableSlotOps;
 typedef struct WalUsage WalUsage;
 typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
+typedef struct RowBatch RowBatch;
 
 
 /* ----------------
@@ -1670,6 +1671,7 @@ typedef struct SeqScanState
 {
 	ScanState	ss;				/* its first field is NodeTag */
 	Size		pscan_len;		/* size of parallel heap scan descriptor */
+	RowBatch   *batch;			/* NULL if batching disabled */
 } SeqScanState;
 
 /* ----------------
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 22+ messages in thread


end of thread, other threads:[~2026-04-06 12:02 UTC | newest]

Thread overview: 22+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-09-26 13:28 Batching in executor Amit Langote <[email protected]>
2025-09-26 13:49 ` Bruce Momjian <[email protected]>
2025-09-30 02:15   ` Amit Langote <[email protected]>
2025-09-29 11:01 ` Tomas Vondra <[email protected]>
2025-09-30 02:11   ` Amit Langote <[email protected]>
2025-09-30 13:35     ` Amit Langote <[email protected]>
2025-10-10 06:40   ` Amit Langote <[email protected]>
2025-10-27 07:24   ` Amit Langote <[email protected]>
2025-10-27 16:18     ` Tomas Vondra <[email protected]>
2025-10-28 13:40       ` Amit Langote <[email protected]>
2025-10-28 14:32         ` Daniil Davydov <[email protected]>
2025-10-29 02:22           ` Amit Langote <[email protected]>
2025-10-30 12:12             ` Daniil Davydov <[email protected]>
2025-12-20 14:36               ` Amit Langote <[email protected]>
2025-10-29 06:37         ` Amit Langote <[email protected]>
2025-12-04 15:54           ` Amit Langote <[email protected]>
2025-12-20 14:12             ` Amit Langote <[email protected]>
2025-12-22 11:45               ` =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-03-24 00:59                 ` Amit Langote <[email protected]>
2026-04-06 12:02                   ` Amit Langote <[email protected]>
2025-10-27 17:37   ` Peter Geoghegan <[email protected]>
2025-10-28 13:11     ` Amit Langote <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox