MIME-Version: 1.0
References: <CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com>
 <68f3771f-91f5-4cb7-b1de-74d9abbf0b96@vondra.me> <CA+HiwqGqeS_94wxiYa8VpymiR_OtFPKDSpX+Me=MYWO45f5yig@mail.gmail.com>
 <dbce8def-e817-4e70-a9cb-e62574724d95@vondra.me> <CA+HiwqGM0ZTeVHicSkGnCp-2U-jvU-KBQCkPJ0N7nAj_c2LjZg@mail.gmail.com>
 <CA+HiwqHyE7-oOvtZ+OC-4N7DvKSr8Jbu75erMLQ7O4d6gfkBhg@mail.gmail.com>
 <CA+HiwqEPMwhg6pUE4XML2rG4fRqKMpeGTtMwRPGes90f9iOqtg@mail.gmail.com>
 <CA+HiwqEZja5rJ78p3FBDZNvynWsHwanxyt6h0YaK_r84NemXng@mail.gmail.com>
 <tencent_B45268659AF0400F68EC1643950ABB6A1B09@qq.com> <CAJDiXggP41+-HrRzT+BmtgmS8wkoZM7b4skkQA5NAe+NFEMPSQ@mail.gmail.com>
 <CA+HiwqGufj46_4kT6z5nQrCOxp3rcMDR8fn_t33n2pDKokNRQg@mail.gmail.com> <CA+HiwqH-2GmTKLW9kHdnqV4KdFiPfuAdVK2TgqOM2JaaeUYXnw@mail.gmail.com>
In-Reply-To: <CA+HiwqH-2GmTKLW9kHdnqV4KdFiPfuAdVK2TgqOM2JaaeUYXnw@mail.gmail.com>
From: Junwang Zhao <zhjwpku@gmail.com>
Date: Sun, 1 Feb 2026 22:49:31 +0800
Message-ID: <CAEG8a3JKNRGwBb-C_YV6Px4aFHwd_G=AsB5AbfyU4-8YZT54AQ@mail.gmail.com>
Subject: Re: Batching in executor
To: Amit Langote <amitlangote09@gmail.com>
Cc: Daniil Davydov <3danissimo@gmail.com>, cca5507 <2624345507@qq.com>, 
	PostgreSQL-development <pgsql-hackers@postgresql.org>, Tomas Vondra <tomas@vondra.me>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAEG8a3JKNRGwBb-C_YV6Px4aFHwd_G%3DAsB5AbfyU4-8YZT54AQ%40mail.gmail.com>
Precedence: bulk

Hi Amit,

On Thu, Jan 29, 2026 at 3:35=E2=80=AFPM Amit Langote <amitlangote09@gmail.c=
om> wrote:
>
> Hi,
>
> Here is v5 of the patch series.
>
> Patches 0001-0003 add the core batching infrastructure. 0001 adds the
> batch table AM API with heapam implementation, 0002 wires up SeqScan
> to use it (still returning one slot at a time), and 0003 adds EXPLAIN
> (BATCHES). I'd love to hear people's thoughts around TupleBatch
> structure added in 0002. I thought about making it a separate patch so
> that 0002 will still populate the single ScanState.ss_scanTupleSlot,
> but that means we'd still have to call the TAM callback to populate
> the tuple in the TAM's batch struct into the slot, defeating the whole
> point. With TupleBatch, you have executor_batch_rows number of slots
> which are filled in one TAM callback (materialize_all) call. So I
> decided to keep the TupleBatch and related things in 0002.
>
> For scans without quals, batching shows 20-30% improvement with no
> visible regressions when batching is disabled (batch_rows=3D0):
>
> SELECT * FROM t LIMIT n (no qual)
>
>   Rows      Master    batch=3D0   %diff    batch=3D64  %diff
>   ------   --------   -------   -----    --------  -----
>   1M        12.42 ms  11.96 ms   3.7%     8.56 ms  31.0%
>   3M        38.95 ms  38.92 ms   0.1%    28.59 ms  26.6%
>   10M      153.64 ms 150.28 ms   2.2%   112.95 ms  26.5%
>
> (%diff: positive =3D faster than master, negative =3D slower)
>
> Patches 0004-0005 add batched qual evaluation and are more
> experimental (see below on why 0005 exists). For quals referencing
> early columns, the improvement is significant:
>
>   SELECT * FROM t WHERE a =3D 0 ... OFFSET n (qual on 1st col)
>
>   Rows      Master    batch=3D64  %diff
>   ------   --------   --------  -----
>   1M        30.19 ms  15.55 ms  48.5%
>   3M        92.47 ms  50.01 ms  45.9%
>   10M      325.58 ms 211.83 ms  34.9%
>
> However, for quals on later columns (e.g., 15th), batching provides no
> benefit - deformation dominates and batching doesn't help:
>
>   SELECT * FROM t WHERE o =3D 0 ... OFFSET n (qual on 15th col)
>
>   Rows      Master    batch=3D64  %diff
>   ------   --------   --------  -----
>   1M        44.14 ms  44.56 ms  -0.9%
>   3M       133.89 ms 137.77 ms  -2.9%
>   10M      503.33 ms 528.88 ms  -5.1%
>
> I don't have a satisfactory explanation for why batching doesn't help
> the deform-heavy case at all. One would expect at least some benefit
> from reduced per-tuple overhead, but that's not materializing.
>
> I've also been struggling to understand why 0004 affects the per-tuple
> path even when batch_rows=3D0. For quals with 0% selectivity (all rows
> fail the qual), perf shows ExecInterpExpr is noticeably hotter with
> the patched code compared to master, even though batching is disabled:
>
>   SELECT * FROM t WHERE a =3D 0 ... OFFSET n (0% selectivity)
>
>   Rows      Master    batch=3D0   %diff    batch=3D64  %diff
>   ------   --------   -------   -----    --------  -----
>   1M        24.37 ms  28.67 ms -17.6%    12.46 ms  48.9%
>   3M        73.95 ms  85.07 ms -15.0%    41.64 ms  43.7%
>   10M      287.63 ms 316.81 ms -10.1%   188.01 ms  34.6%
>
> Compare that to 100% selectivity (all rows pass), where there's no regres=
sion:
>
>   SELECT * FROM t WHERE a > 0 ... OFFSET n (100% selectivity)
>
>   Rows      Master    batch=3D0   %diff    batch=3D64  %diff
>   ------   --------   -------   -----    --------  -----
>   1M        29.44 ms  29.10 ms   1.2%    16.61 ms  43.6%
>   3M        91.22 ms  90.28 ms   1.0%    54.10 ms  40.7%
>   10M      360.77 ms 331.25 ms   8.2%   224.00 ms  37.9%
>
> I tried moving batch opcodes to a separate interpreter (0005) thinking
> it might be register pressure or jump table effects from adding cases
> to ExecInterpExpr's switch. With 0005, the generated assembly for
> ExecInterpExpr looks identical to master (same stack frame size, same
> epilogue), yet the performance still differs. Specifically, the ldp
> instruction in the function epilogue shows 53% hotness in patched vs
> 35% in master. We still need placeholder entries in the dispatch
> table, so it's unclear if this fully isolates the per-tuple path. I'll
> continue looking at perf, but I feel like at a bit of a loss here and
> would appreciate any insights.
>
> Other changes worth noting:
>
> - I removed the BatchVector intermediate representation that copied
> Datums into columnar arrays before qual evaluation (it used to be in
> the batched qual patch 0004). Now quals access batch slots' tts_values
> directly. This simplifies the code and the copy overhead wasn't paying
> off. If we pursue serious vectorization later, this may need to be
> revisited, but removing it doesn't degrade performance.
>
> --
> Thanks, Amit Langote

Here are some comments for v5:

0001:

+/*
+ * heap_scan_begin_batch
+ *
+ * Allocate a HeapBatch with space for 'maxitems' tuple headers. No pin is
+ * taken here. Memory is allocated under the scan's memory context.
+ */
+void *
+heap_begin_batch(TableScanDesc sscan, int maxitems)


+/*
+ * heap_scan_end_batch
+ *
+ * Release any outstanding pin and free the batch allocations. Caller will
+ * not use 'am_batch' after this point.
+ */
+void
+heap_end_batch(TableScanDesc sscan, void *am_batch)

These function names are not consistent with comments.

0002:

+/*
+ * heap_scan_materialize_all
+ *
+ * Bind all tuples of the current batch into 'slots'. We bind the
+ * HeapTupleData header that points into the pinned page. No per-row copy.
+ */
+void
+heap_materialize_batch_all(void *am_batch, TupleTableSlot **slots, int n)

ditto.

+const TupleBatchOps *
+table_batch_callbacks(Relation relation)
+{
+ if (relation->rd_tableam)
+ return relation->rd_tableam->batch_callbacks(relation);
+ elog(ERROR, "relation does not support TupleBatch operations");
+}

Is there any chance this batch_callbacks can be NULL? In that case it
can cause a segfault. I felt changing to
if (relation->rd_tableam && relation->rd_tableam->batch_callbacks)
should be more robust, but then I found table_slot_callbacks follow
the same pattern, so this shouldn't be a problem.

0003:

+++ b/src/include/executor/execBatch.h
@@ -13,6 +13,8 @@
#ifndef EXECBATCH_H
#define EXECBATCH_H
+#include <limits.h>

I guess the reason for including this header is because of the use
of INT_MAX, so maybe put that line into execBatch.c?

--=20
Regards
Junwang Zhao