MIME-Version: 1.0
References: 
 <CA+HiwqFfAY_ZFqN8wcAEMw71T9hM_kA8UtyHaZZEZtuT3UyogA@mail.gmail.com>
 <68f3771f-91f5-4cb7-b1de-74d9abbf0b96@vondra.me>
In-Reply-To: <68f3771f-91f5-4cb7-b1de-74d9abbf0b96@vondra.me>
From: Peter Geoghegan <pg@bowt.ie>
Date: Mon, 27 Oct 2025 13:37:25 -0400
Message-ID: 
 <CAH2-WznijhPtw2vtwCtfFSwamwkT2O1KXMx6tE+eoHi3CKwRFg@mail.gmail.com>
Subject: Re: Batching in executor
To: Tomas Vondra <tomas@vondra.me>
Cc: Amit Langote <amitlangote09@gmail.com>,
	PostgreSQL-development <pgsql-hackers@postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAH2-WznijhPtw2vtwCtfFSwamwkT2O1KXMx6tE%2BeoHi3CKwRFg%40mail.gmail.com>
Precedence: bulk

On Mon, Sep 29, 2025 at 7:01=E2=80=AFAM Tomas Vondra <tomas@vondra.me> wrot=
e:
> While looking at the patch, I couldn't help but think about the index
> prefetching stuff that I work on. It also introduces the concept of a
> "batch", for passing data between an index AM and the executor. It's
> interesting how different the designs are in some respects. I'm not
> saying one of those designs is wrong, it's more due different goals.

I've been working on a new prototype enhancement to the index
prefetching patch. The new spinoff patch has index scans batch up
calls to heap_hot_search_buffer for heap TIDs that the scan has yet to
return. This optimization is effective whenever an index scan returns
a contiguous group of TIDs that all point to the same heap page. We're
able to lock and unlock heap page buffers at the same point that
they're pinned and unpinned, which can dramatically decrease the
number of heap buffer locks acquired by index scans that return
contiguous TIDs (which is very common).

I find that speedups for pgbench SELECT variants with a predicate such
as "WHERE aid BETWEEN 1000 AND 1500" can have up to ~20% higher
throughput, at least in cases with low client counts (think 1 or 2
clients). These are cases where everything can fit in shared buffers,
so we're not getting any benefit from I/O prefetching (in spite of the
fact that this is built on top of the index prefetching patchset).

It makes sense to put this in scope for the index prefetching work
because that work will already give code outside of an index AM
visibility into which group of TIDs need to be read next. Right now
(on master) there is some trivial sense in which index AMs use their
own batches, but that's completely hidden from external callers.

> For example, the index prefetching patch establishes a "shared" batch
> struct, and the index AM is expected to fill it with data. After that,
> the batch is managed entirely by indexam.c, with no AM calls. The only
> AM-specific bit in the batch is "position", but that's used only when
> advancing to the next page, etc.

The major difficulty with my heap batching prototype is getting the
layering right (no surprises there). In some sense we're deliberately
sharing information across different what we currently think of as
different layers of abstraction, in order to be able to "schedule" the
work more intelligently. There's a number of competing considerations.

I have invented a new concept of heap batch, that is orthogonal to the
existing concept of index batches. Right now these are just an array
of HeapTuple structs that relate to exactly one group of group of
contiguous heap TIDs (i.e. if the index scan returns TIDs even a
little out of order, which is fairly common, we cannot currently
reorder the work in the current prototype patch).

Once a batch is prepared, calls to heapam_index_fetch_tuple just
return the next TID from the batch (until the next time we have to
return a TID pointing to some distinct heap block). In the case of
pgbench queries like the one I mentioned, we only need to call
LockBuffer/heap_hot_search_buffer once for every 61 heap tuples
returned (not once per heap tuple returned).

Importantly, the new interface added by my new prototype spinoff patch
is higher level than the existing
table_index_fetch_tuple/heapam_index_fetch_tuple interface. The
executor asks the table AM "give me the next heap TID in the current
scan direction", rather than asking "give me this heap TID". The
general idea is that the table AM has a direct understanding of
ordered index scans.

The advantage of this higher-level interface is that it gives the
table AM maximum freedom to reorder work. As I said already, we won't
do things like merge together logically noncontiguous accesses to the
same heap page into one physical access right now. But I think that
that should at least be enabled by this interface.

The downside of this approach is that table AM (not the executor
proper) is responsible for interfacing with the index AM layer. I
think that this can be generalized without very much code duplication
across table AMs. But it's hard.

> This patch does things differently. IIUC, each TAM may produce it's own
> "batch", which is then wrapped in a generic one. For example, heap
> produces HeapBatch, and it gets wrapped in TupleBatch. But I think this
> is fine. In the prefetching we chose to move all this code (walking the
> batch items) from the AMs into the layer above, and make it AM agnostic.

I think that the base index prefetching patch's current notion of
index-AM-wise batches can be kept quite separate from any table AM
batch concept that might be invented, either as part of what I'm
working on, or in Amit's patch.

It probably wouldn't be terribly difficult to get the new interface
I've described to return heap tuples in whatever batch format Amit
comes up with. That only has a benefit if it makes life easier for
expression evaluation in higher levels of the plan tree, but it might
just make sense to always do it that way. I doubt that adopting Amit's
batch format will make life much harder for the
heap_hot_search_buffer-batching mechanism (at least if it is generally
understood that its new index scan interface's builds batches in
Amit's format on a best-effort basis).

--=20
Peter Geoghegan