MIME-Version: 1.0
References: <AS1PR02MB784695AFEC37179FFAF7EAE19A9DA@AS1PR02MB7846.eurprd02.prod.outlook.com>
 <a652515e-278a-4838-815b-ecd2ad4495f6@vondra.me>
In-Reply-To: <a652515e-278a-4838-815b-ecd2ad4495f6@vondra.me>
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Tue, 3 Feb 2026 17:01:39 +0100
Message-ID: <CAEze2WiM2w-RCffBqjXWudm-=TS8r6BOxCKsNesM3KJ1-AnChw@mail.gmail.com>
Subject: Re: New access method for b-tree.
To: Tomas Vondra <tomas@vondra.me>
Cc: Alexandre Felipe <alexandre.felipe@tpro.io>, 
	"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Archived-At: <https://www.postgresql.org/message-id/CAEze2WiM2w-RCffBqjXWudm-%3DTS8r6BOxCKsNesM3KJ1-AnChw%40mail.gmail.com>
Precedence: bulk

On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote:
>
> Hello Felipe,
>
> On 2/1/26 11:02, Alexandre Felipe wrote:
> > Hello Hackers,
> >
> > Please check this out,
> >
> > It is an access method to scan a table sorting by the second column of
> > an index, filtered on the first.
> > Queries like
> > SELECT x, y FROM grid
> > WHERE x in (array of Nx elements)
> > ORDER BY y, x
> > LIMIT M
> >
> > Can execute streaming the rows directly from disk instead of loading
> > everything.

+1 for the idea, it does sound interesting. I haven't looked in depth
at the patch, so no comments on the execution yet.

> So how does this compare to skip scan in practice? It's hard to compare,
> as the patch does not implement an actual access path, but I tried this:
[...]
> 1) master
>
>   SELECT * FROM merge_test_int
>   WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
[...]
> 2) merge scan
>
>   SELECT * FROM test_btree_merge_scan_int(
[...]
>       ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
[...]
> And with explain analyze we get this:
>
> 1) Buffers: shared hit=26 read=25
> 2) Buffers: shared hit=143 read=17

(FYI; your first query was missing "2" from it's IN list while it was
present in the merge scan input; this makes the difference worse by a
few pages)

> So it seems to access many more buffers, even if the number of reads is
> lower. Presumably the merge scan is not always better than skip scan,
> probably depending on number of prefixes in the query etc. What is the
> cost model to decide between those two?

Skip scan always returns data in index order, while this merge scan
would return tuples a suffix order. The cost model would thus weigh
the cost of sorting the result of an index skipscan against the cost
of doing a merge join on n_in_list_items distinct (partial) index
scans.

As for when you would benefit in buffers accessed: The merge scan
would mainly benefit in number of buffers accessed when the selected
prefix values are non-sequential, and the prefixes cover multiple
pages at a time, and when there is a LIMIT clause on the scan. Normal
btree index skip scan infrastructure efficiently prevents new index
descents into the index when the selected SAOP key ranges are directly
adjecent, while merge scan would generally do at least one index
descent for each of its N scan heads (*) - which in the proposed
prototype patch guarantees O(index depth * num scan heads) buffer
accesses.

(*) It is theoretically possible to reuse an earlier index descent if
the SAOP entry's key range of the last descent starts and ends on the
leaf page that the next SAOP entry's key range also starts on
(applying the ideas of 5bf748b86b to this new multi-headed index scan
mode), but that infrastructure doesn't seem to be in place in the
current patch. That commit is also why your buffer access count for
master is so low compared to the merge scan's; if your chosen list of
numbers was multiples of 5 (so that matching tuples are not all
sequential) you'd probably see much more comparable buffer access
counts.

> If you had to construct the best case and worst cases (vs. skip scan),
> what would that look like?

Presumably the best case would be:

-- mytable.a has very few distinct values (e.g. bool or enum);
mytable.b many distinct values (e.g. uuid)
SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;

which the index's merge scan would turn into an index scan that
behaves similar to the following, possibly with the merge join pushed
down into the index:

SELECT * FROM (
    SELECT ... FROM mytable WHERE a = 1
  UNION
    SELECT ... FROM mytable WHERE a = 2
) ORDER BY b.


The worst case would be the opposite:

-- mytable.a has many distinct values (random uuid); mytable.b few
(e.g. boolean; enum)
SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b

As the merge scan maintains one internal indexscan head per SAOP array
element, it'd have significant in-memory and scan startup overhead,
while few values are produced for each of those scan heads.

> I'm also wondering how common is the targeted query pattern? How common
> it is to have an IN condition on the leading column in an index, and
> ORDER BY on the second one?

I'm not sure, but it seems like it might be common/useful in
queue-like access patterns:

With an index on (state, updated_timestamp) you're probably interested
in all messages in just a subset of states, ordered by recent state
transitions. An index on (updated_timestamp, state) might be
considered more optimal, but won't be able to efficiently serve
queries that only want data on uncommon states: The leaf pages would
mainly contain data on common states, reducing the value of those leaf
pages.

Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
using UNION, or add an index for each percievable IN list, but it'd be
great if the user didn't have to rewrite their query or create
n_combinations indexes with their respective space usage to get this
more efficient query execution.


Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)