Message-ID: <b0c06e87-1c0a-43d6-8432-5704fbf131ab@vondra.me>
Date: Tue, 3 Feb 2026 23:25:36 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: New access method for b-tree.
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
Cc: Alexandre Felipe <alexandre.felipe@tpro.io>,
 "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>
References: <AS1PR02MB784695AFEC37179FFAF7EAE19A9DA@AS1PR02MB7846.eurprd02.prod.outlook.com>
 <a652515e-278a-4838-815b-ecd2ad4495f6@vondra.me>
 <CAEze2WiM2w-RCffBqjXWudm-=TS8r6BOxCKsNesM3KJ1-AnChw@mail.gmail.com>
Content-Language: en-US
From: Tomas Vondra <tomas@vondra.me>
In-Reply-To: <CAEze2WiM2w-RCffBqjXWudm-=TS8r6BOxCKsNesM3KJ1-AnChw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: <https://www.postgresql.org/message-id/b0c06e87-1c0a-43d6-8432-5704fbf131ab%40vondra.me>
Precedence: bulk


On 2/3/26 17:01, Matthias van de Meent wrote:
> On Mon, 2 Feb 2026 at 00:54, Tomas Vondra <tomas@vondra.me> wrote:
>>
>> Hello Felipe,
>>
>> On 2/1/26 11:02, Alexandre Felipe wrote:
>>> Hello Hackers,
>>>
>>> Please check this out,
>>>
>>> It is an access method to scan a table sorting by the second column of
>>> an index, filtered on the first.
>>> Queries like
>>> SELECT x, y FROM grid
>>> WHERE x in (array of Nx elements)
>>> ORDER BY y, x
>>> LIMIT M
>>>
>>> Can execute streaming the rows directly from disk instead of loading
>>> everything.
> 
> +1 for the idea, it does sound interesting. I haven't looked in depth
> at the patch, so no comments on the execution yet.
> 
>> So how does this compare to skip scan in practice? It's hard to compare,
>> as the patch does not implement an actual access path, but I tried this:
> [...]
>> 1) master
>>
>>   SELECT * FROM merge_test_int
>>   WHERE prefix_col IN (1,3,4,5,6,7,8,9,10,11,12,13,14,15)
> [...]
>> 2) merge scan
>>
>>   SELECT * FROM test_btree_merge_scan_int(
> [...]
>>       ARRAY[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15],
> [...]
>> And with explain analyze we get this:
>>
>> 1) Buffers: shared hit=26 read=25
>> 2) Buffers: shared hit=143 read=17
> 
> (FYI; your first query was missing "2" from it's IN list while it was
> present in the merge scan input; this makes the difference worse by a
> few pages)
> 
>> So it seems to access many more buffers, even if the number of reads is
>> lower. Presumably the merge scan is not always better than skip scan,
>> probably depending on number of prefixes in the query etc. What is the
>> cost model to decide between those two?
> 
> Skip scan always returns data in index order, while this merge scan
> would return tuples a suffix order. The cost model would thus weigh
> the cost of sorting the result of an index skipscan against the cost
> of doing a merge join on n_in_list_items distinct (partial) index
> scans.
> 

Makes sense.

> As for when you would benefit in buffers accessed: The merge scan
> would mainly benefit in number of buffers accessed when the selected
> prefix values are non-sequential, and the prefixes cover multiple
> pages at a time, and when there is a LIMIT clause on the scan. Normal
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.
> 

Do we have sufficient information to reliably make the right decision?
Can we actually cost the two cases well enough?

> (*) It is theoretically possible to reuse an earlier index descent if
> the SAOP entry's key range of the last descent starts and ends on the
> leaf page that the next SAOP entry's key range also starts on
> (applying the ideas of 5bf748b86b to this new multi-headed index scan
> mode), but that infrastructure doesn't seem to be in place in the
> current patch. That commit is also why your buffer access count for
> master is so low compared to the merge scan's; if your chosen list of
> numbers was multiples of 5 (so that matching tuples are not all
> sequential) you'd probably see much more comparable buffer access
> counts.
> 
>> If you had to construct the best case and worst cases (vs. skip scan),
>> what would that look like?
> 
> Presumably the best case would be:
> 
> -- mytable.a has very few distinct values (e.g. bool or enum);
> mytable.b many distinct values (e.g. uuid)
> SELECT * FROM mytable WHERE a IN (1, 2) ORDER BY b;
> 
> which the index's merge scan would turn into an index scan that
> behaves similar to the following, possibly with the merge join pushed
> down into the index:
> 
> SELECT * FROM (
>     SELECT ... FROM mytable WHERE a = 1
>   UNION
>     SELECT ... FROM mytable WHERE a = 2
> ) ORDER BY b.
> 
> 
> The worst case would be the opposite:
> 
> -- mytable.a has many distinct values (random uuid); mytable.b few
> (e.g. boolean; enum)
> SELECT * FROM mytable WHERE a IN (... huge in list) ORDER BY b
> 
> As the merge scan maintains one internal indexscan head per SAOP array
> element, it'd have significant in-memory and scan startup overhead,
> while few values are produced for each of those scan heads.
> 

OK. It'll be interesting to see how this performs in practice for the
whole gamut between the best and worst case.

>> I'm also wondering how common is the targeted query pattern? How common
>> it is to have an IN condition on the leading column in an index, and
>> ORDER BY on the second one?
> 
> I'm not sure, but it seems like it might be common/useful in
> queue-like access patterns:
> 
> With an index on (state, updated_timestamp) you're probably interested
> in all messages in just a subset of states, ordered by recent state
> transitions. An index on (updated_timestamp, state) might be
> considered more optimal, but won't be able to efficiently serve
> queries that only want data on uncommon states: The leaf pages would
> mainly contain data on common states, reducing the value of those leaf
> pages.
> 
> Right now, you can rewrite the "prefix IN (...) ORDER BY SUFFIX" query
> using UNION, or add an index for each percievable IN list, but it'd be
> great if the user didn't have to rewrite their query or create
> n_combinations indexes with their respective space usage to get this
> more efficient query execution.
> 

I think the examples presented by Ants (with timeline view) are quite
plausible in practice.


regards

-- 
Tomas Vondra