MIME-Version: 1.0
References: 
 <AS1PR02MB784695AFEC37179FFAF7EAE19A9DA@AS1PR02MB7846.eurprd02.prod.outlook.com>
 <a652515e-278a-4838-815b-ecd2ad4495f6@vondra.me>
 <CANwKhkNd85u+4joaKR3YHoDOQSMg5SmJmsYJGo-tMyW=XVXTew@mail.gmail.com>
 <E313FDE4-8138-44CC-99CE-60F38251D878@kleczek.org>
 <CAE8JnxNM9Bh=LCGOzayewDgX3-kUNXdTDwNSDwsf+t=wKhPiCQ@mail.gmail.com>
 <CAE8JnxM5GDEWdvEckjgG60OwPK04pZ9dSyxYm2+-PuyKCpmo-w@mail.gmail.com>
 <CAE8JnxOJoWF-ABi5EtsrmBg3FRtmyk+D0Na8=e1vCwMaG1B2Lg@mail.gmail.com>
In-Reply-To: 
 <CAE8JnxOJoWF-ABi5EtsrmBg3FRtmyk+D0Na8=e1vCwMaG1B2Lg@mail.gmail.com>
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Tue, 17 Mar 2026 12:37:47 +0000
Message-ID: 
 <CAE8JnxM06S_B+2N-VU7LqwMhbf4pp94nnUTB=whZ2ou56++zVg@mail.gmail.com>
Subject: Re: New access method for b-tree.
To: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>,
 michael@paquier.xyz, tgl@sss.pgh.pa.us,
	"peter@eisentraut.org" <peter@eisentraut.org>
Cc: Ants Aasma <ants.aasma@cybertec.at>, Tomas Vondra <tomas@vondra.me>,
	Alexandre Felipe <alexandre.felipe@tpro.io>,
 =?UTF-8?B?TWljaGHFgiBLxYJlY3plaw==?= <michal@kleczek.org>,
	Andres Freund <andres@anarazel.de>
Content-Type: multipart/alternative; boundary="000000000000f826dc064d379ae2"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE8JnxM06S_B%2B2N-VU7LqwMhbf4pp94nnUTB%3DwhZ2ou56%2B%2BzVg%40mail.gmail.com>
Precedence: bulk

--000000000000f826dc064d379ae2
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Happy St. Patrick's day!

Based on what I said said in previous emails I see alternative
proposals

#1 Make it simpler by not changing the index access methods.

#2 Make it optimal in some sense by not using generic index searches
and not keeping multiple open index scans.

and
#3 Follow the pragmatic approach
Objective is, minimize the number of heap fetches.
As high level as possible, reusing existing functions
instead of writing custom code when possible.


Ants Aasma & Tomas Vondra

> > My workarounds I have proposed users have been either to rewrite the
> > query as a UNION ALL of a set of single value prefix queries wrapped
> > in an order by limit. This gives the exact needed merge append plan
> > shape. But repeating the query N times can get unwieldy when the
> > number of values grows, so the fallback is:
> >
> > SELECT * FROM unnest(:friends) id, LATERAL (
> >     SELECT * FROM posts
> >     WHERE user_id =3D id
> >     ORDER BY tstamp DESC LIMIT 100)
> > ORDER BY tstamp DESC LIMIT 100;
> >
> > The downside of this formulation is that we still have to fetch a
> > batch worth of items from scans where we otherwise would have only had
> > to look at one index tuple.
> >


> True. It's useful to think about the query this way, and it may be
> better than full select + sort, but it has issues too.
>

An issue with this query is generality, if this is joined with other
queries we can't determine in advance the limit.


> The main problem I can see is that at planning time the cardinality of
> > the prefix array might not be known, and in theory could be in the
> > millions. Having millions of index scans open at the same time is not
> > viable, so the method needs to somehow degrade gracefully. The idea I
> > had is to pick some limit, based on work_mem and/or benchmarking, and
> > one the limit is hit, populate the first batch and then run the next
> > batch of index scans, merging with the first result. Or something like
> > that, I can imagine a few different ways to handle it with different
> > tradeoffs.
> >
>
> Doesn't the proposed merge scan have a similar issue? Because that will
> also have to keep all the index scans open (even if only internally).
> Indeed, it needs to degrade gracefully, in some way.


It is true, but I think we can trust the planner.
This problem scales similarly in a memoize node.
Is ~24kB for each open index scan a good guess?

ALTERNATIVE #1 - More efficient

Or to avoid having N open index scans we could  (??)
(1) find the index page for the head of each prefix.
(2) for each prefix
(2.a) load tuples from each head page
(2.b) if we consume the last tuple in a page save a pointer
to the next page.
(2.c) check if tuples for the next prefix are in the same page
(2.d) Release the page.
(3) producing tuples in the suffix order
  (3.b) when tuples for prefix are exhausted load load
          page from (2.b)

Step 2.a. would possibly waste time extracting tuples that are
not needed, and memory by storing them. Not sure how efficient
this can be compared to having an open index scan.

Matthias van de Meent, Feb 3
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.

This could also be addressed if we do this custom descent,
I didn't bother about that depth factor because with a few random prefixes
doing so we are probably going to save accesses only for the top level.

I would prefer to start with a very conceptual implementation
that can already provide 1000x speedup, but if you think this
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.


I'm afraid the
> proposed batches execution will be rather complex, so I'd say v1 should
> simply have a threshold, and do the full scan + sort for more items.


Do you mean by an executor node that performs the query as if it was writte=
n

ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
  result +=3D (SELECT FROM table
        WHERE prefix =3D _prefix AND qual(*)
        ORDER BY suffix
        LIMIT N)
return SELECT * FROM result
    ORDER BY suffix
    LIMIT N

This query may have to produce N * len(prefixes) rows, while the
original proposal would produce only N + len(prefixes) - 1.

Comparing to the previous results,
> | Method     | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge      |         13 |         119 |     13 ms |
> | IndexScan  |     15,308 |     525,310 |  3,409 ms |

This Prefix Batch Scan approach
hit=3D62 read=3D773, Execution Time: 80.815 ms
With 8 prefixes, the execution time increased to 80.8/13 ~ 6.21
And the number of buffers by 835/132 ~ 6.32

> I can imagine that this would really nicely benefit from
> ReadStream'ification.
> >
>
> Not sure, maybe.
>

Actually as I was watching the index prefetch development I was
quite uncertain about how this would play with that, but we can
probably simply give a budget for each stream.


> One other connection I see is with block nested loops. In a perfect
> > future PostgreSQL could run the following as a set of merged index
> > scans that terminate early:
> >
> > SELECT posts.*
> > FROM follows f
> >     JOIN posts p ON f.followed_id =3D p.user_id
> > WHERE f.follower_id =3D :userid
> > ORDER BY p.tstamp DESC LIMIT 100;
> >
> > In practice this is not a huge issue - it's not that hard to transform
> > this to array_agg and =3D ANY subqueries.
> >

Automating that transformation seems quite non-trivial (to me).
>

Well, not trivial. To give a rough idea.

wc -l *.patch
     113 v2-0001-Test-the-baseline.patch
     614 v2-0002-Access-method.patch
     850 v2-0003-Planner-integration.patch
    1958 v2-0004-Multi-column.patch
    2439 v2-0005-Joins.patch

it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
are known to be unique it is good.

The multi column accepts things like A in (...) B in (...)
and computes the cartesian product or (A, B) IN (...)


Regards,
Alexandre

On Mon, Feb 23, 2026 at 10:08=E2=80=AFPM Alexandre Felipe <
o.alexandre.felipe@gmail.com> wrote:

> Hi Hackers,
>
> Do you think that MERGE-SCAN was a terrible name? I wanted a name that
> wouldn't
>
> require much explanation. I named it like this because it relies on a k-w=
ay
>
>
> merge to combine several segments of an index in one result. But we alrea=
dy
>
>
> have a MERGE statement. Even in the example plan above we can see
> an external
>
> merge that has nothing to do with the new feature, and now as I am doing
> joins,
>
> I started doing it on the NestedLoop trying to follow the same conditions
> that
>
> lead to a memoize. But I added so many fields to the NestedLoop state tha=
t
> I
>
> think it is good to have a separate structure, and maybe a separate node,
> and
>
> MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
> could
>
> use the Ants nickname TimeLineScan, but of course it is not limited to ti=
me
>
>
> lines (even though realistically, that will probably be the most common
> use of
>
> this). Another one I considered was TransposedIndexScan, because it order=
s
>
>
> output on (suffix, prefix) instead of (prefix, suffix).
>
>
>
>
> On Fri, Feb 6, 2026 at 10:52=E2=80=AFAM Alexandre Felipe <
> o.alexandre.felipe@gmail.com> wrote:
>
>> Hello again hackers!
>>
>> +pt@bowt.ie <pt@bowt.ie>: That seems to be the one that is probably the
>> most familiar with the index scan (based on the commits).
>> +michael@paquier.xyz <michael@paquier.xyz> , +tgl@sss.pgh.pa.us
>> <tgl@sss.pgh.pa.us> , +peter@eisentraut.org <peter@eisentraut.org> as
>> the top 3 committers to nbtree over the last ~6 months.
>>
>> I  have made substantial progress on adding a few features. I have
>> questions, but I will let you go first :)
>>
>> Motivation:
>> *In technical terms:* this proposal is to take advantage of a btree
>> index when the query is filtered by a few distinct prefixes and ordered =
by
>> a suffix and has a limit.
>> *In non technical:* This could help to efficiently render a social
>> network feed, where each user can select a list of users whose posts the=
y
>> want to see, and the posts must be ordered from newest to oldest.
>>
>>
>> *Performance Comparison*
>> I did a test with a toy table, please find more details below.
>>
>> With limit 100
>>
>> | Method     | Shared Hit | Shared Read | Exec Time |
>> |------------|-----------:|------------:|----------:|
>> | Merge      |         13 |         119 |     13 ms |
>> | IndexScan  |     15,308 |     525,310 |  3,409 ms |
>>
>> With limit 1,000,000
>>
>> | Method     |  SharedHit | SharRead | Temp I | Temp O | Exec Time |
>> |------------|-----------:|---------:|-------:|-------:|----------:|
>> | Merge      |    980,318 |   19,721 |      0 |      0 |  2,128 ms |
>> | Sequential |     15,208 |  525,410 | 20,207 | 35,384 |  3,762 ms |
>> | Bitmap     |        629 |  113,759 | 20,207 | 35,385 |  5,487 ms |
>> | IndexScan  |  7,880,619 |  126,706 | 20,945 | 35,386 |  5,874 ms |
>>
>> Sequential scans and bitmap scans in this case reduces significantly the
>> number of
>> accessed buff because the table has only four integer columns, and these
>> methods
>> can read all the lines on a given page at a time.
>>
>> However that comes at the cost of resorting to an in-disk sort method.
>> For the query with limit 100 we get no temp files as we are using a
>> top-100 sort.
>>
>> make check passes
>>
>>
>> *Experiment details*
>>
>> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
>>
>>
>> ```sql
>> CREATE TABLE grid AS (
>>     SELECT a, b, c, d, FROM
>>         generate_series(1, 100) AS a,
>>         generate_series(1, 100) AS b,
>>         generate_series(1, 100) AS c,
>>         generate_series(1, 100) AS d
>> );
>>
>> CREATE INDEX grid_index ON grid (a, b, c);
>> ANALYSE grid;
>> ```
>>
>> Now let's say that we need to find certain number of rows filtered by a
>> and ordered by b;
>> ```sql
>> PREPARE grid_query(int) AS
>> SELECT sum(d) FROM (
>>     SELECT * FROM grid
>>     WHERE a IN (2,3,5,8,13,21,34,55) AND b >=3D 0
>>     ORDER BY b
>>     LIMIT $1) t;
>> ```
>>
>> ---
>>
>>
>> Now with limit 100, with index merge scan (notice Index Prefixes in the
>> plan).
>>
>> ```sql
>> SET enable_indexmergescan =3D on;
>> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
>> ```
>>
>> ```text
>>    Buffers: shared hit=3D13 read=3D119
>>    ->  Limit  (cost=3D0.57..87.29 rows=3D100 width=3D16) (actual
>> time=3D5.528..12.999 rows=3D100.00 loops=3D1)
>>          Buffers: shared hit=3D13 read=3D119
>>          ->  Index Scan using grid_a_b_c_idx on grid  (cost=3D0.57..93.3=
6
>> rows=3D107 width=3D16) (actual time=3D5.528..12.994 rows=3D100.00 loops=
=3D1)
>>                Index Cond: (b >=3D 0)
>>                *Index Prefixes: *(a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[]))
>>                Index Searches: 8
>>                Buffers: shared hit=3D13 read=3D119
>>  Planning:
>>    Buffers: shared hit=3D59 read=3D23
>>  Planning Time: 4.619 ms
>>  Execution Time: 13.055 ms
>>  ```
>>
>>
>> ```sql
>> SET enable_indexmergescan =3D off;
>> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
>> ```
>>
>> ```text
>>  Aggregate  (cost=3D1603588.06..1603588.07 rows=3D1 width=3D8) (actual
>> time=3D3406.624..3408.710 rows=3D1.00 loops=3D1)
>>    Buffers: shared hit=3D15308 read=3D525310
>>    ->  Limit  (cost=3D1603575.17..1603586.81 rows=3D100 width=3D16) (act=
ual
>> time=3D3406.601..3408.702 rows=3D100.00 loops=3D1)
>>          Buffers: shared hit=3D15308 read=3D525310
>>          ->  Gather Merge  (cost=3D1603575.17..2514342.92 rows=3D7819999
>> width=3D16) (actual time=3D3406.598..3408.695 rows=3D100.00 loops=3D1)
>>                Workers Planned: 2
>>                Workers Launched: 2
>>                Buffers: shared hit=3D15308 read=3D525310
>>                ->  Sort  (cost=3D1602575.14..1610720.98 rows=3D3258333
>> width=3D16) (actual time=3D3393.782..3393.784 rows=3D100.00 loops=3D3)
>>                      Sort Key: grid.b
>>                      Sort Method: top-N heapsort  Memory: 32kB
>>                      Buffers: shared hit=3D15308 read=3D525310
>>                      Worker 0:  Sort Method: top-N heapsort  Memory: 32k=
B
>>                      Worker 1:  Sort Method: top-N heapsort  Memory: 32k=
B
>>                      ->  *Parallel Seq Scan* on grid
>>  (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D0.94=
4..3129.896
>> rows=3D2666666.67 loops=3D3)
>>                            Filter: ((b >=3D 0) AND (a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])))
>>                            Rows Removed by Filter: 30666667
>>                            Buffers: shared hit=3D15234 read=3D525310
>>  Planning Time: 0.370 ms
>>  Execution Time: 3409.134 ms
>>  ```
>>
>> Now queries with limit 1,000,000
>>
>> ```sql
>> SET enable_indexmergescan =3D on;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>>
>> Query executed with the proposed access method. Notice in the plan Index
>> Prefixes and Index Cond.
>> ```text
>>    Buffers: shared hit=3D980318 read=3D19721
>>    ->  Limit  (cost=3D0.57..867259.84 rows=3D1000000 width=3D16) (actual
>> time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1)
>>          Buffers: shared hit=3D980318 read=3D19721
>>          ->  Index Scan using grid_a_b_c_idx on grid
>>  (cost=3D0.57..867265.91 rows=3D1000007 width=3D16) (actual time=3D2.852=
..2066.205
>> rows=3D1000000.00 loops=3D1)
>>                Index Cond: (b >=3D 0)
>>                *Index Prefixes:* (a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[]))
>>                Index Searches: 8
>>                Buffers: shared hit=3D980318 read=3D19721
>>  Planning Time: 0.328 ms
>>  Execution Time: 2127.811 ms
>>  ```
>>
>> If we disable index_mergescan we naturally we fall into a sequential sca=
n.
>>
>> ```sql
>> SET enable_indexmergescan =3D off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```text
>>    Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written=
=3D35384
>>    ->  Limit  (cost=3D1942895.64..2059362.12 rows=3D1000000 width=3D16) =
(actual
>> time=3D3467.012..3712.044 rows=3D1000000.00 loops=3D1)
>>          Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207
>> written=3D35384
>>          ->  Gather Merge  (cost=3D1942895.64..2853663.39 rows=3D7819999
>> width=3D16) (actual time=3D3467.010..3671.220 rows=3D1000000.00 loops=3D=
1)
>>                Workers Planned: 2
>>                Workers Launched: 2
>>                Buffers: shared hit=3D15208 read=3D525410, temp read=3D20=
207
>> written=3D35384
>>                ->  Sort  (cost=3D1941895.62..1950041.45 rows=3D3258333
>> width=3D16) (actual time=3D3455.852..3476.358 rows=3D334576.33 loops=3D3=
)
>>                      Sort Key: grid.b
>>                      Sort Method: *external merge  Disk: 47016kB*
>>                      Buffers: shared hit=3D15208 read=3D525410, temp
>> read=3D20207 written=3D35384
>>                      Worker 0:  Sort Method: external merge  Disk: 46976=
kB
>>                      Worker 1:  Sort Method: external merge  Disk: 47000=
kB
>>                      ->  *Parallel Seq Scan* on grid
>>  (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D2.78=
9..2779.483
>> rows=3D2666666.67 loops=3D3)
>>                            Filter: ((b >=3D 0) AND (a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])))
>>                            Rows Removed by Filter: 30666667
>>                            Buffers: shared hit=3D15134 read=3D525410
>>  Planning Time: 0.332 ms
>>  Execution Time: 3761.866 ms
>> ```
>>
>> If we disable sequential scans, then we get a bitmap scan
>>
>> ```sql
>> SET enable_seqscan =3D off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```text
>>    Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=3D2020=
7
>> written=3D35385
>>    ->  Limit  (cost=3D1998199.78..2114666.26 rows=3D1000000 width=3D16) =
(actual
>> time=3D5170.456..5453.433 rows=3D1000000.00 loops=3D1)
>>          Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=
=3D20207
>> written=3D35385
>>          ->  Gather Merge  (cost=3D1998199.78..2908967.53 rows=3D7819999
>> width=3D16) (actual time=3D5170.455..5413.254 rows=3D1000000.00 loops=3D=
1)
>>                Workers Planned: 2
>>                Workers Launched: 2
>>                Buffers: shared hit=3D629 read=3D113759 written=3D2, temp
>> read=3D20207 written=3D35385
>>                ->  Sort  (cost=3D1997199.75..2005345.59 rows=3D3258333
>> width=3D16) (actual time=3D5156.929..5177.507 rows=3D334500.67 loops=3D3=
)
>>                      Sort Key: grid.b
>>                      Sort Method: external merge  Disk: 47032kB
>>                      Buffers: shared hit=3D629 read=3D113759 written=3D2=
, temp
>> read=3D20207 written=3D35385
>>                      Worker 0:  Sort Method: external merge  Disk: 47280=
kB
>>                      Worker 1:  Sort Method: external merge  Disk: 46680=
kB
>>                      ->  Parallel Bitmap Heap Scan on grid
>>  (cost=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (actual
>> time=3D299.891..4489.787 rows=3D2666666.67 loops=3D3)
>>                            Recheck Cond: ((a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
>>                            Rows *Removed by Index Recheck*: 2410242
>>                            Heap Blocks: exact=3D13100 lossy=3D22639
>>                            Buffers: shared hit=3D615 read=3D113759 writt=
en=3D2
>>                            Worker 0:  Heap Blocks: exact=3D13077 lossy=
=3D22755
>>                            Worker 1:  Heap Blocks: exact=3D13036 lossy=
=3D22421
>>                            ->  *Bitmap Index Scan* on grid_a_b_c_idx
>>  (cost=3D0.00..105736.54 rows=3D7820000 width=3D0) (actual time=3D297.65=
1..297.651
>> rows=3D8000000.00 loops=3D1)
>>                                  Index Cond: ((a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
>>                                  Index Searches: 7
>>                                  Buffers: shared hit=3D13 read=3D7293
>> written=3D2
>>  Planning Time: 0.165 ms
>>  Execution Time: 5487.213 ms
>> ```
>>
>> If we disable bitmap scans we finally get an index scan
>>
>> ```sql
>> SET enable_bitmapscan =3D off;
>> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
>> ```
>> ```
>>    Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699 writte=
n=3D35385
>>    ->  Limit  (cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) =
(actual
>> time=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1)
>>          Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699
>> written=3D35385
>>          ->  Gather Merge  (cost=3D7201203.08..8111970.83 rows=3D7819999
>> width=3D16) (actual time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D=
1)
>>                Workers Planned: 2
>>                Workers Launched: 2
>>                Buffers: shared hit=3D7883221 read=3D124111, temp read=3D=
20699
>> written=3D35385
>>                ->  Sort  (cost=3D7200203.05..7208348.88 rows=3D3258333
>> width=3D16) (actual time=3D4390.625..4411.896 rows=3D334567.00 loops=3D3=
)
>>                      Sort Key: grid.b
>>                      Sort Method: *external merge  Disk: 47304kB*
>>                      Buffers: shared hit=3D7883221 read=3D124111, temp
>> read=3D20699 written=3D35385
>>                      Worker 0:  Sort Method: external merge  Disk: 47304=
kB
>>                      Worker 1:  Sort Method: external merge  Disk: 46384=
kB
>>                      ->  *Parallel Index Scan* using grid_a_b_c_idx on
>> grid  (cost=3D0.57..6736351.43 rows=3D3258333 width=3D16) (actual
>> time=3D46.925..3796.915 rows=3D2666666.67 loops=3D3)
>>                            Index Cond: ((a =3D ANY
>> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
>>                            Index Searches: 7
>>                            Buffers: shared hit=3D7883208 read=3D124110
>>  Planning Time: 0.385 ms
>>  Execution Time: 4713.325 ms
>>  ```
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre Felipe <
>> o.alexandre.felipe@gmail.com> wrote:
>>
>>> Thank you for looking into this.
>>>
>>> Now we can execute a, still narrow, family queries!
>>>
>>> Maybe it helps to see this as a *social network feeds*. Imagine a
>>> social network, you have a few friends, or follow a few people, and you
>>> want to see their updates ordered by date. For each user we have a
>>> different combination of users that we have to display. But maybe, even
>>> having hundreds of users we will only show the first 10.
>>>
>>> There is a low hanging fruit on the skip scan, if we need N rows, and
>>> one group already has M rows we could stop there.
>>> If Nx is the number of friends, and M is the number of posts to show.
>>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
>>> instead of (Nx * N) followed by an (Nx * N) sort.
>>> Where M =3D 10 and N is 1000 this is a significant improvement.
>>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M +
>>> Nx) heap operations.
>>> If everything is on the same page the skip scan would win.
>>>
>>> The cost estimation is probably far off.
>>> I am also not considering the filters applied after this operator, and =
I
>>> don't know if the planner infrastructure is able to adjust it by itself=
.
>>> This is where I would like reviewer's feedback. I think that the planne=
r
>>> costs are something to be determined experimentally.
>>>
>>> Next I will make it slightly more general handling
>>> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
>>> ORDER BY b LIMIT N (ignoring s...)
>>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
>>> * Non-leading prefix: WHERE b IN (...) AND a =3D const ORDER BY c on in=
dex
>>> (a, b, c)
>>>
>>> ---
>>> Kind Regards,
>>> Alexandre
>>>
>>> On Wed, Feb 4, 2026 at 7:13=E2=80=AFAM Micha=C5=82 K=C5=82eczek <michal=
@kleczek.org>
>>> wrote:
>>>
>>>>
>>>>
>>>> On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:
>>>>
>>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:
>>>>
>>>> I'm also wondering how common is the targeted query pattern? How commo=
n
>>>> it is to have an IN condition on the leading column in an index, and
>>>> ORDER BY on the second one?
>>>>
>>>>
>>>> I have seen this pattern multiple times. My nickname for it is the
>>>> timeline view. Think of the social media timeline, showing posts from
>>>> all followed accounts in timestamp order, returned in reasonably sized
>>>> batches. The naive SQL query will have to scan all posts from all
>>>> followed accounts and pass them through a top-N sort. When the total
>>>> number of posts is much larger than the batch size this is much slower
>>>> than what is proposed here (assuming I understand it correctly) -
>>>> effectively equivalent to running N index scans through Merge Append.
>>>>
>>>>
>>>> My workarounds I have proposed users have been either to rewrite the
>>>> query as a UNION ALL of a set of single value prefix queries wrapped
>>>> in an order by limit. This gives the exact needed merge append plan
>>>> shape. But repeating the query N times can get unwieldy when the
>>>> number of values grows, so the fallback is:
>>>>
>>>> SELECT * FROM unnest(:friends) id, LATERAL (
>>>>    SELECT * FROM posts
>>>>    WHERE user_id =3D id
>>>>    ORDER BY tstamp DESC LIMIT 100)
>>>> ORDER BY tstamp DESC LIMIT 100;
>>>>
>>>> The downside of this formulation is that we still have to fetch a
>>>> batch worth of items from scans where we otherwise would have only had
>>>> to look at one index tuple.
>>>>
>>>>
>>>> GIST can be used to handle this kind of queries as it supports multipl=
e
>>>> sort orders.
>>>> The only problem is that GIST does not support ORDER BY column.
>>>> One possible workaround is [1] but as described there it does not play
>>>> well with partitioning.
>>>> I=E2=80=99ve started drafting support for ORDER BY column in GIST - se=
e [2].
>>>> I think it would be easier to implement and maintain than a new IAM
>>>> (but I don=E2=80=99t have enough knowledge and experience to implement=
 it myself)
>>>>
>>>> [1]
>>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1E=
C21F%40kleczek.org
>>>> [2]
>>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5=
DC91%40kleczek.org
>>>>
>>>> =E2=80=94
>>>> Kind regards,
>>>> Michal
>>>>
>>>

--000000000000f826dc064d379ae2
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div id=3D"gmail-:9t6t" class=3D"gmail-Am gmail-aiL gmail-=
aO9 gmail-Al editable gmail-LW-avf gmail-tS-tW gmail-tS-tY" aria-label=3D"M=
essage Body" role=3D"textbox" aria-multiline=3D"true" tabindex=3D"1" style=
=3D"direction:ltr;min-height:85px" aria-controls=3D":afry" aria-expanded=3D=
"false"><div dir=3D"ltr"><br></div><div>Happy St. Patrick&#39;s day!</div><=
div></div><div><br></div><div>Based on what I said said in previous emails =
I see alternative</div><div>proposals</div><div><br></div><div>#1 Make it s=
impler by not changing the index access methods.</div><div><br></div><div>#=
2=C2=A0Make it optimal in some sense by not using generic index searches</d=
iv><div>and not keeping multiple open index scans.</div><div><br></div><div=
><div>and=C2=A0</div><div>#3 Follow the pragmatic approach</div><div></div>=
</div><div>Objective is, minimize the number of heap fetches.</div><div>As =
high level as possible, reusing existing functions</div><div>instead of wri=
ting custom code when possible.</div><div><br></div><div><br></div><div><br=
></div><div>Ants Aasma &amp; Tomas Vondra</div><div class=3D"gmail_quote gm=
ail_quote_container"><blockquote class=3D"gmail_quote" style=3D"margin:0px =
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">&gt;=
 My workarounds I have proposed users have been either to rewrite the<br>
&gt; query as a UNION ALL of a set of single value prefix queries wrapped<b=
r>
&gt; in an order by limit. This gives the exact needed merge append plan<br=
>
&gt; shape. But repeating the query N times can get unwieldy when the<br>
&gt; number of values grows, so the fallback is:<br>
&gt; <br>
&gt; SELECT * FROM unnest(:friends) id, LATERAL (<br>
&gt;=C2=A0 =C2=A0 =C2=A0SELECT * FROM posts<br>
&gt;=C2=A0 =C2=A0 =C2=A0WHERE user_id =3D id<br>
&gt;=C2=A0 =C2=A0 =C2=A0ORDER BY tstamp DESC LIMIT 100)<br>
&gt; ORDER BY tstamp DESC LIMIT 100;<br>
&gt; <br>
&gt; The downside of this formulation is that we still have to fetch a<br>
&gt; batch worth of items from scans where we otherwise would have only had=
<br>
&gt; to look at one index tuple.<br>
&gt;=C2=A0</blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><b=
r></blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
True. It&#39;s useful to think about the query this way, and it may be<br>
better than full select + sort, but it has issues too.<br></blockquote><div=
><br></div><div>An issue with this query is generality, if this is joined w=
ith other</div><div>queries we can&#39;t determine in advance the limit.</d=
iv><div>=C2=A0</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
&gt; The main problem I can see is that at planning time the cardinality of=
<br>
&gt; the prefix array might not be known, and in theory could be in the<br>
&gt; millions. Having millions of index scans open at the same time is not<=
br>
&gt; viable, so the method needs to somehow degrade gracefully. The idea I<=
br>
&gt; had is to pick some limit, based on work_mem and/or benchmarking, and<=
br>
&gt; one the limit is hit, populate the first batch and then run the next<b=
r>
&gt; batch of index scans, merging with the first result. Or something like=
<br>
&gt; that, I can imagine a few different ways to handle it with different<b=
r>
&gt; tradeoffs.<br>
&gt; <br>
<br>
Doesn&#39;t the proposed merge scan have a similar issue? Because that will=
<br>
also have to keep all the index scans open (even if only internally).<br>
Indeed, it needs to degrade gracefully, in some way.</blockquote><div>=C2=
=A0</div><div>It is true, but I think we can trust the planner.</div><div>T=
his problem scales=C2=A0similarly in a=C2=A0memoize node.</div><div>Is ~24k=
B for each open index scan a good guess?</div><div><br></div><div>ALTERNATI=
VE #1 - More efficient</div><div><br></div><div>Or to avoid having N open i=
ndex scans we could=C2=A0 (??)</div><div>(1) find the index page for the he=
ad of each prefix.</div><div>(2) for each prefix<br>(2.a) load tuples from =
each head page</div><div>(2.b) if we consume the last tuple in a page save =
a pointer</div><div>to the next page.</div><div>(2.c) check if tuples for t=
he next prefix are in the same page</div><div>(2.d) Release the page.</div>=
<div>(3) producing tuples in the suffix order</div><div>=C2=A0 (3.b) when t=
uples for prefix are exhausted load load</div><div>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 page from (2.b)</div><div><br></div><div>Step 2.a. would possibl=
y waste time extracting tuples that are=C2=A0</div><div>not needed, and mem=
ory by storing them. Not sure how efficient</div><div>this can be compared =
to having an open index scan.</div><div><br></div><div>Matthias van de Meen=
t, Feb 3</div><div>&gt; btree=C2=A0<span class=3D"gmail-il">index</span>=C2=
=A0skip scan infrastructure efficiently prevents new=C2=A0<span class=3D"gm=
ail-il">index</span></div><div>&gt; descents into the=C2=A0<span class=3D"g=
mail-il">index</span>=C2=A0when the selected SAOP key ranges are directly<b=
r>&gt; adjecent, while merge scan would generally do at least one=C2=A0<spa=
n class=3D"gmail-il">index</span><br>&gt; descent for each of its N scan he=
ads (*) - which in the proposed<br>&gt; prototype patch guarantees O(<span =
class=3D"gmail-il">index</span>=C2=A0depth * num scan heads) buffer<br>&gt;=
 accesses.</div><div><br></div><div>This could also be addressed if we do t=
his custom descent,</div><div>I didn&#39;t bother about that depth factor b=
ecause with a few random prefixes</div><div>doing so we are probably going =
to save accesses only for the top level.</div><div><br></div><div><div>I wo=
uld prefer to start with a very conceptual implementation</div><div>that ca=
n already provide 1000x speedup, but if you think this</div><div>way is bet=
ter, I am open to try it. I think this can be done</div><div>without affect=
ing the planner logic and the PrefixJoin node.</div></div><div><br></div><d=
iv><br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px =
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">I&#39;m afra=
id the<br>
proposed batches execution will be rather complex, so I&#39;d say v1 should=
<br>
simply have a threshold, and do the full scan + sort for more items.</block=
quote><div><br>Do you mean by an executor node that performs the query as i=
f it was written</div><div><br></div><div>ALTERNATIVE #2 - Simpler(??)</div=
><div><font face=3D"monospace">for each _prefix of prefixes:</font></div><d=
iv><font face=3D"monospace">=C2=A0 result=C2=A0+=3D (SELECT FROM table=C2=
=A0</font></div><div><font face=3D"monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 W=
HERE prefix =3D _prefix AND qual(*)=C2=A0</font></div><div><font face=3D"mo=
nospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 ORDER BY suffix=C2=A0</font></div><div=
><font face=3D"monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 LIMIT N)</font></div>=
<div><font face=3D"monospace">return SELECT * FROM result=C2=A0</font></div=
><div><font face=3D"monospace">=C2=A0 =C2=A0 ORDER BY suffix=C2=A0</font></=
div><div><font face=3D"monospace">=C2=A0 =C2=A0 LIMIT N</font></div><div><b=
r></div><div>This query may have to produce N * len(prefixes) rows, while t=
he=C2=A0</div><div>original proposal would produce only N=C2=A0+ len(prefix=
es) - 1.</div><div><br></div><div>Comparing to the previous results,</div><=
div><span style=3D"font-family:monospace">&gt; | Method =C2=A0 =C2=A0 | Sha=
red Hit | Shared Read | Exec Time |</span><br style=3D"font-family:monospac=
e"><span style=3D"font-family:monospace">&gt; |------------|-----------:|--=
-</span><span style=3D"font-family:monospace">---------:|----------:|</span=
><br style=3D"font-family:monospace"><span style=3D"font-family:monospace">=
&gt; | Merge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 13 | =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |</span><br style=3D"font-fa=
mily:monospace"><span style=3D"font-family:monospace">&gt; | IndexScan =C2=
=A0| =C2=A0 =C2=A0 15,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 ms |</span>=
</div><div><span style=3D"font-family:monospace"><br></span></div><div><spa=
n style=3D"font-family:monospace">This Prefix Batch Scan approach</span></d=
iv><div>hit=3D62 read=3D773,=C2=A0<span style=3D"font-family:monospace"></s=
pan>Execution Time: 80.815 ms</div><div>With 8 prefixes, the execution time=
 increased to 80.8/13 ~ 6.21</div><div>And the number of buffers by 835/132=
 ~ 6.32</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1e=
x">
&gt; I can imagine that this would really nicely benefit from ReadStream=
9;ification.<br>
&gt; <br>
<br>
Not sure, maybe.<br></blockquote><div><br></div><div>Actually as I was watc=
hing the index prefetch development I was</div><div>quite uncertain about h=
ow this would play with that, but we can</div><div>probably simply give a b=
udget for each stream.</div><div><br></div><div><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">
&gt; One other connection I see is with block nested loops. In a perfect<br=
>
&gt; future PostgreSQL could run the following as a set of merged index<br>
&gt; scans that terminate early:<br>
&gt; <br>
&gt; SELECT posts.*<br>
&gt; FROM follows f<br>
&gt;=C2=A0 =C2=A0 =C2=A0JOIN posts p ON f.followed_id =3D p.user_id<br>
&gt; WHERE f.follower_id =3D :userid<br>
&gt; ORDER BY p.tstamp DESC LIMIT 100;<br>
&gt; <br>
&gt; In practice this is not a huge issue - it&#39;s not that hard to trans=
form<br>
&gt; this to array_agg and =3D ANY subqueries.<br>
&gt;=C2=A0</blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Automating that transformation seems quite non-trivial (to me).<br></blockq=
uote><div><br></div><div>Well, not trivial. To give a rough idea.</div><div=
><br><font face=3D"monospace">wc -l *.patch</font></div><div><font face=3D"=
monospace">=C2=A0 =C2=A0 =C2=A0113 v2-0001-Test-the-baseline.patch<br>=C2=
=A0 =C2=A0 =C2=A0614 v2-0002-Access-method.patch<br>=C2=A0 =C2=A0 =C2=A0850=
 v2-0003-Planner-integration.patch<br>=C2=A0 =C2=A0 1958 v2-0004-Multi-colu=
mn.patch<br>=C2=A0 =C2=A0 2439 v2-0005-Joins.patch</font></div><div><br></d=
iv><div>it is missing some important details like prefix deduplication</div=
><div>but for the scenario where the values on the other table</div><div>ar=
e known to be unique it is good.</div><div><br></div><div>The multi column =
accepts things like A in (...) B in (...)</div><div>and computes the cartes=
ian product or (A, B) IN (...)<br><br><br>Regards,</div><div>Alexandre</div=
></div></div></div><br><div class=3D"gmail_quote gmail_quote_container"><di=
v dir=3D"ltr" class=3D"gmail_attr">On Mon, Feb 23, 2026 at 10:08=E2=80=AFPM=
 Alexandre Felipe &lt;<a href=3D"mailto:o.alexandre.felipe@gmail.com">o.ale=
xandre.felipe@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_=
quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,=
204);padding-left:1ex"><div dir=3D"ltr">Hi Hackers,<br><br>


<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">Do you think that MERGE-SCAN was a terrible =
name? I wanted a name that wouldn&#39;t<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">require much explanation. I named it like th=
is because it relies on a k-way<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">merge to combine several segments of an inde=
x in one result.=C2=A0But we already<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">have a MERGE statement. Even in the example =
plan above we can see an=C2=A0external<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">merge that has nothing to do with the new fe=
ature, and now as I am doing joins,<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">I started doing it on the NestedLoop trying =
to follow the same conditions that<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">lead to a memoize. But I added so many field=
s to the NestedLoop state that I<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">think it is good to have a separate=C2=A0str=
ucture, and maybe a separate node, and<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">MergeScan of course is taken hehe. I was thi=
nking of IndexPrefixMerge. We could<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">use the Ants nickname TimeLineScan, but of c=
ourse it is not limited to time<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">lines (even though realistically, that will =
probably be the most common use of<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">this). Another one I considered was Transpos=
edIndexScan, because it orders<span>=C2=A0</span></span></p>
<p style=3D"margin:0px;font-variant-numeric:normal;font-variant-east-asian:=
normal;font-variant-alternates:normal;font-size-adjust:none;font-kerning:au=
to;font-feature-settings:normal;font-stretch:normal;font-size:11px;line-hei=
ght:normal;font-family:Menlo;color:rgb(0,0,0)"><span style=3D"font-variant-=
ligatures:no-common-ligatures">output on (suffix, prefix) instead of (prefi=
x, suffix).</span></p><div><br></div><div><br><br></div></div><br><div clas=
s=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Feb 6, 2026=
 at 10:52=E2=80=AFAM Alexandre Felipe &lt;<a href=3D"mailto:o.alexandre.fel=
ipe@gmail.com" target=3D"_blank">o.alexandre.felipe@gmail.com</a>&gt; wrote=
:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.=
8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"lt=
r"><div dir=3D"ltr"><div dir=3D"ltr"><div>Hello again hackers!</div><br><di=
v><a class=3D"gmail_plusreply" id=3D"m_6080345663077266924m_-38605142459751=
09604m_1178963958729505245gmail-plusReplyChip-0" href=3D"mailto:pt@bowt.ie"=
 target=3D"_blank">+pt@bowt.ie</a>: That seems to be the one that is probab=
ly the most familiar with the index scan (based on the commits).</div><div>=
<a class=3D"gmail_plusreply" id=3D"m_6080345663077266924m_-3860514245975109=
604m_1178963958729505245gmail-plusReplyChip-2" href=3D"mailto:michael@paqui=
er.xyz" target=3D"_blank">+michael@paquier.xyz</a>=C2=A0,=C2=A0<a class=3D"=
gmail_plusreply" id=3D"m_6080345663077266924m_-3860514245975109604m_1178963=
958729505245gmail-plusReplyChip-4" href=3D"mailto:tgl@sss.pgh.pa.us" target=
=3D"_blank">+tgl@sss.pgh.pa.us</a>=C2=A0,=C2=A0<a class=3D"gmail_plusreply"=
 id=3D"m_6080345663077266924m_-3860514245975109604m_1178963958729505245gmai=
l-plusReplyChip-5" href=3D"mailto:peter@eisentraut.org" target=3D"_blank">+=
peter@eisentraut.org</a>=C2=A0as the top 3 committers to nbtree over the la=
st ~6 months.<br></div><div><br></div><div>I=C2=A0 have made substantial=C2=
=A0progress on adding a few features. I have questions, but I will let you =
go first :)</div><div><br></div><div><font size=3D"4">Motivation</font>:</d=
iv><div><b>In technical terms:</b> this proposal is to take advantage of a =
btree index when the query is filtered by a few distinct prefixes and order=
ed by a suffix and has a limit.<br></div><div><b>In non technical:</b> This=
 could help to efficiently render=C2=A0a social network feed, where each us=
er can select a list of users whose posts they want to see, and the posts m=
ust be ordered from newest to oldest.</div><div><br></div><div><b><font siz=
e=3D"4">Performance Comparison</font><br></b><br>I did a test with a toy ta=
ble, please find more details below.<br><br>With limit 100<br><br><font fac=
e=3D"monospace">| Method =C2=A0 =C2=A0 | Shared Hit | Shared Read | Exec Ti=
me |<br>|------------|-----------:|------------:|----------:|<br>| Merge =
=C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 13 | =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |<br>| IndexScan =C2=A0| =C2=A0 =C2=A0 15=
,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 ms |</font><br><br>With limit 1,=
000,000<br><br><font face=3D"monospace">| Method =C2=A0 =C2=A0 | =C2=A0Shar=
edHit | SharRead | Temp I | Temp O | Exec Time |<br>|------------|---------=
--:|---------:|-------:|-------:|----------:|<br>| Merge =C2=A0 =C2=A0 =C2=
=A0| =C2=A0 =C2=A0980,318 | =C2=A0 19,721 | =C2=A0 =C2=A0 =C2=A00 | =C2=A0 =
=C2=A0 =C2=A00 | =C2=A02,128 ms |<br>| Sequential | =C2=A0 =C2=A0 15,208 | =
=C2=A0525,410 | 20,207 | 35,384 | =C2=A03,762 ms |<br>| Bitmap =C2=A0 =C2=
=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0629 | =C2=A0113,759 | 20,207 | 35,385 | =
=C2=A05,487 ms |<br>| IndexScan =C2=A0| =C2=A07,880,619 | =C2=A0126,706 | 2=
0,945 | 35,386 | =C2=A05,874 ms |</font><br><br>Sequential scans and bitmap=
 scans in this case reduces significantly the number of <br>accessed buff b=
ecause the table has only four integer columns, and these methods <br>can r=
ead all the lines on a given page at a time.<br><br>However that comes at t=
he cost of resorting to an in-disk sort method. <br>For the query with limi=
t 100 we get no temp files as we are using a<br>top-100 sort.</div><div><br=
></div><div>make check passes</div><div><br><br><b><font size=3D"4">Experim=
ent details</font></b><br><br>Consider a 100M row table formed (a,b,c,d) \i=
n 100 x 100 x 100 x 100 <br><br><br>```sql<br><font face=3D"monospace">CREA=
TE TABLE grid AS (<br>=C2=A0 =C2=A0 SELECT a, b, c, d, FROM <br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 generate_series(1, 100) AS a, <br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 generate_series(1, 100) AS b, <br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 genera=
te_series(1, 100) AS c, <br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 generate_series(1, =
100) AS d<br>);<br><br>CREATE INDEX grid_index ON grid (a, b, c);<br>ANALYS=
E grid;</font><br>```<br><br>Now let&#39;s say that we need to find certain=
 number of rows filtered by a and ordered by b;<br>```sql<br><font face=3D"=
monospace">PREPARE grid_query(int) AS<br>SELECT sum(d) FROM (<br>=C2=A0 =C2=
=A0 SELECT * FROM grid <br>=C2=A0 =C2=A0 WHERE a IN (2,3,5,8,13,21,34,55) A=
ND b &gt;=3D 0 <br>=C2=A0 =C2=A0 ORDER BY b <br>=C2=A0 =C2=A0 LIMIT $1) t;<=
/font><br>```<br><br>---<br><br><br>Now with limit 100, with index merge sc=
an (notice Index Prefixes in the plan).<br><br>```sql<br>SET enable_indexme=
rgescan =3D on;<br>EXPLAIN (ANALYSE) EXECUTE grid_query(100);<br>```<br><br=
>```text<br>=C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119<br>=C2=A0 =C2=
=A0-&gt; =C2=A0Limit =C2=A0(cost=3D0.57..87.29 rows=3D100 width=3D16) (actu=
al time=3D5.528..12.999 rows=3D100.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0-&gt; =C2=A0Index Scan using grid_a_b_c_idx on grid =C2=A0(cos=
t=3D0.57..93.36 rows=3D107 width=3D16) (actual time=3D5.528..12.994 rows=3D=
100.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Index Cond: (b &gt;=3D 0)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0<b>Index Prefixes: </b>(a =3D ANY (&#39;{2,3,5,8,13,21,34,55}&=
#39;::integer[]))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Index Searches: 8<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0Buffers: shared hit=3D13 read=3D119<br>=C2=A0Planning:<br>=C2=A0 =C2=
=A0Buffers: shared hit=3D59 read=3D23<br>=C2=A0Planning Time: 4.619 ms<br>=
=C2=A0Execution Time: 13.055 ms<br>=C2=A0```<br><br><br>```sql<br>SET enabl=
e_indexmergescan =3D off;<br>EXPLAIN (ANALYSE) EXECUTE grid_query(100);<br>=
```<br><br>```text<br>=C2=A0Aggregate =C2=A0(cost=3D1603588.06..1603588.07 =
rows=3D1 width=3D8) (actual time=3D3406.624..3408.710 rows=3D1.00 loops=3D1=
)<br>=C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D525310<br>=C2=A0 =C2=
=A0-&gt; =C2=A0Limit =C2=A0(cost=3D1603575.17..1603586.81 rows=3D100 width=
=3D16) (actual time=3D3406.601..3408.702 rows=3D100.00 loops=3D1)<br>=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D525310<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Gather Merge =C2=A0(cost=3D16=
03575.17..2514342.92 rows=3D7819999 width=3D16) (actual time=3D3406.598..34=
08.695 rows=3D100.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0Workers Planned: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0Workers Launched: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D525310<br>=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Sort =C2=A0(co=
st=3D1602575.14..1610720.98 rows=3D3258333 width=3D16) (actual time=3D3393.=
782..3393.784 rows=3D100.00 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Meth=
od: top-N heapsort =C2=A0Memory: 32kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=
=3D525310<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: top-N heapsort =C2=A0Memory: 32k=
B<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0Worker 1: =C2=A0Sort Method: top-N heapsort =C2=A0Memory: 32kB<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&=
gt; =C2=A0<b>Parallel Seq Scan</b> on grid =C2=A0(cost=3D0.00..1478044.00 r=
ows=3D3258333 width=3D16) (actual time=3D0.944..3129.896 rows=3D2666666.67 =
loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b &gt;=3D 0) AND (a =3D ANY=
 (&#39;{2,3,5,8,13,21,34,55}&#39;::integer[])))<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Ro=
ws Removed by Filter: 30666667<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=
=3D15234 read=3D525310<br>=C2=A0Planning Time: 0.370 ms<br>=C2=A0Execution =
Time: 3409.134 ms<br>=C2=A0```<br><br>Now queries with limit 1,000,000<br><=
br>```sql<br>SET enable_indexmergescan =3D on;<br>EXPLAIN ANALYSE EXECUTE g=
rid_query(1000000);<br>```<br><br>Query executed with the proposed access m=
ethod. Notice in the plan Index Prefixes and Index Cond.<br>```text<br>=C2=
=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721<br>=C2=A0 =C2=A0-&gt; =
=C2=A0Limit =C2=A0(cost=3D0.57..867259.84 rows=3D1000000 width=3D16) (actua=
l time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721<br>=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Index Scan using grid_a_b_c_idx on grid =
=C2=A0(cost=3D0.57..867265.91 rows=3D1000007 width=3D16) (actual time=3D2.8=
52..2066.205 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (b &gt;=3D 0)<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<b>Index Prefixes:</b> (a =3D ANY (&#=
39;{2,3,5,8,13,21,34,55}&#39;::integer[]))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 8<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721<br>=
=C2=A0Planning Time: 0.328 ms<br>=C2=A0Execution Time: 2127.811 ms<br>=C2=
=A0```<br><br>If we disable index_mergescan we naturally we fall into a seq=
uential scan.<br><br>```sql<br>SET enable_indexmergescan =3D off;<br>EXPLAI=
N ANALYSE EXECUTE grid_query(1000000);<br>```<br>```text<br>=C2=A0 =C2=A0Bu=
ffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D35384<=
br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D1942895.64..2059362.12 rows=
=3D1000000 width=3D16) (actual time=3D3467.012..3712.044 rows=3D1000000.00 =
loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15208=
 read=3D525410, temp read=3D20207 written=3D35384<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0-&gt; =C2=A0Gather Merge =C2=A0(cost=3D1942895.64..2853663.39 =
rows=3D7819999 width=3D16) (actual time=3D3467.010..3671.220 rows=3D1000000=
.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Wo=
rkers Planned: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
Workers Launched: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D3=
5384<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0=
Sort =C2=A0(cost=3D1941895.62..1950041.45 rows=3D3258333 width=3D16) (actua=
l time=3D3455.852..3476.358 rows=3D334576.33 loops=3D3)<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: gri=
d.b<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0Sort Method: <b>external merge =C2=A0Disk: 47016kB</b><br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffer=
s: shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D35384<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Worker 0: =C2=A0Sort Method: external merge =C2=A0Disk: 46976kB<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Wo=
rker 1: =C2=A0Sort Method: external merge =C2=A0Disk: 47000kB<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=
=A0<b>Parallel Seq Scan</b> on grid =C2=A0(cost=3D0.00..1478044.00 rows=3D3=
258333 width=3D16) (actual time=3D2.789..2779.483 rows=3D2666666.67 loops=
=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b &gt;=3D 0) AND (a =3D ANY (&#39;=
{2,3,5,8,13,21,34,55}&#39;::integer[])))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows Remo=
ved by Filter: 30666667<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D1513=
4 read=3D525410<br>=C2=A0Planning Time: 0.332 ms<br>=C2=A0Execution Time: 3=
761.866 ms<br>```<br>=C2=A0<br>If we disable sequential scans, then we get =
a bitmap scan<br><br>```sql<br>SET enable_seqscan =3D off;<br>EXPLAIN ANALY=
SE EXECUTE grid_query(1000000);<br>```<br>```text<br>=C2=A0 =C2=A0Buffers: =
shared hit=3D629 read=3D113759 written=3D2, temp read=3D20207 written=3D353=
85<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D1998199.78..2114666.26 r=
ows=3D1000000 width=3D16) (actual time=3D5170.456..5453.433 rows=3D1000000.=
00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D62=
9 read=3D113759 written=3D2, temp read=3D20207 written=3D35385<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Gather Merge =C2=A0(cost=3D1998199.7=
8..2908967.53 rows=3D7819999 width=3D16) (actual time=3D5170.455..5413.254 =
rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0Workers Planned: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0Workers Launched: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 written=3D2, te=
mp read=3D20207 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0-&gt; =C2=A0Sort =C2=A0(cost=3D1997199.75..2005345.59 rows=
=3D3258333 width=3D16) (actual time=3D5156.929..5177.507 rows=3D334500.67 l=
oops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0Sort Key: grid.b<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: external merge =C2=A0Disk: =
47032kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=
=3D20207 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: external merge =
=C2=A0Disk: 47280kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: external merge =C2=A0D=
isk: 46680kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0-&gt; =C2=A0Parallel Bitmap Heap Scan on grid =C2=A0(cost=
=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (actual time=3D299.891.=
.4489.787 rows=3D2666666.67 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Recheck C=
ond: ((a =3D ANY (&#39;{2,3,5,8,13,21,34,55}&#39;::integer[])) AND (b &gt;=
=3D 0))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows <b>Removed by Index Recheck</b>: 241=
0242<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Heap Blocks: exact=3D13100 lossy=3D22639<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D615 read=3D113759 written=3D2=
<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Heap Blocks: exact=3D13077 lossy=
=3D22755<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Heap Blocks: exact=3D1303=
6 lossy=3D22421<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0<b>Bitmap Index Scan</=
b> on grid_a_b_c_idx =C2=A0(cost=3D0.00..105736.54 rows=3D7820000 width=3D0=
) (actual time=3D297.651..297.651 rows=3D8000000.00 loops=3D1)<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D ANY (&#39;{2,3,5,=
8,13,21,34,55}&#39;::integer[])) AND (b &gt;=3D 0))<br>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 7<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D7293 written=3D2<br>=C2=A0=
Planning Time: 0.165 ms<br>=C2=A0Execution Time: 5487.213 ms<br>```<br><br>=
If we disable bitmap scans we finally get an index scan<br><br>```sql<br>SE=
T enable_bitmapscan =3D off;<br>EXPLAIN ANALYSE EXECUTE grid_query(1000000)=
;<br>```<br>```<br>=C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=3D124111=
, temp read=3D20699 written=3D35385<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=
=A0(cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) (actual time=
=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699=
 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Gather Me=
rge =C2=A0(cost=3D7201203.08..8111970.83 rows=3D7819999 width=3D16) (actual=
 time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: 2<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D78832=
21 read=3D124111, temp read=3D20699 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Sort =C2=A0(cost=3D7200203.0=
5..7208348.88 rows=3D3258333 width=3D16) (actual time=3D4390.625..4411.896 =
rows=3D334567.00 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: <b>exte=
rnal merge =C2=A0Disk: 47304kB</b><br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=
=3D124111, temp read=3D20699 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Metho=
d: external merge =C2=A0Disk: 47304kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: exte=
rnal merge =C2=A0Disk: 46384kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0<b>Parallel Index Scan</b> u=
sing grid_a_b_c_idx on grid =C2=A0(cost=3D0.57..6736351.43 rows=3D3258333 w=
idth=3D16) (actual time=3D46.925..3796.915 rows=3D2666666.67 loops=3D3)<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D ANY (&#39;{2,3,5,8,13,21,34,55}=
&#39;::integer[])) AND (b &gt;=3D 0))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Search=
es: 7<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7883208 read=3D124110<=
br>=C2=A0Planning Time: 0.385 ms<br>=C2=A0Execution Time: 4713.325 ms<br>=
=C2=A0```<br><br>=C2=A0</div><div><br></div><div><br></div></div><div dir=
=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D=
"gmail_attr">On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre Felipe &lt;<a=
 href=3D"mailto:o.alexandre.felipe@gmail.com" target=3D"_blank">o.alexandre=
.felipe@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote"=
 style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);p=
adding-left:1ex"><div dir=3D"ltr">Thank you for looking into this.<div><br>=
</div><div>Now we can execute a, still narrow, family queries!</div><div><b=
r></div><div>Maybe it helps to see this as a=C2=A0<u>social network=C2=A0fe=
eds</u>. Imagine a social network, you have a few friends, or follow a few =
people, and you want to see their updates ordered by date. For each user we=
 have a different combination of users that we have to display. But maybe, =
even having hundreds of users we will only show the first 10.</div><div><br=
></div><div>There is a low hanging fruit on the skip scan, if we need N row=
s, and one group already has M rows we could stop there.</div><div>If Nx is=
 the number of friends, and M is the number of posts to show.</div><div>Thi=
s runs with complexity (Nx * M) rows, followed by an (Nx * M) sort, instead=
 of (Nx * N) followed by an (Nx * N) sort.</div><div>Where M =3D 10 and N i=
s 1000 this is a significant improvement.</div><div>But if M ~ N, the merge=
 scan=C2=A0that runs with M=C2=A0+ Nx row accesses, (M=C2=A0+ Nx) heap oper=
ations.</div><div>If everything is on the same page the skip scan would win=
.</div><div><br></div><div>The cost estimation is probably far off.</div><d=
iv>I am also not considering the filters=C2=A0applied after this=C2=A0opera=
tor,=C2=A0and I don&#39;t know if the planner infrastructure is able to adj=
ust it by itself.</div><div>This is where I would like reviewer&#39;s feedb=
ack. I think that the planner costs are something to be determined experime=
ntally.</div><div><br></div><div>Next I will make it slightly more general =
handling<br>* More index columns: Index (a, b, s...) could support WHERE a =
IN (...) ORDER BY b LIMIT N (ignoring s...)<br>* Multi-column prefix: WHERE=
 (a, b) IN (...) ORDER BY c<br>* Non-leading prefix: WHERE b IN (...) AND a=
 =3D const ORDER BY c on index (a, b, c)</div><div><br></div><div>---</div>=
<div>Kind Regards,</div><div>Alexandre</div></div><br><div class=3D"gmail_q=
uote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Feb 4, 2026 at 7:13=E2=
=80=AFAM Micha=C5=82 K=C5=82eczek &lt;<a href=3D"mailto:michal@kleczek.org"=
 target=3D"_blank">michal@kleczek.org</a>&gt; wrote:<br></div><blockquote c=
lass=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px soli=
d rgb(204,204,204);padding-left:1ex"><div><br id=3D"m_6080345663077266924m_=
-3860514245975109604m_1178963958729505245m_7370567820092274636m_45794186525=
61517672m_3586735336800096661lineBreakAtBeginningOfMessage"><div><br><block=
quote type=3D"cite"><div>On 3 Feb 2026, at 22:42, Ants Aasma &lt;<a href=3D=
"mailto:ants.aasma@cybertec.at" target=3D"_blank">ants.aasma@cybertec.at</a=
>&gt; wrote:</div><br><div><div>On Mon, 2 Feb 2026 at 01:54, Tomas Vondra &=
lt;<a href=3D"mailto:tomas@vondra.me" target=3D"_blank">tomas@vondra.me</a>=
&gt; wrote:<br><blockquote type=3D"cite">I&#39;m also wondering how common =
is the targeted query pattern? How common<br>it is to have an IN condition =
on the leading column in an index, and<br>ORDER BY on the second one?<br></=
blockquote><br>I have seen this pattern multiple times. My nickname for it =
is the<br>timeline view. Think of the social media timeline, showing posts =
from<br>all followed accounts in timestamp order, returned in reasonably si=
zed<br>batches. The naive SQL query will have to scan all posts from all<br=
>followed accounts and pass them through a top-N sort. When the total<br>nu=
mber of posts is much larger than the batch size this is much slower<br>tha=
n what is proposed here (assuming I understand it correctly) -<br>effective=
ly equivalent to running N index scans through Merge Append.</div></div></b=
lockquote><blockquote type=3D"cite"><div><div><br>My workarounds I have pro=
posed users have been either to rewrite the<br>query as a UNION ALL of a se=
t of single value prefix queries wrapped<br>in an order by limit. This give=
s the exact needed merge append plan<br>shape. But repeating the query N ti=
mes can get unwieldy when the<br>number of values grows, so the fallback is=
:<br><br>SELECT * FROM unnest(:friends) id, LATERAL (<br> =C2=A0=C2=A0=C2=
=A0SELECT * FROM posts<br> =C2=A0=C2=A0=C2=A0WHERE user_id =3D id<br> =C2=
=A0=C2=A0=C2=A0ORDER BY tstamp DESC LIMIT 100)<br>ORDER BY tstamp DESC LIMI=
T 100;<br><br>The downside of this formulation is that we still have to fet=
ch a<br>batch worth of items from scans where we otherwise would have only =
had<br>to look at one index tuple.<br></div></div></blockquote><div><br></d=
iv><div>GIST can be used to handle this kind of queries as it supports mult=
iple sort orders.</div><div>The only problem is that GIST does not support =
ORDER BY column.</div><div>One possible workaround is [1] but as described =
there it does not play well with partitioning.</div><div>I=E2=80=99ve start=
ed drafting support for ORDER BY column in GIST - see [2].</div><div>I thin=
k it would be easier to implement and maintain than a new IAM (but I don=E2=
=80=99t have enough knowledge and experience to implement it myself)</div><=
div><br></div><div>[1]=C2=A0<a href=3D"https://www.postgresql.org/message-i=
d/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org" target=3D"_blank">htt=
ps://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40k=
leczek.org</a></div><div>[2]=C2=A0<a href=3D"https://www.postgresql.org/mes=
sage-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org" target=3D"_blan=
k">https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC=
91%40kleczek.org</a></div><br></div>=E2=80=94<div>Kind regards,<br><div>Mic=
hal</div></div></div></blockquote></div>
</blockquote></div>
</div>
</div>
</blockquote></div>
</blockquote></div>

--000000000000f826dc064d379ae2--