MIME-Version: 1.0
References: <AS1PR02MB784695AFEC37179FFAF7EAE19A9DA@AS1PR02MB7846.eurprd02.prod.outlook.com>
 <a652515e-278a-4838-815b-ecd2ad4495f6@vondra.me> <CANwKhkNd85u+4joaKR3YHoDOQSMg5SmJmsYJGo-tMyW=XVXTew@mail.gmail.com>
 <E313FDE4-8138-44CC-99CE-60F38251D878@kleczek.org> <CAE8JnxNM9Bh=LCGOzayewDgX3-kUNXdTDwNSDwsf+t=wKhPiCQ@mail.gmail.com>
 <CAE8JnxM5GDEWdvEckjgG60OwPK04pZ9dSyxYm2+-PuyKCpmo-w@mail.gmail.com>
In-Reply-To: <CAE8JnxM5GDEWdvEckjgG60OwPK04pZ9dSyxYm2+-PuyKCpmo-w@mail.gmail.com>
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Mon, 23 Feb 2026 22:08:29 +0000
Message-ID: <CAE8JnxOJoWF-ABi5EtsrmBg3FRtmyk+D0Na8=e1vCwMaG1B2Lg@mail.gmail.com>
Subject: Re: New access method for b-tree.
To: "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>, michael@paquier.xyz, tgl@sss.pgh.pa.us, 
	"peter@eisentraut.org" <peter@eisentraut.org>
Cc: Ants Aasma <ants.aasma@cybertec.at>, Tomas Vondra <tomas@vondra.me>, 
	Alexandre Felipe <alexandre.felipe@tpro.io>, =?UTF-8?B?TWljaGHFgiBLxYJlY3plaw==?= <michal@kleczek.org>, 
	pg@bowt.ie
Content-Type: multipart/alternative; boundary="0000000000006fb55f064b8503b1"
Archived-At: <https://www.postgresql.org/message-id/CAE8JnxOJoWF-ABi5EtsrmBg3FRtmyk%2BD0Na8%3De1vCwMaG1B2Lg%40mail.gmail.com>
Precedence: bulk

--0000000000006fb55f064b8503b1
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Hackers,

Do you think that MERGE-SCAN was a terrible name? I wanted a name that
wouldn't

require much explanation. I named it like this because it relies on a k-way

merge to combine several segments of an index in one result. But we already

have a MERGE statement. Even in the example plan above we can see
an external

merge that has nothing to do with the new feature, and now as I am doing
joins,

I started doing it on the NestedLoop trying to follow the same conditions
that

lead to a memoize. But I added so many fields to the NestedLoop state that =
I


think it is good to have a separate structure, and maybe a separate node,
and

MergeScan of course is taken hehe. I was thinking of IndexPrefixMerge. We
could

use the Ants nickname TimeLineScan, but of course it is not limited to time

lines (even though realistically, that will probably be the most common use
of

this). Another one I considered was TransposedIndexScan, because it orders

output on (suffix, prefix) instead of (prefix, suffix).


On Fri, Feb 6, 2026 at 10:52=E2=80=AFAM Alexandre Felipe <
o.alexandre.felipe@gmail.com> wrote:

> Hello again hackers!
>
> +pt@bowt.ie <pt@bowt.ie>: That seems to be the one that is probably the
> most familiar with the index scan (based on the commits).
> +michael@paquier.xyz <michael@paquier.xyz> , +tgl@sss.pgh.pa.us
> <tgl@sss.pgh.pa.us> , +peter@eisentraut.org <peter@eisentraut.org> as the
> top 3 committers to nbtree over the last ~6 months.
>
> I  have made substantial progress on adding a few features. I have
> questions, but I will let you go first :)
>
> Motivation:
> *In technical terms:* this proposal is to take advantage of a btree index
> when the query is filtered by a few distinct prefixes and ordered by a
> suffix and has a limit.
> *In non technical:* This could help to efficiently render a social
> network feed, where each user can select a list of users whose posts they
> want to see, and the posts must be ordered from newest to oldest.
>
>
> *Performance Comparison*
> I did a test with a toy table, please find more details below.
>
> With limit 100
>
> | Method     | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge      |         13 |         119 |     13 ms |
> | IndexScan  |     15,308 |     525,310 |  3,409 ms |
>
> With limit 1,000,000
>
> | Method     |  SharedHit | SharRead | Temp I | Temp O | Exec Time |
> |------------|-----------:|---------:|-------:|-------:|----------:|
> | Merge      |    980,318 |   19,721 |      0 |      0 |  2,128 ms |
> | Sequential |     15,208 |  525,410 | 20,207 | 35,384 |  3,762 ms |
> | Bitmap     |        629 |  113,759 | 20,207 | 35,385 |  5,487 ms |
> | IndexScan  |  7,880,619 |  126,706 | 20,945 | 35,386 |  5,874 ms |
>
> Sequential scans and bitmap scans in this case reduces significantly the
> number of
> accessed buff because the table has only four integer columns, and these
> methods
> can read all the lines on a given page at a time.
>
> However that comes at the cost of resorting to an in-disk sort method.
> For the query with limit 100 we get no temp files as we are using a
> top-100 sort.
>
> make check passes
>
>
> *Experiment details*
>
> Consider a 100M row table formed (a,b,c,d) \in 100 x 100 x 100 x 100
>
>
> ```sql
> CREATE TABLE grid AS (
>     SELECT a, b, c, d, FROM
>         generate_series(1, 100) AS a,
>         generate_series(1, 100) AS b,
>         generate_series(1, 100) AS c,
>         generate_series(1, 100) AS d
> );
>
> CREATE INDEX grid_index ON grid (a, b, c);
> ANALYSE grid;
> ```
>
> Now let's say that we need to find certain number of rows filtered by a
> and ordered by b;
> ```sql
> PREPARE grid_query(int) AS
> SELECT sum(d) FROM (
>     SELECT * FROM grid
>     WHERE a IN (2,3,5,8,13,21,34,55) AND b >=3D 0
>     ORDER BY b
>     LIMIT $1) t;
> ```
>
> ---
>
>
> Now with limit 100, with index merge scan (notice Index Prefixes in the
> plan).
>
> ```sql
> SET enable_indexmergescan =3D on;
> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
> ```
>
> ```text
>    Buffers: shared hit=3D13 read=3D119
>    ->  Limit  (cost=3D0.57..87.29 rows=3D100 width=3D16) (actual
> time=3D5.528..12.999 rows=3D100.00 loops=3D1)
>          Buffers: shared hit=3D13 read=3D119
>          ->  Index Scan using grid_a_b_c_idx on grid  (cost=3D0.57..93.36
> rows=3D107 width=3D16) (actual time=3D5.528..12.994 rows=3D100.00 loops=
=3D1)
>                Index Cond: (b >=3D 0)
>                *Index Prefixes: *(a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[]))
>                Index Searches: 8
>                Buffers: shared hit=3D13 read=3D119
>  Planning:
>    Buffers: shared hit=3D59 read=3D23
>  Planning Time: 4.619 ms
>  Execution Time: 13.055 ms
>  ```
>
>
> ```sql
> SET enable_indexmergescan =3D off;
> EXPLAIN (ANALYSE) EXECUTE grid_query(100);
> ```
>
> ```text
>  Aggregate  (cost=3D1603588.06..1603588.07 rows=3D1 width=3D8) (actual
> time=3D3406.624..3408.710 rows=3D1.00 loops=3D1)
>    Buffers: shared hit=3D15308 read=3D525310
>    ->  Limit  (cost=3D1603575.17..1603586.81 rows=3D100 width=3D16) (actu=
al
> time=3D3406.601..3408.702 rows=3D100.00 loops=3D1)
>          Buffers: shared hit=3D15308 read=3D525310
>          ->  Gather Merge  (cost=3D1603575.17..2514342.92 rows=3D7819999
> width=3D16) (actual time=3D3406.598..3408.695 rows=3D100.00 loops=3D1)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=3D15308 read=3D525310
>                ->  Sort  (cost=3D1602575.14..1610720.98 rows=3D3258333
> width=3D16) (actual time=3D3393.782..3393.784 rows=3D100.00 loops=3D3)
>                      Sort Key: grid.b
>                      Sort Method: top-N heapsort  Memory: 32kB
>                      Buffers: shared hit=3D15308 read=3D525310
>                      Worker 0:  Sort Method: top-N heapsort  Memory: 32kB
>                      Worker 1:  Sort Method: top-N heapsort  Memory: 32kB
>                      ->  *Parallel Seq Scan* on grid
>  (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D0.944=
..3129.896
> rows=3D2666666.67 loops=3D3)
>                            Filter: ((b >=3D 0) AND (a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])))
>                            Rows Removed by Filter: 30666667
>                            Buffers: shared hit=3D15234 read=3D525310
>  Planning Time: 0.370 ms
>  Execution Time: 3409.134 ms
>  ```
>
> Now queries with limit 1,000,000
>
> ```sql
> SET enable_indexmergescan =3D on;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
>
> Query executed with the proposed access method. Notice in the plan Index
> Prefixes and Index Cond.
> ```text
>    Buffers: shared hit=3D980318 read=3D19721
>    ->  Limit  (cost=3D0.57..867259.84 rows=3D1000000 width=3D16) (actual
> time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1)
>          Buffers: shared hit=3D980318 read=3D19721
>          ->  Index Scan using grid_a_b_c_idx on grid
>  (cost=3D0.57..867265.91 rows=3D1000007 width=3D16) (actual time=3D2.852.=
.2066.205
> rows=3D1000000.00 loops=3D1)
>                Index Cond: (b >=3D 0)
>                *Index Prefixes:* (a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[]))
>                Index Searches: 8
>                Buffers: shared hit=3D980318 read=3D19721
>  Planning Time: 0.328 ms
>  Execution Time: 2127.811 ms
>  ```
>
> If we disable index_mergescan we naturally we fall into a sequential scan=
.
>
> ```sql
> SET enable_indexmergescan =3D off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```text
>    Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 written=
=3D35384
>    ->  Limit  (cost=3D1942895.64..2059362.12 rows=3D1000000 width=3D16) (=
actual
> time=3D3467.012..3712.044 rows=3D1000000.00 loops=3D1)
>          Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207
> written=3D35384
>          ->  Gather Merge  (cost=3D1942895.64..2853663.39 rows=3D7819999
> width=3D16) (actual time=3D3467.010..3671.220 rows=3D1000000.00 loops=3D1=
)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=3D15208 read=3D525410, temp read=3D202=
07
> written=3D35384
>                ->  Sort  (cost=3D1941895.62..1950041.45 rows=3D3258333
> width=3D16) (actual time=3D3455.852..3476.358 rows=3D334576.33 loops=3D3)
>                      Sort Key: grid.b
>                      Sort Method: *external merge  Disk: 47016kB*
>                      Buffers: shared hit=3D15208 read=3D525410, temp
> read=3D20207 written=3D35384
>                      Worker 0:  Sort Method: external merge  Disk: 46976k=
B
>                      Worker 1:  Sort Method: external merge  Disk: 47000k=
B
>                      ->  *Parallel Seq Scan* on grid
>  (cost=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D2.789=
..2779.483
> rows=3D2666666.67 loops=3D3)
>                            Filter: ((b >=3D 0) AND (a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])))
>                            Rows Removed by Filter: 30666667
>                            Buffers: shared hit=3D15134 read=3D525410
>  Planning Time: 0.332 ms
>  Execution Time: 3761.866 ms
> ```
>
> If we disable sequential scans, then we get a bitmap scan
>
> ```sql
> SET enable_seqscan =3D off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```text
>    Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=3D20207
> written=3D35385
>    ->  Limit  (cost=3D1998199.78..2114666.26 rows=3D1000000 width=3D16) (=
actual
> time=3D5170.456..5453.433 rows=3D1000000.00 loops=3D1)
>          Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=
=3D20207
> written=3D35385
>          ->  Gather Merge  (cost=3D1998199.78..2908967.53 rows=3D7819999
> width=3D16) (actual time=3D5170.455..5413.254 rows=3D1000000.00 loops=3D1=
)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=3D629 read=3D113759 written=3D2, temp
> read=3D20207 written=3D35385
>                ->  Sort  (cost=3D1997199.75..2005345.59 rows=3D3258333
> width=3D16) (actual time=3D5156.929..5177.507 rows=3D334500.67 loops=3D3)
>                      Sort Key: grid.b
>                      Sort Method: external merge  Disk: 47032kB
>                      Buffers: shared hit=3D629 read=3D113759 written=3D2,=
 temp
> read=3D20207 written=3D35385
>                      Worker 0:  Sort Method: external merge  Disk: 47280k=
B
>                      Worker 1:  Sort Method: external merge  Disk: 46680k=
B
>                      ->  Parallel Bitmap Heap Scan on grid
>  (cost=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (actual
> time=3D299.891..4489.787 rows=3D2666666.67 loops=3D3)
>                            Recheck Cond: ((a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
>                            Rows *Removed by Index Recheck*: 2410242
>                            Heap Blocks: exact=3D13100 lossy=3D22639
>                            Buffers: shared hit=3D615 read=3D113759 writte=
n=3D2
>                            Worker 0:  Heap Blocks: exact=3D13077 lossy=3D=
22755
>                            Worker 1:  Heap Blocks: exact=3D13036 lossy=3D=
22421
>                            ->  *Bitmap Index Scan* on grid_a_b_c_idx
>  (cost=3D0.00..105736.54 rows=3D7820000 width=3D0) (actual time=3D297.651=
..297.651
> rows=3D8000000.00 loops=3D1)
>                                  Index Cond: ((a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
>                                  Index Searches: 7
>                                  Buffers: shared hit=3D13 read=3D7293 wri=
tten=3D2
>  Planning Time: 0.165 ms
>  Execution Time: 5487.213 ms
> ```
>
> If we disable bitmap scans we finally get an index scan
>
> ```sql
> SET enable_bitmapscan =3D off;
> EXPLAIN ANALYSE EXECUTE grid_query(1000000);
> ```
> ```
>    Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699 written=
=3D35385
>    ->  Limit  (cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) (=
actual
> time=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1)
>          Buffers: shared hit=3D7883221 read=3D124111, temp read=3D20699
> written=3D35385
>          ->  Gather Merge  (cost=3D7201203.08..8111970.83 rows=3D7819999
> width=3D16) (actual time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D1=
)
>                Workers Planned: 2
>                Workers Launched: 2
>                Buffers: shared hit=3D7883221 read=3D124111, temp read=3D2=
0699
> written=3D35385
>                ->  Sort  (cost=3D7200203.05..7208348.88 rows=3D3258333
> width=3D16) (actual time=3D4390.625..4411.896 rows=3D334567.00 loops=3D3)
>                      Sort Key: grid.b
>                      Sort Method: *external merge  Disk: 47304kB*
>                      Buffers: shared hit=3D7883221 read=3D124111, temp
> read=3D20699 written=3D35385
>                      Worker 0:  Sort Method: external merge  Disk: 47304k=
B
>                      Worker 1:  Sort Method: external merge  Disk: 46384k=
B
>                      ->  *Parallel Index Scan* using grid_a_b_c_idx on
> grid  (cost=3D0.57..6736351.43 rows=3D3258333 width=3D16) (actual
> time=3D46.925..3796.915 rows=3D2666666.67 loops=3D3)
>                            Index Cond: ((a =3D ANY
> ('{2,3,5,8,13,21,34,55}'::integer[])) AND (b >=3D 0))
>                            Index Searches: 7
>                            Buffers: shared hit=3D7883208 read=3D124110
>  Planning Time: 0.385 ms
>  Execution Time: 4713.325 ms
>  ```
>
>
>
>
>
>
> On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre Felipe <
> o.alexandre.felipe@gmail.com> wrote:
>
>> Thank you for looking into this.
>>
>> Now we can execute a, still narrow, family queries!
>>
>> Maybe it helps to see this as a *social network feeds*. Imagine a social
>> network, you have a few friends, or follow a few people, and you want to
>> see their updates ordered by date. For each user we have a different
>> combination of users that we have to display. But maybe, even having
>> hundreds of users we will only show the first 10.
>>
>> There is a low hanging fruit on the skip scan, if we need N rows, and on=
e
>> group already has M rows we could stop there.
>> If Nx is the number of friends, and M is the number of posts to show.
>> This runs with complexity (Nx * M) rows, followed by an (Nx * M) sort,
>> instead of (Nx * N) followed by an (Nx * N) sort.
>> Where M =3D 10 and N is 1000 this is a significant improvement.
>> But if M ~ N, the merge scan that runs with M + Nx row accesses, (M + Nx=
)
>> heap operations.
>> If everything is on the same page the skip scan would win.
>>
>> The cost estimation is probably far off.
>> I am also not considering the filters applied after this operator, and I
>> don't know if the planner infrastructure is able to adjust it by itself.
>> This is where I would like reviewer's feedback. I think that the planner
>> costs are something to be determined experimentally.
>>
>> Next I will make it slightly more general handling
>> * More index columns: Index (a, b, s...) could support WHERE a IN (...)
>> ORDER BY b LIMIT N (ignoring s...)
>> * Multi-column prefix: WHERE (a, b) IN (...) ORDER BY c
>> * Non-leading prefix: WHERE b IN (...) AND a =3D const ORDER BY c on ind=
ex
>> (a, b, c)
>>
>> ---
>> Kind Regards,
>> Alexandre
>>
>> On Wed, Feb 4, 2026 at 7:13=E2=80=AFAM Micha=C5=82 K=C5=82eczek <michal@=
kleczek.org> wrote:
>>
>>>
>>>
>>> On 3 Feb 2026, at 22:42, Ants Aasma <ants.aasma@cybertec.at> wrote:
>>>
>>> On Mon, 2 Feb 2026 at 01:54, Tomas Vondra <tomas@vondra.me> wrote:
>>>
>>> I'm also wondering how common is the targeted query pattern? How common
>>> it is to have an IN condition on the leading column in an index, and
>>> ORDER BY on the second one?
>>>
>>>
>>> I have seen this pattern multiple times. My nickname for it is the
>>> timeline view. Think of the social media timeline, showing posts from
>>> all followed accounts in timestamp order, returned in reasonably sized
>>> batches. The naive SQL query will have to scan all posts from all
>>> followed accounts and pass them through a top-N sort. When the total
>>> number of posts is much larger than the batch size this is much slower
>>> than what is proposed here (assuming I understand it correctly) -
>>> effectively equivalent to running N index scans through Merge Append.
>>>
>>>
>>> My workarounds I have proposed users have been either to rewrite the
>>> query as a UNION ALL of a set of single value prefix queries wrapped
>>> in an order by limit. This gives the exact needed merge append plan
>>> shape. But repeating the query N times can get unwieldy when the
>>> number of values grows, so the fallback is:
>>>
>>> SELECT * FROM unnest(:friends) id, LATERAL (
>>>    SELECT * FROM posts
>>>    WHERE user_id =3D id
>>>    ORDER BY tstamp DESC LIMIT 100)
>>> ORDER BY tstamp DESC LIMIT 100;
>>>
>>> The downside of this formulation is that we still have to fetch a
>>> batch worth of items from scans where we otherwise would have only had
>>> to look at one index tuple.
>>>
>>>
>>> GIST can be used to handle this kind of queries as it supports multiple
>>> sort orders.
>>> The only problem is that GIST does not support ORDER BY column.
>>> One possible workaround is [1] but as described there it does not play
>>> well with partitioning.
>>> I=E2=80=99ve started drafting support for ORDER BY column in GIST - see=
 [2].
>>> I think it would be easier to implement and maintain than a new IAM (bu=
t
>>> I don=E2=80=99t have enough knowledge and experience to implement it my=
self)
>>>
>>> [1]
>>> https://www.postgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC=
21F%40kleczek.org
>>> [2]
>>> https://www.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5D=
C91%40kleczek.org
>>>
>>> =E2=80=94
>>> Kind regards,
>>> Michal
>>>
>>

--0000000000006fb55f064b8503b1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Hackers,<br><br>


<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">Do you=
 think that MERGE-SCAN was a terrible name? I wanted a name that wouldn&#39=
;t<span class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">requir=
e much explanation. I named it like this because it relies on a k-way<span =
class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">merge =
to combine several segments of an index in one result.=C2=A0But we already<=
span class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">have a=
 MERGE statement. Even in the example plan above we can see an=C2=A0externa=
l<span class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">merge =
that has nothing to do with the new feature, and now as I am doing joins,<s=
pan class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">I star=
ted doing it on the NestedLoop trying to follow the same conditions that<sp=
an class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">lead t=
o a memoize. But I added so many fields to the NestedLoop state that I<span=
 class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">think =
it is good to have a separate=C2=A0structure, and maybe a separate node, an=
d<span class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">MergeS=
can of course is taken hehe. I was thinking of IndexPrefixMerge. We could<s=
pan class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">use th=
e Ants nickname TimeLineScan, but of course it is not limited to time<span =
class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">lines =
(even though realistically, that will probably be the most common use of<sp=
an class=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">this).=
 Another one I considered was TransposedIndexScan, because it orders<span c=
lass=3D"gmail-Apple-converted-space">=C2=A0</span></span></p>
<p class=3D"gmail-p1" style=3D"margin:0px;font-variant-numeric:normal;font-=
variant-east-asian:normal;font-variant-alternates:normal;font-size-adjust:n=
one;font-kerning:auto;font-feature-settings:normal;font-stretch:normal;font=
-size:11px;line-height:normal;font-family:Menlo;color:rgb(0,0,0)"><span cla=
ss=3D"gmail-s1" style=3D"font-variant-ligatures:no-common-ligatures">output=
 on (suffix, prefix) instead of (prefix, suffix).</span></p><div><br></div>=
<div><br><br></div></div><br><div class=3D"gmail_quote gmail_quote_containe=
r"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Feb 6, 2026 at 10:52=E2=80=
=AFAM Alexandre Felipe &lt;<a href=3D"mailto:o.alexandre.felipe@gmail.com">=
o.alexandre.felipe@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204=
,204,204);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"l=
tr"><div>Hello again hackers!</div><br><div><a class=3D"gmail_plusreply" id=
=3D"m_-3860514245975109604m_1178963958729505245gmail-plusReplyChip-0" href=
=3D"mailto:pt@bowt.ie" target=3D"_blank">+pt@bowt.ie</a>: That seems to be =
the one that is probably the most familiar with the index scan (based on th=
e commits).</div><div><a class=3D"gmail_plusreply" id=3D"m_-386051424597510=
9604m_1178963958729505245gmail-plusReplyChip-2" href=3D"mailto:michael@paqu=
ier.xyz" target=3D"_blank">+michael@paquier.xyz</a>=C2=A0,=C2=A0<a class=3D=
"gmail_plusreply" id=3D"m_-3860514245975109604m_1178963958729505245gmail-pl=
usReplyChip-4" href=3D"mailto:tgl@sss.pgh.pa.us" target=3D"_blank">+tgl@sss=
.pgh.pa.us</a>=C2=A0,=C2=A0<a class=3D"gmail_plusreply" id=3D"m_-3860514245=
975109604m_1178963958729505245gmail-plusReplyChip-5" href=3D"mailto:peter@e=
isentraut.org" target=3D"_blank">+peter@eisentraut.org</a>=C2=A0as the top =
3 committers to nbtree over the last ~6 months.<br></div><div><br></div><di=
v>I=C2=A0 have made substantial=C2=A0progress on adding a few features. I h=
ave questions, but I will let you go first :)</div><div><br></div><div><fon=
t size=3D"4">Motivation</font>:</div><div><b>In technical terms:</b> this p=
roposal is to take advantage of a btree index when the query is filtered by=
 a few distinct prefixes and ordered by a suffix and has a limit.<br></div>=
<div><b>In non technical:</b> This could help to efficiently render=C2=A0a =
social network feed, where each user can select a list of users whose posts=
 they want to see, and the posts must be ordered from newest to oldest.</di=
v><div><br></div><div><b><font size=3D"4">Performance Comparison</font><br>=
</b><br>I did a test with a toy table, please find more details below.<br><=
br>With limit 100<br><br><font face=3D"monospace">| Method =C2=A0 =C2=A0 | =
Shared Hit | Shared Read | Exec Time |<br>|------------|-----------:|------=
------:|----------:|<br>| Merge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 13 | =C2=A0 =C2=A0 =C2=A0 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |<br>| In=
dexScan =C2=A0| =C2=A0 =C2=A0 15,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 =
ms |</font><br><br>With limit 1,000,000<br><br><font face=3D"monospace">| M=
ethod =C2=A0 =C2=A0 | =C2=A0SharedHit | SharRead | Temp I | Temp O | Exec T=
ime |<br>|------------|-----------:|---------:|-------:|-------:|----------=
:|<br>| Merge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0980,318 | =C2=A0 19,721 | =
=C2=A0 =C2=A0 =C2=A00 | =C2=A0 =C2=A0 =C2=A00 | =C2=A02,128 ms |<br>| Seque=
ntial | =C2=A0 =C2=A0 15,208 | =C2=A0525,410 | 20,207 | 35,384 | =C2=A03,76=
2 ms |<br>| Bitmap =C2=A0 =C2=A0 | =C2=A0 =C2=A0 =C2=A0 =C2=A0629 | =C2=A01=
13,759 | 20,207 | 35,385 | =C2=A05,487 ms |<br>| IndexScan =C2=A0| =C2=A07,=
880,619 | =C2=A0126,706 | 20,945 | 35,386 | =C2=A05,874 ms |</font><br><br>=
Sequential scans and bitmap scans in this case reduces significantly the nu=
mber of <br>accessed buff because the table has only four integer columns, =
and these methods <br>can read all the lines on a given page at a time.<br>=
<br>However that comes at the cost of resorting to an in-disk sort method. =
<br>For the query with limit 100 we get no temp files as we are using a<br>=
top-100 sort.</div><div><br></div><div>make check passes</div><div><br><br>=
<b><font size=3D"4">Experiment details</font></b><br><br>Consider a 100M ro=
w table formed (a,b,c,d) \in 100 x 100 x 100 x 100 <br><br><br>```sql<br><f=
ont face=3D"monospace">CREATE TABLE grid AS (<br>=C2=A0 =C2=A0 SELECT a, b,=
 c, d, FROM <br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 generate_series(1, 100) AS a, <=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 generate_series(1, 100) AS b, <br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 generate_series(1, 100) AS c, <br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 generate_series(1, 100) AS d<br>);<br><br>CREATE INDEX grid_index ON=
 grid (a, b, c);<br>ANALYSE grid;</font><br>```<br><br>Now let&#39;s say th=
at we need to find certain number of rows filtered by a and ordered by b;<b=
r>```sql<br><font face=3D"monospace">PREPARE grid_query(int) AS<br>SELECT s=
um(d) FROM (<br>=C2=A0 =C2=A0 SELECT * FROM grid <br>=C2=A0 =C2=A0 WHERE a =
IN (2,3,5,8,13,21,34,55) AND b &gt;=3D 0 <br>=C2=A0 =C2=A0 ORDER BY b <br>=
=C2=A0 =C2=A0 LIMIT $1) t;</font><br>```<br><br>---<br><br><br>Now with lim=
it 100, with index merge scan (notice Index Prefixes in the plan).<br><br>`=
``sql<br>SET enable_indexmergescan =3D on;<br>EXPLAIN (ANALYSE) EXECUTE gri=
d_query(100);<br>```<br><br>```text<br>=C2=A0 =C2=A0Buffers: shared hit=3D1=
3 read=3D119<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D0.57..87.29 ro=
ws=3D100 width=3D16) (actual time=3D5.528..12.999 rows=3D100.00 loops=3D1)<=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119<br=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Index Scan using grid_a_b_c_=
idx on grid =C2=A0(cost=3D0.57..93.36 rows=3D107 width=3D16) (actual time=
=3D5.528..12.994 rows=3D100.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (b &gt;=3D 0)<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<b>Index Prefixes: </b>(a =3D ANY (&#=
39;{2,3,5,8,13,21,34,55}&#39;::integer[]))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 8<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D119<br>=C2=A0Pl=
anning:<br>=C2=A0 =C2=A0Buffers: shared hit=3D59 read=3D23<br>=C2=A0Plannin=
g Time: 4.619 ms<br>=C2=A0Execution Time: 13.055 ms<br>=C2=A0```<br><br><br=
>```sql<br>SET enable_indexmergescan =3D off;<br>EXPLAIN (ANALYSE) EXECUTE =
grid_query(100);<br>```<br><br>```text<br>=C2=A0Aggregate =C2=A0(cost=3D160=
3588.06..1603588.07 rows=3D1 width=3D8) (actual time=3D3406.624..3408.710 r=
ows=3D1.00 loops=3D1)<br>=C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D52=
5310<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D1603575.17..1603586.81=
 rows=3D100 width=3D16) (actual time=3D3406.601..3408.702 rows=3D100.00 loo=
ps=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 re=
ad=3D525310<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Gather Merge =
=C2=A0(cost=3D1603575.17..2514342.92 rows=3D7819999 width=3D16) (actual tim=
e=3D3406.598..3408.695 rows=3D100.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: 2<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15308 read=3D52=
5310<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0=
Sort =C2=A0(cost=3D1602575.14..1610720.98 rows=3D3258333 width=3D16) (actua=
l time=3D3393.782..3393.784 rows=3D100.00 loops=3D3)<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b=
<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0Sort Method: top-N heapsort =C2=A0Memory: 32kB<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared =
hit=3D15308 read=3D525310<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: top-N heapsort =
=C2=A0Memory: 32kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: top-N heapsort =C2=A0M=
emory: 32kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0-&gt; =C2=A0<b>Parallel Seq Scan</b> on grid =C2=A0(cost=
=3D0.00..1478044.00 rows=3D3258333 width=3D16) (actual time=3D0.944..3129.8=
96 rows=3D2666666.67 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b &gt;=
=3D 0) AND (a =3D ANY (&#39;{2,3,5,8,13,21,34,55}&#39;::integer[])))<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0Rows Removed by Filter: 30666667<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0Buffers: shared hit=3D15234 read=3D525310<br>=C2=A0Planning Time: 0.3=
70 ms<br>=C2=A0Execution Time: 3409.134 ms<br>=C2=A0```<br><br>Now queries =
with limit 1,000,000<br><br>```sql<br>SET enable_indexmergescan =3D on;<br>=
EXPLAIN ANALYSE EXECUTE grid_query(1000000);<br>```<br><br>Query executed w=
ith the proposed access method. Notice in the plan Index Prefixes and Index=
 Cond.<br>```text<br>=C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D19721=
<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D0.57..867259.84 rows=3D100=
0000 width=3D16) (actual time=3D2.854..2103.438 rows=3D1000000.00 loops=3D1=
)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D980318 read=3D=
19721<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Index Scan using gri=
d_a_b_c_idx on grid =C2=A0(cost=3D0.57..867265.91 rows=3D1000007 width=3D16=
) (actual time=3D2.852..2066.205 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: (b &gt;=3D 0)<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0<b>Index Prefixes:</=
b> (a =3D ANY (&#39;{2,3,5,8,13,21,34,55}&#39;::integer[]))<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 8<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D98031=
8 read=3D19721<br>=C2=A0Planning Time: 0.328 ms<br>=C2=A0Execution Time: 21=
27.811 ms<br>=C2=A0```<br><br>If we disable index_mergescan we naturally we=
 fall into a sequential scan.<br><br>```sql<br>SET enable_indexmergescan =
=3D off;<br>EXPLAIN ANALYSE EXECUTE grid_query(1000000);<br>```<br>```text<=
br>=C2=A0 =C2=A0Buffers: shared hit=3D15208 read=3D525410, temp read=3D2020=
7 written=3D35384<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D1942895.6=
4..2059362.12 rows=3D1000000 width=3D16) (actual time=3D3467.012..3712.044 =
rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: =
shared hit=3D15208 read=3D525410, temp read=3D20207 written=3D35384<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Gather Merge =C2=A0(cost=3D19428=
95.64..2853663.39 rows=3D7819999 width=3D16) (actual time=3D3467.010..3671.=
220 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0Workers Planned: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0Workers Launched: 2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D15208 read=3D525410, temp rea=
d=3D20207 written=3D35384<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0-&gt; =C2=A0Sort =C2=A0(cost=3D1941895.62..1950041.45 rows=3D3258=
333 width=3D16) (actual time=3D3455.852..3476.358 rows=3D334576.33 loops=3D=
3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0Sort Key: grid.b<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: <b>external merge =C2=A0Disk: 4701=
6kB</b><br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0Buffers: shared hit=3D15208 read=3D525410, temp read=3D20207 w=
ritten=3D35384<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Method: external merge =C2=A0Disk:=
 46976kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: external merge =C2=A0Disk: 47000k=
B<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0-&gt; =C2=A0<b>Parallel Seq Scan</b> on grid =C2=A0(cost=3D0.00..1478=
044.00 rows=3D3258333 width=3D16) (actual time=3D2.789..2779.483 rows=3D266=
6666.67 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Filter: ((b &gt;=3D 0) AND (a =
=3D ANY (&#39;{2,3,5,8,13,21,34,55}&#39;::integer[])))<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0Rows Removed by Filter: 30666667<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: =
shared hit=3D15134 read=3D525410<br>=C2=A0Planning Time: 0.332 ms<br>=C2=A0=
Execution Time: 3761.866 ms<br>```<br>=C2=A0<br>If we disable sequential sc=
ans, then we get a bitmap scan<br><br>```sql<br>SET enable_seqscan =3D off;=
<br>EXPLAIN ANALYSE EXECUTE grid_query(1000000);<br>```<br>```text<br>=C2=
=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 written=3D2, temp read=3D=
20207 written=3D35385<br>=C2=A0 =C2=A0-&gt; =C2=A0Limit =C2=A0(cost=3D19981=
99.78..2114666.26 rows=3D1000000 width=3D16) (actual time=3D5170.456..5453.=
433 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffe=
rs: shared hit=3D629 read=3D113759 written=3D2, temp read=3D20207 written=
=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Gather Merge =C2=
=A0(cost=3D1998199.78..2908967.53 rows=3D7819999 width=3D16) (actual time=
=3D5170.455..5413.254 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: 2<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D1=
13759 written=3D2, temp read=3D20207 written=3D35385<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Sort =C2=A0(cost=3D199719=
9.75..2005345.59 rows=3D3258333 width=3D16) (actual time=3D5156.929..5177.5=
07 rows=3D334500.67 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b<br>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Method: extern=
al merge =C2=A0Disk: 47032kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D629 read=3D113759 w=
ritten=3D2, temp read=3D20207 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Sort Me=
thod: external merge =C2=A0Disk: 47280kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort Method: e=
xternal merge =C2=A0Disk: 46680kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Parallel Bitmap Heap Scan=
 on grid =C2=A0(cost=3D107691.54..1533348.13 rows=3D3258333 width=3D16) (ac=
tual time=3D299.891..4489.787 rows=3D2666666.67 loops=3D3)<br>=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0Recheck Cond: ((a =3D ANY (&#39;{2,3,5,8,13,21,34,55}&#39;::integ=
er[])) AND (b &gt;=3D 0))<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Rows <b>Removed by Inde=
x Recheck</b>: 2410242<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Heap Blocks: exact=3D13100 =
lossy=3D22639<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D615 read=3D1=
13759 written=3D2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=A0Heap Blocks: e=
xact=3D13077 lossy=3D22755<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Heap Bl=
ocks: exact=3D13036 lossy=3D22421<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0<b>B=
itmap Index Scan</b> on grid_a_b_c_idx =C2=A0(cost=3D0.00..105736.54 rows=
=3D7820000 width=3D0) (actual time=3D297.651..297.651 rows=3D8000000.00 loo=
ps=3D1)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D =
ANY (&#39;{2,3,5,8,13,21,34,55}&#39;::integer[])) AND (b &gt;=3D 0))<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Searches: 7<br>=C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D13 read=3D7293 written=
=3D2<br>=C2=A0Planning Time: 0.165 ms<br>=C2=A0Execution Time: 5487.213 ms<=
br>```<br><br>If we disable bitmap scans we finally get an index scan<br><b=
r>```sql<br>SET enable_bitmapscan =3D off;<br>EXPLAIN ANALYSE EXECUTE grid_=
query(1000000);<br>```<br>```<br>=C2=A0 =C2=A0Buffers: shared hit=3D7883221=
 read=3D124111, temp read=3D20699 written=3D35385<br>=C2=A0 =C2=A0-&gt; =C2=
=A0Limit =C2=A0(cost=3D7201203.08..7317669.55 rows=3D1000000 width=3D16) (a=
ctual time=3D4414.478..4674.400 rows=3D1000000.00 loops=3D1)<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7883221 read=3D124111, temp r=
ead=3D20699 written=3D35385<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=
=A0Gather Merge =C2=A0(cost=3D7201203.08..8111970.83 rows=3D7819999 width=
=3D16) (actual time=3D4414.476..4633.982 rows=3D1000000.00 loops=3D1)<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Planned: 2<b=
r>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Workers Launched: =
2<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared=
 hit=3D7883221 read=3D124111, temp read=3D20699 written=3D35385<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0Sort =C2=A0(cos=
t=3D7200203.05..7208348.88 rows=3D3258333 width=3D16) (actual time=3D4390.6=
25..4411.896 rows=3D334567.00 loops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Key: grid.b<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Sort Meth=
od: <b>external merge =C2=A0Disk: 47304kB</b><br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D7=
883221 read=3D124111, temp read=3D20699 written=3D35385<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 0: =C2=
=A0Sort Method: external merge =C2=A0Disk: 47304kB<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Worker 1: =C2=A0Sort=
 Method: external merge =C2=A0Disk: 46384kB<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0-&gt; =C2=A0<b>Parallel Ind=
ex Scan</b> using grid_a_b_c_idx on grid =C2=A0(cost=3D0.57..6736351.43 row=
s=3D3258333 width=3D16) (actual time=3D46.925..3796.915 rows=3D2666666.67 l=
oops=3D3)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Index Cond: ((a =3D ANY (&#39;{2,3,5,8,1=
3,21,34,55}&#39;::integer[])) AND (b &gt;=3D 0))<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0Index Searches: 7<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Buffers: shared hit=3D78832=
08 read=3D124110<br>=C2=A0Planning Time: 0.385 ms<br>=C2=A0Execution Time: =
4713.325 ms<br>=C2=A0```<br><br>=C2=A0</div><div><br></div><div><br></div><=
/div><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"=
ltr" class=3D"gmail_attr">On Thu, Feb 5, 2026 at 6:59=E2=80=AFAM Alexandre =
Felipe &lt;<a href=3D"mailto:o.alexandre.felipe@gmail.com" target=3D"_blank=
">o.alexandre.felipe@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D=
"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(2=
04,204,204);padding-left:1ex"><div dir=3D"ltr">Thank you for looking into t=
his.<div><br></div><div>Now we can execute a, still narrow, family queries!=
</div><div><br></div><div>Maybe it helps to see this as a=C2=A0<u>social ne=
twork=C2=A0feeds</u>. Imagine a social network, you have a few friends, or =
follow a few people, and you want to see their updates ordered by date. For=
 each user we have a different combination of users that we have to display=
. But maybe, even having hundreds of users we will only show the first 10.<=
/div><div><br></div><div>There is a low hanging fruit on the skip scan, if =
we need N rows, and one group already has M rows we could stop there.</div>=
<div>If Nx is the number of friends, and M is the number of posts to show.<=
/div><div>This runs with complexity (Nx * M) rows, followed by an (Nx * M) =
sort, instead of (Nx * N) followed by an (Nx * N) sort.</div><div>Where M =
=3D 10 and N is 1000 this is a significant improvement.</div><div>But if M =
~ N, the merge scan=C2=A0that runs with M=C2=A0+ Nx row accesses, (M=C2=A0+=
 Nx) heap operations.</div><div>If everything is on the same page the skip =
scan would win.</div><div><br></div><div>The cost estimation is probably fa=
r off.</div><div>I am also not considering the filters=C2=A0applied after t=
his=C2=A0operator,=C2=A0and I don&#39;t know if the planner infrastructure =
is able to adjust it by itself.</div><div>This is where I would like review=
er&#39;s feedback. I think that the planner costs are something to be deter=
mined experimentally.</div><div><br></div><div>Next I will make it slightly=
 more general handling<br>* More index columns: Index (a, b, s...) could su=
pport WHERE a IN (...) ORDER BY b LIMIT N (ignoring s...)<br>* Multi-column=
 prefix: WHERE (a, b) IN (...) ORDER BY c<br>* Non-leading prefix: WHERE b =
IN (...) AND a =3D const ORDER BY c on index (a, b, c)</div><div><br></div>=
<div>---</div><div>Kind Regards,</div><div>Alexandre</div></div><br><div cl=
ass=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Feb 4, 20=
26 at 7:13=E2=80=AFAM Micha=C5=82 K=C5=82eczek &lt;<a href=3D"mailto:michal=
@kleczek.org" target=3D"_blank">michal@kleczek.org</a>&gt; wrote:<br></div>=
<blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-=
left:1px solid rgb(204,204,204);padding-left:1ex"><div><br id=3D"m_-3860514=
245975109604m_1178963958729505245m_7370567820092274636m_4579418652561517672=
m_3586735336800096661lineBreakAtBeginningOfMessage"><div><br><blockquote ty=
pe=3D"cite"><div>On 3 Feb 2026, at 22:42, Ants Aasma &lt;<a href=3D"mailto:=
ants.aasma@cybertec.at" target=3D"_blank">ants.aasma@cybertec.at</a>&gt; wr=
ote:</div><br><div><div>On Mon, 2 Feb 2026 at 01:54, Tomas Vondra &lt;<a hr=
ef=3D"mailto:tomas@vondra.me" target=3D"_blank">tomas@vondra.me</a>&gt; wro=
te:<br><blockquote type=3D"cite">I&#39;m also wondering how common is the t=
argeted query pattern? How common<br>it is to have an IN condition on the l=
eading column in an index, and<br>ORDER BY on the second one?<br></blockquo=
te><br>I have seen this pattern multiple times. My nickname for it is the<b=
r>timeline view. Think of the social media timeline, showing posts from<br>=
all followed accounts in timestamp order, returned in reasonably sized<br>b=
atches. The naive SQL query will have to scan all posts from all<br>followe=
d accounts and pass them through a top-N sort. When the total<br>number of =
posts is much larger than the batch size this is much slower<br>than what i=
s proposed here (assuming I understand it correctly) -<br>effectively equiv=
alent to running N index scans through Merge Append.</div></div></blockquot=
e><blockquote type=3D"cite"><div><div><br>My workarounds I have proposed us=
ers have been either to rewrite the<br>query as a UNION ALL of a set of sin=
gle value prefix queries wrapped<br>in an order by limit. This gives the ex=
act needed merge append plan<br>shape. But repeating the query N times can =
get unwieldy when the<br>number of values grows, so the fallback is:<br><br=
>SELECT * FROM unnest(:friends) id, LATERAL (<br> =C2=A0=C2=A0=C2=A0SELECT =
* FROM posts<br> =C2=A0=C2=A0=C2=A0WHERE user_id =3D id<br> =C2=A0=C2=A0=C2=
=A0ORDER BY tstamp DESC LIMIT 100)<br>ORDER BY tstamp DESC LIMIT 100;<br><b=
r>The downside of this formulation is that we still have to fetch a<br>batc=
h worth of items from scans where we otherwise would have only had<br>to lo=
ok at one index tuple.<br></div></div></blockquote><div><br></div><div>GIST=
 can be used to handle this kind of queries as it supports multiple sort or=
ders.</div><div>The only problem is that GIST does not support ORDER BY col=
umn.</div><div>One possible workaround is [1] but as described there it doe=
s not play well with partitioning.</div><div>I=E2=80=99ve started drafting =
support for ORDER BY column in GIST - see [2].</div><div>I think it would b=
e easier to implement and maintain than a new IAM (but I don=E2=80=99t have=
 enough knowledge and experience to implement it myself)</div><div><br></di=
v><div>[1]=C2=A0<a href=3D"https://www.postgresql.org/message-id/3FA1E0A9-8=
393-41F6-88BD-62EEEA1EC21F%40kleczek.org" target=3D"_blank">https://www.pos=
tgresql.org/message-id/3FA1E0A9-8393-41F6-88BD-62EEEA1EC21F%40kleczek.org</=
a></div><div>[2]=C2=A0<a href=3D"https://www.postgresql.org/message-id/B2AC=
13F9-6655-4E27-BFD3-068844E5DC91%40kleczek.org" target=3D"_blank">https://w=
ww.postgresql.org/message-id/B2AC13F9-6655-4E27-BFD3-068844E5DC91%40kleczek=
.org</a></div><br></div>=E2=80=94<div>Kind regards,<br><div>Michal</div></d=
iv></div></blockquote></div>
</blockquote></div>
</div>
</div>
</blockquote></div>

--0000000000006fb55f064b8503b1--