MIME-Version: 1.0
References: 
 <AS1PR02MB784695AFEC37179FFAF7EAE19A9DA@AS1PR02MB7846.eurprd02.prod.outlook.com>
 <a652515e-278a-4838-815b-ecd2ad4495f6@vondra.me>
 <CANwKhkNd85u+4joaKR3YHoDOQSMg5SmJmsYJGo-tMyW=XVXTew@mail.gmail.com>
 <4be96b19-1e19-4000-b65a-eada001e5a9a@vondra.me>
In-Reply-To: <4be96b19-1e19-4000-b65a-eada001e5a9a@vondra.me>
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Fri, 20 Mar 2026 13:44:39 +0000
Message-ID: 
 <CAE8JnxNXO+EoBvbj9szj3QgS=z+_NbVgCUs4UaGkvXswzT+OYQ@mail.gmail.com>
Subject: Re: New access method for b-tree.
To: Tomas Vondra <tomas@vondra.me>
Cc: Ants Aasma <ants.aasma@cybertec.at>,
 Alexandre Felipe <alexandre.felipe@tpro.io>,
	"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>,
 boekewurm+postgres@gmail.com
Content-Type: multipart/alternative; boundary="000000000000ad013d064d74e300"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE8JnxNXO%2BEoBvbj9szj3QgS%3Dz%2B_NbVgCUs4UaGkvXswzT%2BOYQ%40mail.gmail.com>
Precedence: bulk

--000000000000ad013d064d74e300
Content-Type: text/plain; charset="UTF-8"

Happy St. Patrick's day!

(this was sitting on my drafts)

Based on what I said said in previous emails I see alternative
proposals

#1 Make it simpler by not changing the index access methods.

#2 Make it optimal by not using generic index searches
and not keeping multiple open index scans.

and
#3 Follow the pragmatic approach
Objective is, minimize the number of heap fetches.
As high level as possible, reusing existing functions
instead of writing custom code when possible.


Ants Aasma & Tomas Vondra

> > My workarounds I have proposed users have been either to rewrite the
> > query as a UNION ALL of a set of single value prefix queries wrapped
> > in an order by limit. This gives the exact needed merge append plan
> > shape. But repeating the query N times can get unwieldy when the
> > number of values grows, so the fallback is:
> >
> > SELECT * FROM unnest(:friends) id, LATERAL (
> >     SELECT * FROM posts
> >     WHERE user_id = id
> >     ORDER BY tstamp DESC LIMIT 100)
> > ORDER BY tstamp DESC LIMIT 100;
> >
> > The downside of this formulation is that we still have to fetch a
> > batch worth of items from scans where we otherwise would have only had
> > to look at one index tuple.
> >


> True. It's useful to think about the query this way, and it may be
> better than full select + sort, but it has issues too.
>

An issue with this query is generality, if this is joined with other
queries we can't determine in advance the limit.


> The main problem I can see is that at planning time the cardinality of
> > the prefix array might not be known, and in theory could be in the
> > millions. Having millions of index scans open at the same time is not
> > viable, so the method needs to somehow degrade gracefully. The idea I
> > had is to pick some limit, based on work_mem and/or benchmarking, and
> > one the limit is hit, populate the first batch and then run the next
> > batch of index scans, merging with the first result. Or something like
> > that, I can imagine a few different ways to handle it with different
> > tradeoffs.
> >
>
> Doesn't the proposed merge scan have a similar issue? Because that will
> also have to keep all the index scans open (even if only internally).
> Indeed, it needs to degrade gracefully, in some way.


It is true, but I think we can trust the planner.
This problem scales similarly in a memoize node.
Is ~24kB for each open index scan a good guess?

ALTERNATIVE #1 - More efficient

Or to avoid having N open index scans we could  (??)
(1) find the index page for the head of each prefix.
(2) for each prefix
(2.a) load tuples from each head page, if we reach
(2.b) if we consume the last tuple in a page save a pointer
to the next page.
(2.c) check if tuples for the next prefix are in the same page
(2.d) Release the page.
(3) producing tuples in the suffix order
  (3.b) when tuples for prefix are exhausted load load
          page from (2.b)


Matthias van de Meent, Feb 3
> btree index skip scan infrastructure efficiently prevents new index
> descents into the index when the selected SAOP key ranges are directly
> adjecent, while merge scan would generally do at least one index
> descent for each of its N scan heads (*) - which in the proposed
> prototype patch guarantees O(index depth * num scan heads) buffer
> accesses.

This could also be addressed if we do this custom descent,
I didn't bother about that depth factor because with a few random prefixes
doing so we are probably going to save accesses only for the top level.


I would prefer to start with a very conceptual implementation
that can already provide 1000x speedup, but if you think this
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.


I'm afraid the
> proposed batches execution will be rather complex, so I'd say v1 should
> simply have a threshold, and do the full scan + sort for more items.


Do you mean by an executor node that performs the query as if it was written

ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
  result += (SELECT FROM table
        WHERE prefix = _prefix AND qual(*)
        ORDER BY suffix
        LIMIT N)
return SELECT * FROM result
    ORDER BY suffix
    LIMIT N

This query may have to produce N * len(prefixes) rows, while the
original proposal would produce only N + len(prefixes) - 1.

Alexandre Felipe, Feb 6
> | Method     | Shared Hit | Shared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Merge      |         13 |         119 |     13 ms |
> | IndexScan  |     15,308 |     525,310 |  3,409 ms |

This Prefix Batch Scan approach
 hit=62 read=773, Execution Time: 80.815 ms

> I can imagine that this would really nicely benefit from
> ReadStream'ification.
> >
>
> Not sure, maybe.
>

Actually as I was watching the index prefetch development I was
quite uncertain about how this would play with that, but we can
probably simply give a budget for each stream.


> One other connection I see is with block nested loops. In a perfect
> > future PostgreSQL could run the following as a set of merged index
> > scans that terminate early:
> >
> > SELECT posts.*
> > FROM follows f
> >     JOIN posts p ON f.followed_id = p.user_id
> > WHERE f.follower_id = :userid
> > ORDER BY p.tstamp DESC LIMIT 100;
> >
> > In practice this is not a huge issue - it's not that hard to transform
> > this to array_agg and = ANY subqueries.
> >

Automating that transformation seems quite non-trivial (to me).
>

Well, not trivial. To give a rough idea.

wc -l *.patch
     113 v2-0001-Test-the-baseline.patch
     614 v2-0002-Access-method.patch
     850 v2-0003-Planner-integration.patch
    1958 v2-0004-Multi-column.patch
    2439 v2-0005-Joins.patch

it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
are known to be unique it is good.

The multi column accepts things like A in (...) B in (...)
and computes the cartesian product or (A, B) IN (...)


Regards,
Alexandre

--000000000000ad013d064d74e300
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><br></div><div>Happy St.=
 Patrick&#39;s day!</div><div><br></div><div>(this was sitting on my drafts=
)</div><div><br></div><div>Based on what I said said in previous emails I s=
ee alternative</div><div>proposals</div><div><br></div><div>#1 Make it simp=
ler by not changing the index access methods.</div><div><br></div><div>#2=
=C2=A0Make it optimal by not using generic index searches</div><div>and not=
 keeping multiple open index scans.</div><div><br></div><div><div>and=C2=A0=
</div><div>#3 Follow the pragmatic approach</div><div></div></div><div>Obje=
ctive is, minimize the number of heap fetches.</div><div>As high level as p=
ossible, reusing existing functions</div><div>instead of writing custom cod=
e when possible.</div><div><br></div><div><br></div><div><br></div><div>Ant=
s Aasma &amp; Tomas Vondra</div><div class=3D"gmail_quote"><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex">&gt; My workarounds I have proposed users=
 have been either to rewrite the<br>
&gt; query as a UNION ALL of a set of single value prefix queries wrapped<b=
r>
&gt; in an order by limit. This gives the exact needed merge append plan<br=
>
&gt; shape. But repeating the query N times can get unwieldy when the<br>
&gt; number of values grows, so the fallback is:<br>
&gt; <br>
&gt; SELECT * FROM unnest(:friends) id, LATERAL (<br>
&gt;=C2=A0 =C2=A0 =C2=A0SELECT * FROM posts<br>
&gt;=C2=A0 =C2=A0 =C2=A0WHERE user_id =3D id<br>
&gt;=C2=A0 =C2=A0 =C2=A0ORDER BY tstamp DESC LIMIT 100)<br>
&gt; ORDER BY tstamp DESC LIMIT 100;<br>
&gt; <br>
&gt; The downside of this formulation is that we still have to fetch a<br>
&gt; batch worth of items from scans where we otherwise would have only had=
<br>
&gt; to look at one index tuple.<br>
&gt;=C2=A0</blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><b=
r></blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
True. It&#39;s useful to think about the query this way, and it may be<br>
better than full select + sort, but it has issues too.<br></blockquote><div=
><br></div><div>An issue with this query is generality, if this is joined w=
ith other</div><div>queries we can&#39;t determine in advance the limit.</d=
iv><div>=C2=A0</div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">
&gt; The main problem I can see is that at planning time the cardinality of=
<br>
&gt; the prefix array might not be known, and in theory could be in the<br>
&gt; millions. Having millions of index scans open at the same time is not<=
br>
&gt; viable, so the method needs to somehow degrade gracefully. The idea I<=
br>
&gt; had is to pick some limit, based on work_mem and/or benchmarking, and<=
br>
&gt; one the limit is hit, populate the first batch and then run the next<b=
r>
&gt; batch of index scans, merging with the first result. Or something like=
<br>
&gt; that, I can imagine a few different ways to handle it with different<b=
r>
&gt; tradeoffs.<br>
&gt; <br>
<br>
Doesn&#39;t the proposed merge scan have a similar issue? Because that will=
<br>
also have to keep all the index scans open (even if only internally).<br>
Indeed, it needs to degrade gracefully, in some way.</blockquote><div>=C2=
=A0</div><div>It is true, but I think we can trust the planner.</div><div>T=
his problem scales=C2=A0similarly in a=C2=A0memoize node.</div><div>Is ~24k=
B for each open index scan a good guess?</div><div><br></div><div>ALTERNATI=
VE #1 - More efficient</div><div><br></div><div>Or to avoid having N open i=
ndex scans we could=C2=A0 (??)</div><div>(1) find the index page for the he=
ad of each prefix.</div><div>(2) for each prefix<br>(2.a) load tuples from =
each head page, if we reach</div><div>(2.b) if we consume the last tuple in=
 a page save a pointer</div><div>to the next page.</div><div>(2.c) check if=
 tuples for the next prefix are in the same page</div><div>(2.d) Release th=
e page.</div><div>(3) producing tuples in the suffix order</div><div>=C2=A0=
 (3.b) when tuples for prefix are exhausted load load</div><div>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 page from (2.b)</div><div><br></div><div><br></div=
><div>Matthias van de Meent, Feb 3</div><div>&gt; btree=C2=A0<span>index</s=
pan>=C2=A0skip scan infrastructure efficiently prevents new=C2=A0<span>inde=
x</span></div><div>&gt; descents into the=C2=A0<span>index</span>=C2=A0when=
 the selected SAOP key ranges are directly<br>&gt; adjecent, while merge sc=
an would generally do at least one=C2=A0<span>index</span><br>&gt; descent =
for each of its N scan heads (*) - which in the proposed<br>&gt; prototype =
patch guarantees O(<span>index</span>=C2=A0depth * num scan heads) buffer<b=
r>&gt; accesses.</div><div><br></div><div>This could also be addressed if w=
e do this custom descent,</div><div>I didn&#39;t bother about that depth fa=
ctor because with a few random prefixes</div><div>doing so we are probably =
going to save accesses only for the top level.</div><div><br></div><div><br=
></div><div><div>I would prefer to start with a very conceptual implementat=
ion</div><div>that can already provide 1000x speedup, but if you think this=
</div><div>way is better, I am open to try it. I think this can be done</di=
v><div>without affecting the planner logic and the PrefixJoin node.</div></=
div><div><br></div><div><br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex">I&#39;m afraid the<br>
proposed batches execution will be rather complex, so I&#39;d say v1 should=
<br>
simply have a threshold, and do the full scan + sort for more items.</block=
quote><div><br>Do you mean by an executor node that performs the query as i=
f it was written</div><div><br></div><div>ALTERNATIVE #2 - Simpler(??)</div=
><div><font face=3D"monospace">for each _prefix of prefixes:</font></div><d=
iv><font face=3D"monospace">=C2=A0 result=C2=A0+=3D (SELECT FROM table=C2=
=A0</font></div><div><font face=3D"monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 W=
HERE prefix =3D _prefix AND qual(*)=C2=A0</font></div><div><font face=3D"mo=
nospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 ORDER BY suffix=C2=A0</font></div><div=
><font face=3D"monospace">=C2=A0 =C2=A0 =C2=A0 =C2=A0 LIMIT N)</font></div>=
<div><font face=3D"monospace">return SELECT * FROM result=C2=A0</font></div=
><div><font face=3D"monospace">=C2=A0 =C2=A0 ORDER BY suffix=C2=A0</font></=
div><div><font face=3D"monospace">=C2=A0 =C2=A0 LIMIT N</font></div><div><b=
r></div><div>This query may have to produce N * len(prefixes) rows, while t=
he=C2=A0</div><div>original proposal would produce only N=C2=A0+ len(prefix=
es) - 1.</div><div><br></div><div>Alexandre Felipe, Feb 6</div><div><span s=
tyle=3D"font-family:monospace">&gt; | Method =C2=A0 =C2=A0 | Shared Hit | S=
hared Read | Exec Time |</span><br style=3D"font-family:monospace"><span st=
yle=3D"font-family:monospace">&gt; |------------|-----------:|---</span><sp=
an style=3D"font-family:monospace">---------:|----------:|</span><br style=
=3D"font-family:monospace"><span style=3D"font-family:monospace">&gt; | Mer=
ge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 13 | =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |</span><br style=3D"font-family:monos=
pace"><span style=3D"font-family:monospace">&gt; | IndexScan =C2=A0| =C2=A0=
 =C2=A0 15,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 ms |</span></div><div>=
<span style=3D"font-family:monospace"><br></span></div><div><span style=3D"=
font-family:monospace">This Prefix Batch Scan approach</span></div><div>=C2=
=A0hit=3D62 read=3D773,=C2=A0<span style=3D"font-family:monospace"></span>E=
xecution Time: 80.815 ms</div><div><br></div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">
&gt; I can imagine that this would really nicely benefit from ReadStream=
9;ification.<br>
&gt; <br>
<br>
Not sure, maybe.<br></blockquote><div><br></div><div>Actually as I was watc=
hing the index prefetch development I was</div><div>quite uncertain about h=
ow this would play with that, but we can</div><div>probably simply give a b=
udget for each stream.</div><div><br></div><div><br></div><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex">
&gt; One other connection I see is with block nested loops. In a perfect<br=
>
&gt; future PostgreSQL could run the following as a set of merged index<br>
&gt; scans that terminate early:<br>
&gt; <br>
&gt; SELECT posts.*<br>
&gt; FROM follows f<br>
&gt;=C2=A0 =C2=A0 =C2=A0JOIN posts p ON f.followed_id =3D p.user_id<br>
&gt; WHERE f.follower_id =3D :userid<br>
&gt; ORDER BY p.tstamp DESC LIMIT 100;<br>
&gt; <br>
&gt; In practice this is not a huge issue - it&#39;s not that hard to trans=
form<br>
&gt; this to array_agg and =3D ANY subqueries.<br>
&gt;=C2=A0</blockquote><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Automating that transformation seems quite non-trivial (to me).<br></blockq=
uote><div><br></div><div>Well, not trivial. To give a rough idea.</div><div=
><br><font face=3D"monospace">wc -l *.patch</font></div><div><font face=3D"=
monospace">=C2=A0 =C2=A0 =C2=A0113 v2-0001-Test-the-baseline.patch<br>=C2=
=A0 =C2=A0 =C2=A0614 v2-0002-Access-method.patch<br>=C2=A0 =C2=A0 =C2=A0850=
 v2-0003-Planner-integration.patch<br>=C2=A0 =C2=A0 1958 v2-0004-Multi-colu=
mn.patch<br>=C2=A0 =C2=A0 2439 v2-0005-Joins.patch</font></div><div><br></d=
iv><div>it is missing some important details like prefix deduplication</div=
><div>but for the scenario where the values on the other table</div><div>ar=
e known to be unique it is good.</div><div><br></div><div>The multi column =
accepts things like A in (...) B in (...)</div><div>and computes the cartes=
ian product or (A, B) IN (...)<br><br><br>Regards,</div><div>Alexandre</div=
></div></div>
</div>

--000000000000ad013d064d74e300--