Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w3a9o-001MiS-16 for pgsql-hackers@arkaria.postgresql.org; Fri, 20 Mar 2026 13:44:56 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w3a9m-006Mle-2n for pgsql-hackers@arkaria.postgresql.org; Fri, 20 Mar 2026 13:44:55 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w3a9m-006MlW-1h for pgsql-hackers@lists.postgresql.org; Fri, 20 Mar 2026 13:44:55 +0000 Received: from mail-ej1-x62a.google.com ([2a00:1450:4864:20::62a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w3a9k-00000000BVg-1WJH for pgsql-hackers@postgresql.org; Fri, 20 Mar 2026 13:44:54 +0000 Received: by mail-ej1-x62a.google.com with SMTP id a640c23a62f3a-b9825ba7e8dso215838666b.3 for ; Fri, 20 Mar 2026 06:44:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774014292; cv=none; d=google.com; s=arc-20240605; b=WWcr5sSn/iVakBobyDJxj5JeLbE9jUgchXnxq0l/y3CsO/nmKoWXeszQPhjcIqjSof OzttvAxn6KEVI0uiqzYSx1kCmuJDIBlTi/7lwtrP68U6TT1yB4WOyaf3s8bzHnWaZzXO jKKGlw0W5Ur/w1O2glx+XdzXCm7WQcN6HuxCL9TkmkpCVWXSGqXlpM78np1Jlm5V4F4L VY4Bg/TiYuHHmbI3r3HXf1+0cPenKhspDnQkXrY98LL+HpUh6mUvUhmdz83xN7I7CP/n vExudnRbSi9PL7BIUclG8qhQg6FfsYM0V0xcLj74XzA5a0IEbHIAs50ModJVf2soLktQ CP5g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=SMOVwoDpLnxaJHYttATWYCSSyR7SuUPfh+SIG5t40P4=; fh=P0biSqSAttAV5mxR8I1Z1UY3ciMpUrbhlxVB6IOvx/w=; b=SZ3n1a5HpbC0LbFXnxgW8002fdshzAmqzRgKMnWGo5qkMraYNiPM+Hu6tpfbhQwN4H zxMBYrDI33sIqnds/QHcZQ9oVrbCM9DpX4DmwHLWmzBD5ae7avwnvnFTzWCIXV4r/yWw MZanzWMwjUWuBbW0j4URu/IcdgQvQkOjUqd0NibufhOhW7dSm+rOy7LIudhJ0jJdOWYL QqDSgGBd29eHg/+pz+zlKLsNgu0kAK9e/3NwEDDMlvHqXcn/+SQOZTrkgvbvhjIncR9r DVHWCSwJThgFmZTpL4ITg6SZf1ekJhIbs13TckXQUrhI5t3A1dYFNBWVmzfzyes9eLIW k8nA==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1774014292; x=1774619092; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=SMOVwoDpLnxaJHYttATWYCSSyR7SuUPfh+SIG5t40P4=; b=Y0QXOMMI2YhQ7zmAfW9BQLXbO2vjNxulo5bH/R4JsdEgsSUjIU8ESBJ5O135tGtvhG +EOIOhuEa4xLYUyw0HD7d2rJw+exdJn37FgZaF3+Rdxg/QV1AqLk8/x4mRlv+GfQMMUO pQ0JOmSYDF2SLu5Crxg3fCKoZuRNxreyInKKynmk5r83dumP8xHVQD6mGDfVlZrCxI1G wrQvT9z4NAnNLojIRoJST0l8+IE3H6lLD2/Vk58UA4WdjIcbsHs1uH1bHOk74PmeA88l EmawYUvPx//6Ks5Fk1z0tToyPm+TU+ciCK3WwM6aDB5KIRKswRTfDqzl4kQelfk1JAlU ZMrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774014292; x=1774619092; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=SMOVwoDpLnxaJHYttATWYCSSyR7SuUPfh+SIG5t40P4=; b=XJ+VkC0S4EMn4aH/NK7uq4r/1BfMwDTOsT3EhBxzTFp2w4SdLgcSAWelp5JlYYbcsU 1vkEAtNBTayaegTQ/PtJtGcXIuNbQwfCZmbgyuCnrjdoNuiuaMQHFb6yInX5y+wVcqGx XDyvPC3JF1YIXrLeO/w0kgTrxzNtIPeX/jJ9zBt0tQnRSJVoENNoN3snixYeV16ECMMc r0hHyVjrHmKftV+5YvS6cG5Hn0o5TwBi4fXq1fOt6K1KCJOS9+GIzwJHa1bg4PBBZe/W dG5wJQKRTLIjt+bWZwx3qGSUoRVaJqrMec/hGHKelSHtfzuZv+IOmCbpneBKV7jN8dNm OyyQ== X-Forwarded-Encrypted: i=1; AJvYcCWWO2JqquU0MDzEJENL1eh3+VHbtcZ2a8Oq5ZzcJdS/77+RM7kuMI23Z7WsVltRgqswKCFVBGV0DA6DOoTK@postgresql.org X-Gm-Message-State: AOJu0YyC2/ItBqDypTea8lHNewwrQuP2+/oH8AwJcaFJa6G+yWlT7LKj l7QW4KviXiCVZwaK4fAKwxDD7GHPvFE13Go3ofmIrBHVSIZrHfwiCpn29ed5IT7V7AEqK/wZK3K C7wIObebTqHUwPk4GH15R8YaoUJPqeNM= X-Gm-Gg: ATEYQzxw2LEu1KUzpfVkyChngXa8aFgCbnW93XoAoypGUjn9zxty0A5toIL28pSQmCW Fz+osFXQttYzQC7favw029TD711lUO8KXiQJ+D+gZaR++OhvBHzzhzJqecPW/HKOvBW4lrEhoui n9vkebRnANBHlcE8zIxOXy4yHVn+maFa6abLC8tWsBEQuNM3fcnPxcyPOj7rydMRBjnJUEK2Gr+ aWMcBdgZretNHq411/Y5h4jsl8655ZuQsO8TCo8EQf52ZcsnJgGIMEddRwUhzrSGslRj9UMsHY6 J1K95GB+GaMuXPEcwRGH+YMI1jTZy/ngw9P1xU/VWwuMNMLPbxtVGJKxTADm X-Received: by 2002:a17:906:4693:b0:b97:e32f:7ee8 with SMTP id a640c23a62f3a-b982f3a4176mr161928666b.37.1774014291574; Fri, 20 Mar 2026 06:44:51 -0700 (PDT) MIME-Version: 1.0 References: <4be96b19-1e19-4000-b65a-eada001e5a9a@vondra.me> In-Reply-To: <4be96b19-1e19-4000-b65a-eada001e5a9a@vondra.me> From: Alexandre Felipe Date: Fri, 20 Mar 2026 13:44:39 +0000 X-Gm-Features: AaiRm52g5pBg7Olr8H07GCLxOEA9QONb85Oj_I97IJvK7aTOCNR4bbs0hlOKjIw Message-ID: Subject: Re: New access method for b-tree. To: Tomas Vondra Cc: Ants Aasma , Alexandre Felipe , "pgsql-hackers@postgresql.org" , boekewurm+postgres@gmail.com Content-Type: multipart/alternative; boundary="000000000000ad013d064d74e300" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ad013d064d74e300 Content-Type: text/plain; charset="UTF-8" Happy St. Patrick's day! (this was sitting on my drafts) Based on what I said said in previous emails I see alternative proposals #1 Make it simpler by not changing the index access methods. #2 Make it optimal by not using generic index searches and not keeping multiple open index scans. and #3 Follow the pragmatic approach Objective is, minimize the number of heap fetches. As high level as possible, reusing existing functions instead of writing custom code when possible. Ants Aasma & Tomas Vondra > > My workarounds I have proposed users have been either to rewrite the > > query as a UNION ALL of a set of single value prefix queries wrapped > > in an order by limit. This gives the exact needed merge append plan > > shape. But repeating the query N times can get unwieldy when the > > number of values grows, so the fallback is: > > > > SELECT * FROM unnest(:friends) id, LATERAL ( > > SELECT * FROM posts > > WHERE user_id = id > > ORDER BY tstamp DESC LIMIT 100) > > ORDER BY tstamp DESC LIMIT 100; > > > > The downside of this formulation is that we still have to fetch a > > batch worth of items from scans where we otherwise would have only had > > to look at one index tuple. > > > True. It's useful to think about the query this way, and it may be > better than full select + sort, but it has issues too. > An issue with this query is generality, if this is joined with other queries we can't determine in advance the limit. > The main problem I can see is that at planning time the cardinality of > > the prefix array might not be known, and in theory could be in the > > millions. Having millions of index scans open at the same time is not > > viable, so the method needs to somehow degrade gracefully. The idea I > > had is to pick some limit, based on work_mem and/or benchmarking, and > > one the limit is hit, populate the first batch and then run the next > > batch of index scans, merging with the first result. Or something like > > that, I can imagine a few different ways to handle it with different > > tradeoffs. > > > > Doesn't the proposed merge scan have a similar issue? Because that will > also have to keep all the index scans open (even if only internally). > Indeed, it needs to degrade gracefully, in some way. It is true, but I think we can trust the planner. This problem scales similarly in a memoize node. Is ~24kB for each open index scan a good guess? ALTERNATIVE #1 - More efficient Or to avoid having N open index scans we could (??) (1) find the index page for the head of each prefix. (2) for each prefix (2.a) load tuples from each head page, if we reach (2.b) if we consume the last tuple in a page save a pointer to the next page. (2.c) check if tuples for the next prefix are in the same page (2.d) Release the page. (3) producing tuples in the suffix order (3.b) when tuples for prefix are exhausted load load page from (2.b) Matthias van de Meent, Feb 3 > btree index skip scan infrastructure efficiently prevents new index > descents into the index when the selected SAOP key ranges are directly > adjecent, while merge scan would generally do at least one index > descent for each of its N scan heads (*) - which in the proposed > prototype patch guarantees O(index depth * num scan heads) buffer > accesses. This could also be addressed if we do this custom descent, I didn't bother about that depth factor because with a few random prefixes doing so we are probably going to save accesses only for the top level. I would prefer to start with a very conceptual implementation that can already provide 1000x speedup, but if you think this way is better, I am open to try it. I think this can be done without affecting the planner logic and the PrefixJoin node. I'm afraid the > proposed batches execution will be rather complex, so I'd say v1 should > simply have a threshold, and do the full scan + sort for more items. Do you mean by an executor node that performs the query as if it was written ALTERNATIVE #2 - Simpler(??) for each _prefix of prefixes: result += (SELECT FROM table WHERE prefix = _prefix AND qual(*) ORDER BY suffix LIMIT N) return SELECT * FROM result ORDER BY suffix LIMIT N This query may have to produce N * len(prefixes) rows, while the original proposal would produce only N + len(prefixes) - 1. Alexandre Felipe, Feb 6 > | Method | Shared Hit | Shared Read | Exec Time | > |------------|-----------:|------------:|----------:| > | Merge | 13 | 119 | 13 ms | > | IndexScan | 15,308 | 525,310 | 3,409 ms | This Prefix Batch Scan approach hit=62 read=773, Execution Time: 80.815 ms > I can imagine that this would really nicely benefit from > ReadStream'ification. > > > > Not sure, maybe. > Actually as I was watching the index prefetch development I was quite uncertain about how this would play with that, but we can probably simply give a budget for each stream. > One other connection I see is with block nested loops. In a perfect > > future PostgreSQL could run the following as a set of merged index > > scans that terminate early: > > > > SELECT posts.* > > FROM follows f > > JOIN posts p ON f.followed_id = p.user_id > > WHERE f.follower_id = :userid > > ORDER BY p.tstamp DESC LIMIT 100; > > > > In practice this is not a huge issue - it's not that hard to transform > > this to array_agg and = ANY subqueries. > > Automating that transformation seems quite non-trivial (to me). > Well, not trivial. To give a rough idea. wc -l *.patch 113 v2-0001-Test-the-baseline.patch 614 v2-0002-Access-method.patch 850 v2-0003-Planner-integration.patch 1958 v2-0004-Multi-column.patch 2439 v2-0005-Joins.patch it is missing some important details like prefix deduplication but for the scenario where the values on the other table are known to be unique it is good. The multi column accepts things like A in (...) B in (...) and computes the cartesian product or (A, B) IN (...) Regards, Alexandre --000000000000ad013d064d74e300 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Happy St.= Patrick's day!

(this was sitting on my drafts= )

Based on what I said said in previous emails I s= ee alternative
proposals

#1 Make it simp= ler by not changing the index access methods.

#2= =C2=A0Make it optimal by not using generic index searches
and not= keeping multiple open index scans.

and=C2=A0=
#3 Follow the pragmatic approach
Obje= ctive is, minimize the number of heap fetches.
As high level as p= ossible, reusing existing functions
instead of writing custom cod= e when possible.



Ant= s Aasma & Tomas Vondra
> My workarounds I have proposed users= have been either to rewrite the
> query as a UNION ALL of a set of single value prefix queries wrapped > in an order by limit. This gives the exact needed merge append plan > shape. But repeating the query N times can get unwieldy when the
> number of values grows, so the fallback is:
>
> SELECT * FROM unnest(:friends) id, LATERAL (
>=C2=A0 =C2=A0 =C2=A0SELECT * FROM posts
>=C2=A0 =C2=A0 =C2=A0WHERE user_id =3D id
>=C2=A0 =C2=A0 =C2=A0ORDER BY tstamp DESC LIMIT 100)
> ORDER BY tstamp DESC LIMIT 100;
>
> The downside of this formulation is that we still have to fetch a
> batch worth of items from scans where we otherwise would have only had=
> to look at one index tuple.
>=C2=A0
True. It's useful to think about the query this way, and it may be
better than full select + sort, but it has issues too.

An issue with this query is generality, if this is joined w= ith other
queries we can't determine in advance the limit.
=C2=A0

> The main problem I can see is that at planning time the cardinality of=
> the prefix array might not be known, and in theory could be in the
> millions. Having millions of index scans open at the same time is not<= br> > viable, so the method needs to somehow degrade gracefully. The idea I<= br> > had is to pick some limit, based on work_mem and/or benchmarking, and<= br> > one the limit is hit, populate the first batch and then run the next > batch of index scans, merging with the first result. Or something like=
> that, I can imagine a few different ways to handle it with different > tradeoffs.
>

Doesn't the proposed merge scan have a similar issue? Because that will=
also have to keep all the index scans open (even if only internally).
Indeed, it needs to degrade gracefully, in some way.
=C2= =A0
It is true, but I think we can trust the planner.
T= his problem scales=C2=A0similarly in a=C2=A0memoize node.
Is ~24k= B for each open index scan a good guess?

ALTERNATI= VE #1 - More efficient

Or to avoid having N open i= ndex scans we could=C2=A0 (??)
(1) find the index page for the he= ad of each prefix.
(2) for each prefix
(2.a) load tuples from = each head page, if we reach
(2.b) if we consume the last tuple in= a page save a pointer
to the next page.
(2.c) check if= tuples for the next prefix are in the same page
(2.d) Release th= e page.
(3) producing tuples in the suffix order
=C2=A0= (3.b) when tuples for prefix are exhausted load load
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 page from (2.b)


Matthias van de Meent, Feb 3
> btree=C2=A0index=C2=A0skip scan infrastructure efficiently prevents new=C2=A0inde= x
> descents into the=C2=A0index=C2=A0when= the selected SAOP key ranges are directly
> adjecent, while merge sc= an would generally do at least one=C2=A0index
> descent = for each of its N scan heads (*) - which in the proposed
> prototype = patch guarantees O(index=C2=A0depth * num scan heads) buffer> accesses.

This could also be addressed if w= e do this custom descent,
I didn't bother about that depth fa= ctor because with a few random prefixes
doing so we are probably = going to save accesses only for the top level.

I would prefer to start with a very conceptual implementat= ion
that can already provide 1000x speedup, but if you think this=
way is better, I am open to try it. I think this can be done
without affecting the planner logic and the PrefixJoin node.


I'm afraid the
proposed batches execution will be rather complex, so I'd say v1 should=
simply have a threshold, and do the full scan + sort for more items.

Do you mean by an executor node that performs the query as i= f it was written

ALTERNATIVE #2 - Simpler(??)
for each _prefix of prefixes:
=C2=A0 result=C2=A0+=3D (SELECT FROM table=C2= =A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 W= HERE prefix =3D _prefix AND qual(*)=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 ORDER BY suffix=C2=A0
=C2=A0 =C2=A0 =C2=A0 =C2=A0 LIMIT N)
=
return SELECT * FROM result=C2=A0
=C2=A0 =C2=A0 ORDER BY suffix=C2=A0
=C2=A0 =C2=A0 LIMIT N
This query may have to produce N * len(prefixes) rows, while t= he=C2=A0
original proposal would produce only N=C2=A0+ len(prefix= es) - 1.

Alexandre Felipe, Feb 6
> | Method =C2=A0 =C2=A0 | Shared Hit | S= hared Read | Exec Time |
> |------------|-----------:|------------:|----------:|
> | Mer= ge =C2=A0 =C2=A0 =C2=A0| =C2=A0 =C2=A0 =C2=A0 =C2=A0 13 | =C2=A0 =C2=A0 =C2= =A0 =C2=A0 119 | =C2=A0 =C2=A0 13 ms |
> | IndexScan =C2=A0| =C2=A0= =C2=A0 15,308 | =C2=A0 =C2=A0 525,310 | =C2=A03,409 ms |
=
This Prefix Batch Scan approach
=C2= =A0hit=3D62 read=3D773,=C2=A0E= xecution Time: 80.815 ms

> I can imagine that this would really nicely benefit from ReadStream= 9;ification.
>

Not sure, maybe.

Actually as I was watc= hing the index prefetch development I was
quite uncertain about h= ow this would play with that, but we can
probably simply give a b= udget for each stream.


> One other connection I see is with block nested loops. In a perfect > future PostgreSQL could run the following as a set of merged index
> scans that terminate early:
>
> SELECT posts.*
> FROM follows f
>=C2=A0 =C2=A0 =C2=A0JOIN posts p ON f.followed_id =3D p.user_id
> WHERE f.follower_id =3D :userid
> ORDER BY p.tstamp DESC LIMIT 100;
>
> In practice this is not a huge issue - it's not that hard to trans= form
> this to array_agg and =3D ANY subqueries.
>=C2=A0
Automating that transformation seems quite non-trivial (to me).

Well, not trivial. To give a rough idea.

wc -l *.patch
=C2=A0 =C2=A0 =C2=A0113 v2-0001-Test-the-baseline.patch
=C2= =A0 =C2=A0 =C2=A0614 v2-0002-Access-method.patch
=C2=A0 =C2=A0 =C2=A0850= v2-0003-Planner-integration.patch
=C2=A0 =C2=A0 1958 v2-0004-Multi-colu= mn.patch
=C2=A0 =C2=A0 2439 v2-0005-Joins.patch

it is missing some important details like prefix deduplication
but for the scenario where the values on the other table
ar= e known to be unique it is good.

The multi column = accepts things like A in (...) B in (...)
and computes the cartes= ian product or (A, B) IN (...)


Regards,
Alexandre
--000000000000ad013d064d74e300--