MIME-Version: 1.0
References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru>
 <CAEze2Wg2a8LQDRocVPa7Df2qXQLrTz-xSu1k3xP3x_6ABVo1Jw@mail.gmail.com>
In-Reply-To: 
 <CAEze2Wg2a8LQDRocVPa7Df2qXQLrTz-xSu1k3xP3x_6ABVo1Jw@mail.gmail.com>
From: Kirk Wolak <wolakk@gmail.com>
Date: Tue, 26 Aug 2025 10:11:43 -0400
Message-ID: 
 <CACLU5mRude0L5psEj5WS0DVDv=AHN0McfZBKV5eBoW0JqwwZDA@mail.gmail.com>
Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
Cc: Andrey Borodin <x4mmm@yandex-team.ru>,
 pgsql-hackers <pgsql-hackers@postgresql.org>,
	Nikolay Samokhvalov <nik@postgres.ai>
Content-Type: multipart/alternative; boundary="0000000000000b7350063d4542ea"
Archived-At: 
 <https://www.postgresql.org/message-id/CACLU5mRude0L5psEj5WS0DVDv%3DAHN0McfZBKV5eBoW0JqwwZDA%40mail.gmail.com>
Precedence: bulk

--0000000000000b7350063d4542ea
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Tue, Aug 26, 2025 at 6:33=E2=80=AFAM Matthias van de Meent <
boekewurm+postgres@gmail.com> wrote:

> On Tue, 26 Aug 2025 at 11:40, Andrey Borodin <x4mmm@yandex-team.ru> wrote=
:
> >
> > Hi hackers,
> >
> > Together with Kirk and Nik we spent several online hacking sessions to
> tackle index bloat problems [0,1,2]. As a result we concluded that B-tree
> index page merge should help to keep an index from pressuring
> shared_buffers.
> >
> > We are proposing a feature to automatically merge nearly-empty B-tree
> leaf pages during VACUUM operations to reduce index bloat. This addresses=
 a
> common issue where deleted tuples leave behind sparsely populated pages
> that traditional page deletion cannot handle because they're not complete=
ly
> empty.
> >
> ...
> I'm fairly sure there is a correctness issue here; I don't think you
> correctly detect the two following cases:
>
> 1.) a page (P0) is scanned by a scan, finishes processing the results,
> and releases its pin. It prepares to scan the next page of the scan
> (P1).
> 2.) a page (A) is split, with new right sibling page B,
> 3.) and the newly created page B is merged into its right sibling C,
> 4.) the scan continues on to the next page
>
> For backward scans, if page A is the same page as the one identified
> with P1, the scan won't notice that tuples from P1 (aka A) have been
> moved through B to P0 (C), causing the scan to skip processing for
> those tuples.
> For forward scans, if page A is the same page as the one identified
> with P0, the scan won't notice that tuples from P0 (A) have been moved
> through B to P1 (C), causing the scan to process those tuples twice in
> the same scan, potentially duplicating results.
>
> NB: Currently, the only way for "merge" to happen is when the index
> page is completely empty. This guarantees that there is no movement of
> scan-visible tuples to pages we've already visited/are about to visit.
> This invariant is used extensively to limit lock and pin coupling (and
> thus: improve performance) in index scans; see e.g. in 1bd4bc85. This
> patch will invalidate that invariant, and therefore it will require
> (significantly) more work in the scan code (incl. nbtsearch.c) to
> guarantee exactly-once results + no false negatives.
>
> Kind regards,
>
> Matthias van de Meent
> Databricks
>

This was one of our concerns.  We will review the patch mentioned.
I do have a question, one of the IDEAS we discussed was to ADD a new page
that combined the 2 pages.
Making the deletion "feel" like a page split.

This has the advantage of leaving the original 2 pages alone for anyone who
is currently traversing.
And like the page split, updating the links around while marking the pages
for the new path.

The downside to this approach is that we are "adding 1 page to then mark 2
pages as unused".

Could you comment on this secondary approach?

Thanks in advance!

Kirk

--0000000000000b7350063d4542ea
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Tue, Aug 26, 2025 at 6:33=E2=80=AFAM M=
atthias van de Meent &lt;<a href=3D"mailto:boekewurm%2Bpostgres@gmail.com">=
boekewurm+postgres@gmail.com</a>&gt; wrote:</div><div class=3D"gmail_quote =
gmail_quote_container"><blockquote class=3D"gmail_quote" style=3D"margin:0p=
x 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On=
 Tue, 26 Aug 2025 at 11:40, Andrey Borodin &lt;<a href=3D"mailto:x4mmm@yand=
ex-team.ru" target=3D"_blank">x4mmm@yandex-team.ru</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi hackers,<br>
&gt;<br>
&gt; Together with Kirk and Nik we spent several online hacking sessions to=
 tackle index bloat problems [0,1,2]. As a result we concluded that B-tree =
index page merge should help to keep an index from pressuring shared_buffer=
s.<br>
&gt;<br>
&gt; We are proposing a feature to automatically merge nearly-empty B-tree =
leaf pages during VACUUM operations to reduce index bloat. This addresses a=
 common issue where deleted tuples leave behind sparsely populated pages th=
at traditional page deletion cannot handle because they&#39;re not complete=
ly empty.<br>
&gt;<br>...<br>
I&#39;m fairly sure there is a correctness issue here; I don&#39;t think yo=
u<br>
correctly detect the two following cases:<br>
<br>
1.) a page (P0) is scanned by a scan, finishes processing the results,<br>
and releases its pin. It prepares to scan the next page of the scan<br>
(P1).<br>
2.) a page (A) is split, with new right sibling page B,<br>
3.) and the newly created page B is merged into its right sibling C,<br>
4.) the scan continues on to the next page<br>
<br>
For backward scans, if page A is the same page as the one identified<br>
with P1, the scan won&#39;t notice that tuples from P1 (aka A) have been<br=
>
moved through B to P0 (C), causing the scan to skip processing for<br>
those tuples.<br>
For forward scans, if page A is the same page as the one identified<br>
with P0, the scan won&#39;t notice that tuples from P0 (A) have been moved<=
br>
through B to P1 (C), causing the scan to process those tuples twice in<br>
the same scan, potentially duplicating results.<br>
<br>
NB: Currently, the only way for &quot;merge&quot; to happen is when the ind=
ex<br>
page is completely empty. This guarantees that there is no movement of<br>
scan-visible tuples to pages we&#39;ve already visited/are about to visit.<=
br>
This invariant is used extensively to limit lock and pin coupling (and<br>
thus: improve performance) in index scans; see e.g. in 1bd4bc85. This<br>
patch will invalidate that invariant, and therefore it will require<br>
(significantly) more work in the scan code (incl. nbtsearch.c) to<br>
guarantee exactly-once results + no false negatives.<br>
<br>
Kind regards,<br>
<br>
Matthias van de Meent<br>
Databricks<br></blockquote><div><br></div><div>This was one of our concerns=
.=C2=A0 We will review the patch mentioned.</div><div>I do have a question,=
 one of the IDEAS we discussed was to ADD a new page that combined the 2 pa=
ges.</div><div>Making the deletion &quot;feel&quot; like a page split.</div=
><div><br></div><div>This has the advantage of leaving the original 2 pages=
 alone for anyone who is currently traversing.</div><div>And like the page =
split, updating the links around while marking the pages for the new path.<=
/div><div><br></div><div>The downside to this approach is that we are &quot=
;adding 1 page to then mark 2 pages as unused&quot;.</div><div><br></div><d=
iv>Could you comment on this secondary approach?</div><div><br></div><div>T=
hanks in advance!</div><div><br></div><div>Kirk<br>=C2=A0</div></div></div>

--0000000000000b7350063d4542ea--