MIME-Version: 1.0
References: <11A59C0C-A8C8-4642-8493-292D5DF8311D@yandex-team.ru>
 <CAEze2Wg2a8LQDRocVPa7Df2qXQLrTz-xSu1k3xP3x_6ABVo1Jw@mail.gmail.com>
 <CACLU5mRude0L5psEj5WS0DVDv=AHN0McfZBKV5eBoW0JqwwZDA@mail.gmail.com>
 <CAKw2Pb2wV=L4yeAqj27wJ1wptHt1mnBaQ5t4YuLLe+M5aU6ESg@mail.gmail.com>
 <CAEze2WhPwEAszAOV+3_Oc63V=8JcnsY_4FD3XRLzRi0q0nNdUA@mail.gmail.com>
 <CAKw2Pb19rE+sJPpCXhfk8T4K+snfa240_eDXs0dqpCsXoS6cgA@mail.gmail.com>
In-Reply-To: 
 <CAKw2Pb19rE+sJPpCXhfk8T4K+snfa240_eDXs0dqpCsXoS6cgA@mail.gmail.com>
From: Madhav Madhusoodanan <madhavmadhusoodanan@gmail.com>
Date: Fri, 6 Mar 2026 02:15:31 +0530
Message-ID: 
 <CAKw2Pb3=PrcznZuweLLyHt4gmRLyO9cffqRfTm27BFgruGWgWw@mail.gmail.com>
Subject: Re: [WiP] B-tree page merge during vacuum to reduce index bloat
To: Matthias van de Meent <boekewurm+postgres@gmail.com>
Cc: Kirk Wolak <wolakk@gmail.com>, Andrey Borodin <x4mmm@yandex-team.ru>,
	pgsql-hackers <pgsql-hackers@postgresql.org>,
 Nikolay Samokhvalov <nik@postgres.ai>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAKw2Pb3%3DPrcznZuweLLyHt4gmRLyO9cffqRfTm27BFgruGWgWw%40mail.gmail.com>
Precedence: bulk

On Sun, Mar 1, 2026 at 4:15=E2=80=AFPM Madhav Madhusoodanan
<madhavmadhusoodanan@gmail.com> wrote:
>
> On Fri, Feb 27, 2026 at 5:37=E2=80=AFPM Matthias van de Meent
> <boekewurm+postgres@gmail.com> wrote:
> >
> > That is one part of the picture (the merging part), but it's missing a
> > lot of details:
> > - How do concurrent workloads (index scans, backwards index scans,
> > index insertions) detect and recover from concurrent merges when they
> > step through pages?
> > - Do you have a theory or proof of correctness for the above?
> > - Can this scheme be implemented without adding significant overhead
> > to current workloads that don't benefit significantly from the new
> > feature?
> >
> > One example for a problem with the given flow: where (and how) do you
> > update the sibling pointers in pages A (to N from B) and D (to N from
> > C)?
> > You haven't explicitly locked pages A and D by the time you get to
> > step 4, even though locking pages is critical to guarantee correct and
> > safe operations. Simply locking them in step 4 isn't sufficient, as
> > another process may have locked page A in preparation for a split,
> > which would then be waiting to get a lock on page B. If you've already
> > locked page B before locking page A, you'll have a deadlock.
> >
> > Additionally, how does this work when you find out in step 3 that your
> > pages B and C each are linked to from a different parent page? You'd
> > have to update both parent pages, but that would also require locking
> > more than just the one parent page per level that currently gets
> > locked during page deletion.
>
> Ohh I see, I haven't included the dynamics of such flows too. I might
> have to think this through.
> Thank you for pointing it out!
>
> Just to understand what we can control (and to help internalize the
> constraints so that I can think through a strategy): to handle
> transient merge states, are we intending to update core? Or is the
> bloat reduction functionality intended to be abstracted into an
> extension?
>
> I'm coming up with ways that might potentially update index amcheck
> and the like.
>
> Thanks in advance!
>
> Madhav

Some parts of the merge flow that I am coming up with are as follows
(assuming tuples from index page B are migrated rightwards into C):

1. Leaving B's tuples as it is even after merge, to remove the
possible risk of scans "skipping over" tuples. Essentially, the tuples
then would be "copied" into C.
2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say,
MERGE_SRC and MERGE_DEST respectively) before the actual merge
process, then marking the pages with another flag upon completion
(MERGE_COMPLETE) so that other processes can handle transient merge
states.
3. For example, scans that reach page B post-merge (MERGE_SRC +
MERGE_COMPLETE) would be made to skip to the page to its right.
4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B)=
.

Note: The underlying assumption is that B and C both are children of a
single parent page P. For a start I'm picturing this flow only for
sibling nodes, not for cousin nodes (B and C having different parent
nodes).

Would the above approach be alright? These are not exhaustive steps,
I'm only raising these lines of thought so that I can fast-fail them
(if there are any issues).
I'd be happy to provide a doc of the exact flow I'm picturing (how to
ensure TIDs are read exactly-once, how the merge process would hold up
with other concurrent processes, etc), incase my thoughts seem
promising.

Thanks in advance!

Madhav