Re: [WiP] B-tree page merge during vacuum to reduce index bloat

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: [WiP] B-tree page merge during vacuum to reduce index bloat
6+ messages / 3 participants
[nested] [flat]

* Re: [WiP] B-tree page merge during vacuum to reduce index bloat
@ 2026-03-01 10:45  Madhav Madhusoodanan <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Madhav Madhusoodanan @ 2026-03-01 10:45 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]>

On Fri, Feb 27, 2026 at 5:37 PM Matthias van de Meent
<[email protected]> wrote:
>
> That is one part of the picture (the merging part), but it's missing a
> lot of details:
> - How do concurrent workloads (index scans, backwards index scans,
> index insertions) detect and recover from concurrent merges when they
> step through pages?
> - Do you have a theory or proof of correctness for the above?
> - Can this scheme be implemented without adding significant overhead
> to current workloads that don't benefit significantly from the new
> feature?
>
> One example for a problem with the given flow: where (and how) do you
> update the sibling pointers in pages A (to N from B) and D (to N from
> C)?
> You haven't explicitly locked pages A and D by the time you get to
> step 4, even though locking pages is critical to guarantee correct and
> safe operations. Simply locking them in step 4 isn't sufficient, as
> another process may have locked page A in preparation for a split,
> which would then be waiting to get a lock on page B. If you've already
> locked page B before locking page A, you'll have a deadlock.
>
> Additionally, how does this work when you find out in step 3 that your
> pages B and C each are linked to from a different parent page? You'd
> have to update both parent pages, but that would also require locking
> more than just the one parent page per level that currently gets
> locked during page deletion.

Ohh I see, I haven't included the dynamics of such flows too. I might
have to think this through.
Thank you for pointing it out!

Just to understand what we can control (and to help internalize the
constraints so that I can think through a strategy): to handle
transient merge states, are we intending to update core? Or is the
bloat reduction functionality intended to be abstracted into an
extension?

I'm coming up with ways that might potentially update index amcheck
and the like.

Thanks in advance!

Madhav






^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [WiP] B-tree page merge during vacuum to reduce index bloat
@ 2026-03-05 20:45  Madhav Madhusoodanan <[email protected]>
  parent: Madhav Madhusoodanan <[email protected]>
  0 siblings, 2 replies; 6+ messages in thread

From: Madhav Madhusoodanan @ 2026-03-05 20:45 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]>

On Sun, Mar 1, 2026 at 4:15 PM Madhav Madhusoodanan
<[email protected]> wrote:
>
> On Fri, Feb 27, 2026 at 5:37 PM Matthias van de Meent
> <[email protected]> wrote:
> >
> > That is one part of the picture (the merging part), but it's missing a
> > lot of details:
> > - How do concurrent workloads (index scans, backwards index scans,
> > index insertions) detect and recover from concurrent merges when they
> > step through pages?
> > - Do you have a theory or proof of correctness for the above?
> > - Can this scheme be implemented without adding significant overhead
> > to current workloads that don't benefit significantly from the new
> > feature?
> >
> > One example for a problem with the given flow: where (and how) do you
> > update the sibling pointers in pages A (to N from B) and D (to N from
> > C)?
> > You haven't explicitly locked pages A and D by the time you get to
> > step 4, even though locking pages is critical to guarantee correct and
> > safe operations. Simply locking them in step 4 isn't sufficient, as
> > another process may have locked page A in preparation for a split,
> > which would then be waiting to get a lock on page B. If you've already
> > locked page B before locking page A, you'll have a deadlock.
> >
> > Additionally, how does this work when you find out in step 3 that your
> > pages B and C each are linked to from a different parent page? You'd
> > have to update both parent pages, but that would also require locking
> > more than just the one parent page per level that currently gets
> > locked during page deletion.
>
> Ohh I see, I haven't included the dynamics of such flows too. I might
> have to think this through.
> Thank you for pointing it out!
>
> Just to understand what we can control (and to help internalize the
> constraints so that I can think through a strategy): to handle
> transient merge states, are we intending to update core? Or is the
> bloat reduction functionality intended to be abstracted into an
> extension?
>
> I'm coming up with ways that might potentially update index amcheck
> and the like.
>
> Thanks in advance!
>
> Madhav

Some parts of the merge flow that I am coming up with are as follows
(assuming tuples from index page B are migrated rightwards into C):

1. Leaving B's tuples as it is even after merge, to remove the
possible risk of scans "skipping over" tuples. Essentially, the tuples
then would be "copied" into C.
2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say,
MERGE_SRC and MERGE_DEST respectively) before the actual merge
process, then marking the pages with another flag upon completion
(MERGE_COMPLETE) so that other processes can handle transient merge
states.
3. For example, scans that reach page B post-merge (MERGE_SRC +
MERGE_COMPLETE) would be made to skip to the page to its right.
4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B).

Note: The underlying assumption is that B and C both are children of a
single parent page P. For a start I'm picturing this flow only for
sibling nodes, not for cousin nodes (B and C having different parent
nodes).

Would the above approach be alright? These are not exhaustive steps,
I'm only raising these lines of thought so that I can fast-fail them
(if there are any issues).
I'd be happy to provide a doc of the exact flow I'm picturing (how to
ensure TIDs are read exactly-once, how the merge process would hold up
with other concurrent processes, etc), incase my thoughts seem
promising.

Thanks in advance!

Madhav

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [WiP] B-tree page merge during vacuum to reduce index bloat
@ 2026-03-10 05:09  Madhav Madhusoodanan <[email protected]>
  parent: Madhav Madhusoodanan <[email protected]>
  1 sibling, 1 reply; 6+ messages in thread

From: Madhav Madhusoodanan @ 2026-03-10 05:09 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]>

On Fri, Mar 6, 2026 at 2:15 AM Madhav Madhusoodanan
<[email protected]> wrote:
>
> Some parts of the merge flow that I am coming up with are as follows
> (assuming tuples from index page B are migrated rightwards into C):
>
> 1. Leaving B's tuples as it is even after merge, to remove the
> possible risk of scans "skipping over" tuples. Essentially, the tuples
> then would be "copied" into C.
> 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say,
> MERGE_SRC and MERGE_DEST respectively) before the actual merge
> process, then marking the pages with another flag upon completion
> (MERGE_COMPLETE) so that other processes can handle transient merge
> states.
> 3. For example, scans that reach page B post-merge (MERGE_SRC +
> MERGE_COMPLETE) would be made to skip to the page to its right.
> 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B).
>

I was going through the source code to understand whether the
aforementioned direction of changes would be reasonable.

I was observing `BTPageOpaqueData.btpo_flags` [0] which is a uint16,
but only 9 bits are used.

Would using a couple bits of the same for this purpose be reasonable?
Or are they being reserved for future functionality?

Thanks!

Madhav

[0] https://github.com/postgres/postgres/blob/03facc1211b0ff1550f41bcd4da09329080c30f9/src/include/acces...





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [WiP] B-tree page merge during vacuum to reduce index bloat
@ 2026-03-10 11:53  Matthias van de Meent <[email protected]>
  parent: Madhav Madhusoodanan <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Matthias van de Meent @ 2026-03-10 11:53 UTC (permalink / raw)
  To: Madhav Madhusoodanan <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]>

On Tue, 10 Mar 2026 at 06:09, Madhav Madhusoodanan
<[email protected]> wrote:
>
> On Fri, Mar 6, 2026 at 2:15 AM Madhav Madhusoodanan
> <[email protected]> wrote:
> >
> > Some parts of the merge flow that I am coming up with are as follows
> > (assuming tuples from index page B are migrated rightwards into C):
> >
> > 1. Leaving B's tuples as it is even after merge, to remove the
> > possible risk of scans "skipping over" tuples. Essentially, the tuples
> > then would be "copied" into C.
> > 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say,
> > MERGE_SRC and MERGE_DEST respectively) before the actual merge
> > process, then marking the pages with another flag upon completion
> > (MERGE_COMPLETE) so that other processes can handle transient merge
> > states.
> > 3. For example, scans that reach page B post-merge (MERGE_SRC +
> > MERGE_COMPLETE) would be made to skip to the page to its right.
> > 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B).
> >
>
> I was going through the source code to understand whether the
> aforementioned direction of changes would be reasonable.
>
> I was observing `BTPageOpaqueData.btpo_flags` [0] which is a uint16,
> but only 9 bits are used.
>
> Would using a couple bits of the same for this purpose be reasonable?
> Or are they being reserved for future functionality?

They're exclusively for btree code's use; extensions (*) must not add
to (or change the meaning of) those bits, lest they create a forward
incompatibility with core PostgreSQL btree code in newer major
versions; it would cause corrupted binary upgraded databases.
But patches against core btree code can use those bits, because
forward compatibility is less of an issue there - we don't really
support binary upgrades manually patched systems, especially if they
have incompatible on-disk data.

(*): I'm skeptical about whether you could make btree scans handle
concurrently merged pages, when that merging is implemented as
extension and the btree code doesn't know about merges.

Kind regards,

Matthias van de Meent
Databricks (https://www.databricks.com)





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [WiP] B-tree page merge during vacuum to reduce index bloat
@ 2026-03-11 09:40  Madhav Madhusoodanan <[email protected]>
  parent: Matthias van de Meent <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Madhav Madhusoodanan @ 2026-03-11 09:40 UTC (permalink / raw)
  To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]>

On Tue, Mar 10, 2026 at 5:23 PM Matthias van de Meent
<[email protected]> wrote:
>
> They're exclusively for btree code's use; extensions (*) must not add
> to (or change the meaning of) those bits, lest they create a forward
> incompatibility with core PostgreSQL btree code in newer major
> versions; it would cause corrupted binary upgraded databases.
> But patches against core btree code can use those bits, because
> forward compatibility is less of an issue there - we don't really
> support binary upgrades manually patched systems, especially if they
> have incompatible on-disk data.

Noted, I'm looking at it from the core btree code side of things.
I'm unable to see a way where scans or VACUUM calls can recognise
transient merge states if we implement merging as an extension.

Madhav





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: [WiP] B-tree page merge during vacuum to reduce index bloat
@ 2026-03-24 08:56  Wenbo Lin <[email protected]>
  parent: Madhav Madhusoodanan <[email protected]>
  1 sibling, 0 replies; 6+ messages in thread

From: Wenbo Lin @ 2026-03-24 08:56 UTC (permalink / raw)
  To: [email protected]

On 3/6/26 4:45 AM, Madhav Madhusoodanan wrote:
> 
> Some parts of the merge flow that I am coming up with are as follows
> (assuming tuples from index page B are migrated rightwards into C):
> 
> 1. Leaving B's tuples as it is even after merge, to remove the
> possible risk of scans "skipping over" tuples. Essentially, the tuples
> then would be "copied" into C.
> 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say,
> MERGE_SRC and MERGE_DEST respectively) before the actual merge
> process, then marking the pages with another flag upon completion
> (MERGE_COMPLETE) so that other processes can handle transient merge
> states.
> 3. For example, scans that reach page B post-merge (MERGE_SRC +
> MERGE_COMPLETE) would be made to skip to the page to its right.
> 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B).
> 
> Note: The underlying assumption is that B and C both are children of a
> single parent page P. For a start I'm picturing this flow only for
> sibling nodes, not for cousin nodes (B and C having different parent
> nodes).
> 
> Would the above approach be alright? These are not exhaustive steps,
> I'm only raising these lines of thought so that I can fast-fail them
> (if there are any issues).
> I'd be happy to provide a doc of the exact flow I'm picturing (how to
> ensure TIDs are read exactly-once, how the merge process would hold up
> with other concurrent processes, etc), incase my thoughts seem
> promising.
> 
> Thanks in advance!
> 
> Madhav
> 
> 
Hi Madhav
I've been following this thread with interest and have been thinking
about some of the same problem. A few things came to mind reading your
proposal:

1. Backward scans. Your point 3 mentions "skip to right" when scans
hitting a MERGE_SRC page. I think this would work for forward scan, but
backward scans travel right-to-left, so they'd read C first, then step
to B. By the time they reach B, they've already passed C. How would the
scan know whether it already picked up B's tuples from C or not?
2. Lock gap. Building on the above, between releasing one page's lock
and acquiring the next, the merge could actually happen. So a backward
scan might read C before the merge, then find B marked as MERGE_SRC
after the merge. In that case C didn't have B's tuples when the scan
read it.so the scan need to read B. But if the merge happened earlier, C
already has B's tuples and reading B would produce duplicates. I think
it will need some detection mechanism (I wonder if some kind of per-scan
tracking of what was already returned could help distinguish these two
cases, though I haven't thought through the details yet.)
3. MERGE_COMPLETE flag. I was wondering whether two flags (MERGE_SRC and
MERGE_DEST) might be enough. if the tuple copy and flag settings are
done atomically within a single critical section, then other backends
either see the pre-merge or post-merge state, never a partial one. Or am
i missing a case where you'd need to distinguish "merge in progres" from
"merge done"

Looking forward to hearing your thoughts and anyone's else. Happy to be
corrected if I'm off base on any of these.

Regards

^ permalink  raw  reply  [nested|flat] 6+ messages in thread

end of thread, other threads:[~2026-03-24 08:56 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-01 10:45 Re: [WiP] B-tree page merge during vacuum to reduce index bloat Madhav Madhusoodanan <[email protected]>
2026-03-05 20:45 ` Madhav Madhusoodanan <[email protected]>
2026-03-10 05:09   ` Madhav Madhusoodanan <[email protected]>
2026-03-10 11:53     ` Matthias van de Meent <[email protected]>
2026-03-11 09:40       ` Madhav Madhusoodanan <[email protected]>
2026-03-24 08:56   ` Wenbo Lin <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox