public inbox for [email protected]help / color / mirror / Atom feed
Re: [WiP] B-tree page merge during vacuum to reduce index bloat 6+ messages / 3 participants [nested] [flat]
* Re: [WiP] B-tree page merge during vacuum to reduce index bloat @ 2026-03-01 10:45 Madhav Madhusoodanan <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Madhav Madhusoodanan @ 2026-03-01 10:45 UTC (permalink / raw) To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]> On Fri, Feb 27, 2026 at 5:37 PM Matthias van de Meent <[email protected]> wrote: > > That is one part of the picture (the merging part), but it's missing a > lot of details: > - How do concurrent workloads (index scans, backwards index scans, > index insertions) detect and recover from concurrent merges when they > step through pages? > - Do you have a theory or proof of correctness for the above? > - Can this scheme be implemented without adding significant overhead > to current workloads that don't benefit significantly from the new > feature? > > One example for a problem with the given flow: where (and how) do you > update the sibling pointers in pages A (to N from B) and D (to N from > C)? > You haven't explicitly locked pages A and D by the time you get to > step 4, even though locking pages is critical to guarantee correct and > safe operations. Simply locking them in step 4 isn't sufficient, as > another process may have locked page A in preparation for a split, > which would then be waiting to get a lock on page B. If you've already > locked page B before locking page A, you'll have a deadlock. > > Additionally, how does this work when you find out in step 3 that your > pages B and C each are linked to from a different parent page? You'd > have to update both parent pages, but that would also require locking > more than just the one parent page per level that currently gets > locked during page deletion. Ohh I see, I haven't included the dynamics of such flows too. I might have to think this through. Thank you for pointing it out! Just to understand what we can control (and to help internalize the constraints so that I can think through a strategy): to handle transient merge states, are we intending to update core? Or is the bloat reduction functionality intended to be abstracted into an extension? I'm coming up with ways that might potentially update index amcheck and the like. Thanks in advance! Madhav ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: [WiP] B-tree page merge during vacuum to reduce index bloat @ 2026-03-05 20:45 Madhav Madhusoodanan <[email protected]> parent: Madhav Madhusoodanan <[email protected]> 0 siblings, 2 replies; 6+ messages in thread From: Madhav Madhusoodanan @ 2026-03-05 20:45 UTC (permalink / raw) To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]> On Sun, Mar 1, 2026 at 4:15 PM Madhav Madhusoodanan <[email protected]> wrote: > > On Fri, Feb 27, 2026 at 5:37 PM Matthias van de Meent > <[email protected]> wrote: > > > > That is one part of the picture (the merging part), but it's missing a > > lot of details: > > - How do concurrent workloads (index scans, backwards index scans, > > index insertions) detect and recover from concurrent merges when they > > step through pages? > > - Do you have a theory or proof of correctness for the above? > > - Can this scheme be implemented without adding significant overhead > > to current workloads that don't benefit significantly from the new > > feature? > > > > One example for a problem with the given flow: where (and how) do you > > update the sibling pointers in pages A (to N from B) and D (to N from > > C)? > > You haven't explicitly locked pages A and D by the time you get to > > step 4, even though locking pages is critical to guarantee correct and > > safe operations. Simply locking them in step 4 isn't sufficient, as > > another process may have locked page A in preparation for a split, > > which would then be waiting to get a lock on page B. If you've already > > locked page B before locking page A, you'll have a deadlock. > > > > Additionally, how does this work when you find out in step 3 that your > > pages B and C each are linked to from a different parent page? You'd > > have to update both parent pages, but that would also require locking > > more than just the one parent page per level that currently gets > > locked during page deletion. > > Ohh I see, I haven't included the dynamics of such flows too. I might > have to think this through. > Thank you for pointing it out! > > Just to understand what we can control (and to help internalize the > constraints so that I can think through a strategy): to handle > transient merge states, are we intending to update core? Or is the > bloat reduction functionality intended to be abstracted into an > extension? > > I'm coming up with ways that might potentially update index amcheck > and the like. > > Thanks in advance! > > Madhav Some parts of the merge flow that I am coming up with are as follows (assuming tuples from index page B are migrated rightwards into C): 1. Leaving B's tuples as it is even after merge, to remove the possible risk of scans "skipping over" tuples. Essentially, the tuples then would be "copied" into C. 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say, MERGE_SRC and MERGE_DEST respectively) before the actual merge process, then marking the pages with another flag upon completion (MERGE_COMPLETE) so that other processes can handle transient merge states. 3. For example, scans that reach page B post-merge (MERGE_SRC + MERGE_COMPLETE) would be made to skip to the page to its right. 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B). Note: The underlying assumption is that B and C both are children of a single parent page P. For a start I'm picturing this flow only for sibling nodes, not for cousin nodes (B and C having different parent nodes). Would the above approach be alright? These are not exhaustive steps, I'm only raising these lines of thought so that I can fast-fail them (if there are any issues). I'd be happy to provide a doc of the exact flow I'm picturing (how to ensure TIDs are read exactly-once, how the merge process would hold up with other concurrent processes, etc), incase my thoughts seem promising. Thanks in advance! Madhav ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: [WiP] B-tree page merge during vacuum to reduce index bloat @ 2026-03-10 05:09 Madhav Madhusoodanan <[email protected]> parent: Madhav Madhusoodanan <[email protected]> 1 sibling, 1 reply; 6+ messages in thread From: Madhav Madhusoodanan @ 2026-03-10 05:09 UTC (permalink / raw) To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]> On Fri, Mar 6, 2026 at 2:15 AM Madhav Madhusoodanan <[email protected]> wrote: > > Some parts of the merge flow that I am coming up with are as follows > (assuming tuples from index page B are migrated rightwards into C): > > 1. Leaving B's tuples as it is even after merge, to remove the > possible risk of scans "skipping over" tuples. Essentially, the tuples > then would be "copied" into C. > 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say, > MERGE_SRC and MERGE_DEST respectively) before the actual merge > process, then marking the pages with another flag upon completion > (MERGE_COMPLETE) so that other processes can handle transient merge > states. > 3. For example, scans that reach page B post-merge (MERGE_SRC + > MERGE_COMPLETE) would be made to skip to the page to its right. > 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B). > I was going through the source code to understand whether the aforementioned direction of changes would be reasonable. I was observing `BTPageOpaqueData.btpo_flags` [0] which is a uint16, but only 9 bits are used. Would using a couple bits of the same for this purpose be reasonable? Or are they being reserved for future functionality? Thanks! Madhav [0] https://github.com/postgres/postgres/blob/03facc1211b0ff1550f41bcd4da09329080c30f9/src/include/acces... ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: [WiP] B-tree page merge during vacuum to reduce index bloat @ 2026-03-10 11:53 Matthias van de Meent <[email protected]> parent: Madhav Madhusoodanan <[email protected]> 0 siblings, 1 reply; 6+ messages in thread From: Matthias van de Meent @ 2026-03-10 11:53 UTC (permalink / raw) To: Madhav Madhusoodanan <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]> On Tue, 10 Mar 2026 at 06:09, Madhav Madhusoodanan <[email protected]> wrote: > > On Fri, Mar 6, 2026 at 2:15 AM Madhav Madhusoodanan > <[email protected]> wrote: > > > > Some parts of the merge flow that I am coming up with are as follows > > (assuming tuples from index page B are migrated rightwards into C): > > > > 1. Leaving B's tuples as it is even after merge, to remove the > > possible risk of scans "skipping over" tuples. Essentially, the tuples > > then would be "copied" into C. > > 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say, > > MERGE_SRC and MERGE_DEST respectively) before the actual merge > > process, then marking the pages with another flag upon completion > > (MERGE_COMPLETE) so that other processes can handle transient merge > > states. > > 3. For example, scans that reach page B post-merge (MERGE_SRC + > > MERGE_COMPLETE) would be made to skip to the page to its right. > > 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B). > > > > I was going through the source code to understand whether the > aforementioned direction of changes would be reasonable. > > I was observing `BTPageOpaqueData.btpo_flags` [0] which is a uint16, > but only 9 bits are used. > > Would using a couple bits of the same for this purpose be reasonable? > Or are they being reserved for future functionality? They're exclusively for btree code's use; extensions (*) must not add to (or change the meaning of) those bits, lest they create a forward incompatibility with core PostgreSQL btree code in newer major versions; it would cause corrupted binary upgraded databases. But patches against core btree code can use those bits, because forward compatibility is less of an issue there - we don't really support binary upgrades manually patched systems, especially if they have incompatible on-disk data. (*): I'm skeptical about whether you could make btree scans handle concurrently merged pages, when that merging is implemented as extension and the btree code doesn't know about merges. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com) ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: [WiP] B-tree page merge during vacuum to reduce index bloat @ 2026-03-11 09:40 Madhav Madhusoodanan <[email protected]> parent: Matthias van de Meent <[email protected]> 0 siblings, 0 replies; 6+ messages in thread From: Madhav Madhusoodanan @ 2026-03-11 09:40 UTC (permalink / raw) To: Matthias van de Meent <[email protected]>; +Cc: Kirk Wolak <[email protected]>; Andrey Borodin <[email protected]>; pgsql-hackers; Nikolay Samokhvalov <[email protected]> On Tue, Mar 10, 2026 at 5:23 PM Matthias van de Meent <[email protected]> wrote: > > They're exclusively for btree code's use; extensions (*) must not add > to (or change the meaning of) those bits, lest they create a forward > incompatibility with core PostgreSQL btree code in newer major > versions; it would cause corrupted binary upgraded databases. > But patches against core btree code can use those bits, because > forward compatibility is less of an issue there - we don't really > support binary upgrades manually patched systems, especially if they > have incompatible on-disk data. Noted, I'm looking at it from the core btree code side of things. I'm unable to see a way where scans or VACUUM calls can recognise transient merge states if we implement merging as an extension. Madhav ^ permalink raw reply [nested|flat] 6+ messages in thread
* Re: [WiP] B-tree page merge during vacuum to reduce index bloat @ 2026-03-24 08:56 Wenbo Lin <[email protected]> parent: Madhav Madhusoodanan <[email protected]> 1 sibling, 0 replies; 6+ messages in thread From: Wenbo Lin @ 2026-03-24 08:56 UTC (permalink / raw) To: [email protected] On 3/6/26 4:45 AM, Madhav Madhusoodanan wrote: > > Some parts of the merge flow that I am coming up with are as follows > (assuming tuples from index page B are migrated rightwards into C): > > 1. Leaving B's tuples as it is even after merge, to remove the > possible risk of scans "skipping over" tuples. Essentially, the tuples > then would be "copied" into C. > 2. Marking pages B and C with flags similar to INCOMPLETE_SPLIT (say, > MERGE_SRC and MERGE_DEST respectively) before the actual merge > process, then marking the pages with another flag upon completion > (MERGE_COMPLETE) so that other processes can handle transient merge > states. > 3. For example, scans that reach page B post-merge (MERGE_SRC + > MERGE_COMPLETE) would be made to skip to the page to its right. > 4. Updating VACUUM to handle post-merge cleanup (to remove pages such as B). > > Note: The underlying assumption is that B and C both are children of a > single parent page P. For a start I'm picturing this flow only for > sibling nodes, not for cousin nodes (B and C having different parent > nodes). > > Would the above approach be alright? These are not exhaustive steps, > I'm only raising these lines of thought so that I can fast-fail them > (if there are any issues). > I'd be happy to provide a doc of the exact flow I'm picturing (how to > ensure TIDs are read exactly-once, how the merge process would hold up > with other concurrent processes, etc), incase my thoughts seem > promising. > > Thanks in advance! > > Madhav > > Hi Madhav I've been following this thread with interest and have been thinking about some of the same problem. A few things came to mind reading your proposal: 1. Backward scans. Your point 3 mentions "skip to right" when scans hitting a MERGE_SRC page. I think this would work for forward scan, but backward scans travel right-to-left, so they'd read C first, then step to B. By the time they reach B, they've already passed C. How would the scan know whether it already picked up B's tuples from C or not? 2. Lock gap. Building on the above, between releasing one page's lock and acquiring the next, the merge could actually happen. So a backward scan might read C before the merge, then find B marked as MERGE_SRC after the merge. In that case C didn't have B's tuples when the scan read it.so the scan need to read B. But if the merge happened earlier, C already has B's tuples and reading B would produce duplicates. I think it will need some detection mechanism (I wonder if some kind of per-scan tracking of what was already returned could help distinguish these two cases, though I haven't thought through the details yet.) 3. MERGE_COMPLETE flag. I was wondering whether two flags (MERGE_SRC and MERGE_DEST) might be enough. if the tuple copy and flag settings are done atomically within a single critical section, then other backends either see the pre-merge or post-merge state, never a partial one. Or am i missing a case where you'd need to distinguish "merge in progres" from "merge done" Looking forward to hearing your thoughts and anyone's else. Happy to be corrected if I'm off base on any of these. Regards ^ permalink raw reply [nested|flat] 6+ messages in thread
end of thread, other threads:[~2026-03-24 08:56 UTC | newest] Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-03-01 10:45 Re: [WiP] B-tree page merge during vacuum to reduce index bloat Madhav Madhusoodanan <[email protected]> 2026-03-05 20:45 ` Madhav Madhusoodanan <[email protected]> 2026-03-10 05:09 ` Madhav Madhusoodanan <[email protected]> 2026-03-10 11:53 ` Matthias van de Meent <[email protected]> 2026-03-11 09:40 ` Madhav Madhusoodanan <[email protected]> 2026-03-24 08:56 ` Wenbo Lin <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox