public inbox for [email protected]help / color / mirror / Atom feed
Proposal: Prevent Primary/Standby SLRU divergence during MultiXact truncation 4+ messages / 2 participants [nested] [flat]
* Proposal: Prevent Primary/Standby SLRU divergence during MultiXact truncation @ 2026-03-16 16:09 Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Ayush Tiwari @ 2026-03-16 16:09 UTC (permalink / raw) To: pgsql-hackers Hi Hackers, Looking at the MultiXact truncation behavior and reading through the recent thread regarding the 17.8 standby crashing during WAL replay (commit 8ba61bc), we noticed an architectural edge case that seems to cause a silent primary/standby SLRU divergence. I'd like to ask if this is a known accepted risk or if a patch to reorder this logic is worth exploring. The Issue: In TruncateMultiXact(), we write the truncation WAL record (WriteMTruncateXlogRec) before we actually perform the truncation via PerformOffsetsTruncation() -> SimpleLruTruncate(). The problem arises from the "apparent wraparound" safety check inside SimpleLruTruncate(). If SlruScanDirectory() detects an apparent wraparound, SimpleLruTruncate() safely bails out and skips unlinking the SLRU segments on the primary, logging: could not truncate directory "%s": apparent wraparound. However, the WAL record for the truncation has already been flushed. Standbys replay this TRUNCATE_ID WAL record and blindly delete their SLRU segments. At this point, the primary and standby have diverged. The Impact: If the standby is subsequently promoted to primary, any attempt to access rows holding those older MultiXact IDs (which the original primary decided to keep) will throw a FATAL: could not access status of transaction error, effectively resulting in data loss / inaccessible rows for the user. While the recent commits address the immediate standby crash involving latest_page_number during multixact_redo(), they don't seem to prevent the primary from emitting a "false" WAL truncation record when it abandons its own truncation. Proposed Approach: It seems safer to only emit the WAL record if we are guaranteed to follow through with the truncation. We could modify SimpleLruTruncate() to perform its safety checks first and return a boolean indicating whether the truncation is safe to proceed. TruncateMultiXact() would then only call WriteMTruncateXlogRec() and proceed with physical deletion if the check passes. I have attached a rough draft patch illustrating this sequence change. Is this a scenario the community has already considered, or is this reordering something that should be explored further to harden standby reliability? PS. Also, noticed this to be the case in clog.c file Thanks for your time. Regards, Ayush Attachments: [application/octet-stream] v1-prevent-multixact-slru-divergence 2.patch (1.9K, 3-v1-prevent-multixact-slru-divergence%202.patch) download | inline diff: From 9b8830fe35ac9a2c2efabc1e91244e8abcd12345 Mon Sep 17 00:00:00 2001 From: Ayush Tiwari <[email protected]> Date: Mon, 16 Mar 2026 12:00:00 +0000 Subject: [Draft Patch v1] Prevent SLRU divergence by checking wraparound before WAL flush In TruncateMultiXact(), if SimpleLruTruncate() encounters an apparent wraparound, it aborts the truncation safely. However, the TRUNCATE_ID WAL record has already been emitted. Standbys blindly replay this record and delete their SLRU segments. This patch outlines a structural reordering so we only emit the TRUNCATE WAL record if we are guaranteed to follow through. --- src/backend/access/transam/multixact.c | 16 +++++++++++++--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c index a1b2c3d4e..f5g6h7i8j 100644 --- a/src/backend/access/transam/multixact.c +++ b/src/backend/access/transam/multixact.c @@ -2980,10 +2980,20 @@ TruncateMultiXact(MultiXactId newOldestMulti, Oid newOldestMultiDB) Assert((MyProc->delayChkptFlags & DELAY_CHKPT_START) == 0); MyProc->delayChkptFlags |= DELAY_CHKPT_START; - /* WAL log truncation */ - WriteMTruncateXlogRec(newOldestMultiDB, newOldestMulti, newOldestOffset); + /* + * TODO/DRAFT: + * We should check if SimpleLruTruncate() will abort due to + * apparent wraparound *before* flushing the WAL. + * + * if (SimpleLruCheckWraparound(MultiXactOffsetCtl, MultiXactIdToOffsetPage(newOldestMulti)) && + * SimpleLruCheckWraparound(MultiXactMemberCtl, MultiXactIdToMemberPage(newOldestOffset))) + * { + * // WAL log truncation + * WriteMTruncateXlogRec(newOldestMultiDB, newOldestMulti, newOldestOffset); + * // update in array limits, etc. + * // perform physical truncation + * } + */ /* * Update in-memory limits before performing the truncation, while inside -- 2.34.1 ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: Proposal: Prevent Primary/Standby SLRU divergence during MultiXact truncation @ 2026-03-16 22:10 Heikki Linnakangas <[email protected]> parent: Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Heikki Linnakangas @ 2026-03-16 22:10 UTC (permalink / raw) To: Ayush Tiwari <[email protected]>; pgsql-hackers On 16/03/2026 18:09, Ayush Tiwari wrote: > The Issue: > In TruncateMultiXact(), we write the truncation WAL record > (WriteMTruncateXlogRec) before we actually perform the truncation via > PerformOffsetsTruncation() -> SimpleLruTruncate(). > > The problem arises from the "apparent wraparound" safety check inside > SimpleLruTruncate(). If SlruScanDirectory() detects an apparent > wraparound, SimpleLruTruncate() safely bails out and skips unlinking the > SLRU segments on the primary, logging: could not truncate directory > "%s": apparent wraparound. > > However, the WAL record for the truncation has already been flushed. > Standbys replay this TRUNCATE_ID WAL record and blindly delete their > SLRU segments. At this point, the primary and standby have diverged. Replaying the record will perform the same sanity checks against wraparound as the primary does. Hmm, although why did I not apply commit 817f74600d to 'master', only backbranches? The bug that it fixed was related to minor version upgrade, and thus it was not needed on 'master', but the code change would nevertheless make a lot of sense on 'master' too. > The Impact: > If the standby is subsequently promoted to primary, any attempt to > access rows holding those older MultiXact IDs (which the original > primary decided to keep) will throw a FATAL: could not access status of > transaction error, effectively resulting in data loss / inaccessible > rows for the user. Have you been able to reproduce that? > While the recent commits address the immediate standby crash involving > latest_page_number during multixact_redo(), they don't seem to prevent > the primary from emitting a "false" WAL truncation record when it > abandons its own truncation. > > Proposed Approach: > It seems safer to only emit the WAL record if we are guaranteed to > follow through with the truncation. We could modify SimpleLruTruncate() > to perform its safety checks first and return a boolean indicating > whether the truncation is safe to proceed. TruncateMultiXact() would > then only call WriteMTruncateXlogRec() and proceed with physical > deletion if the check passes. > > I have attached a rough draft patch illustrating this sequence change. I agree that would probably be better. I'm not sure how straightforward it will be to implement though, I wouldn't want to add much extra code just for this. P.S. Thanks for looking into this! This is hairy stuff, more review is much appreciated. - Heikki ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: Proposal: Prevent Primary/Standby SLRU divergence during MultiXact truncation @ 2026-03-17 15:07 Ayush Tiwari <[email protected]> parent: Heikki Linnakangas <[email protected]> 0 siblings, 1 reply; 4+ messages in thread From: Ayush Tiwari @ 2026-03-17 15:07 UTC (permalink / raw) To: Heikki Linnakangas <[email protected]>; +Cc: pgsql-hackers Hi, Thank you for the response. On Tue, 17 Mar 2026 at 03:40, Heikki Linnakangas <[email protected]> wrote: > > Replaying the record will perform the same sanity checks against > wraparound as the primary does. > > Hmm, although why did I not apply commit 817f74600d to 'master', only > backbranches? The bug that it fixed was related to minor version > upgrade, and thus it was not needed on 'master', but the code change > would nevertheless make a lot of sense on 'master' too. > Agreed, once 817f74600d is on master the standby would honestly evaluate the SimpleLruTruncate wraparound backstop instead of bypassing it. However, the backstop is documented as catching "wraparound bugs elsewhere in SLRU handling." If such a bug corrupts latest_page_number on the primary, the standby — which derives its latest_page_number independently from ZERO_OFF_PAGE replay and StartupMultiXact() — would not share the same corruption. The primary would skip the truncation, but the standby would see a healthy latest_page_number and proceed. > Have you been able to reproduce that? > I have reproduced the primary-side condition on an unmodified tree using gdb in batch mode: attach to the VACUUM backend after WriteMTruncateXlogRec() returns, corrupt latest_page_number, and resume. The primary logs "apparent wraparound" and skips the physical deletion, while pg_waldump confirms the TRUNCATE_ID record is present in the WAL. I have not yet set up a streaming replica to demonstrate end-to-end divergence and promotion failure. > > I agree that would probably be better. I'm not sure how straightforward > it will be to implement though, I wouldn't want to add much extra code > just for this. > One approach that might keep the footprint small: we could inline the same PagePrecedes check that SimpleLruTruncate uses directly in TruncateMultiXact(), before START_CRIT_SECTION(). Something like: if (MultiXactOffsetCtl->PagePrecedes( pg_atomic_read_u64(&MultiXactOffsetCtl->shared->latest_page_number), MultiXactIdToOffsetPage(PreviousMultiXactId(newOldestMulti))) || MultiXactMemberCtl->PagePrecedes( pg_atomic_read_u64(&MultiXactMemberCtl->shared->latest_page_number), MXOffsetToMemberPage(newOldestOffset))) { ereport(LOG, (errmsg("skipping multixact truncation due to apparent wraparound"))); LWLockRelease(MultiXactTruncationLock); return; } No new functions, no changes to slru.c or the replay path — just the same condition evaluated earlier so we never enter the critical section or write WAL for a truncation that won't be carried out. Does this seem like a reasonable direction? Regards, Ayush ^ permalink raw reply [nested|flat] 4+ messages in thread
* Re: Proposal: Prevent Primary/Standby SLRU divergence during MultiXact truncation @ 2026-03-22 12:29 Heikki Linnakangas <[email protected]> parent: Ayush Tiwari <[email protected]> 0 siblings, 0 replies; 4+ messages in thread From: Heikki Linnakangas @ 2026-03-22 12:29 UTC (permalink / raw) To: Ayush Tiwari <[email protected]>; +Cc: pgsql-hackers On 17/03/2026 17:07, Ayush Tiwari wrote: > On Tue, 17 Mar 2026 at 03:40, Heikki Linnakangas <[email protected] > <mailto:[email protected]>> wrote: > >> Hmm, although why did I not apply commit 817f74600d to 'master', only >> backbranches? The bug that it fixed was related to minor version >> upgrade, and thus it was not needed on 'master', but the code change >> would nevertheless make a lot of sense on 'master' too. > > Agreed, once 817f74600d is on master the standby would honestly evaluate > the SimpleLruTruncate wraparound backstop instead of bypassing it. I now committed that change to master too. - Heikki ^ permalink raw reply [nested|flat] 4+ messages in thread
end of thread, other threads:[~2026-03-22 12:29 UTC | newest] Thread overview: 4+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-03-16 16:09 Proposal: Prevent Primary/Standby SLRU divergence during MultiXact truncation Ayush Tiwari <[email protected]> 2026-03-16 22:10 ` Heikki Linnakangas <[email protected]> 2026-03-17 15:07 ` Ayush Tiwari <[email protected]> 2026-03-22 12:29 ` Heikki Linnakangas <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox