Bug in MultiXact replay compat logic for older minor version after crash-recovery

public inbox for [email protected]  
help / color / mirror / Atom feed

 Bug in MultiXact replay compat logic for older minor version after crash-recovery
7+ messages / 3 participants
[nested] [flat]

*  Bug in MultiXact replay compat logic for older minor version after crash-recovery
@ 2026-03-19 18:02 =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-20 11:55 ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= @ 2026-03-19 18:02 UTC (permalink / raw)
  To: pgsql-hackers; +Cc: hlinnaka <[email protected]>

Hi hackers,
    
This is related to two recent threads:
    
[0] https://www.postgresql.org/message-id/flat/172e5723-d65f-4eec-b512-14beacb326ce%40yandex.ru
[1] https://www.postgresql.org/message-id/flat/CACV2tSw3VYS7d27ftO_cs%2BaF3M54%2BJwWBbqSGLcKoG9cvyb6EA%4...
    
I think there may be a bug in the compat logic introduced in
RecordNewMultiXact() to fix [0]. The compat logic handles WAL
generated by older  minor versions where ZERO_OFF_PAGE N+1
may appear after CREATE_ID:N in the WAL stream. However, the condition
used to detect whether the next offset page needs initialization
seems to fail after a crash-restart, causing the standby to enter a
FATAL crash loop.
    
We hit this in production: a PG 16.12 standby replaying WAL from a
PG 16.8 primary entered a crash-restart loop with:
    
> FATAL:  could not access status of transaction 1277952
> DETAIL:  Could not read from file "pg_multixact/offsets/0013" at offset
> 131072: read too few bytes.
> CONTEXT:  WAL redo at 5194/5F5F7118 for MultiXact/CREATE_ID: 1277951 
>offset 2605275 nmembers 2: 814605500 (keysh) 814605501 (keysh)
    
== Analysis ==
    
The compat logic condition is:
    
> if (InRecovery &&
>     next_pageno != pageno &&
>     MultiXactOffsetCtl->shared->latest_page_number == pageno)
    
The third condition (latest_page_number == pageno) assumes that if
latest_page_number equals the current multixact's page, then the next
page hasn't been initialized yet. This works during normal (non-crash)
replay because latest_page_number is incrementally maintained by
SimpleLruZeroPage() calls.
    
But after a crash-restart, StartupMultiXact() resets
latest_page_number to page(checkPoint.nextMulti):
    
> void StartupMultiXact(void)
> {
>     MultiXactId multi = MultiXactState->nextMXact;
>     pageno = MultiXactIdToOffsetPage(multi);
>     MultiXactOffsetCtl->shared->latest_page_number = pageno;
> }
    
When the checkpoint captured an already-advanced nextMXact (which
happens when GetNewMultiXactId() incremented nextMXact before the
backend wrote CREATE_ID to WAL), checkPoint.nextMulti >= N+1, so
latest_page_number = page(N+1) = P+1. The compat check becomes:
    
> P+1 == P  ->  FALSE  ->  compat logic skipped
    
The subsequent SimpleLruReadPage(next_pageno=P+1) then fails because
page P+1 doesn't exist on disk (ZERO_OFF_PAGE:P+1 hasn't been
replayed yet -- it's after CREATE_ID:N in the WAL due to the
reordering that the compat logic is supposed to handle).
    
The fix for [1] addressed a related issue where
TRUNCATE replay reset latest_page_number to a very old value
(latest_page_number << pageno). That fix addresses the "too small"
direction but not the "too large" direction described here.
    
== Timeline: the lock-free window ==
    
The root cause is a lock-free window in MultiXactIdCreateFromMembers().GetNewMultiXactId() advances nextMXact under MultiXactGenLock, but
the lock is released BEFORE XLogInsert(CREATE_ID). Any checkpoint
that runs in this window captures the advanced nextMXact without the
corresponding CREATE_ID in WAL.
    
Assume multi N is the last entry on offset page P (entry 2047), so
multi N+1 falls on page P+1.
    
1. [Backend] GetNewMultiXactId():
   - LWLockAcquire(MultiXactGenLock)
   - result = nextMXact (= N)
   - nextMXact++ (now N+1)
   - LWLockRelease(MultiXactGenLock)
   - return N
   <<< lock-free window opens here >>>
   nextMXact=N+1 is globally visible, but CREATE_ID:N not in WAL yet.
2. [Checkpointer] CHECKPOINT runs during the window:
   - reads nextMXact = N+1
   - writes checkpoint record with nextMulti=N+1
3. [Standby] replays the checkpoint, does restartpoint
   (persists nextMulti=N+1 to disk)
4. [Standby] *** CRASH ***
   <<< lock-free window closes >>>
5. [Backend] XLogInsert(CREATE_ID: N)
   -- this WAL record lands AFTER the checkpoint record
6. [Backend] RecordNewMultiXact(N):
   - next_pageno = page(N+1) = P+1
   - SimpleLruReadPage(P+1) -- works fine on primary
    
The resulting WAL stream on disk:
> ... CHECKPOINT(nextMulti=N+1) ... CREATE_ID:N ... ZERO_OFF_PAGE:P+1 ...
>     ^                             ^                ^
>     redo starts here              needs page P+1   page P+1 init
>     after crash-restart
    
After standby crash-restart:
7. [Standby] StartupMultiXact():
   - multi = nextMXact = N+1 (from checkpoint)
   - latest_page_number = page(N+1) = P+1
8. [Standby] Redo CREATE_ID:N -> RecordNewMultiXact(N):
   - pageno = page(N) = P, next_pageno = page(N+1) = P+1
   - Compat logic check:
     InRecovery = true
     next_pageno(P+1) != pageno(P)
     latest_page_number(P+1) == pageno(P) -- FAIL
     --> compat logic SKIPPED
   - SimpleLruReadPage(P+1)
     page P+1 not on disk yet (ZERO_OFF_PAGE:P+1 not replayed)
     --> FATAL: read too few bytes
    
== Reproduction ==
    
I wrote a deterministic reproducer using sleep injection. A sleep
injection patch (sleep-injection-for-multixact-compat-logic-bug-repro.patch)
adds a 60-second sleep in MultiXactIdCreateFromMembers(), after
GetNewMultiXactId() returns but before XLogInsert(CREATE_ID).
The sleep only triggers when the allocated multi is the last entry on
an offset page (entry 2047) AND the file /tmp/mxact_sleep_enabled exists
(so the batch phase that generates multixacts is not affected).
    
The setup:
    
- Primary: PG 16.11 with the sleep injection patch applied.
- Standby: PG 16.13 (contains the compat logic from [0] and[1]).
    
The test script (test_multixact_compat.sh) does the following:
    
1. Initialize primary, create test table
2. Generate multixacts until nextMulti's entry is exactly 2047
   (last entry on page P)
3. pg_basebackup to create standby, start with pg_new (16.13),
   checkpoint_timeout=30s
4. Session A: BEGIN; SELECT * FROM t WHERE id=1 FOR SHARE;
   (dirties heap buffer)
5. CHECKPOINT to flush the dirty buffer (*)
6. Enable sleep injection (touch /tmp/mxact_sleep_enabled)
7. Session B: SELECT * FROM t WHERE id=1 FOR SHARE;
   -> creates multixact N, GetNewMultiXactId advances nextMXact
     to N+1, then sleeps 60s
8. During sleep: CHECKPOINT on primary
   -> completes instantly (buffer is clean)
   -> captures nextMulti=N+1 in checkpoint record
9. Wait for standby to replay checkpoint and do restartpoint
10. Crash standby (pg_ctl -m immediate stop)
11. Wait for sleep to end -> CREATE_ID:N written to WAL
12. Restart standby -> FATAL
    
The CHECKPOINT in step 5 is essential. Without it, the checkpoint
in step 8 blocks because heap_lock_tuple holds the buffer content
lock in EXCLUSIVE mode across the entire MultiXactIdCreateFromMembers
call (including the sleep). Session A's FOR SHARE dirtied the heap
buffer, so BufferSync needs LW_SHARED on the content lock and blocks.
The intermediate CHECKPOINT flushes the dirty buffer before Session B
acquires the content lock.
    
Result:
    
> DEBUG:  next MultiXactId: 114688; next MultiXactOffset: 688233
> FATAL:  could not access status of transaction 114688
> DETAIL:  Could not read from file "pg_multixact/offsets/0001"
>          at offset 196608: read too few bytes.
> CONTEXT:  WAL redo at 0/40426E0 for MultiXact/CREATE_ID: 114687
>           offset 688231 nmembers 2: 116735 (sh) 116736 (sh)
    
== Fix ideas ==
    
I have two ideas for fixing this. Two independent patches are
attached.
    
Idea 1 (0001-Fix-multixact-compat-logic-via-unconditional-zero.patch):
    
Remove the "latest_page_number == pageno" condition entirely. During
recovery, whenever we cross a page boundary in RecordNewMultiXact(),
unconditionally ensure the next page is initialized via
SimpleLruZeroPage().
    
Pro: Minimal change, easy to review, impossible to miss any edge
case since we always initialize. Safe because SimpleLruZeroPage is
idempotent -- at this point the next page only contains zeros
(CREATE_ID records for that page haven't been replayed yet), so
re-zeroing loses nothing.
    
Con: When replaying WAL from a new-version primary (where
ZERO_OFF_PAGE is emitted BEFORE CREATE_ID), the page has already
been initialized by the ZERO_OFF_PAGE record. This patch will
redundantly zero+write it again. The cost is one extra
SimpleLruZeroPage + SimpleLruWritePage per 2048 multixacts during
recovery, which is negligible in practice, but not zero.
    
Idea 2 (0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch):
    
Also remove the "latest_page_number == pageno" condition, but add a
two-tier check: first, if latest_page_number == pageno (fast path),
we know the next page needs initialization. Otherwise, scan the SLRU
buffer slots to check if the next page already exists (e.g. from a
prior ZERO_OFF_PAGE replay). Only initialize if the page is not
found in the buffer.
    
Pro: Avoids the redundant zero+write when replaying new-version WAL
where ZERO_OFF_PAGE has already been replayed before CREATE_ID. The
SLRU buffer scan is O(n_buffers) which is cheap (default 8 buffers).
    
Con: More complex than Idea 1. The SLRU buffer check is a heuristic
-- if the page was written out and evicted from the buffer pool
before CREATE_ID replay, we'd zero it again (still safe, just
redundant). Also introduces a dependency on SLRU buffer pool
internals.
    
I'd appreciate it if someone could review whether my analysis of the
bug is correct. If it is, I'd be happy to hear any thoughts on the
patches or better approaches to fix this.
    
Regards,
Duan



Attachments:

  [application/octet-stream] sleep-injection-for-multixact-compat-logic-bug-repro.patch (1.9K, 2-sleep-injection-for-multixact-compat-logic-bug-repro.patch)
  download | inline diff:
From 9caa8fc8a10ad65fe0f1fb5ec9a6c3e42668d9f6 Mon Sep 17 00:00:00 2001
From: "duankunren.dkr" <[email protected]>
Date: Thu, 19 Mar 2026 23:19:19 +0800
Subject: [PATCH] Sleep injection for multixact compat logic bug reproduction

Inject a 60s sleep in MultiXactIdCreateFromMembers() between
GetNewMultiXactId() and XLogInsert(CREATE_ID), triggered when
the allocated multi is the last entry on an offset page (entry 2047).

This creates a deterministic window where CHECKPOINT can capture
nextMulti=N+1 while CREATE_ID:N has not yet been written to WAL,
reproducing the WAL disorder that triggers the compat logic bug
in commit 8ba61bc0638.

Apply to REL_16_11 (or any unpatched minor version) to build the
primary for the reproduction test.
---
 src/backend/access/transam/multixact.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3a2d7055c42..cb3f8f207ae 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -822,6 +822,21 @@ MultiXactIdCreateFromMembers(int nmembers, MultiXactMember *members)
 	 */
 	multi = GetNewMultiXactId(nmembers, &offset);
 
+
+	/*
+	 * BUG REPRODUCTION: Sleep 60s before writing CREATE_ID WAL when
+	 * this multi is the last entry on its offset page (entry 2047).
+	 * Only activates when /tmp/mxact_sleep_enabled exists, so the
+	 * batch phase can advance multixacts without hitting the sleep.
+	 */
+	if (MultiXactIdToOffsetEntry(multi) == MULTIXACT_OFFSETS_PER_PAGE - 1 &&
+		access("/tmp/mxact_sleep_enabled", F_OK) == 0)
+	{
+		elog(LOG, "MXACT_DELAY: multi=%u is last entry on page %d, sleeping 60s",
+			 multi, MultiXactIdToOffsetPage(multi));
+		pg_usleep(60000000L);
+		elog(LOG, "MXACT_DELAY: multi=%u woke up, writing CREATE_ID now", multi);
+	}
 	/* Make an XLOG entry describing the new MXID. */
 	xlrec.mid = multi;
 	xlrec.moff = offset;
-- 
2.32.0.3.g01195cf9f



  [application/octet-stream] test_multixact_compat.sh (9.7K, 3-test_multixact_compat.sh)
  download

  [application/octet-stream] 0001-Fix-multixact-compat-logic-via-unconditional-zero.patch (1.8K, 4-0001-Fix-multixact-compat-logic-via-unconditional-zero.patch)
  download | inline diff:
From 2ea10ae333022b777904e3e64da7f62485a44c90 Mon Sep 17 00:00:00 2001
From: "duankunren.dkr" <[email protected]>
Date: Thu, 19 Mar 2026 22:07:51 +0800
Subject: [PATCH] Fix multixact compat logic via unconditional zero

---
 src/backend/access/transam/multixact.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 26b8d4e1230..a9a64d6c9c4 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -897,10 +897,21 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * multixid was assigned.  If we're replaying WAL that was generated by
 	 * such a version, the next page might not be initialized yet.  Initialize
 	 * it now.
+	 *
+	 * The previous condition checked latest_page_number == pageno, but that
+	 * fails after a crash-restart: StartupMultiXact() sets
+	 * latest_page_number to page(checkPoint.nextMulti), which can be
+	 * next_pageno or even higher when the checkpoint captured an advanced
+	 * nextMXact.  In that case, the == check doesn't match and we skip
+	 * initialization, causing SimpleLruReadPage(next_pageno) to fail with
+	 * "read too few bytes" because the page doesn't exist on disk.
+	 *
+	 * Use an unconditional check instead: during recovery, whenever we cross
+	 * a page boundary, always ensure the next page is initialized.
+	 * SimpleLruZeroPage is idempotent, and we use pre_initialized_offsets_page
+	 * to skip the subsequent ZERO_OFF_PAGE replay, so this is safe.
 	 */
-	if (InRecovery &&
-		next_pageno != pageno &&
-		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	if (InRecovery && next_pageno != pageno)
 	{
 		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
 
-- 
2.32.0.3.g01195cf9f



  [application/octet-stream] 0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch (3.6K, 5-0002-Fix-multixact-compat-logic-via-SLRU-buffer-check.patch)
  download | inline diff:
From 8025d208a5e7bc8a174445849a730e2a2ff7b172 Mon Sep 17 00:00:00 2001
From: "duankunren.dkr" <[email protected]>
Date: Thu, 19 Mar 2026 22:08:49 +0800
Subject: [PATCH] Fix multixact compat logic via SLRU buffer check

---
 src/backend/access/transam/multixact.c | 73 ++++++++++++++++++++------
 1 file changed, 58 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 26b8d4e1230..5e18f988e6d 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -897,25 +897,68 @@ RecordNewMultiXact(MultiXactId multi, MultiXactOffset offset,
 	 * multixid was assigned.  If we're replaying WAL that was generated by
 	 * such a version, the next page might not be initialized yet.  Initialize
 	 * it now.
-	 */
-	if (InRecovery &&
-		next_pageno != pageno &&
-		MultiXactOffsetCtl->shared->latest_page_number == pageno)
+	 *
+	 * The previous condition checked latest_page_number == pageno, but that
+	 * fails after a crash-restart: StartupMultiXact() sets
+	 * latest_page_number to page(checkPoint.nextMulti), which can be
+	 * next_pageno or even higher when the checkpoint captured an advanced
+	 * nextMXact.  In that case, the == check doesn't match and we skip
+	 * initialization, causing SimpleLruReadPage(next_pageno) to fail with
+	 * "read too few bytes" because the page doesn't exist on disk.
+	 *
+	 * When latest_page_number == pageno, we know for sure the next page has
+	 * not been initialized yet.  Otherwise (e.g. after crash-restart),
+	 * latest_page_number is unreliable, so fall back to checking whether the
+	 * next page exists in the SLRU buffer pool.  SimpleLruZeroPage is
+	 * idempotent, and we use pre_initialized_offsets_page to skip the
+	 * subsequent ZERO_OFF_PAGE replay, so this is safe.
+	 */
+	if (InRecovery && next_pageno != pageno)
 	{
-		elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+		bool		need_init;
 
-		/* Create and zero the page */
-		slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+		if (MultiXactOffsetCtl->shared->latest_page_number == pageno)
+		{
+			/* Fast path: latest_page_number confirms next page not initialized */
+			need_init = true;
+		}
+		else
+		{
+			/*
+			 * latest_page_number != pageno, but we may still need to
+			 * initialize.  Check SLRU buffer pool to decide.
+			 */
+			SlruShared	shared = MultiXactOffsetCtl->shared;
 
-		/* Make sure it's written out */
-		SimpleLruWritePage(MultiXactOffsetCtl, slotno);
-		Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+			need_init = true;
+			for (slotno = 0; slotno < shared->num_slots; slotno++)
+			{
+				if (shared->page_number[slotno] == next_pageno &&
+					shared->page_status[slotno] != SLRU_PAGE_EMPTY)
+				{
+					need_init = false;
+					break;
+				}
+			}
+		}
 
-		/*
-		 * Remember that we initialized the page, so that we don't zero it
-		 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
-		 */
-		pre_initialized_offsets_page = next_pageno;
+		if (need_init)
+		{
+			elog(DEBUG1, "next offsets page is not initialized, initializing it now");
+
+			/* Create and zero the page */
+			slotno = SimpleLruZeroPage(MultiXactOffsetCtl, next_pageno);
+
+			/* Make sure it's written out */
+			SimpleLruWritePage(MultiXactOffsetCtl, slotno);
+			Assert(!MultiXactOffsetCtl->shared->page_dirty[slotno]);
+
+			/*
+			 * Remember that we initialized the page, so that we don't zero it
+			 * again at the XLOG_MULTIXACT_ZERO_OFF_PAGE record.
+			 */
+			pre_initialized_offsets_page = next_pageno;
+		}
 	}
 
 	/*
-- 
2.32.0.3.g01195cf9f



^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
  2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
@ 2026-03-20 11:55 ` Andrey Borodin <[email protected]>
  2026-03-20 13:39   ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Andrey Borodin @ 2026-03-20 11:55 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: "段坤仁(刻韧)" <[email protected]>; pgsql-hackers



> On 19 Mar 2026, at 23:11, Heikki Linnakangas <[email protected]> wrote:
> 
> :-(. This is a gift that keeps giving.

Well, maybe, we could leaving that deadlock in place for some time...

> Idea 3:
> 
> I think a better fix is to accept that our tracking is a little imprecise and use SimpleLruDoesPhysicalPageExist() to check if the page exists. I suspect that's too expensive to do on every RecordNewMultiXact() call that crosses a page, but perhaps we could do it once at StartupMultiXact().
> 
> Or perhaps track last-zeroed page separately from latest_page_number, and if we haven't seen any XLOG_MULTIXACT_ZERO_OFF_PAGE records yet after startup, call SimpleLruDoesPhysicalPageExist() to determine if initialization is needed. Attached patch does that.

SimpleLruDoesPhysicalPageExist() does not detect recently zeroed pages via buffers, because it goes directly to FS.
I tried this approach when implementing deadlock fix, it did not work for me.


Best regards, Andrey Borodin.




^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
  2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-20 11:55 ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
@ 2026-03-20 13:39   ` Andrey Borodin <[email protected]>
  2026-03-20 17:05     ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Andrey Borodin @ 2026-03-20 13:39 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: "段坤仁(刻韧)" <[email protected]>; pgsql-hackers



> On 20 Mar 2026, at 16:19, Heikki Linnakangas <[email protected]> wrote:
> 
> Hmm, after startup, before we have zeroed any pages, it still works though. So I think my patch works, but it means that tracking the latest page we have zeroed is not merely an optimization to avoid excessive SimpleLruDoesPhysicalPageExist() calls, it's needed for correctness. Need to adjust the comments for that.

If we are sure buffers have no this page we can detect it via FS.
Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more.

If we got init_needed==false, maybe cache it for this page and set last_initialized_offsets_page = pageno?
Or, perhaps, XLOG_MULTIXACT_ZERO_OFF_PAGE will do it for us anyway, but a bit later.

Best regards, Andrey Borodin.




^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
  2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-20 11:55 ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 13:39   ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
@ 2026-03-20 17:05     ` Andrey Borodin <[email protected]>
  2026-03-22 13:09       `  回复：Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-22 14:15       ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Kirill Reshke <[email protected]>
  0 siblings, 2 replies; 7+ messages in thread

From: Andrey Borodin @ 2026-03-20 17:05 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: "段坤仁(刻韧)" <[email protected]>; pgsql-hackers



> On 20 Mar 2026, at 18:14, Heikki Linnakangas <[email protected]> wrote:
> 
> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already contain some later multixids, and zeroing will overwrite them.

I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, tested FS, zeroed page once more, overwrote again by replaying WAL, no big deal.
We should never zero a page with offsets, that will not be replayed by WAL.

If the page was persisted, even partially, we will read it from disk without zeroing out.


Best regards, Andrey Borodin.




^ permalink  raw  reply  [nested|flat] 7+ messages in thread

*  回复：Bug in MultiXact replay compat logic for older minor version after crash-recovery
  2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-20 11:55 ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 13:39   ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 17:05     ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
@ 2026-03-22 13:09       ` =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  1 sibling, 0 replies; 7+ messages in thread

From: =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= @ 2026-03-22 13:09 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: pgsql-hackers; x4mmm <[email protected]>

Thanks for the v2 patch.
    
On 20/03/2026 16:19, Heikki Linnakangas wrote:
> it means that tracking the latest page we have zeroed is not merely
> an optimization to avoid excessive SimpleLruDoesPhysicalPageExist()
> calls, it's needed for correctness.
    
Agreed.
    
On 20/03/2026 18:14, Heikki Linnakangas wrote:
> I also added another safety measure: before calling
> SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers.
    
This is more robust than scanning the SLRU buffers first and only
calling SimpleLruDoesPhysicalPageExist() on a miss, which would
rely on the SLRU eviction invariant.
    
I walked through the scenarios I could think of. Let N be the last
multixid on offset page P, so N+1 falls on page P+1.
    
(a) Old-version WAL (CREATE_ID:N before ZERO_OFF_PAGE:P+1):
    last_initialized_offsets_page = P from earlier ZERO_OFF_PAGE.
    init_needed = (P == P) = true -> init P+1. Correct.
    Later ZERO_OFF_PAGE:P+1 is skipped via pre_initialized_offsets_page.
    
(b) Crash-restart, page P+1 not on disk (the original bug):
    last_initialized_offsets_page = -1, fallback path fires.
    SimpleLruDoesPhysicalPageExist(P+1) = false -> init. Correct.
    
(c) Crash-restart, page P+1 already on disk:
    Same fallback, SimpleLruDoesPhysicalPageExist(P+1) = true -> skip.
    last_initialized_offsets_page stays -1 until the next
    ZERO_OFF_PAGE switches back to the fast path.
    
(d) Out-of-order CREATE_IDs (ZERO_PAGE:P+1 -> CREATE_ID:N+1 ->
    CREATE_ID:N+2 -> CREATE_ID:N):
    N+1 and N+2 don't cross a page boundary, compat logic not entered.
    CREATE_ID:N: init_needed = (P+1 == P) = false -> skip.
    Page P+1 is not re-zeroed, data from N+1/N+2 preserved.
    
(e) Consecutive page crossings (N on page P, later M on page P+1):
    After init of P+1: last_initialized_offsets_page = P+1.
    CREATE_ID:M: init_needed = (P+1 == P+1) = true -> init P+2.
    Tracking advances monotonically across page boundaries.
    
The logic looks correct to me in all the cases above.
    
Regards,
Duan


^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
  2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-20 11:55 ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 13:39   ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 17:05     ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
@ 2026-03-22 14:15       ` Kirill Reshke <[email protected]>
  2026-03-22 14:22         ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Kirill Reshke <[email protected]>
  1 sibling, 1 reply; 7+ messages in thread

From: Kirill Reshke @ 2026-03-22 14:15 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: Andrey Borodin <[email protected]>; 段坤仁(刻韧) <[email protected]>; pgsql-hackers

Hi!
I can see that the back-branches commit was included into master [0].
I think this is good.

On Sun, 22 Mar 2026 at 16:10, Heikki Linnakangas <[email protected]> wrote:
>
> On 20/03/2026 19:05, Andrey Borodin wrote:
> >> On 20 Mar 2026, at 18:14, Heikki Linnakangas <[email protected]> wrote:
> >>
> >> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already contain some later multixids, and zeroing will overwrite them.
> >
> > I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, tested FS, zeroed page once more, overwrote again by replaying WAL, no big deal.
> > We should never zero a page with offsets, that will not be replayed by WAL.
>
> I think we're in agreement, but I want to verify because this is
> important to get right. I was replying to this:
>
> > If we are sure buffers have no this page we can detect it via FS.
> > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more.
>
> My point is that if we rely on SimpleLruDoesPhysicalPageExist(), and it
> ever returns false even though we had already initialized the page, you
> can lose data. It's *not* ok to zero a page again that was zeroed
> earlier already, because we might have already written some real data on it.

+1. Even if we manage to compose a "fix" that zeroes a page more than
once, this "fix" will be non-future-profing and we will corrupt the
database if anything goes even slightly wrong.

> Let's consider this wal stream, generated with old minor version:
>
> ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047
>
> 2048 is the first multixid on the page. When WAL replay gets to the
> CREATE_ID:2047 record, it will enter the backwards-compatibility
> codepath and needs to determine if the page containing the next mxid
> (2048) already exists.
>
> In this WAL sequence, the page already exist because the ZERO_PAGE
> record was replayed earlier. But if we just call
> SimpleLruDoesPhysicalPageExist(), it will return 'false' because the
> page was not flushed to disk yet. If we believe that and zero the page
> again, we will lose data (the offset for mxid 2049).
>
> The opposite cannot happen: if SimpleLruDoesPhysicalPageExist() returns
> true, then it does really exist.
>
> So indeed we can only trust SimpleLruDoesPhysicalPageExist() if we are
> sure that the page is not sitting in the buffers.

+1

> Attached is a new version. I updated the comment to explain that.
>
> I also added another safety measure: before calling
> SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. That way,
> SimpleLruDoesPhysicalPageExist() should definitely return the correct
> answer. That shouldn't be necessary because the check with
> last_initialized_offsets_page should cover all the cases where a page
> that extended the file is sitting in the buffers, but better safe than
> sorry.
>
> - Heikki

I played with v2 and was unable to fool it into corrupting db. So v2
looks good to me.


[0] https://git.postgresql.org/cgit/postgresql.git/commit/?id=516310ed4dba89bd300242df0d56b4782f33ed4d

-- 
Best regards,
Kirill Reshke





^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
  2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  2026-03-20 11:55 ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 13:39   ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-20 17:05     ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Andrey Borodin <[email protected]>
  2026-03-22 14:15       ` Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery Kirill Reshke <[email protected]>
@ 2026-03-22 14:22         ` Kirill Reshke <[email protected]>
  0 siblings, 0 replies; 7+ messages in thread

From: Kirill Reshke @ 2026-03-22 14:22 UTC (permalink / raw)
  To: Heikki Linnakangas <[email protected]>; +Cc: Andrey Borodin <[email protected]>; 段坤仁(刻韧) <[email protected]>; pgsql-hackers

On Sun, 22 Mar 2026 at 19:15, Kirill Reshke <[email protected]> wrote:
>
> Hi!
> I can see that the back-branches commit was included into master [0].
> I think this is good.
>
> On Sun, 22 Mar 2026 at 16:10, Heikki Linnakangas <[email protected]> wrote:
> >
> > On 20/03/2026 19:05, Andrey Borodin wrote:
> > >> On 20 Mar 2026, at 18:14, Heikki Linnakangas <[email protected]> wrote:
> > >>
> > >> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already contain some later multixids, and zeroing will overwrite them.
> > >
> > > I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, tested FS, zeroed page once more, overwrote again by replaying WAL, no big deal.
> > > We should never zero a page with offsets, that will not be replayed by WAL.
> >
> > I think we're in agreement, but I want to verify because this is
> > important to get right. I was replying to this:
> >
> > > If we are sure buffers have no this page we can detect it via FS.
> > > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more.
> >
> > My point is that if we rely on SimpleLruDoesPhysicalPageExist(), and it
> > ever returns false even though we had already initialized the page, you
> > can lose data. It's *not* ok to zero a page again that was zeroed
> > earlier already, because we might have already written some real data on it.
>
> +1. Even if we manage to compose a "fix" that zeroes a page more than
> once, this "fix" will be non-future-profing and we will corrupt the
> database if anything goes even slightly wrong.
>
> > Let's consider this wal stream, generated with old minor version:
> >
> > ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047
> >
> > 2048 is the first multixid on the page. When WAL replay gets to the
> > CREATE_ID:2047 record, it will enter the backwards-compatibility
> > codepath and needs to determine if the page containing the next mxid
> > (2048) already exists.
> >
> > In this WAL sequence, the page already exist because the ZERO_PAGE
> > record was replayed earlier. But if we just call
> > SimpleLruDoesPhysicalPageExist(), it will return 'false' because the
> > page was not flushed to disk yet. If we believe that and zero the page
> > again, we will lose data (the offset for mxid 2049).
> >
> > The opposite cannot happen: if SimpleLruDoesPhysicalPageExist() returns
> > true, then it does really exist.
> >
> > So indeed we can only trust SimpleLruDoesPhysicalPageExist() if we are
> > sure that the page is not sitting in the buffers.
>
> +1
>
> > Attached is a new version. I updated the comment to explain that.
> >
> > I also added another safety measure: before calling
> > SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. That way,
> > SimpleLruDoesPhysicalPageExist() should definitely return the correct
> > answer. That shouldn't be necessary because the check with
> > last_initialized_offsets_page should cover all the cases where a page
> > that extended the file is sitting in the buffers, but better safe than
> > sorry.
> >
> > - Heikki
>
> I played with v2 and was unable to fool it into corrupting db. So v2
> looks good to me.
>
>
> [0] https://git.postgresql.org/cgit/postgresql.git/commit/?id=516310ed4dba89bd300242df0d56b4782f33ed4d
>
> --
> Best regards,
> Kirill Reshke

Also, in commit message:

> the backwards compatibility logic to tolerate WAL generated by older minor versions

Let's define older as pre-789d65364c to be exact?

-- 
Best regards,
Kirill Reshke





^ permalink  raw  reply  [nested|flat] 7+ messages in thread

end of thread, other threads:[~2026-03-22 14:22 UTC | newest]

Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-19 18:02  Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
2026-03-20 11:55 ` Andrey Borodin <[email protected]>
2026-03-20 13:39   ` Andrey Borodin <[email protected]>
2026-03-20 17:05     ` Andrey Borodin <[email protected]>
2026-03-22 13:09       `  回复：Bug in MultiXact replay compat logic for older minor version after crash-recovery =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
2026-03-22 14:15       ` Kirill Reshke <[email protected]>
2026-03-22 14:22         ` Kirill Reshke <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox