Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Kirill Reshke <[email protected]>
To: Heikki Linnakangas <[email protected]>
Cc: Andrey Borodin <[email protected]>
Cc: 段坤仁(刻韧) <[email protected]>
Cc: pgsql-hackers <[email protected]>
Subject: Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
Date: Sun, 22 Mar 2026 19:22:30 +0500
Message-ID: <CALdSSPh2tfxcZOgdeGdjBwdnUBEY+8iG7unuJMEoruWToUMYwA@mail.gmail.com> (raw)
In-Reply-To: <CALdSSPghO2YGyy4BN_rRxdG7dBSf1-pmi2D4NARk14+HyZoHhg@mail.gmail.com>
References: <c4ef1737-8cba-458e-b6fd-4e2d6011e985.duankunren.dkr@alibaba-inc.com>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<CALdSSPghO2YGyy4BN_rRxdG7dBSf1-pmi2D4NARk14+HyZoHhg@mail.gmail.com>

On Sun, 22 Mar 2026 at 19:15, Kirill Reshke <[email protected]> wrote:
>
> Hi!
> I can see that the back-branches commit was included into master [0].
> I think this is good.
>
> On Sun, 22 Mar 2026 at 16:10, Heikki Linnakangas <[email protected]> wrote:
> >
> > On 20/03/2026 19:05, Andrey Borodin wrote:
> > >> On 20 Mar 2026, at 18:14, Heikki Linnakangas <[email protected]> wrote:
> > >>
> > >> Zeroing the page again is dangerous because the CREATE_ID records can be out of order. The page might already contain some later multixids, and zeroing will overwrite them.
> > >
> > > I see only cases when it's not a problem: we zeroed page, did not flush it, thus did not extend the file, crashed, tested FS, zeroed page once more, overwrote again by replaying WAL, no big deal.
> > > We should never zero a page with offsets, that will not be replayed by WAL.
> >
> > I think we're in agreement, but I want to verify because this is
> > important to get right. I was replying to this:
> >
> > > If we are sure buffers have no this page we can detect it via FS.
> > > Otherwise... nothing bad can happen, actually. We might get false positive and zero the page once more.
> >
> > My point is that if we rely on SimpleLruDoesPhysicalPageExist(), and it
> > ever returns false even though we had already initialized the page, you
> > can lose data. It's *not* ok to zero a page again that was zeroed
> > earlier already, because we might have already written some real data on it.
>
> +1. Even if we manage to compose a "fix" that zeroes a page more than
> once, this "fix" will be non-future-profing and we will corrupt the
> database if anything goes even slightly wrong.
>
> > Let's consider this wal stream, generated with old minor version:
> >
> > ZERO_PAGE:2048 -> CREATE_ID:2048 -> CREATE_ID:2049 -> CREATE_ID:2047
> >
> > 2048 is the first multixid on the page. When WAL replay gets to the
> > CREATE_ID:2047 record, it will enter the backwards-compatibility
> > codepath and needs to determine if the page containing the next mxid
> > (2048) already exists.
> >
> > In this WAL sequence, the page already exist because the ZERO_PAGE
> > record was replayed earlier. But if we just call
> > SimpleLruDoesPhysicalPageExist(), it will return 'false' because the
> > page was not flushed to disk yet. If we believe that and zero the page
> > again, we will lose data (the offset for mxid 2049).
> >
> > The opposite cannot happen: if SimpleLruDoesPhysicalPageExist() returns
> > true, then it does really exist.
> >
> > So indeed we can only trust SimpleLruDoesPhysicalPageExist() if we are
> > sure that the page is not sitting in the buffers.
>
> +1
>
> > Attached is a new version. I updated the comment to explain that.
> >
> > I also added another safety measure: before calling
> > SimpleLruDoesPhysicalPageExist(), flush all the SLRU buffers. That way,
> > SimpleLruDoesPhysicalPageExist() should definitely return the correct
> > answer. That shouldn't be necessary because the check with
> > last_initialized_offsets_page should cover all the cases where a page
> > that extended the file is sitting in the buffers, but better safe than
> > sorry.
> >
> > - Heikki
>
> I played with v2 and was unable to fool it into corrupting db. So v2
> looks good to me.
>
>
> [0] https://git.postgresql.org/cgit/postgresql.git/commit/?id=516310ed4dba89bd300242df0d56b4782f33ed4d
>
> --
> Best regards,
> Kirill Reshke

Also, in commit message:

> the backwards compatibility logic to tolerate WAL generated by older minor versions

Let's define older as pre-789d65364c to be exact?

-- 
Best regards,
Kirill Reshke

view thread (7+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Bug in MultiXact replay compat logic for older minor version after crash-recovery
  In-Reply-To: <CALdSSPh2tfxcZOgdeGdjBwdnUBEY+8iG7unuJMEoruWToUMYwA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox