Tested Andrey's demo.diff on a fresh environment:
- Primary: REL_16_8, Standby: REL_16_14 (--enable-cassert)
- ~2300 MultiXacts crossing the offsets page boundary
- Without patch: startup deadlocks at RecordNewMultiXact(multi=2047)
- With patch: standby replays all WAL and catches up
> On 26 May 2026, at 17:28, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> looks correct
I tested that change as follows.
Setted up REL_16_0 as primary, REL_16_STABLE as standby.
Generate multixacts in a single session using savepoints:
BEGIN;
SELECT * FROM t WHERE i = 1 FOR NO KEY UPDATE;
-- repeat 2500 times:
SAVEPOINT a; SELECT * FROM t WHERE i = 1 FOR UPDATE; ROLLBACK TO a;
COMMIT;
Each iteration creates a new MultiXactId. 2500 iterations cross the SLRU page
boundary at multixact 2048 with some spare multis (we'll pickle the excess ones in
jars when all is fixed, toying with 2048 wasted dev cycles for no reason).
Test:
0. Run the workload on REL_16_0 primary (2500 multixacts, crossing page 0->1)
1. Take pg_basebackup
2. Run the workload again (2500 more, crossing page 1->2)
3. Start the standby
I observe:
Without the change startup deadlocks.
With the change standby catches up, the DEBUG1 message "next offsets page is not
initialized, initializing it now" confirms the compat block fires correctly.
I packaged this test into a buildfarm module (TestReplayXversion) [0] that
builds REL_x_0 and runs this check on REL_x_STABLE build. It reproduces the deadlock
on 14, 15, and 16; 17 and 18 pass. Currently I'm struggling to inject regress WAL trace
into it, not working so far. On a bright side - I managed to get PR number 42 in buildfarm
client repo.
Best regards, Andrey Borodin.
[0] https://github.com/PGBuildFarm/client-code/pull/42