public inbox for [email protected]
help / color / mirror / Atom feedFrom: Olegs Germanovs <[email protected]>
To: [email protected]
Subject: 16.14 regression: startup process self-deadlocks during multixact WAL replay in RecordNewMultiXact
Date: Wed, 27 May 2026 15:33:58 +0300
Message-ID: <CA+yEoBxD18Z3VxOwLfk+959giwrt=6Jo5HujnvyZZN2Y63TWBg@mail.gmail.com> (raw)
Hi!
*Bug summary:* After upgrading from 16.13 to 16.14, archive recovery of a
basebackup
hangs indefinitely during multixact WAL replay. The startup process
blocks acquiring MultiXactOffsetSLRULock in EXCLUSIVE mode while
already holding one LWLock. The lock has shared_count=1 with no
exclusive holder, no other live process appears to hold it, and the
same recovery completes successfully on 16.13.
*Environment*:
PostgreSQL: 16.14 (pgdg)
OS: Ubuntu 22.04, kernel 6.8.0-1016-aws
Arch: aarch64 (AWS Graviton)
Backup tool: pgBackRest 2.53.1 (backup) → 2.58.0 (restore)
Source: x86_64 cluster, Postgres version - 16.6 (Ubuntu
16.6-1.pgdg22.04+1)
*Scenario*: archive recovery from pgBackRest. End-of-backup record not yet
seen. Stalls during replay of WAL segment 0000000100006BEB00000031.
Verified that the next segment is genuinely irrelevant: startup is not
waiting for WAL — it has a record in hand (frozen on the same frame
across many gdb captures separated by minutes).
*Stack of startup process *(PID 395003):
#7 LWLockAcquire (lock=0xfdbf33f2f000, mode=LW_EXCLUSIVE)
at storage/lmgr/lwlock.c:1314
#8 SimpleLruWriteAll (ctl=MultiXactOffsetCtlData, ...)
at access/transam/slru.c:1174
#9 RecordNewMultiXact (multi=981215231, offset=2282786137,
nmembers=2, members=...)
at access/transam/multixact.c:944
#10 multixact_redo (record=...)
at access/transam/multixact.c:3464
#11 ApplyWalRecord -> PerformWalRecovery -> StartupXLOG
LWLock state at 0xfdbf33f2f000 (stable across 5+ snapshots):
tranche = 14 (MultiXactOffsetSLRU)
state.value = 0x61000000
= LW_FLAG_RELEASE_OK | LW_FLAG_HAS_WAITERS | shared_count=1
waiters = {head=524, tail=524} (one waiter)
Critical evidence — startup process holds exactly one LWLock:
num_held_lwlocks = 1
*Combined with*:
- No exclusive holder of the lock
- shared_count = 1
- Checkpointer (PID 395001) and bgwriter (PID 395002) sitting idle
in CheckpointerMain/BackgroundWriterMain WaitLatch loops, with no
visible work pending
- Same gdb stack frame frozen across captures separated by minutes
- Zero CPU, zero I/O, ctx_switches not advancing
→ The startup process is holding MultiXactOffsetSLRULock in SHARED mode
(acquired earlier in the RecordNewMultiXact path) and now requesting
it in EXCLUSIVE mode via SimpleLruWriteAll. Since LWLocks cannot be
upgraded shared→exclusive, this is a self-deadlock.
Auxiliary process stacks (for completeness):
Checkpointer (395001):
epoll_pwait → WaitLatch (timeout=15000)
→ CheckpointerMain (checkpointer.c:535)
Bgwriter (395002):
epoll_pwait → WaitLatch (timeout=10000)
→ BackgroundWriterMain (bgwriter.c:336)
Both are idle in their main loops; held_lwlocks was <optimized out> in
gdb but neither process has any plausible reason to hold the SLRU lock.
pg_controldata excerpt:
Database cluster state: in archive recovery
Backup start location: 6BEB/27000378
Minimum recovery ending location: 6BEB/31DCEBE0
Backup end location: 0/0
End-of-backup record required: yes
NextMultiXactId: 981215122 (replay reached 981215231)
NextMultiOffset: 2282785918 (replay reached 2282786137)
oldestMultiXid: 964544775
Reproduction:
- Restore basebackup + WAL via pgBackRest archive-get on aarch64
- Start cluster on 16.14: hangs as described, every time, same WAL
position
- Stop cluster, downgrade to 16.13 (same pgdg apt source), start:
recovery completes successfully on identical PGDATA
- No data or environment change between the two attempts
I'm happy to apply test patches or capture additional diagnostics.
Best wishes
Olegs Germanovs
view thread (2+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected]
Subject: Re: 16.14 regression: startup process self-deadlocks during multixact WAL replay in RecordNewMultiXact
In-Reply-To: <CA+yEoBxD18Z3VxOwLfk+959giwrt=6Jo5HujnvyZZN2Y63TWBg@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox