public inbox for [email protected]
help / color / mirror / Atom feed16.14 regression: startup process self-deadlocks during multixact WAL replay in RecordNewMultiXact
2+ messages / 2 participants
[nested] [flat]
* 16.14 regression: startup process self-deadlocks during multixact WAL replay in RecordNewMultiXact
@ 2026-05-27 12:33 Olegs Germanovs <[email protected]>
0 siblings, 1 reply; 2+ messages in thread
From: Olegs Germanovs @ 2026-05-27 12:33 UTC (permalink / raw)
To: [email protected]
Hi!
*Bug summary:* After upgrading from 16.13 to 16.14, archive recovery of a
basebackup
hangs indefinitely during multixact WAL replay. The startup process
blocks acquiring MultiXactOffsetSLRULock in EXCLUSIVE mode while
already holding one LWLock. The lock has shared_count=1 with no
exclusive holder, no other live process appears to hold it, and the
same recovery completes successfully on 16.13.
*Environment*:
PostgreSQL: 16.14 (pgdg)
OS: Ubuntu 22.04, kernel 6.8.0-1016-aws
Arch: aarch64 (AWS Graviton)
Backup tool: pgBackRest 2.53.1 (backup) → 2.58.0 (restore)
Source: x86_64 cluster, Postgres version - 16.6 (Ubuntu
16.6-1.pgdg22.04+1)
*Scenario*: archive recovery from pgBackRest. End-of-backup record not yet
seen. Stalls during replay of WAL segment 0000000100006BEB00000031.
Verified that the next segment is genuinely irrelevant: startup is not
waiting for WAL — it has a record in hand (frozen on the same frame
across many gdb captures separated by minutes).
*Stack of startup process *(PID 395003):
#7 LWLockAcquire (lock=0xfdbf33f2f000, mode=LW_EXCLUSIVE)
at storage/lmgr/lwlock.c:1314
#8 SimpleLruWriteAll (ctl=MultiXactOffsetCtlData, ...)
at access/transam/slru.c:1174
#9 RecordNewMultiXact (multi=981215231, offset=2282786137,
nmembers=2, members=...)
at access/transam/multixact.c:944
#10 multixact_redo (record=...)
at access/transam/multixact.c:3464
#11 ApplyWalRecord -> PerformWalRecovery -> StartupXLOG
LWLock state at 0xfdbf33f2f000 (stable across 5+ snapshots):
tranche = 14 (MultiXactOffsetSLRU)
state.value = 0x61000000
= LW_FLAG_RELEASE_OK | LW_FLAG_HAS_WAITERS | shared_count=1
waiters = {head=524, tail=524} (one waiter)
Critical evidence — startup process holds exactly one LWLock:
num_held_lwlocks = 1
*Combined with*:
- No exclusive holder of the lock
- shared_count = 1
- Checkpointer (PID 395001) and bgwriter (PID 395002) sitting idle
in CheckpointerMain/BackgroundWriterMain WaitLatch loops, with no
visible work pending
- Same gdb stack frame frozen across captures separated by minutes
- Zero CPU, zero I/O, ctx_switches not advancing
→ The startup process is holding MultiXactOffsetSLRULock in SHARED mode
(acquired earlier in the RecordNewMultiXact path) and now requesting
it in EXCLUSIVE mode via SimpleLruWriteAll. Since LWLocks cannot be
upgraded shared→exclusive, this is a self-deadlock.
Auxiliary process stacks (for completeness):
Checkpointer (395001):
epoll_pwait → WaitLatch (timeout=15000)
→ CheckpointerMain (checkpointer.c:535)
Bgwriter (395002):
epoll_pwait → WaitLatch (timeout=10000)
→ BackgroundWriterMain (bgwriter.c:336)
Both are idle in their main loops; held_lwlocks was <optimized out> in
gdb but neither process has any plausible reason to hold the SLRU lock.
pg_controldata excerpt:
Database cluster state: in archive recovery
Backup start location: 6BEB/27000378
Minimum recovery ending location: 6BEB/31DCEBE0
Backup end location: 0/0
End-of-backup record required: yes
NextMultiXactId: 981215122 (replay reached 981215231)
NextMultiOffset: 2282785918 (replay reached 2282786137)
oldestMultiXid: 964544775
Reproduction:
- Restore basebackup + WAL via pgBackRest archive-get on aarch64
- Start cluster on 16.14: hangs as described, every time, same WAL
position
- Stop cluster, downgrade to 16.13 (same pgdg apt source), start:
recovery completes successfully on identical PGDATA
- No data or environment change between the two attempts
I'm happy to apply test patches or capture additional diagnostics.
Best wishes
Olegs Germanovs
^ permalink raw reply [nested|flat] 2+ messages in thread
* Re: 16.14 regression: startup process self-deadlocks during multixact WAL replay in RecordNewMultiXact
@ 2026-05-27 12:48 Andrey Borodin <[email protected]>
parent: Olegs Germanovs <[email protected]>
0 siblings, 0 replies; 2+ messages in thread
From: Andrey Borodin @ 2026-05-27 12:48 UTC (permalink / raw)
To: Olegs Germanovs <[email protected]>; +Cc: PostgreSQL mailing lists <[email protected]>
> On 27 May 2026, at 17:33, Olegs Germanovs <[email protected]> wrote:
>
> After upgrading from 16.13 to 16.14, archive recovery of a basebackup
> hangs indefinitely during multixact WAL replay.
Hi Olegs!
Thanks for the detailed report! Your analysis of the self-deadlock is spot on.
The fix for this problem has already been committed to REL_16_STABLE as 42a3194e5483 [0].
It was discussed on the pgsql-bugs thread "BUG #19490: Streaming standby on 16.14 stops
applying WAL on MultiXactOffsetSLRU when primary is 16.8" [1].
Please let us know if you still observe the problem or any other unusual behavior.
Best regards, Andrey Borodin.
[0] https://git.postgresql.org/cgit/postgresql.git/commit/?h=REL_16_STABLE&id=42a3194e548349b658a808...
[1] https://www.postgresql.org/message-id/flat/46FE61C9-F273-45FD-BED7-0F8CDA6EB992%40yandex-team.ru#69d...
^ permalink raw reply [nested|flat] 2+ messages in thread
end of thread, other threads:[~2026-05-27 12:48 UTC | newest]
Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-27 12:33 16.14 regression: startup process self-deadlocks during multixact WAL replay in RecordNewMultiXact Olegs Germanovs <[email protected]>
2026-05-27 12:48 ` Andrey Borodin <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox