MIME-Version: 1.0
From: Olegs Germanovs <olegs.germanovs@gmail.com>
Date: Wed, 27 May 2026 15:33:58 +0300
Message-ID: 
 <CA+yEoBxD18Z3VxOwLfk+959giwrt=6Jo5HujnvyZZN2Y63TWBg@mail.gmail.com>
Subject: 16.14 regression: startup process self-deadlocks during multixact WAL
 replay in RecordNewMultiXact
To: pgsql-bugs@lists.postgresql.org
Content-Type: multipart/alternative; boundary="0000000000000de55b0652cbd425"
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2ByEoBxD18Z3VxOwLfk%2B959giwrt%3D6Jo5HujnvyZZN2Y63TWBg%40mail.gmail.com>
Precedence: bulk

--0000000000000de55b0652cbd425
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi!


*Bug summary:*  After upgrading from 16.13 to 16.14, archive recovery of a
basebackup
  hangs indefinitely during multixact WAL replay. The startup process
  blocks acquiring MultiXactOffsetSLRULock in EXCLUSIVE mode while
  already holding one LWLock. The lock has shared_count=3D1 with no
  exclusive holder, no other live process appears to hold it, and the
  same recovery completes successfully on 16.13.

*Environment*:
  PostgreSQL:  16.14 (pgdg)
  OS:          Ubuntu 22.04, kernel 6.8.0-1016-aws
  Arch:        aarch64 (AWS Graviton)
  Backup tool: pgBackRest 2.53.1 (backup) =E2=86=92 2.58.0 (restore)
  Source:      x86_64 cluster, Postgres version - 16.6 (Ubuntu
16.6-1.pgdg22.04+1)

*Scenario*: archive recovery from pgBackRest. End-of-backup record not yet
seen. Stalls during replay of WAL segment 0000000100006BEB00000031.
Verified that the next segment is genuinely irrelevant: startup is not
waiting for WAL =E2=80=94 it has a record in hand (frozen on the same frame
across many gdb captures separated by minutes).

*Stack of startup process *(PID 395003):

  #7  LWLockAcquire (lock=3D0xfdbf33f2f000, mode=3DLW_EXCLUSIVE)
        at storage/lmgr/lwlock.c:1314
  #8  SimpleLruWriteAll (ctl=3DMultiXactOffsetCtlData, ...)
        at access/transam/slru.c:1174
  #9  RecordNewMultiXact (multi=3D981215231, offset=3D2282786137,
                          nmembers=3D2, members=3D...)
        at access/transam/multixact.c:944
  #10 multixact_redo (record=3D...)
        at access/transam/multixact.c:3464
  #11 ApplyWalRecord -> PerformWalRecovery -> StartupXLOG

LWLock state at 0xfdbf33f2f000 (stable across 5+ snapshots):
  tranche =3D 14 (MultiXactOffsetSLRU)
  state.value =3D 0x61000000
    =3D LW_FLAG_RELEASE_OK | LW_FLAG_HAS_WAITERS | shared_count=3D1
  waiters =3D {head=3D524, tail=3D524}   (one waiter)

Critical evidence =E2=80=94 startup process holds exactly one LWLock:
  num_held_lwlocks =3D 1

*Combined with*:
  - No exclusive holder of the lock
  - shared_count =3D 1
  - Checkpointer (PID 395001) and bgwriter (PID 395002) sitting idle
    in CheckpointerMain/BackgroundWriterMain WaitLatch loops, with no
    visible work pending
  - Same gdb stack frame frozen across captures separated by minutes
  - Zero CPU, zero I/O, ctx_switches not advancing

=E2=86=92 The startup process is holding MultiXactOffsetSLRULock in SHARED =
mode
  (acquired earlier in the RecordNewMultiXact path) and now requesting
  it in EXCLUSIVE mode via SimpleLruWriteAll. Since LWLocks cannot be
  upgraded shared=E2=86=92exclusive, this is a self-deadlock.

Auxiliary process stacks (for completeness):

  Checkpointer (395001):
    epoll_pwait =E2=86=92 WaitLatch (timeout=3D15000)
                =E2=86=92 CheckpointerMain (checkpointer.c:535)
  Bgwriter (395002):
    epoll_pwait =E2=86=92 WaitLatch (timeout=3D10000)
                =E2=86=92 BackgroundWriterMain (bgwriter.c:336)

Both are idle in their main loops; held_lwlocks was <optimized out> in
gdb but neither process has any plausible reason to hold the SLRU lock.

pg_controldata excerpt:
  Database cluster state:           in archive recovery
  Backup start location:            6BEB/27000378
  Minimum recovery ending location: 6BEB/31DCEBE0
  Backup end location:              0/0
  End-of-backup record required:    yes
  NextMultiXactId:                  981215122 (replay reached 981215231)
  NextMultiOffset:                  2282785918 (replay reached 2282786137)
  oldestMultiXid:                   964544775

Reproduction:
  - Restore basebackup + WAL via pgBackRest archive-get on aarch64
  - Start cluster on 16.14: hangs as described, every time, same WAL
    position
  - Stop cluster, downgrade to 16.13 (same pgdg apt source), start:
    recovery completes successfully on identical PGDATA
  - No data or environment change between the two attempts

I'm happy to apply test patches or capture additional diagnostics.

Best wishes
Olegs Germanovs

--0000000000000de55b0652cbd425
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi!<br><br><b>Bug summary:<br></b>=C2=A0 After upgrading f=
rom 16.13 to 16.14, archive recovery of a basebackup<br>=C2=A0 hangs indefi=
nitely during multixact WAL replay. The startup process<br>=C2=A0 blocks ac=
quiring MultiXactOffsetSLRULock in EXCLUSIVE mode while<br>=C2=A0 already h=
olding one LWLock. The lock has shared_count=3D1 with no<br>=C2=A0 exclusiv=
e holder, no other live process appears to hold it, and the<br>=C2=A0 same =
recovery completes successfully on 16.13.<br><br><b>Environment</b>:<br>=C2=
=A0 PostgreSQL: =C2=A016.14 (pgdg)<br>=C2=A0 OS: =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0Ubuntu 22.04, kernel 6.8.0-1016-aws<br>=C2=A0 Arch: =C2=A0 =C2=A0=
 =C2=A0 =C2=A0aarch64 (AWS Graviton)<br>=C2=A0 Backup tool: pgBackRest 2.53=
.1 (backup) =E2=86=92 2.58.0 (restore)<br>=C2=A0 Source: =C2=A0 =C2=A0 =C2=
=A0x86_64 cluster, Postgres version - 16.6 (Ubuntu 16.6-1.pgdg22.04+1)=C2=
=A0<br><br><b>Scenario</b>: archive recovery from pgBackRest. End-of-backup=
 record not yet<br>seen. Stalls during replay of WAL segment 0000000100006B=
EB00000031.<br>Verified that the next segment is genuinely irrelevant: star=
tup is not<br>waiting for WAL =E2=80=94 it has a record in hand (frozen on =
the same frame<br>across many gdb captures separated by minutes).<br><br><b=
>Stack of startup process </b>(PID 395003):<br><br>=C2=A0 #7 =C2=A0LWLockAc=
quire (lock=3D0xfdbf33f2f000, mode=3DLW_EXCLUSIVE)<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 at storage/lmgr/lwlock.c:1314<br>=C2=A0 #8 =C2=A0SimpleLruWriteAll (=
ctl=3DMultiXactOffsetCtlData, ...)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at access=
/transam/slru.c:1174<br>=C2=A0 #9 =C2=A0RecordNewMultiXact (multi=3D9812152=
31, offset=3D2282786137,<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 nmembers=3D2, members=3D...)<=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at access/transam/multixact.c:944<br>=C2=A0 =
#10 multixact_redo (record=3D...)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 at access/=
transam/multixact.c:3464<br>=C2=A0 #11 ApplyWalRecord -&gt; PerformWalRecov=
ery -&gt; StartupXLOG<br><br>LWLock state at 0xfdbf33f2f000 (stable across =
5+ snapshots):<br>=C2=A0 tranche =3D 14 (MultiXactOffsetSLRU)<br>=C2=A0 sta=
te.value =3D 0x61000000<br>=C2=A0 =C2=A0 =3D LW_FLAG_RELEASE_OK | LW_FLAG_H=
AS_WAITERS | shared_count=3D1<br>=C2=A0 waiters =3D {head=3D524, tail=3D524=
} =C2=A0 (one waiter)<br><br>Critical evidence =E2=80=94 startup process ho=
lds exactly one LWLock:<br>=C2=A0 num_held_lwlocks =3D 1<br><br><b>Combined=
 with</b>:<br>=C2=A0 - No exclusive holder of the lock<br>=C2=A0 - shared_c=
ount =3D 1<br>=C2=A0 - Checkpointer (PID 395001) and bgwriter (PID 395002) =
sitting idle<br>=C2=A0 =C2=A0 in CheckpointerMain/BackgroundWriterMain Wait=
Latch loops, with no<br>=C2=A0 =C2=A0 visible work pending<br>=C2=A0 - Same=
 gdb stack frame frozen across captures separated by minutes<br>=C2=A0 - Ze=
ro CPU, zero I/O, ctx_switches not advancing<br><br>=E2=86=92 The startup p=
rocess is holding MultiXactOffsetSLRULock in SHARED mode<br>=C2=A0 (acquire=
d earlier in the RecordNewMultiXact path) and now requesting<br>=C2=A0 it i=
n EXCLUSIVE mode via SimpleLruWriteAll. Since LWLocks cannot be<br>=C2=A0 u=
pgraded shared=E2=86=92exclusive, this is a self-deadlock.<br><br>Auxiliary=
 process stacks (for completeness):<br><br>=C2=A0 Checkpointer (395001):<br=
>=C2=A0 =C2=A0 epoll_pwait =E2=86=92 WaitLatch (timeout=3D15000)<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =E2=86=92 CheckpointerMain=
 (checkpointer.c:535)<br>=C2=A0 Bgwriter (395002):<br>=C2=A0 =C2=A0 epoll_p=
wait =E2=86=92 WaitLatch (timeout=3D10000)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =E2=86=92 BackgroundWriterMain (bgwriter.c:336)=
<br><br>Both are idle in their main loops; held_lwlocks was &lt;optimized o=
ut&gt; in<br>gdb but neither process has any plausible reason to hold the S=
LRU lock.<br><br>pg_controldata excerpt:<br>=C2=A0 Database cluster state: =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 in archive recovery<br>=C2=A0 Backup sta=
rt location: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A06BEB/27000378<br>=C2=
=A0 Minimum recovery ending location: 6BEB/31DCEBE0<br>=C2=A0 Backup end lo=
cation: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A00/0<br>=C2=A0 End-o=
f-backup record required: =C2=A0 =C2=A0yes<br>=C2=A0 NextMultiXactId: =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0981215122 (repla=
y reached 981215231)<br>=C2=A0 NextMultiOffset: =C2=A0 =C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A02282785918 (replay reached 2282786137)<b=
r>=C2=A0 oldestMultiXid: =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 964544775<br><br>Reproduction:<br>=C2=A0 - Restore basebackup=
 + WAL via pgBackRest archive-get on aarch64<br>=C2=A0 - Start cluster on 1=
6.14: hangs as described, every time, same WAL<br>=C2=A0 =C2=A0 position<br=
>=C2=A0 - Stop cluster, downgrade to 16.13 (same pgdg apt source), start:<b=
r>=C2=A0 =C2=A0 recovery completes successfully on identical PGDATA<br>=C2=
=A0 - No data or environment change between the two attempts<br><br>I&#39;m=
 happy to apply test patches or capture additional diagnostics.<div><br></d=
iv><div>Best wishes</div><div>Olegs Germanovs</div></div>

--0000000000000de55b0652cbd425--