Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wS4Qd-002vjI-10 for pgsql-bugs@arkaria.postgresql.org; Wed, 27 May 2026 02:55:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wS4Qb-006rCa-0d for pgsql-bugs@arkaria.postgresql.org; Wed, 27 May 2026 02:55:30 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wS4Qa-006rCS-2R for pgsql-bugs@lists.postgresql.org; Wed, 27 May 2026 02:55:29 +0000 Received: from mail-lf1-x130.google.com ([2a00:1450:4864:20::130]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wS4QZ-00000000x8i-2TAG for pgsql-bugs@lists.postgresql.org; Wed, 27 May 2026 02:55:28 +0000 Received: by mail-lf1-x130.google.com with SMTP id 2adb3069b0e04-5a746f9c092so15448488e87.1 for ; Tue, 26 May 2026 19:55:27 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779850526; cv=none; d=google.com; s=arc-20240605; b=Tbrl8mVB6w6eFykVWbAjiC2QUsxxcsej90YKWyjV5G2gzuKfCButvb5VineVMHQeJR +EPC3LTbZhZt1srJDy74WxGrqjgN1JWKphddDV5k4wuw/H358OgQGrXQn0gjS8qtskgX f/y3axXgge3shlQ5xSEyO+QM3sP12d49oZkSX4DfuoXnn1L3944I9389OJlUM+4kOeTL JhaNrsr2OJmigkecMfR7GatrO1CYEyjF72Sk5fPYkvH1EUr5fjtmpk+RTqrUFmiQ+otX +WC/JuPn7tTVOeAlHBbooBA7vu8kPcDLDfCCa36HKLOHaZ6Jlgwu5rUpVHCmCZlkdKYc TZdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=NsQpEIGAZpBUhxbFPiqJgnPu2bui32xEmm1xo27Pn8g=; fh=yAoLi6wLLJSSzuj1qYEQgUlKqoEfGO0I7uzhRYaaClM=; b=VvciVjPKWgSatjCJunPlQQtpx8MK0izgtlwqWdLDmEchXoftLmpTPoSCZGqizWtU3O tw/5eeuL62ksG+whFRP3nJMFqssTkCMOI3aRJY0eGM7S6s6kUoA42bRowkpcZTlSnqF0 4+EDR//36whNqtuRI2Wv88L0fnf0kY9J2SYyHXGBJF4YyEmyj+nW+XuA2xyAuYqXd1mP RJnaBwbY+zW0GEVSaxWIGEYIGxGoWCE0hBESHTKq0A6Nm/q2G/Tj4KJHoQr3W6/hkVTI MpvilwNbDcW9hWpwxgPHBrKlSvBky69wFdYRKJhd/7O9p4UIyiaxWI4viIHDwbPGnq0R 94CA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779850526; x=1780455326; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=NsQpEIGAZpBUhxbFPiqJgnPu2bui32xEmm1xo27Pn8g=; b=nNgwbh5/u7WwxuutgjtZDcAnblTMsvGqKXoMSZdXa3zBKvZ8hyxcI06FzBEs/dHLe2 yM4AGHcLtM7bqIc3bcDSwlAjkZR6CDdtoWr/8Dit165squL3R3APheOSyR1KOqPdFKcj DRiPV7tQT+EnyAlHKzzm/Dh7utnB4rVYHu7Ofa/xcf8WX1PCBSuhnlTPrtIDYxgoIJ4L uepgE3pt7d3UDHx5HV7JwcHO0A8+vxsU2j9X5gOBaNNzvaPUtHEdgt75j1Gdv1jNLRCe UBm5o5NlA8YRYpJxq7sDH8w1gIKJLhQuF4d9/N86Z5xpkCSy6wkkmOEdeER1+L06pbfE EJZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779850526; x=1780455326; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=NsQpEIGAZpBUhxbFPiqJgnPu2bui32xEmm1xo27Pn8g=; b=OpbTZWeFE0/q9sOpGjYpHVefVjGOISBmHWCYD8vUJXWBEqS4Vpgco/bcbzuuBAsOC+ yy3xXpst17umKLt3pLda/sz0lk/CgMfIxHhkODmbp/IZxJN4HadFyU/ExtUjsqZNSvQy l7tjSaPhTsIhJEzDdzuUYo1LxtIJTEDtE6gaxzihEp7KSej3dG1zA0Yu/hLdNZpmaut7 eZbdnMKke9hGZhXH4RYhBBtgVP6Jz9gNOH4zhvIOXcLTAf76Fr0/7eiwScqL7UKcxaGX 4mXE5hqqUxaWzivjIM8o53zUQZEKxBpUO0dFsAjmrUlYdsq+KHG9qU6tjyxF2oTMVZ/h +Dxg== X-Forwarded-Encrypted: i=1; AFNElJ9Xh6P2S0Ap6JwHaD1IU1JPPItwfYTSycZjX/ojkaw9ilaFCnL8mHCs09wAbD29eCqpoehNeGz3t6zj@lists.postgresql.org X-Gm-Message-State: AOJu0YymjRQFQLY53RtHuzyTEH3eTFgyiqjUQPPq5/guEir4xXIKd5qC IW2f6KkwD5tUuMwGLwXfHeenLMc8igdsmnM54V/dVEvlsEdFqRpx7gUK2j16DsiDO67dc/YSjXx R2lTcrbCt5Ed1E00n/7fqU3x3qvgm9Go= X-Gm-Gg: Acq92OEysyK2xyJ/7zj98agT2+mgaQ3v70DH4fKkNXeR7z1RY5UMI5HtqiultvgjolQ a0piANEwbtjXVgML4yoxhJVxBZYnU4lKRYJeT0SJXT3g2FEmD7NiH632QumKEQlhKI1MSNm1RUP 3QSENrxGH7vYmx5zuFzIV6HDFnM5UgrbCJciWr+VGJZwVjs5g93CnpOZskXuXHwrgtgK6ocPhi7 cPSqmuvVJwGJgijy4yztD7nVmZBKIOG++m7W5WmhIMN6Qw1073GWjnsUVjOe4ubWW7QPF6b3/Pf wRMXAGqKdTGKrtFE5CrWSPWhX3ywLa7iO1zqhugPQuHooy9h4MoiC40MNGqJbPnLgrIaWw4u4A= = X-Received: by 2002:ac2:57ce:0:b0:5a8:6e82:6845 with SMTP id 2adb3069b0e04-5aa2ba8f6f1mr5631154e87.22.1779850525406; Tue, 26 May 2026 19:55:25 -0700 (PDT) MIME-Version: 1.0 References: <19490-9c59c6a583513b99@postgresql.org> <46FE61C9-F273-45FD-BED7-0F8CDA6EB992@yandex-team.ru> <46DB3CAB-EA1C-41A5-9D6D-5F913A2AAF66@yandex-team.ru> <9DF05C0C-D165-4821-80C2-FFAF47C07FF4@yandex-team.ru> <90F2A05B-FEEB-4695-87ED-32F53C6AC097@yandex-team.ru> In-Reply-To: <90F2A05B-FEEB-4695-87ED-32F53C6AC097@yandex-team.ru> From: Nazneen Jafri Date: Tue, 26 May 2026 19:55:14 -0700 X-Gm-Features: AVHnY4JokhyFJ2T1JFYYn29J0DWk-27wbMb5kI1guGLWaCfS8y1tJKkPZ5sN-Wo Message-ID: Subject: Re: BUG #19490: Streaming standby on 16.14 stops applying WAL on MultiXactOffsetSLRU when primary is 16.8 To: Andrey Borodin Cc: Heikki Linnakangas , Michael Paquier , Ayush Tiwari , Radim Marek , Marko Tiikkaja , PostgreSQL mailing lists Content-Type: multipart/alternative; boundary="00000000000051f8410652c3beb3" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000051f8410652c3beb3 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Tested Andrey's demo.diff on a fresh environment: - Primary: REL_16_8, Standby: REL_16_14 (--enable-cassert) - ~2300 MultiXacts crossing the offsets page boundary - Without patch: startup deadlocks at RecordNewMultiXact(multi=3D2047) - With patch: standby replays all WAL and catches up Thanks, Nazneen On Tue, May 26, 2026 at 2:55=E2=80=AFPM Andrey Borodin wrote: > > > > On 26 May 2026, at 17:28, Heikki Linnakangas wrote: > > > > looks correct > > I tested that change as follows. > > Setted up REL_16_0 as primary, REL_16_STABLE as standby. > > Generate multixacts in a single session using savepoints: > > BEGIN; > SELECT * FROM t WHERE i =3D 1 FOR NO KEY UPDATE; > -- repeat 2500 times: > SAVEPOINT a; SELECT * FROM t WHERE i =3D 1 FOR UPDATE; ROLLBACK TO a; > COMMIT; > > Each iteration creates a new MultiXactId. 2500 iterations cross the SLRU > page > boundary at multixact 2048 with some spare multis (we'll pickle the exces= s > ones in > jars when all is fixed, toying with 2048 wasted dev cycles for no reason)= . > > Test: > 0. Run the workload on REL_16_0 primary (2500 multixacts, crossing page > 0->1) > 1. Take pg_basebackup > 2. Run the workload again (2500 more, crossing page 1->2) > 3. Start the standby > > I observe: > Without the change startup deadlocks. > With the change standby catches up, the DEBUG1 message "next offsets page > is not > initialized, initializing it now" confirms the compat block fires > correctly. > > I packaged this test into a buildfarm module (TestReplayXversion) [0] tha= t > builds REL_x_0 and runs this check on REL_x_STABLE build. It reproduces > the deadlock > on 14, 15, and 16; 17 and 18 pass. Currently I'm struggling to inject > regress WAL trace > into it, not working so far. On a bright side - I managed to get PR numbe= r > 42 in buildfarm > client repo. > > > Best regards, Andrey Borodin. > > [0] https://github.com/PGBuildFarm/client-code/pull/42 > > > > > > --00000000000051f8410652c3beb3 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Tested Andrey's demo.diff on a fresh environment:

=C2=A0=C2=A0

=C2=A0 - Primary: REL_16_8, Standby: REL_16_14 (--enable-= cassert)

=C2=A0 - ~2300 MultiXacts crossing the offsets page bound= ary

=C2=A0 - Without patch: startup deadlocks at RecordNewMul= tiXact(multi=3D2047)

=C2=A0 - With patch: standby replays all WAL and catches = up



Thanks,
Nazneen
On Tue, May 26, 2026 at 2:55=E2=80=AFPM Andrey Borodin <x4mmm@yandex-team.ru> wrote:


> On 26 May 2026, at 17:28, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> looks correct

I tested that change as follows.

Setted up REL_16_0 as primary, REL_16_STABLE as standby.

Generate multixacts in a single session using savepoints:

BEGIN;
SELECT * FROM t WHERE i =3D 1 FOR NO KEY UPDATE;
-- repeat 2500 times:
SAVEPOINT a; SELECT * FROM t WHERE i =3D 1 FOR UPDATE; ROLLBACK TO a;
COMMIT;

Each iteration creates a new MultiXactId. 2500 iterations cross the SLRU pa= ge
boundary at multixact 2048 with some spare multis (we'll pickle the exc= ess ones in
jars when all is fixed, toying with 2048 wasted dev cycles for no reason).<= br>
Test:
0. Run the workload on REL_16_0 primary (2500 multixacts, crossing page 0-&= gt;1)
1. Take pg_basebackup
2. Run the workload again (2500 more, crossing page 1->2)
3. Start the standby

I observe:
Without the change startup deadlocks.
With the change standby catches up, the DEBUG1 message "next offsets p= age is not
initialized, initializing it now" confirms the compat block fires corr= ectly.

I packaged this test into a buildfarm module (TestReplayXversion) [0] that<= br> builds REL_x_0 and runs this check on REL_x_STABLE build. It reproduces the= deadlock
on 14, 15, and 16; 17 and 18 pass. Currently I'm struggling to inject r= egress WAL trace
into it, not working so far. On a bright side - I managed to get PR number = 42 in buildfarm
client repo.


Best regards, Andrey Borodin.

[0] https://github.com/PGBuildFarm/client-code/pul= l/42





--00000000000051f8410652c3beb3--