Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPRsH-000jyV-1S for pgsql-hackers@arkaria.postgresql.org; Tue, 19 May 2026 21:21:13 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wPRsF-005D9q-1R for pgsql-hackers@arkaria.postgresql.org; Tue, 19 May 2026 21:21:12 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wPRsF-005D9i-00 for pgsql-hackers@lists.postgresql.org; Tue, 19 May 2026 21:21:12 +0000 Received: from mail-ot1-x32e.google.com ([2607:f8b0:4864:20::32e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wPRsD-00000000O0G-1Py5 for pgsql-hackers@lists.postgresql.org; Tue, 19 May 2026 21:21:11 +0000 Received: by mail-ot1-x32e.google.com with SMTP id 46e09a7af769-7dbccf6a23dso3470001a34.2 for ; Tue, 19 May 2026 14:21:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1779225668; cv=none; d=google.com; s=arc-20240605; b=TXirFK4qTotXrBiufym9Co2afCKErc3k8Ojb/ZioVljl3iYwI8iinE6Yr09HsNYbME LARepdNUWIKZAuRIJPTt9J1t4tNvZcBV1P3kfuiwymdXbPGhibeSu4fiyDPN65kUIgO5 cPl9Tt3N6SGxRHT183ZRUCzo2yDYV7xxiIvyuw4jyMTEkquBNl9Dv2LfuFRgiFARNL4Q Z5OxjMu6jLUHQalJySz7FePGnkT2T0gfZvRY/4PlvA5WCsv4NXc45YWRaRQ+onbLa66Q HGkbsrUGqu4z2n+4luVVDEsyR0zj2fVbc0V8HPDKhM05etX7Q1Vgkfl+3P2t+C7W4+Oo YEPQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=SoRLUuDY/rg73gKtS8k5wZapnO05zx06YSuEJSuLDqY=; fh=6OOv/58aIqz1LxunCE+uFg8WKhAk/UzVV4IDDO4huhA=; b=BaxuWBFNJ0k7gBhbk3MsrBUT4qQsQPLAIH5hAhneznIBWcpl3sYxcC+xBZGJzJsqSt H4EB1pXZwZNIXt0uBsGC1O2lRAOZ4bJJ9aBrCQ7uKiijXDRVqUjyL6OLxGiRAk8mcHPT 9eeAi6aLsNbYMfP5BHywrfXedMcnV3J55UUNPmN8boh3aQbTJ1KYKeV+/BTQbNZUJ9sn +3NF/jAGP4b5ZT6teCbmg66DvCGdggRyeDhGjVaCb4OqrydhXeoyM0zc+dxGAc9txg5f xJoWkODYMq12OjHNa5f8epe5jF8wyApSx33pA37wji48+V5xY5tpDsSCG9x3rIN9DCBl URYA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1779225668; x=1779830468; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=SoRLUuDY/rg73gKtS8k5wZapnO05zx06YSuEJSuLDqY=; b=qT2YYNb6TEp/Z2v9LuDa+hqMM/vOfO112/fxq6SsYYmmogQYeQ7sjEhXM1mIOK7Jak 5Ar4b+EQRAkrDdrqB0irW1N7yvQPUuiyH7Rh9Cmkqb5xioPlB3QYIX6N98HWrc4bgBqC VZXQmhX4LW1RqcDTgyyG8tf8xgCquEQRFpl6tgZaEylBhokRn4Jfv94BmLIEgNawkEj1 nKd4DNambvO84/jxNPPtB7YP9gdAaRH/sgDqt4+m4c36MFPxFgs2LdZQ+8iEJvXNhp4r uh2UYrU0+oQDvoa77ZjUPzoLKN5ONwB3gyE4jgCp9mP8/UecHK6B0RBeDcFZUVdQmKSh SDMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1779225668; x=1779830468; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=SoRLUuDY/rg73gKtS8k5wZapnO05zx06YSuEJSuLDqY=; b=soOAwXEbYjA4MfmofOHAmjBo2jFvB0RBZY6/WTZQFLgsEDlrP2wcJGU4CxLKTyCzzH FSvGqynmX3uN4AWlgOv6F8kOphrKXNBarhSSmD0DJYxyrCsxDvxoyTFTgxl5l+HjhAuI tCtp3zYDxfesFOmNAjLqkxXbe2mSuwLnmAPxEeYzILx6itjlTjA7S4Pnw+EO6m3Q4Pd4 ORIFjA9AHco4HDmHaai+UBICocDCI0RviOaq3qcnGiWYW9atDc1FFSM8EOiAN9bBTF9/ AcLz/O5QBvxN5JVXie8wO8Ffy31y4bLiAt3BSAI5CYoMchPfMmGLM1fCddU0tKjOKJdG PYnw== X-Gm-Message-State: AOJu0Yy0y2cJC7NqZYVe2r+SzrGHXBgMqFuqMnrZUFb2mwCKEPAQzUQr EULaklyFb5zn4XKDO/nAol932LCyhv7yMaSYV6Kmrt2ybEHkwgwUGXg8oYD/808yO+XAEf9jwkF J78f8fFihauM/UXbIs+lGojblS5yG0ww= X-Gm-Gg: Acq92OHeKz9Jxsbh9XvgNf8GIjcNmf+far+vWA9xOq2oJuMym17QasSXL1iGPYll3Nf ylgI/dwsN7MNC5rqVoPr3aMigTuWP4l9XnLaTL7X9/fpCqCl79aKF3DkuaODKzX3pG/XnmZ5Y7M IJtSc+b9acI6sCsVmV9n2vQceoLmmjryD7ErQCBAlOTa7lrJ378jRMafCZRvHJOunM0mG1Q2teJ ChW53zRdNPMZ5UOdLWoCpUnzjKY0nEifcd+FvQsp85a+pFOAYrEmjPAbQFaf8hCewyl7VGh2vsZ 0P/npLg+xnwC/g== X-Received: by 2002:a05:6830:d16:b0:7de:442b:722 with SMTP id 46e09a7af769-7e4de66b426mr14636866a34.0.1779225667981; Tue, 19 May 2026 14:21:07 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: surya poondla Date: Tue, 19 May 2026 14:20:56 -0700 X-Gm-Features: AVHnY4Ll3lcTDvlh4kK9YFX5M2mnoyDwQfz96v9g2SmE98sRWevh4_ap5NfcNmk Message-ID: Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind To: cca5507 Cc: pgsql-hackers Content-Type: multipart/alternative; boundary="000000000000ea5483065232417c" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000ea5483065232417c Content-Type: text/plain; charset="UTF-8" Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind Hi ChangAo, Thanks for the careful diagnosis, I reproduced the hang on macOS on the latest postgres code (It took a lot of iterations to reproduce it) The LSN trace matches your description and I saw the below: minRecoveryPoint = 0/08000028 consistent recovery state reached at = 0/08000060 In my run the standby was stuck for ~9 s; consistency was eventually declared at 0/08000060 because a small upstream record (most likely a RUNNING_XACTS snapshot from bgwriter) landed at 0/08000028 and let lastReplayedEndRecPtr leap past the bad finish line. With the new primary stopped after pg_rewind, the wait was unbounded as expected. Regarding the fix: the underlying issue is that minRecoveryPoint is implicitly expected to be the end-LSN of a real WAL record, because lastReplayedEndRecPtr (the value it gets compared against) can only ever take such values. All current writers respect this expectation except pg_rewind: pg_basebackup uses the backup-end record's EndRecPtr, and the in-running UpdateMinRecoveryPoint path uses buffer LSNs, both of which are record-end LSNs by construction. pg_rewind alone uses pg_current_wal_insert_lsn(), which can return a position just past a page header when the source is idle. That's why I'd lean toward fixing the producer (pg_rewind). Concretely, your original suggestion having pg_rewind use GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), restores the invariant globally, and doesn't require future call sites that compare against minRecoveryPoint to know about page-header adjustments. If we still want a defense-in-depth guard in CheckRecoveryConsistency() to handle older pg_rewind binaries running against a newer server, the v1 patch is on the right track, but I'd suggest: - documenting in the helper comment why exactly SizeOfXLogShortPHD / SizeOfXLogLongPHD past a page boundary are the only legal "non-record-end" minRecoveryPoint values (i.e. who can produce them and under what conditions); - auditing the other call sites that compare against minRecoveryPoint to confirm none of them needs the same adjustment, with a comment recording the conclusion. I can put together a TAP test under src/bin/pg_rewind/t/ that forces a WAL switch on the source, runs pg_rewind against an otherwise-idle primary, and asserts that the rewound node reaches consistency without further upstream activity. Happy to send a v2 with that test if useful. This is a liveness bug with potentially unbounded wait on idle promoted primaries, so it does seem worth back-patching. Regards, Surya Poondla --000000000000ea5483065232417c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Subject: Re: [BUG] Take a long time to reach consistent af= ter pg_rewind

Hi ChangAo,

Thanks for the careful diagnosis, I= reproduced the hang on macOS on the latest postgres code (It took a lot of= iterations to reproduce it)
The LSN trace matches your description and = I saw the below:
=C2=A0 =C2=A0 minRecoveryPoint =3D 0/08000028
=C2=A0= =C2=A0 consistent recovery state reached at =3D 0/08000060

In my ru= n the standby was stuck for ~9 s; consistency was eventually declared at 0/= 08000060 because a small upstream record (most likely
a RUNNING_XACTS sn= apshot from bgwriter) landed at 0/08000028 and let lastReplayedEndRecPtr le= ap past the bad finish line. =C2=A0
With the new primary stopped after = pg_rewind, the wait was unbounded as expected.

Regarding = the fix: the underlying issue is that minRecoveryPoint is implicitly expect= ed to be the end-LSN of a real WAL record, because lastReplayedEndRecPtr (t= he value it gets compared against)
can only ever take such values.=C2=A0= All current writers respect this expectation except pg_rewind: pg_baseback= up uses the backup-end record's EndRecPtr, and the in-running UpdateMin= RecoveryPoint path
uses buffer LSNs, both of which are record-end LSNs b= y construction. pg_rewind alone uses pg_current_wal_insert_lsn(), which can= return a position just past a page header when the source is idle. =C2=A0<= /div>
That's why I'd lean toward fixing the producer (pg_rewind= ).

Concretely, your original suggestion having pg_= rewind use GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), resto= res
the invariant globally, and doesn't require future call sites th= at compare against minRecoveryPoint to know about page-header adjustments.<= br>
If we still want a defense-in-depth guard in CheckRecoveryConsistenc= y() to handle older pg_rewind binaries running against a newer server,
t= he v1 patch is on the right track, but I'd suggest:
=C2=A0 - documen= ting in the helper comment why exactly SizeOfXLogShortPHD /
=C2=A0 =C2= =A0 SizeOfXLogLongPHD past a page boundary are the only legal
=C2=A0 =C2= =A0 "non-record-end" minRecoveryPoint values (i.e. who can produc= e
=C2=A0 =C2=A0 them and under what conditions);

=C2=A0 - auditin= g the other call sites that compare against
=C2=A0 =C2=A0 minRecoveryPoi= nt to confirm none of them needs the same
=C2=A0 =C2=A0 adjustment, with= a comment recording the conclusion.

I can put together a TAP test u= nder src/bin/pg_rewind/t/ that forces a WAL switch on the source, runs pg_r= ewind against an
otherwise-idle primary, and asserts that the rewound no= de reaches consistency without further upstream activity. =C2=A0
Happy t= o send a v2 with that test if useful.

This is a liveness bug with po= tentially unbounded wait on idle promoted primaries, so it does seem worth = back-patching.

Regards,
Surya Poondla
--000000000000ea5483065232417c--