public inbox for [email protected]  
help / color / mirror / Atom feed
[BUG] Take a long time to reach consistent after pg_rewind
3+ messages / 2 participants
[nested] [flat]

* [BUG] Take a long time to reach consistent after pg_rewind
@ 2026-04-10 09:57  =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  0 siblings, 2 replies; 3+ messages in thread

From: =?utf-8?B?Y2NhNTUwNw==?= @ 2026-04-10 09:57 UTC (permalink / raw)
  To: =?utf-8?B?cGdzcWwtaGFja2Vycw==?= <[email protected]>

Hi,

Steps to reproduce (PG19):

1) start two nodes, node1 (primary), node2 (standby), both with the following configuration:

```
archive_mode = on
archive_command = '/bin/true'
archive_timeout = 10
checkpoint_timeout = '60min'
wal_keep_size = 1024
logging_collector = on
```

2) promote node2

3) stop node1

4) make sure the pg_current_wal_insert_lsn() of node2 is at the begin of a wal
segment (end with 000028), if not, do a checkpoint and recheck. (archive_timeout
will switch the wal)

5) execute pg_rewind with node1

6) start node1

7) now node1 can't reach consistent until node2 write some wal

Logs of node1:

```
2026-04-10 16:16:07.802 CST [45623] LOG:  starting backup recovery with redo LSN 0/02000028, checkpoint LSN 0/02000088, on timeline ID 1
2026-04-10 16:16:07.802 CST [45623] LOG:  entering standby mode
2026-04-10 16:16:07.803 CST [45623] LOG:  redo starts at 0/02000028
2026-04-10 16:16:07.803 CST [45623] LOG:  completed backup recovery with redo LSN 0/02000028 and end LSN 0/02000130
2026-04-10 16:16:07.806 CST [45624] LOG:  started streaming WAL from primary at 0/04000000 on timeline 2
2026-04-10 16:19:13.083 CST [47039] FATAL:  the database system is not yet accepting connections
2026-04-10 16:19:13.083 CST [47039] DETAIL:  Consistent recovery state has not been yet reached.
2026-04-10 16:20:16.413 CST [45623] LOG:  consistent recovery state reached at 0/04000048
2026-04-10 16:20:16.413 CST [45616] LOG:  database system is ready to accept read-only connections
```

Root cause:

The min recovery point of node1 is at 0/04000028, but node2 doesn't have any wal after that and may keep idle for
a long time.

Possible fix:

The pg_rewind use pg_current_wal_insert_lsn() to set the min recovery point, which calls
GetXLogInsertRecPtr() and returns the latest wal insert pointer. Maybe we should use
GetXLogInsertEndRecPtr() which returns the latest wal record end pointer.

Thoughts?

--
Regards,
ChangAo Chen


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: [BUG] Take a long time to reach consistent after pg_rewind
@ 2026-04-13 13:14  =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  parent: =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: =?utf-8?B?Y2NhNTUwNw==?= @ 2026-04-13 13:14 UTC (permalink / raw)
  To: =?utf-8?B?cGdzcWwtaGFja2Vycw==?= <[email protected]>

> Possible fix:
> 
> The pg_rewind use pg_current_wal_insert_lsn() to set the min recovery point, which calls
> GetXLogInsertRecPtr() and returns the latest wal insert pointer. Maybe we should use
> GetXLogInsertEndRecPtr() which returns the latest wal record end pointer.
> 
> Thoughts?

Another solution:

If minRecoveryPoint is just after a xlog page header, we can move it to the begin of
the page. It's safe because we just skip the xlog page header. Do I miss something?

Attach a patch done like this.

--
Regards,
ChangAo Chen


Attachments:

  [application/octet-stream] v1-0001-Introduce-GetEffectiveMinRecoveryPoint.patch (2.1K, 2-v1-0001-Introduce-GetEffectiveMinRecoveryPoint.patch)
  download | inline diff:
From 76e75797819d01e36b61ed3a74c51e821a5f2370 Mon Sep 17 00:00:00 2001
From: ChangAo Chen <[email protected]>
Date: Mon, 13 Apr 2026 20:51:55 +0800
Subject: [PATCH v1] Introduce GetEffectiveMinRecoveryPoint()

If minRecoveryPoint is just after a xlog page header, we can move
it to the begin of the page. It's safe because we just skip the
xlog page header. Without this, it may take a long time to reach
a consistent state (e.g. the primary doesn't have any xlog record
after the minRecoveryPoint).
---
 src/backend/access/transam/xlogrecovery.c | 25 ++++++++++++++++++++++-
 1 file changed, 24 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c236e2b7969..63e8409eab9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2139,6 +2139,29 @@ CheckTablespaceDirectory(void)
 	}
 }
 
+/*
+ * If minRecoveryPoint is just after a xlog page header, we return a pointer
+ * that points to the begin of the page, otherwise return minRecoveryPoint.
+ *
+ * The returned pointer is used for checking whether we can reach a consistent
+ * state. It's safe because we just skip the xlog page header.
+ */
+static XLogRecPtr
+GetEffectiveMinRecoveryPoint(void)
+{
+	XLogRecPtr	ptr = minRecoveryPoint;
+	uint64		pageno = XLogSegmentOffset(ptr, wal_segment_size) / XLOG_BLCKSZ;
+	uint64		pageoff = ptr % XLOG_BLCKSZ;
+
+	if (pageno == 0 && pageoff == SizeOfXLogLongPHD)
+		return ptr - SizeOfXLogLongPHD;
+
+	if (pageno > 0 && pageoff == SizeOfXLogShortPHD)
+		return ptr - SizeOfXLogShortPHD;
+
+	return ptr;
+}
+
 /*
  * Checks if recovery has reached a consistent state. When consistency is
  * reached and we have a valid starting standby snapshot, tell postmaster
@@ -2199,7 +2222,7 @@ CheckRecoveryConsistency(void)
 	 * All we know prior to that is that we're not consistent yet.
 	 */
 	if (!reachedConsistency && !backupEndRequired &&
-		minRecoveryPoint <= lastReplayedEndRecPtr)
+		GetEffectiveMinRecoveryPoint() <= lastReplayedEndRecPtr)
 	{
 		/*
 		 * Check to see if the XLOG sequence contained any unresolved
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: [BUG] Take a long time to reach consistent after pg_rewind
@ 2026-05-19 21:20  surya poondla <[email protected]>
  parent: =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: surya poondla @ 2026-05-19 21:20 UTC (permalink / raw)
  To: cca5507 <[email protected]>; +Cc: pgsql-hackers <[email protected]>

Subject: Re: [BUG] Take a long time to reach consistent after pg_rewind

Hi ChangAo,

Thanks for the careful diagnosis, I reproduced the hang on macOS on the
latest postgres code (It took a lot of iterations to reproduce it)
The LSN trace matches your description and I saw the below:
    minRecoveryPoint = 0/08000028
    consistent recovery state reached at = 0/08000060

In my run the standby was stuck for ~9 s; consistency was eventually
declared at 0/08000060 because a small upstream record (most likely
a RUNNING_XACTS snapshot from bgwriter) landed at 0/08000028 and let
lastReplayedEndRecPtr leap past the bad finish line.
With the new primary stopped after pg_rewind, the wait was unbounded as
expected.

Regarding the fix: the underlying issue is that minRecoveryPoint is
implicitly expected to be the end-LSN of a real WAL record, because
lastReplayedEndRecPtr (the value it gets compared against)
can only ever take such values.  All current writers respect this
expectation except pg_rewind: pg_basebackup uses the backup-end record's
EndRecPtr, and the in-running UpdateMinRecoveryPoint path
uses buffer LSNs, both of which are record-end LSNs by construction.
pg_rewind alone uses pg_current_wal_insert_lsn(), which can return a
position just past a page header when the source is idle.
That's why I'd lean toward fixing the producer (pg_rewind).

Concretely, your original suggestion having pg_rewind use
GetXLogInsertEndRecPtr() instead of GetXLogInsertRecPtr(), restores
the invariant globally, and doesn't require future call sites that compare
against minRecoveryPoint to know about page-header adjustments.

If we still want a defense-in-depth guard in CheckRecoveryConsistency() to
handle older pg_rewind binaries running against a newer server,
the v1 patch is on the right track, but I'd suggest:
  - documenting in the helper comment why exactly SizeOfXLogShortPHD /
    SizeOfXLogLongPHD past a page boundary are the only legal
    "non-record-end" minRecoveryPoint values (i.e. who can produce
    them and under what conditions);

  - auditing the other call sites that compare against
    minRecoveryPoint to confirm none of them needs the same
    adjustment, with a comment recording the conclusion.

I can put together a TAP test under src/bin/pg_rewind/t/ that forces a WAL
switch on the source, runs pg_rewind against an
otherwise-idle primary, and asserts that the rewound node reaches
consistency without further upstream activity.
Happy to send a v2 with that test if useful.

This is a liveness bug with potentially unbounded wait on idle promoted
primaries, so it does seem worth back-patching.

Regards,
Surya Poondla


^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2026-05-19 21:20 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-10 09:57 [BUG] Take a long time to reach consistent after pg_rewind =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-04-13 13:14 ` =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-05-19 21:20 ` surya poondla <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox