RE: Fix slotsync worker busy loop causing repeated log messages

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Zhijie Hou (Fujitsu) <[email protected]>
To: Zhijie Hou (Fujitsu) <[email protected]>
To: Amit Kapila <[email protected]>
Cc: Fujii Masao <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: RE: Fix slotsync worker busy loop causing repeated log messages
Date: Fri, 6 Mar 2026 11:04:28 +0000
Message-ID: <TY4PR01MB169079921F88936BDF18F55EB947AA@TY4PR01MB16907.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <OS7PR01MB169097E45A7A84AF631534306947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com>
References: <CAHGQGwF6zG9Z8ws1yb3hY1VqV-WT7hR0qyXCn2HdbjvZQKufDw@mail.gmail.com>
	<CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>
	<OS7PR01MB16909C13530D84781E7C2E2EF947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com>
	<CAA4eK1LGDd2y7Bnj9rHEJLzJx4vThF23+jH9j8bZjuMard9RRA@mail.gmail.com>
	<OS7PR01MB169097E45A7A84AF631534306947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com>

On Tuesday, March 3, 2026 8:25 PM Zhijie Hou (Fujitsu) <[email protected]> wrote:
> > >
> > > I reproduced and debugged this issue where a replication slot's
> > > restart_lsn fails to advance. In my environment, I found it only
> > > occurs when a synced slot first builds a consistent snapshot. The
> > > problematic code path is in
> > > SnapBuildProcessRunningXacts():
> > >
> > >     if (builder->state < SNAPBUILD_CONSISTENT)
> > >     {
> > >         /* returns false if there's no point in performing cleanup just yet */
> > >         if (!SnapBuildFindSnapshot(builder, lsn, running))
> > >             return;
> > >     }
> > >
> > > When a synced slot reaches consistency for the first time with no
> > > running transactions, SnapBuildFindSnapshot() returns false, causing
> > > the function to return without updating the candidate restart_lsn.
> 
> 
> > However, I have a question that even if we haven't incremented it in
> > the first cycle, why is it not incrementing restart_lsn in consecutive sync
> cycles.
> 
> Because no walsender was active during the reproduction step, so the slot
> remained inactive on the publisher and its restart_lsn didn't advanced. As a
> result, the slotsync process stalled while continuously retrying the first cycle
> of snapshot building.

I've analyzed the issues further and identified two distinct cases that can
prevent the slotsync worker from advancing restart_lsn:

1) One is that I mentioned earlier, when the slotsync worker builds a consistent
snapshot for the first time, it does not advance restart_lsn to that LSN nor
serialize the snapshot. If the remote slot on the primary remains unchanged, the
slotsync worker will repeatedly report a WARNING, since it always perceives
itself as building a consistent point for the first time at that same LSN.

The solution I thought is to try to advance the restart_lsn on reaching
consistency, the detailed analysis is as follows:

First is how a consistent snapshot is built first time,

a) a consistent snapshot is built incrementatlly by waiting for running
   transaction to finish.
b) a consistent snapshot is built by restoring a serialized snapshot.
c) a consistent snapshot is built because xl_running_xacts record showed no
   running transactions.

Currently, we do not advance the restart_lsn in all above case, which cause the
restart_lsn fall behind, causing the slotsync worker to repeatedly reporting
LOG. In my reproducation, it reached case c) but case a) and b) can cause the
same issue.

To improve it, I am thinking to advance the restart_lsn in all above cases when
approproate:

For case a), we cannot unconditionaly advance restart_lsn to the LSN of last
xl_running_xact because without collecting all previous transactions info from
the old restart_lsn we could not build a snapshot immediately at this LSN again
after restarting. The solution I thought is to serialize the snapshot in this
case and then advance the restart_lsn, in which case we can avoid collecting all
previous.

For case b) we can direcltly advance the restart_lsn since with a serialized
snapshot we can built reach consistency after restarting as well.

For case c), similar to case b) it's OK to advance the restart_lsn.

In both a) b) c) case, there is one general case where we cannot advance, that
is, when there are transactions decoded but not yet committed before reaching
the consistent point. These transactions data may still be replicated so we
cannot simply advance beyond them.

To implement above, we still need the return value of SnapBuildFindSnapshot() to
distinguish between case a) and b) c). The function returns true for case a
return false for case b) and c). The comment stop SnapBuildFindSnapshot can be
updated because I think the return value is not directly related to whether to
maintain restart_lsn and clean the txn data or not.

See the attachment that implements above.

2) Another reason is when using pg_logical_slot_peek_changes() (including the
binary version) on the primary. This function allows reading WAL beyond the
current confirmed_flush_lsn without actually advancing confirm pos. If the starting
position of an xl_running_xacts record happens to be exactly at the
confirmed_flush_lsn, the function can advance restart_lsn to that point.

However, during slot synchronization, the standby cannot read WAL beyond
confirmed_flush_lsn. This means it cannot access the final xl_running_xacts
record needed to advance restart_lsn to the latest position, causing the
advancement to fail.

In Fuji-San's example:

[PRIMARY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
  slot_name   | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
 logical_slot | 0/03000140  | 0/03000140

[STANDBY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
  slot_name   | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
 logical_slot | 0/03000098  | 0/03000140

0/03000140 is the start position of the last xl_running_xacts.

One way to improve is to change the advancement function to read the last record
as well:

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 603a2b94d05..309feaf2219 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -2134,7 +2134,7 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
 		InvalidateSystemCaches();
 
 		/* Decode records until we reach the requested target */
-		while (ctx->reader->EndRecPtr < moveto)
+		while (ctx->reader->EndRecPtr <= moveto)


Note that I am not insisting on the approach of changing the advancement, I am
trying to analyze the root reason and explore some alternatives for reference.

Best Regards,
Hou zj




Attachments:

  [application/octet-stream] v3-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch (6.2K, 2-v3-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch)
  download | inline diff:
From af6a25eab3b9e829c312b776c1720c9b81cc159e Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Tue, 3 Mar 2026 11:44:38 +0800
Subject: [PATCH v3] Advance restart_lsn when reaching consistency without
 waiting

Currently, the replication slot's restart_lsn is not advanced when first time
building a consistent snapshot, even when it's safe to do so. This can lead
to unnecessary retention of WAL segments, though the impact is rare.

This commit advances restart_lsn at the consistency point if either: a
serialized snapshot from a previous decoding session is available and no
tranasctions have been decoded yet, or there were no running transactions when
reaching consistency

In both cases, it's safe and efficient to restart decoding from this LSN,
reducing WAL retention without affecting decoding capabilities.
---
 src/backend/replication/logical/snapbuild.c | 85 +++++++++++++--------
 1 file changed, 54 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..30ab873b37b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -869,31 +869,34 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
 	if (!TransactionIdIsNormal(builder->xmin))
 		return;
 
-	/* TODO: Neater algorithm than just copying and iterating? */
-	workspace =
-		MemoryContextAlloc(builder->context,
-						   builder->committed.xcnt * sizeof(TransactionId));
-
-	/* copy xids that still are interesting to workspace */
-	for (off = 0; off < builder->committed.xcnt; off++)
+	if (builder->committed.xcnt > 0)
 	{
-		if (NormalTransactionIdPrecedes(builder->committed.xip[off],
-										builder->xmin))
-			;					/* remove */
-		else
-			workspace[surviving_xids++] = builder->committed.xip[off];
-	}
+		/* TODO: Neater algorithm than just copying and iterating? */
+		workspace =
+			MemoryContextAlloc(builder->context,
+							   builder->committed.xcnt * sizeof(TransactionId));
+
+		/* copy xids that still are interesting to workspace */
+		for (off = 0; off < builder->committed.xcnt; off++)
+		{
+			if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+											builder->xmin))
+				;					/* remove */
+			else
+				workspace[surviving_xids++] = builder->committed.xip[off];
+		}
 
-	/* copy workspace back to persistent state */
-	memcpy(builder->committed.xip, workspace,
-		   surviving_xids * sizeof(TransactionId));
+		/* copy workspace back to persistent state */
+		memcpy(builder->committed.xip, workspace,
+			   surviving_xids * sizeof(TransactionId));
 
-	elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
-		 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
-		 builder->xmin, builder->xmax);
-	builder->committed.xcnt = surviving_xids;
+		elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
+			 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
+			 builder->xmin, builder->xmax);
+		builder->committed.xcnt = surviving_xids;
 
-	pfree(workspace);
+		pfree(workspace);
+	}
 
 	/*
 	 * Purge xids in ->catchange as well. The purged array must also be sorted
@@ -1136,6 +1139,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 {
 	ReorderBufferTXN *txn;
 	TransactionId xmin;
+	bool	incremental_build = false;
 
 	/*
 	 * If we're not consistent yet, inspect the record to see whether it
@@ -1143,12 +1147,18 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
-	else
+		incremental_build = SnapBuildFindSnapshot(builder, lsn, running);
+
+	/*
+	 * Serialize the snapshot only when it was built incrementally.
+	 *
+	 * If we built a consistent snapshot immediately at this LSN, either a
+	 * serialized snapshot from a previous decoding session already exists, or
+	 * there were no running transactions. In both cases, any future decoding
+	 * session can also build a consistent snapshot at this point, so
+	 * serialization is unnecessary.
+	 */
+	if (incremental_build && builder->state == SNAPBUILD_CONSISTENT)
 		SnapBuildSerialize(builder, lsn);
 
 	/*
@@ -1197,8 +1207,10 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 
 	/*
-	 * Can't know about a serialized snapshot's location if we're not
-	 * consistent.
+	 * Cannot advance restart_lsn to a point where a consistent snapshot cannot
+	 * be built immediately in the next decoding round (either by restoring a
+	 * serialized snapshot or by confirming there are no running transactions).
+	 * Doing so could cause data prior to reaching consistency to be lost.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 		return;
@@ -1221,6 +1233,15 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 			 XLogRecPtrIsValid(builder->last_serialized_snapshot))
 		LogicalIncreaseRestartDecodingForSlot(lsn,
 											  builder->last_serialized_snapshot);
+
+	/*
+	 * With no active transactions, we reached consistency either because the
+	 * xl_running_xacts record showed no running transactions or because we
+	 * restored a serialized snapshot from another decoding session. In either
+	 * case, it's safe to restart from this LSN.
+	 */
+	else if (txn == NULL)
+		LogicalIncreaseRestartDecodingForSlot(lsn, lsn);
 }
 
 
@@ -1230,8 +1251,10 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
  * Helper function for SnapBuildProcessRunningXacts() while we're not yet
  * consistent.
  *
- * Returns true if there is a point in performing internal maintenance/cleanup
- * using the xl_running_xacts record.
+ * Returns true when the snapshot is built incrementally (whether still in
+ * progress or just completed). Returns false when the snapshot is built
+ * immediately either by restoring a serialized snapshot from disk or because
+ * there were no running transactions.
  */
 static bool
 SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running)
-- 
2.51.1.windows.1

view thread (13+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: RE: Fix slotsync worker busy loop causing repeated log messages
  In-Reply-To: <TY4PR01MB169079921F88936BDF18F55EB947AA@TY4PR01MB16907.jpnprd01.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox