public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Fix slotsync worker busy loop causing repeated log messages
13+ messages / 5 participants
[nested] [flat]

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-02-28 05:03  Amit Kapila <[email protected]>
  0 siblings, 2 replies; 13+ messages in thread

From: Amit Kapila @ 2026-02-28 05:03 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
>
> Normally, the slotsync worker updates the standby slot using the primary's slot
> state. However, when confirmed_flush_lsn matches but restart_lsn does not,
> the worker does not actually update the standby slot. Despite that, the current
> code of update_local_synced_slot() appears to treat this situation as if
> an update occurred. As a result, the worker sleeps only for the minimum
> interval (200 ms) before retrying. In the next cycle, it again assumes
> an update happened, and continues looping with the short sleep interval,
> causing the repeated logical decoding log messages. Based on a quick analysis,
> this seems to be the root cause.
>
> I think update_local_synced_slot() should return false (i.e., no update
> happened) when confirmed_flush_lsn is equal but restart_lsn differs between
> primary and standby.
>

We expect that in such a case update_local_synced_slot() should
advance local_slot's 'restart_lsn' via
LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
cheap code path next time. Normally, restart_lsn advancement should
happen when we process XLOG_RUNNING_XACTS and call
SnapBuildProcessRunningXacts(). In this particular case as both
restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
machinery may not be processing XLOG_RUNNING_XACTS record. I have not
debugged the exact case yet but you can try by emitting some more
records on publisher, it should let the standby advance the slot. It
is possible that we can do something like you are proposing to silence
the LOG messages but we should know what is going on here.

-- 
With Regards,
Amit Kapila.






^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* RE: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-03 07:42  Zhijie Hou (Fujitsu) <[email protected]>
  parent: Amit Kapila <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Zhijie Hou (Fujitsu) @ 2026-03-03 07:42 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Saturday, February 28, 2026 1:03 PM Amit Kapila <[email protected]> wrote:
> On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> >
> > Normally, the slotsync worker updates the standby slot using the
> > primary's slot state. However, when confirmed_flush_lsn matches but
> > restart_lsn does not, the worker does not actually update the standby
> > slot. Despite that, the current code of update_local_synced_slot()
> > appears to treat this situation as if an update occurred. As a result,
> > the worker sleeps only for the minimum interval (200 ms) before
> > retrying. In the next cycle, it again assumes an update happened, and
> > continues looping with the short sleep interval, causing the repeated
> > logical decoding log messages. Based on a quick analysis, this seems to be
> the root cause.
> >
> > I think update_local_synced_slot() should return false (i.e., no
> > update
> > happened) when confirmed_flush_lsn is equal but restart_lsn differs
> > between primary and standby.
> >
> 
> We expect that in such a case update_local_synced_slot() should advance
> local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
> otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
> advancement should happen when we process XLOG_RUNNING_XACTS and
> call SnapBuildProcessRunningXacts(). In this particular case as both
> restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> debugged the exact case yet but you can try by emitting some more records
> on publisher, it should let the standby advance the slot. It is possible that we
> can do something like you are proposing to silence the LOG messages but we
> should know what is going on here.

I reproduced and debugged this issue where a replication slot's restart_lsn
fails to advance. In my environment, I found it only occurs when a synced
slot first builds a consistent snapshot. The problematic code path is in
SnapBuildProcessRunningXacts():

    if (builder->state < SNAPBUILD_CONSISTENT)
    {
        /* returns false if there's no point in performing cleanup just yet */
        if (!SnapBuildFindSnapshot(builder, lsn, running))
            return;
    }

When a synced slot reaches consistency for the first time with no running
transactions, SnapBuildFindSnapshot() returns false, causing the function to
return without updating the candidate restart_lsn.

So, an alternative approach is to improve this logic by updating the candidate
restart_lsn in this case instead of returning early. See the attached patch for
details. This can fix the issue on my machine.

Best Regards,
Hou zj


Attachments:

  [application/octet-stream] v1-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch (3.5K, 2-v1-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch)
  download | inline diff:
From 05f4ea29319637df578a92e90df5d24919cc2f79 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Tue, 3 Mar 2026 11:44:38 +0800
Subject: [PATCH v1] Advance restart_lsn when reaching consistency without
 waiting

Currently, the replication slot's restart_lsn is not advanced when first time
building a consistent snapshot, even when it's safe to do so. This can lead
to unnecessary retention of WAL segments, though the impact is rare.

This commit advances restart_lsn at the consistency point if either:
a serialized snapshot from a previous decoding session is available, or
there were no running transactions when reaching consistency

In both cases, it's safe and efficient to restart decoding from this LSN,
reducing WAL retention without affecting decoding capabilities.
---
 src/backend/replication/logical/snapbuild.c | 27 +++++++++++++++------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..490b948267b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1136,6 +1136,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 {
 	ReorderBufferTXN *txn;
 	TransactionId xmin;
+	bool	snapshot_built_immediately = false;
 
 	/*
 	 * If we're not consistent yet, inspect the record to see whether it
@@ -1143,11 +1144,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
+		snapshot_built_immediately = !SnapBuildFindSnapshot(builder, lsn, running);
 	else
 		SnapBuildSerialize(builder, lsn);
 
@@ -1165,8 +1162,15 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	builder->xmin = running->oldestRunningXid;
 
-	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeOlderTxn(builder);
+	/*
+	 * Remove transactions we don't need to keep track off anymore.
+	 *
+	 * Cleanup is skipped if this is the first time we built a consistent
+	 * snapshot and we didn't wait for any transactions. In that case, no
+	 * transaction data has accumulated.
+	 */
+	if (!snapshot_built_immediately)
+		SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1211,7 +1215,6 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (txn != NULL && XLogRecPtrIsValid(txn->restart_decoding_lsn))
 		LogicalIncreaseRestartDecodingForSlot(lsn, txn->restart_decoding_lsn);
-
 	/*
 	 * No in-progress transaction, can reuse the last serialized snapshot if
 	 * we have one.
@@ -1221,6 +1224,14 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 			 XLogRecPtrIsValid(builder->last_serialized_snapshot))
 		LogicalIncreaseRestartDecodingForSlot(lsn,
 											  builder->last_serialized_snapshot);
+	/*
+	 * If we built a snapshot immediately at this LSN, either a serialized
+	 * snapshot from a different decoding session is available or there were no
+	 * running transactions. In either case, it's safe and efficient to restart
+	 * from this LSN next time.
+	 */
+	else if (snapshot_built_immediately)
+		LogicalIncreaseRestartDecodingForSlot(lsn, lsn);
 }
 
 
-- 
2.51.1.windows.1



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-03 10:25  Amit Kapila <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Amit Kapila @ 2026-03-03 10:25 UTC (permalink / raw)
  To: Zhijie Hou (Fujitsu) <[email protected]>; +Cc: Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Saturday, February 28, 2026 1:03 PM Amit Kapila <[email protected]> wrote:
> > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> > >
> > > Normally, the slotsync worker updates the standby slot using the
> > > primary's slot state. However, when confirmed_flush_lsn matches but
> > > restart_lsn does not, the worker does not actually update the standby
> > > slot. Despite that, the current code of update_local_synced_slot()
> > > appears to treat this situation as if an update occurred. As a result,
> > > the worker sleeps only for the minimum interval (200 ms) before
> > > retrying. In the next cycle, it again assumes an update happened, and
> > > continues looping with the short sleep interval, causing the repeated
> > > logical decoding log messages. Based on a quick analysis, this seems to be
> > the root cause.
> > >
> > > I think update_local_synced_slot() should return false (i.e., no
> > > update
> > > happened) when confirmed_flush_lsn is equal but restart_lsn differs
> > > between primary and standby.
> > >
> >
> > We expect that in such a case update_local_synced_slot() should advance
> > local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
> > otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
> > advancement should happen when we process XLOG_RUNNING_XACTS and
> > call SnapBuildProcessRunningXacts(). In this particular case as both
> > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> > debugged the exact case yet but you can try by emitting some more records
> > on publisher, it should let the standby advance the slot. It is possible that we
> > can do something like you are proposing to silence the LOG messages but we
> > should know what is going on here.
>
> I reproduced and debugged this issue where a replication slot's restart_lsn
> fails to advance. In my environment, I found it only occurs when a synced
> slot first builds a consistent snapshot. The problematic code path is in
> SnapBuildProcessRunningXacts():
>
>     if (builder->state < SNAPBUILD_CONSISTENT)
>     {
>         /* returns false if there's no point in performing cleanup just yet */
>         if (!SnapBuildFindSnapshot(builder, lsn, running))
>             return;
>     }
>
> When a synced slot reaches consistency for the first time with no running
> transactions, SnapBuildFindSnapshot() returns false, causing the function to
> return without updating the candidate restart_lsn.
>
> So, an alternative approach is to improve this logic by updating the candidate
> restart_lsn in this case instead of returning early.
>

But why not return 'true' from SnapBuildFindSnapshot() in that case?
The comment atop SnapBuildFindSnapshot() says: "Returns true if there
is a point in performing internal maintenance/cleanup using the
xl_running_xacts record.". Doesn't updating restart_lsn fall under
that category?

However, I have a question that even if we haven't incremented it in
the first cycle, why is it not incrementing restart_lsn in consecutive
sync cycles.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* RE: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-03 12:24  Zhijie Hou (Fujitsu) <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 2 replies; 13+ messages in thread

From: Zhijie Hou (Fujitsu) @ 2026-03-03 12:24 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tuesday, March 3, 2026 6:25 PM Amit Kapila <[email protected]> wrote:
> On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu) <[email protected]>
> wrote:
> >
> > On Saturday, February 28, 2026 1:03 PM Amit Kapila
> <[email protected]> wrote:
> > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]>
> wrote:
> > > >
> > > > Normally, the slotsync worker updates the standby slot using the
> > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > but restart_lsn does not, the worker does not actually update the
> > > > standby slot. Despite that, the current code of
> > > > update_local_synced_slot() appears to treat this situation as if
> > > > an update occurred. As a result, the worker sleeps only for the
> > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > again assumes an update happened, and continues looping with the
> > > > short sleep interval, causing the repeated logical decoding log
> > > > messages. Based on a quick analysis, this seems to be
> > > the root cause.
> > > >
> > > > I think update_local_synced_slot() should return false (i.e., no
> > > > update
> > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > differs between primary and standby.
> > > >
> > >
> > > We expect that in such a case update_local_synced_slot() should
> > > advance local_slot's 'restart_lsn' via
> > > LogicalSlotAdvanceAndCheckSnapState(),
> > > otherwise, it won't go in the cheap code path next time. Normally,
> > > restart_lsn advancement should happen when we process
> > > XLOG_RUNNING_XACTS and call SnapBuildProcessRunningXacts(). In this
> > > particular case as both restart_lsn and confirmed_flush_lsn are the
> > > same (0/03000140), the machinery may not be processing
> > > XLOG_RUNNING_XACTS record. I have not debugged the exact case yet
> > > but you can try by emitting some more records on publisher, it
> > > should let the standby advance the slot. It is possible that we can
> > > do something like you are proposing to silence the LOG messages but we
> should know what is going on here.
> >
> > I reproduced and debugged this issue where a replication slot's
> > restart_lsn fails to advance. In my environment, I found it only
> > occurs when a synced slot first builds a consistent snapshot. The
> > problematic code path is in
> > SnapBuildProcessRunningXacts():
> >
> >     if (builder->state < SNAPBUILD_CONSISTENT)
> >     {
> >         /* returns false if there's no point in performing cleanup just yet */
> >         if (!SnapBuildFindSnapshot(builder, lsn, running))
> >             return;
> >     }
> >
> > When a synced slot reaches consistency for the first time with no
> > running transactions, SnapBuildFindSnapshot() returns false, causing
> > the function to return without updating the candidate restart_lsn.
> >
> > So, an alternative approach is to improve this logic by updating the
> > candidate restart_lsn in this case instead of returning early.
> >
> 
> But why not return 'true' from SnapBuildFindSnapshot() in that case?
> The comment atop SnapBuildFindSnapshot() says: "Returns true if there is a
> point in performing internal maintenance/cleanup using the xl_running_xacts
> record.". Doesn't updating restart_lsn fall under that category?

After re-thinking the return value semantics, I realized that
SnapBuildFindSnapshot() may no longer need a return value. In the revised
logic, the function would always return true, making the return value
redundant.

To elaborate: the original SnapBuildFindSnapshot() returned:
- true: when snapshot is built incrementally (after waiting for running
  transactions to complete)
- false: when snapshot is built immediately (restoring from disk or no
  running transactions)

My initial thought was to use this return value to decide whether to advance
restart_lsn to the latest LSN (advance when restoring a snapshot or with no running
transactions). However, this decision can be made without relying on the
return value (please see the v2 patch).

Therefore, we can change the function return type to void. The only other
consumer of this return value is SnapBuildPurgeOlderTxn(), which can instead
skip cleanup when no transaction data has accumulated (i.e., when the number
of committed transactions is zero), eliminating the need for the return value
altogether.

> However, I have a question that even if we haven't incremented it in the first
> cycle, why is it not incrementing restart_lsn in consecutive sync cycles.

Because no walsender was active during the reproduction step, so the slot remained
inactive on the publisher and its restart_lsn didn't advanced. As a result, the
slotsync process stalled while continuously retrying the first cycle of
snapshot building.

Best Regards,
Hou zj


Attachments:

  [application/octet-stream] v2-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch (6.8K, 2-v2-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch)
  download | inline diff:
From 19612c969d68089d6ddca6f78910ccefc556f35a Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Tue, 3 Mar 2026 11:44:38 +0800
Subject: [PATCH v3] Advance restart_lsn when reaching consistency without
 waiting

Currently, the replication slot's restart_lsn is not advanced when first time
building a consistent snapshot, even when it's safe to do so. This can lead
to unnecessary retention of WAL segments, though the impact is rare.

This commit advances restart_lsn at the consistency point if either:
a serialized snapshot from a previous decoding session is available, or
there were no running transactions when reaching consistency

In both cases, it's safe and efficient to restart decoding from this LSN,
reducing WAL retention without affecting decoding capabilities.
---
 src/backend/replication/logical/snapbuild.c | 86 ++++++++++-----------
 1 file changed, 41 insertions(+), 45 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..252894dc090 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -167,7 +167,7 @@ static inline bool SnapBuildXidHasCatalogChanges(SnapBuild *builder, Transaction
 												 uint32 xinfo);
 
 /* xlog reading helper functions for SnapBuildProcessRunningXacts */
-static bool SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
+static void SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running);
 static void SnapBuildWaitSnapshot(xl_running_xacts *running, TransactionId cutoff);
 
 /* serialization functions */
@@ -869,31 +869,34 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
 	if (!TransactionIdIsNormal(builder->xmin))
 		return;
 
-	/* TODO: Neater algorithm than just copying and iterating? */
-	workspace =
-		MemoryContextAlloc(builder->context,
-						   builder->committed.xcnt * sizeof(TransactionId));
-
-	/* copy xids that still are interesting to workspace */
-	for (off = 0; off < builder->committed.xcnt; off++)
+	if (builder->committed.xcnt > 0)
 	{
-		if (NormalTransactionIdPrecedes(builder->committed.xip[off],
-										builder->xmin))
-			;					/* remove */
-		else
-			workspace[surviving_xids++] = builder->committed.xip[off];
-	}
+		/* TODO: Neater algorithm than just copying and iterating? */
+		workspace =
+			MemoryContextAlloc(builder->context,
+							   builder->committed.xcnt * sizeof(TransactionId));
+
+		/* copy xids that still are interesting to workspace */
+		for (off = 0; off < builder->committed.xcnt; off++)
+		{
+			if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+											builder->xmin))
+				;					/* remove */
+			else
+				workspace[surviving_xids++] = builder->committed.xip[off];
+		}
 
-	/* copy workspace back to persistent state */
-	memcpy(builder->committed.xip, workspace,
-		   surviving_xids * sizeof(TransactionId));
+		/* copy workspace back to persistent state */
+		memcpy(builder->committed.xip, workspace,
+			   surviving_xids * sizeof(TransactionId));
 
-	elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
-		 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
-		 builder->xmin, builder->xmax);
-	builder->committed.xcnt = surviving_xids;
+		elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
+			 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
+			 builder->xmin, builder->xmax);
+		builder->committed.xcnt = surviving_xids;
 
-	pfree(workspace);
+		pfree(workspace);
+	}
 
 	/*
 	 * Purge xids in ->catchange as well. The purged array must also be sorted
@@ -1143,11 +1146,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
+		SnapBuildFindSnapshot(builder, lsn, running);
 	else
 		SnapBuildSerialize(builder, lsn);
 
@@ -1197,8 +1196,10 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 
 	/*
-	 * Can't know about a serialized snapshot's location if we're not
-	 * consistent.
+	 * Cannot advance restart_lsn to a point where a consistent snapshot cannot
+	 * be built immediately in the next decoding round (either by restoring a
+	 * serialized snapshot or by confirming there are no running transactions).
+	 * Doing so could cause data prior to reaching consistency to be lost.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 		return;
@@ -1221,6 +1222,15 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 			 XLogRecPtrIsValid(builder->last_serialized_snapshot))
 		LogicalIncreaseRestartDecodingForSlot(lsn,
 											  builder->last_serialized_snapshot);
+
+	/*
+	 * Reaching here indicates we built the snapshot either by restoring a
+	 * serialized snapshot from a previous decoding session or because there
+	 * were no running transactions. In either case, it's safe and efficient to
+	 * restart from this LSN next time.
+	 */
+	else
+		LogicalIncreaseRestartDecodingForSlot(lsn, lsn);
 }
 
 
@@ -1229,11 +1239,8 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
  *
  * Helper function for SnapBuildProcessRunningXacts() while we're not yet
  * consistent.
- *
- * Returns true if there is a point in performing internal maintenance/cleanup
- * using the xl_running_xacts record.
  */
-static bool
+static void
 SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running)
 {
 	/* ---
@@ -1278,7 +1285,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 
 		SnapBuildWaitSnapshot(running, builder->initial_xmin_horizon);
 
-		return true;
+		return;
 	}
 
 	/*
@@ -1312,8 +1319,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 				errmsg("logical decoding found consistent point at %X/%08X",
 					   LSN_FORMAT_ARGS(lsn)),
 				errdetail("There are no running transactions."));
-
-		return false;
 	}
 
 	/*
@@ -1324,8 +1329,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 			 !builder->in_slot_creation &&
 			 SnapBuildRestore(builder, lsn))
 	{
-		/* there won't be any state to cleanup */
-		return false;
 	}
 
 	/*
@@ -1410,13 +1413,6 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 					   LSN_FORMAT_ARGS(lsn)),
 				errdetail("There are no old transactions anymore."));
 	}
-
-	/*
-	 * We already started to track running xacts and need to wait for all
-	 * in-progress ones to finish. We fall through to the normal processing of
-	 * records so incremental cleanup can be performed.
-	 */
-	return true;
 }
 
 /* ---
-- 
2.51.1.windows.1



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* RE: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-04 02:59  =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  1 sibling, 0 replies; 13+ messages in thread

From: =?utf-8?B?Y2NhNTUwNw==?= @ 2026-03-04 02:59 UTC (permalink / raw)
  To: =?utf-8?B?WmhpamllIEhvdSAoRnVqaXRzdSk=?= <[email protected]>; =?utf-8?B?QW1pdCBLYXBpbGE=?= <[email protected]>; +Cc: =?utf-8?B?RnVqaWkgTWFzYW8=?= <[email protected]>; =?utf-8?B?UG9zdGdyZVNRTCBIYWNrZXJz?= <[email protected]>

Hi,

I have a small question about this:

```
@@ -1143,11 +1146,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
+		SnapBuildFindSnapshot(builder, lsn, running);
 	else
 		SnapBuildSerialize(builder, lsn);
```

Why we don't call SnapBuildSerialize() when reaching consistent
in SnapBuildFindSnapshot()? I think we can serialize the snapshot
if we are not building a full snapshot, at least.

--
Regards,
ChangAo Chen


^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* RE: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-06 11:04  Zhijie Hou (Fujitsu) <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  1 sibling, 0 replies; 13+ messages in thread

From: Zhijie Hou (Fujitsu) @ 2026-03-06 11:04 UTC (permalink / raw)
  To: Zhijie Hou (Fujitsu) <[email protected]>; Amit Kapila <[email protected]>; +Cc: Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tuesday, March 3, 2026 8:25 PM Zhijie Hou (Fujitsu) <[email protected]> wrote:
> > >
> > > I reproduced and debugged this issue where a replication slot's
> > > restart_lsn fails to advance. In my environment, I found it only
> > > occurs when a synced slot first builds a consistent snapshot. The
> > > problematic code path is in
> > > SnapBuildProcessRunningXacts():
> > >
> > >     if (builder->state < SNAPBUILD_CONSISTENT)
> > >     {
> > >         /* returns false if there's no point in performing cleanup just yet */
> > >         if (!SnapBuildFindSnapshot(builder, lsn, running))
> > >             return;
> > >     }
> > >
> > > When a synced slot reaches consistency for the first time with no
> > > running transactions, SnapBuildFindSnapshot() returns false, causing
> > > the function to return without updating the candidate restart_lsn.
> 
> 
> > However, I have a question that even if we haven't incremented it in
> > the first cycle, why is it not incrementing restart_lsn in consecutive sync
> cycles.
> 
> Because no walsender was active during the reproduction step, so the slot
> remained inactive on the publisher and its restart_lsn didn't advanced. As a
> result, the slotsync process stalled while continuously retrying the first cycle
> of snapshot building.

I've analyzed the issues further and identified two distinct cases that can
prevent the slotsync worker from advancing restart_lsn:

1) One is that I mentioned earlier, when the slotsync worker builds a consistent
snapshot for the first time, it does not advance restart_lsn to that LSN nor
serialize the snapshot. If the remote slot on the primary remains unchanged, the
slotsync worker will repeatedly report a WARNING, since it always perceives
itself as building a consistent point for the first time at that same LSN.

The solution I thought is to try to advance the restart_lsn on reaching
consistency, the detailed analysis is as follows:

First is how a consistent snapshot is built first time,

a) a consistent snapshot is built incrementatlly by waiting for running
   transaction to finish.
b) a consistent snapshot is built by restoring a serialized snapshot.
c) a consistent snapshot is built because xl_running_xacts record showed no
   running transactions.

Currently, we do not advance the restart_lsn in all above case, which cause the
restart_lsn fall behind, causing the slotsync worker to repeatedly reporting
LOG. In my reproducation, it reached case c) but case a) and b) can cause the
same issue.

To improve it, I am thinking to advance the restart_lsn in all above cases when
approproate:

For case a), we cannot unconditionaly advance restart_lsn to the LSN of last
xl_running_xact because without collecting all previous transactions info from
the old restart_lsn we could not build a snapshot immediately at this LSN again
after restarting. The solution I thought is to serialize the snapshot in this
case and then advance the restart_lsn, in which case we can avoid collecting all
previous.

For case b) we can direcltly advance the restart_lsn since with a serialized
snapshot we can built reach consistency after restarting as well.

For case c), similar to case b) it's OK to advance the restart_lsn.

In both a) b) c) case, there is one general case where we cannot advance, that
is, when there are transactions decoded but not yet committed before reaching
the consistent point. These transactions data may still be replicated so we
cannot simply advance beyond them.

To implement above, we still need the return value of SnapBuildFindSnapshot() to
distinguish between case a) and b) c). The function returns true for case a
return false for case b) and c). The comment stop SnapBuildFindSnapshot can be
updated because I think the return value is not directly related to whether to
maintain restart_lsn and clean the txn data or not.

See the attachment that implements above.

2) Another reason is when using pg_logical_slot_peek_changes() (including the
binary version) on the primary. This function allows reading WAL beyond the
current confirmed_flush_lsn without actually advancing confirm pos. If the starting
position of an xl_running_xacts record happens to be exactly at the
confirmed_flush_lsn, the function can advance restart_lsn to that point.

However, during slot synchronization, the standby cannot read WAL beyond
confirmed_flush_lsn. This means it cannot access the final xl_running_xacts
record needed to advance restart_lsn to the latest position, causing the
advancement to fail.

In Fuji-San's example:

[PRIMARY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
  slot_name   | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
 logical_slot | 0/03000140  | 0/03000140

[STANDBY]
=# SELECT slot_name, restart_lsn, confirmed_flush_lsn from
pg_replication_slots where slot_name = 'logical_slot';
  slot_name   | restart_lsn | confirmed_flush_lsn
--------------+-------------+---------------------
 logical_slot | 0/03000098  | 0/03000140

0/03000140 is the start position of the last xl_running_xacts.

One way to improve is to change the advancement function to read the last record
as well:

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 603a2b94d05..309feaf2219 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -2134,7 +2134,7 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
 		InvalidateSystemCaches();
 
 		/* Decode records until we reach the requested target */
-		while (ctx->reader->EndRecPtr < moveto)
+		while (ctx->reader->EndRecPtr <= moveto)


Note that I am not insisting on the approach of changing the advancement, I am
trying to analyze the root reason and explore some alternatives for reference.

Best Regards,
Hou zj




Attachments:

  [application/octet-stream] v3-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch (6.2K, 2-v3-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch)
  download | inline diff:
From af6a25eab3b9e829c312b776c1720c9b81cc159e Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Tue, 3 Mar 2026 11:44:38 +0800
Subject: [PATCH v3] Advance restart_lsn when reaching consistency without
 waiting

Currently, the replication slot's restart_lsn is not advanced when first time
building a consistent snapshot, even when it's safe to do so. This can lead
to unnecessary retention of WAL segments, though the impact is rare.

This commit advances restart_lsn at the consistency point if either: a
serialized snapshot from a previous decoding session is available and no
tranasctions have been decoded yet, or there were no running transactions when
reaching consistency

In both cases, it's safe and efficient to restart decoding from this LSN,
reducing WAL retention without affecting decoding capabilities.
---
 src/backend/replication/logical/snapbuild.c | 85 +++++++++++++--------
 1 file changed, 54 insertions(+), 31 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..30ab873b37b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -869,31 +869,34 @@ SnapBuildPurgeOlderTxn(SnapBuild *builder)
 	if (!TransactionIdIsNormal(builder->xmin))
 		return;
 
-	/* TODO: Neater algorithm than just copying and iterating? */
-	workspace =
-		MemoryContextAlloc(builder->context,
-						   builder->committed.xcnt * sizeof(TransactionId));
-
-	/* copy xids that still are interesting to workspace */
-	for (off = 0; off < builder->committed.xcnt; off++)
+	if (builder->committed.xcnt > 0)
 	{
-		if (NormalTransactionIdPrecedes(builder->committed.xip[off],
-										builder->xmin))
-			;					/* remove */
-		else
-			workspace[surviving_xids++] = builder->committed.xip[off];
-	}
+		/* TODO: Neater algorithm than just copying and iterating? */
+		workspace =
+			MemoryContextAlloc(builder->context,
+							   builder->committed.xcnt * sizeof(TransactionId));
+
+		/* copy xids that still are interesting to workspace */
+		for (off = 0; off < builder->committed.xcnt; off++)
+		{
+			if (NormalTransactionIdPrecedes(builder->committed.xip[off],
+											builder->xmin))
+				;					/* remove */
+			else
+				workspace[surviving_xids++] = builder->committed.xip[off];
+		}
 
-	/* copy workspace back to persistent state */
-	memcpy(builder->committed.xip, workspace,
-		   surviving_xids * sizeof(TransactionId));
+		/* copy workspace back to persistent state */
+		memcpy(builder->committed.xip, workspace,
+			   surviving_xids * sizeof(TransactionId));
 
-	elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
-		 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
-		 builder->xmin, builder->xmax);
-	builder->committed.xcnt = surviving_xids;
+		elog(DEBUG3, "purged committed transactions from %u to %u, xmin: %u, xmax: %u",
+			 (uint32) builder->committed.xcnt, (uint32) surviving_xids,
+			 builder->xmin, builder->xmax);
+		builder->committed.xcnt = surviving_xids;
 
-	pfree(workspace);
+		pfree(workspace);
+	}
 
 	/*
 	 * Purge xids in ->catchange as well. The purged array must also be sorted
@@ -1136,6 +1139,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 {
 	ReorderBufferTXN *txn;
 	TransactionId xmin;
+	bool	incremental_build = false;
 
 	/*
 	 * If we're not consistent yet, inspect the record to see whether it
@@ -1143,12 +1147,18 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
-	else
+		incremental_build = SnapBuildFindSnapshot(builder, lsn, running);
+
+	/*
+	 * Serialize the snapshot only when it was built incrementally.
+	 *
+	 * If we built a consistent snapshot immediately at this LSN, either a
+	 * serialized snapshot from a previous decoding session already exists, or
+	 * there were no running transactions. In both cases, any future decoding
+	 * session can also build a consistent snapshot at this point, so
+	 * serialization is unnecessary.
+	 */
+	if (incremental_build && builder->state == SNAPBUILD_CONSISTENT)
 		SnapBuildSerialize(builder, lsn);
 
 	/*
@@ -1197,8 +1207,10 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 
 	/*
-	 * Can't know about a serialized snapshot's location if we're not
-	 * consistent.
+	 * Cannot advance restart_lsn to a point where a consistent snapshot cannot
+	 * be built immediately in the next decoding round (either by restoring a
+	 * serialized snapshot or by confirming there are no running transactions).
+	 * Doing so could cause data prior to reaching consistency to be lost.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
 		return;
@@ -1221,6 +1233,15 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 			 XLogRecPtrIsValid(builder->last_serialized_snapshot))
 		LogicalIncreaseRestartDecodingForSlot(lsn,
 											  builder->last_serialized_snapshot);
+
+	/*
+	 * With no active transactions, we reached consistency either because the
+	 * xl_running_xacts record showed no running transactions or because we
+	 * restored a serialized snapshot from another decoding session. In either
+	 * case, it's safe to restart from this LSN.
+	 */
+	else if (txn == NULL)
+		LogicalIncreaseRestartDecodingForSlot(lsn, lsn);
 }
 
 
@@ -1230,8 +1251,10 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
  * Helper function for SnapBuildProcessRunningXacts() while we're not yet
  * consistent.
  *
- * Returns true if there is a point in performing internal maintenance/cleanup
- * using the xl_running_xacts record.
+ * Returns true when the snapshot is built incrementally (whether still in
+ * progress or just completed). Returns false when the snapshot is built
+ * immediately either by restoring a serialized snapshot from disk or because
+ * there were no running transactions.
  */
 static bool
 SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *running)
-- 
2.51.1.windows.1



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-19 13:38  Fujii Masao <[email protected]>
  parent: Amit Kapila <[email protected]>
  1 sibling, 1 reply; 13+ messages in thread

From: Fujii Masao @ 2026-03-19 13:38 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]> wrote:
>
> On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> >
> > Normally, the slotsync worker updates the standby slot using the primary's slot
> > state. However, when confirmed_flush_lsn matches but restart_lsn does not,
> > the worker does not actually update the standby slot. Despite that, the current
> > code of update_local_synced_slot() appears to treat this situation as if
> > an update occurred. As a result, the worker sleeps only for the minimum
> > interval (200 ms) before retrying. In the next cycle, it again assumes
> > an update happened, and continues looping with the short sleep interval,
> > causing the repeated logical decoding log messages. Based on a quick analysis,
> > this seems to be the root cause.
> >
> > I think update_local_synced_slot() should return false (i.e., no update
> > happened) when confirmed_flush_lsn is equal but restart_lsn differs between
> > primary and standby.
> >
>
> We expect that in such a case update_local_synced_slot() should
> advance local_slot's 'restart_lsn' via
> LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> cheap code path next time. Normally, restart_lsn advancement should
> happen when we process XLOG_RUNNING_XACTS and call
> SnapBuildProcessRunningXacts(). In this particular case as both
> restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> debugged the exact case yet but you can try by emitting some more
> records on publisher, it should let the standby advance the slot. It
> is possible that we can do something like you are proposing to silence
> the LOG messages but we should know what is going on here.

I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.

Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.

LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local slot,
but update_local_synced_slot() currently assumes it does. That seems to be
the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was actually
updated, and updates update_local_synced_slot() to act based on that result.

I've also attached 0002, which reduces the log level of some logical decoding
messages (e.g., "logical decoding found consistent point") from LOG to
DEBUG1. These are low-level, developer-oriented messages and can be noisy
when they occur regularly. This is the same issue I reported earlier at [1].

How about applying these patches while we continue working on
the restart_lsn issue?

Regards,

[1] https://postgr.es/m/CAHGQGwED5kJxZ_MdCh=WLa5M7ekXdnzCO1ZCQhQNCEdfaEPwFQ@mail.gmail.com

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v2-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch (6.6K, 2-v2-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch)
  download | inline diff:
From 4e8fbce7accd6872c09ff0da02bfd0d7ae51b1a4 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 19 Mar 2026 17:34:55 +0900
Subject: [PATCH v2 1/2] Fix slotsync worker busy loop causing repeated logical
 decoding logs.

Previously, the slotsync worker could enter a busy loop and emit four logical
log messages every 200 ms, even when both the primary and standby were idle.

This happened because the worker incorrectly treated certain cases as
successful slot updates, causing it to use the minimum sleep interval and
repeatedly restart slot syncing.

This commit fixes this by ensuring the worker does not treat such cases as
updates, allowing it to sleep normally and avoid excessive log output.
---
 src/backend/replication/logical/logical.c  | 32 ++++++++++++++++++----
 src/backend/replication/logical/slotsync.c |  7 +++--
 src/backend/replication/slotfuncs.c        |  2 +-
 src/include/replication/logical.h          |  5 ++--
 4 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 603a2b94d05..9a3b0020040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1806,18 +1806,22 @@ LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn, XLogRecPtr restart
 
 /*
  * Handle a consumer's confirmation having received all changes up to lsn.
+ *
+ * Return TRUE if the local slot is updated.
  */
-void
+bool
 LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 {
+	bool		updated_confirmed = false;
+	bool		updated_xmin = false;
+	bool		updated_restart = false;
+
 	Assert(XLogRecPtrIsValid(lsn));
 
 	/* Do an unlocked check for candidate_lsn first. */
 	if (XLogRecPtrIsValid(MyReplicationSlot->candidate_xmin_lsn) ||
 		XLogRecPtrIsValid(MyReplicationSlot->candidate_restart_valid))
 	{
-		bool		updated_xmin = false;
-		bool		updated_restart = false;
 		XLogRecPtr	restart_lsn pg_attribute_unused();
 
 		SpinLockAcquire(&MyReplicationSlot->mutex);
@@ -1837,7 +1841,10 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		 * LSN. See similar case handling in CreateDecodingContext.
 		 */
 		if (lsn > MyReplicationSlot->data.confirmed_flush)
+		{
 			MyReplicationSlot->data.confirmed_flush = lsn;
+			updated_confirmed = true;
+		}
 
 		/* if we're past the location required for bumping xmin, do so */
 		if (XLogRecPtrIsValid(MyReplicationSlot->candidate_xmin_lsn) &&
@@ -1920,10 +1927,15 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 		 * for the details.
 		 */
 		if (lsn > MyReplicationSlot->data.confirmed_flush)
+		{
 			MyReplicationSlot->data.confirmed_flush = lsn;
+			updated_confirmed = true;
+		}
 
 		SpinLockRelease(&MyReplicationSlot->mutex);
 	}
+
+	return updated_confirmed || updated_xmin || updated_restart;
 }
 
 /*
@@ -2089,10 +2101,14 @@ LogicalReplicationSlotCheckPendingWal(XLogRecPtr end_of_wal,
  *
  * *found_consistent_snapshot will be true if the initial decoding snapshot has
  * been built; Otherwise, it will be false.
+ *
+ * *updated_xmin_or_lsn will be true if the local slot is updated; Otherwise,
+ * it will be false.
  */
 XLogRecPtr
 LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
-									bool *found_consistent_snapshot)
+									bool *found_consistent_snapshot,
+									bool *updated_xmin_or_lsn)
 {
 	LogicalDecodingContext *ctx;
 	ResourceOwner old_resowner PG_USED_FOR_ASSERTS_ONLY = CurrentResourceOwner;
@@ -2103,6 +2119,9 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
 	if (found_consistent_snapshot)
 		*found_consistent_snapshot = false;
 
+	if (updated_xmin_or_lsn)
+		*updated_xmin_or_lsn = false;
+
 	PG_TRY();
 	{
 		/*
@@ -2174,7 +2193,10 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
 
 		if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
 		{
-			LogicalConfirmReceivedLocation(moveto);
+			bool		slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+			if (updated_xmin_or_lsn)
+				*updated_xmin_or_lsn = slot_updated;
 
 			/*
 			 * If only the confirmed_flush LSN has changed the slot won't get
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..3e70cfb0a3e 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -315,13 +315,16 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 			slot->data.confirmed_flush = remote_slot->confirmed_lsn;
 			slot->data.catalog_xmin = remote_slot->catalog_xmin;
 			SpinLockRelease(&slot->mutex);
+
+			updated_xmin_or_lsn = true;
 		}
 		else
 		{
 			bool		found_consistent_snapshot;
 
 			LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
-												&found_consistent_snapshot);
+												&found_consistent_snapshot,
+												&updated_xmin_or_lsn);
 
 			/* Sanity check */
 			if (slot->data.confirmed_flush != remote_slot->confirmed_lsn)
@@ -349,8 +352,6 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 				skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
 			}
 		}
-
-		updated_xmin_or_lsn = true;
 	}
 
 	/* Update slot sync skip stats */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9f5e4f998fe..b6846bb313e 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -524,7 +524,7 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
 {
-	return LogicalSlotAdvanceAndCheckSnapState(moveto, NULL);
+	return LogicalSlotAdvanceAndCheckSnapState(moveto, NULL, NULL);
 }
 
 /*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index bc9d4ece672..0208f9ca6a2 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -140,7 +140,7 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr current_lsn,
 									   TransactionId xmin);
 extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
 												  XLogRecPtr restart_lsn);
-extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalConfirmReceivedLocation(XLogRecPtr lsn);
 
 extern bool filter_prepare_cb_wrapper(LogicalDecodingContext *ctx,
 									  TransactionId xid, const char *gid);
@@ -151,6 +151,7 @@ extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
 extern XLogRecPtr LogicalReplicationSlotCheckPendingWal(XLogRecPtr end_of_wal,
 														XLogRecPtr scan_cutoff_lsn);
 extern XLogRecPtr LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
-													  bool *found_consistent_snapshot);
+													  bool *found_consistent_snapshot,
+													  bool *updated_xmin_or_lsn);
 
 #endif
-- 
2.51.2



  [application/octet-stream] v2-0002-Reduce-log-level-of-some-logical-decoding-message.patch (3.3K, 3-v2-0002-Reduce-log-level-of-some-logical-decoding-message.patch)
  download | inline diff:
From 87dfed10facb2d3ea692db72c2e0963c19d7c70f Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 19 Mar 2026 20:32:43 +0900
Subject: [PATCH v2 2/2] Reduce log level of some logical decoding messages
 from LOG to DEBUG1

Previously some logical decoding messages (e.g., "logical decoding found
consistent point") were logged at level LOG, even though they provided
low-level, developer-oriented information that DBAs were typically not
interested in.

Since these messages can occur routinely (for example, when keeping calling
pg_logical_slot_get_changes() to obtain the changes from logical decoding),
logging them at LOG can be overly verbose.

This commit reduces their log level to DEBUG1 to avoid unnecessary log noise.
---
 src/backend/replication/logical/logical.c              | 2 +-
 src/backend/replication/logical/snapbuild.c            | 6 +++---
 src/test/recovery/t/038_save_logical_slots_shutdown.pl | 1 +
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9a3b0020040..4b5fefbc426 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -598,7 +598,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 
 	ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
 
-	ereport(LOG,
+	ereport(DEBUG1,
 			(errmsg("starting logical decoding for slot \"%s\"",
 					NameStr(slot->data.name)),
 			 errdetail("Streaming transactions committing after %X/%08X, reading WAL from %X/%08X.",
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 37f0c6028bd..b4269a3b102 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1312,7 +1312,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->state = SNAPBUILD_CONSISTENT;
 		builder->next_phase_at = InvalidTransactionId;
 
-		ereport(LOG,
+		ereport(DEBUG1,
 				errmsg("logical decoding found consistent point at %X/%08X",
 					   LSN_FORMAT_ARGS(lsn)),
 				errdetail("There are no running transactions."));
@@ -1409,7 +1409,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
 		builder->state = SNAPBUILD_CONSISTENT;
 		builder->next_phase_at = InvalidTransactionId;
 
-		ereport(LOG,
+		ereport(DEBUG1,
 				errmsg("logical decoding found consistent point at %X/%08X",
 					   LSN_FORMAT_ARGS(lsn)),
 				errdetail("There are no old transactions anymore."));
@@ -1915,7 +1915,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
 
 	Assert(builder->state == SNAPBUILD_CONSISTENT);
 
-	ereport(LOG,
+	ereport(DEBUG1,
 			errmsg("logical decoding found consistent point at %X/%08X",
 				   LSN_FORMAT_ARGS(lsn)),
 			errdetail("Logical decoding will begin using saved snapshot."));
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
index c0392d50460..05aa78b4bc7 100644
--- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -48,6 +48,7 @@ $node_publisher->append_conf(
 	'postgresql.conf', q{
 checkpoint_timeout = 1h
 autovacuum = off
+log_min_messages = 'debug1'
 });
 $node_publisher->start;
 
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-03-21 17:46  Amit Kapila <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 13+ messages in thread

From: Amit Kapila @ 2026-03-21 17:46 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <[email protected]> wrote:
>
> On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]> wrote:
> >
> > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> > >
> > > Normally, the slotsync worker updates the standby slot using the primary's slot
> > > state. However, when confirmed_flush_lsn matches but restart_lsn does not,
> > > the worker does not actually update the standby slot. Despite that, the current
> > > code of update_local_synced_slot() appears to treat this situation as if
> > > an update occurred. As a result, the worker sleeps only for the minimum
> > > interval (200 ms) before retrying. In the next cycle, it again assumes
> > > an update happened, and continues looping with the short sleep interval,
> > > causing the repeated logical decoding log messages. Based on a quick analysis,
> > > this seems to be the root cause.
> > >
> > > I think update_local_synced_slot() should return false (i.e., no update
> > > happened) when confirmed_flush_lsn is equal but restart_lsn differs between
> > > primary and standby.
> > >
> >
> > We expect that in such a case update_local_synced_slot() should
> > advance local_slot's 'restart_lsn' via
> > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > cheap code path next time. Normally, restart_lsn advancement should
> > happen when we process XLOG_RUNNING_XACTS and call
> > SnapBuildProcessRunningXacts(). In this particular case as both
> > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> > debugged the exact case yet but you can try by emitting some more
> > records on publisher, it should let the standby advance the slot. It
> > is possible that we can do something like you are proposing to silence
> > the LOG messages but we should know what is going on here.
>
> I agree it's important to investigate why restart_lsn doesn't advance
> in that case and fix that issue.
>
> Separately, I think mistreating the slot as updated is another problem:
> the local slot is treated as updated even when it isn't, which causes
> the slotsync worker to use the minimum sleep interval. We should
> address this independently.
>
> LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local slot,
> but update_local_synced_slot() currently assumes it does. That seems to be
> the root cause. The attached 0001 patch makes
> LogicalSlotAdvanceAndCheckSnapState() return whether the slot was actually
> updated, and updates update_local_synced_slot() to act based on that result.
>
> I've also attached 0002, which reduces the log level of some logical decoding
> messages (e.g., "logical decoding found consistent point") from LOG to
> DEBUG1. These are low-level, developer-oriented messages and can be noisy
> when they occur regularly. This is the same issue I reported earlier at [1].
>
> How about applying these patches while we continue working on
> the restart_lsn issue?
>

*
@@ -2174,7 +2193,10 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,

  if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
  {
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;

BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot
while processing running_xact record, so not sure we can rely only on
the explicit call LogicalConfirmReceivedLocation() above to ascertain
the same.

Sorry, I couldn't get the chance to look at the patches proposed by
Hou-san to fix this issue but I'll look at it after the feature
freeze.

* For 0002, I think the current log_level is used since logical
decoding is introduced, so if we want to change those, we should start
a separate thread. BTW, as a developer, I find those LOG level
messages quite useful while debugging issues, so I am in favor of
keeping those as it is. However, if the consensus is otherwise, we can
change them as well.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* RE: Fix slotsync worker busy loop causing repeated log messages
@ 2026-04-10 02:58  Zhijie Hou (Fujitsu) <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 3 replies; 13+ messages in thread

From: Zhijie Hou (Fujitsu) @ 2026-04-10 02:58 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Fujii Masao <[email protected]>

On Sunday, March 22, 2026 1:47 AM Amit Kapila <[email protected]> wrote:
> On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <[email protected]>
> wrote:
> >
> > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]>
> wrote:
> > >
> > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]>
> wrote:
> > > >
> > > > Normally, the slotsync worker updates the standby slot using the
> > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > but restart_lsn does not, the worker does not actually update the
> > > > standby slot. Despite that, the current code of
> > > > update_local_synced_slot() appears to treat this situation as if
> > > > an update occurred. As a result, the worker sleeps only for the
> > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > again assumes an update happened, and continues looping with the
> > > > short sleep interval, causing the repeated logical decoding log messages.
> Based on a quick analysis, this seems to be the root cause.
> > > >
> > > > I think update_local_synced_slot() should return false (i.e., no
> > > > update
> > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > differs between primary and standby.
> > > >
> > >
> > > We expect that in such a case update_local_synced_slot() should
> > > advance local_slot's 'restart_lsn' via
> > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > > cheap code path next time. Normally, restart_lsn advancement should
> > > happen when we process XLOG_RUNNING_XACTS and call
> > > SnapBuildProcessRunningXacts(). In this particular case as both
> > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > > machinery may not be processing XLOG_RUNNING_XACTS record. I have
> > > not debugged the exact case yet but you can try by emitting some
> > > more records on publisher, it should let the standby advance the
> > > slot. It is possible that we can do something like you are proposing
> > > to silence the LOG messages but we should know what is going on here.
> >
> > I agree it's important to investigate why restart_lsn doesn't advance
> > in that case and fix that issue.
> >
> > Separately, I think mistreating the slot as updated is another problem:
> > the local slot is treated as updated even when it isn't, which causes
> > the slotsync worker to use the minimum sleep interval. We should
> > address this independently.
> >
> > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
> > slot, but update_local_synced_slot() currently assumes it does. That
> > seems to be the root cause. The attached 0001 patch makes
> > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
> > actually updated, and updates update_local_synced_slot() to act based on
> that result.
> >
> > I've also attached 0002, which reduces the log level of some logical
> > decoding messages (e.g., "logical decoding found consistent point")
> > from LOG to DEBUG1. These are low-level, developer-oriented messages
> > and can be noisy when they occur regularly. This is the same issue I
> reported earlier at [1].
> >
> > How about applying these patches while we continue working on the
> > restart_lsn issue?
> >
> 
> *
> @@ -2174,7 +2193,10 @@
> LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> 
>   if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
>   {
> - LogicalConfirmReceivedLocation(moveto);
> + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> +
> + if (updated_xmin_or_lsn)
> + *updated_xmin_or_lsn = slot_updated;
> 
> BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> processing running_xact record, so not sure we can rely only on the explicit
> call LogicalConfirmReceivedLocation() above to ascertain the same.

Perhaps we could simply compare the slot's old and new LSN/xmin to determine
whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
into each individual update path at the lower level. Attaching a patch for reference.

> 
> Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
> fix this issue but I'll look at it after the feature freeze.

The other patches attempt to advance restart_lsn in more cases by modifying the
logical decoding logic. However, those scenarios are relatively rare and may not
produce significant improvement outside the slot synchronization context. So, I
think modifying the slot sync worker to correctly increase the polling interval instead
is also a good starting point.

Best Regards,
Hou zj



Attachments:

  [application/octet-stream] v3-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch (2.5K, 2-v3-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch)
  download | inline diff:
From 2cb9f3a2b70a9182d9b4d45d25e67779b2712ca8 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Thu, 9 Apr 2026 20:11:40 +0800
Subject: [PATCH v3] Fix slotsync worker busy loop causing repeated logical
 decoding logs.

Previously, the slotsync worker could enter a busy loop and emit four logical
log messages every 200 ms, even when both the primary and standby were idle.

This happened because the worker incorrectly treated certain cases as
successful slot updates, causing it to use the minimum sleep interval and
repeatedly restart slot syncing.

This commit fixes this by ensuring the worker does not treat such cases as
updates, allowing it to sleep normally and avoid excessive log output.
---
 src/backend/replication/logical/slotsync.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index f90653e5232..527b26d1a2b 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -332,10 +332,15 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 			slot->data.confirmed_flush = remote_slot->confirmed_lsn;
 			slot->data.catalog_xmin = remote_slot->catalog_xmin;
 			SpinLockRelease(&slot->mutex);
+
+			updated_xmin_or_lsn = true;
 		}
 		else
 		{
 			bool		found_consistent_snapshot;
+			XLogRecPtr	old_confirmed_lsn = slot->data.confirmed_flush;
+			XLogRecPtr	old_restart_lsn = slot->data.restart_lsn;
+			XLogRecPtr	old_catalog_xmin = slot->data.catalog_xmin;
 
 			LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
 												&found_consistent_snapshot);
@@ -365,9 +370,20 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 				skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
 			}
-		}
 
-		updated_xmin_or_lsn = true;
+			/*
+			 * The restart_lsn and catalog_xmin may not change when the synced
+			 * slot is building its first consistent snapshot (or cannot build
+			 * one at all), or when the slot on the primary advances
+			 * restart_lsn to an xl_running_xacts record that begins exactly
+			 * at confirmed_flush_lsn (e.g., via pg_logical_slot_peek_xxx), a
+			 * position not reachable by
+			 * LogicalSlotAdvanceAndCheckSnapState().
+			 */
+			updated_xmin_or_lsn = (old_confirmed_lsn != slot->data.confirmed_flush ||
+								   old_restart_lsn != slot->data.restart_lsn ||
+								   old_catalog_xmin != slot->data.catalog_xmin);
+		}
 	}
 
 	/* Update slot sync skip stats */
-- 
2.31.1



^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-04-10 04:17  Amit Kapila <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  2 siblings, 0 replies; 13+ messages in thread

From: Amit Kapila @ 2026-04-10 04:17 UTC (permalink / raw)
  To: Zhijie Hou (Fujitsu) <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Fujii Masao <[email protected]>

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Sunday, March 22, 2026 1:47 AM Amit Kapila <[email protected]> wrote:
> >
> > *
> > @@ -2174,7 +2193,10 @@
> > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> >
> >   if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
> >   {
> > - LogicalConfirmReceivedLocation(moveto);
> > + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> > +
> > + if (updated_xmin_or_lsn)
> > + *updated_xmin_or_lsn = slot_updated;
> >
> > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> > processing running_xact record, so not sure we can rely only on the explicit
> > call LogicalConfirmReceivedLocation() above to ascertain the same.
>
> Perhaps we could simply compare the slot's old and new LSN/xmin to determine
> whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
> into each individual update path at the lower level. Attaching a patch for reference.
>

Yes, this sounds like a good idea, I will review it further.

> >
> > Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
> > fix this issue but I'll look at it after the feature freeze.
>
> The other patches attempt to advance restart_lsn in more cases by modifying the
> logical decoding logic. However, those scenarios are relatively rare and may not
> produce significant improvement outside the slot synchronization context. So, I
> think modifying the slot sync worker to correctly increase the polling interval instead
> is also a good starting point.
>

Even in slot synchronization, this is a very rare case where the slot
on primary is not continuously being advanced as normally we do via
walsender, otherwise, we wouldn't have seen LOGs at this frequency.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-04-10 06:01  shveta malik <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  2 siblings, 0 replies; 13+ messages in thread

From: shveta malik @ 2026-04-10 06:01 UTC (permalink / raw)
  To: Zhijie Hou (Fujitsu) <[email protected]>; +Cc: Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>; Fujii Masao <[email protected]>; shveta malik <[email protected]>

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Sunday, March 22, 2026 1:47 AM Amit Kapila <[email protected]> wrote:
> > On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <[email protected]>
> > wrote:
> > >
> > > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]>
> > wrote:
> > > >
> > > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]>
> > wrote:
> > > > >
> > > > > Normally, the slotsync worker updates the standby slot using the
> > > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > > but restart_lsn does not, the worker does not actually update the
> > > > > standby slot. Despite that, the current code of
> > > > > update_local_synced_slot() appears to treat this situation as if
> > > > > an update occurred. As a result, the worker sleeps only for the
> > > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > > again assumes an update happened, and continues looping with the
> > > > > short sleep interval, causing the repeated logical decoding log messages.
> > Based on a quick analysis, this seems to be the root cause.
> > > > >
> > > > > I think update_local_synced_slot() should return false (i.e., no
> > > > > update
> > > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > > differs between primary and standby.
> > > > >
> > > >
> > > > We expect that in such a case update_local_synced_slot() should
> > > > advance local_slot's 'restart_lsn' via
> > > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > > > cheap code path next time. Normally, restart_lsn advancement should
> > > > happen when we process XLOG_RUNNING_XACTS and call
> > > > SnapBuildProcessRunningXacts(). In this particular case as both
> > > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > > > machinery may not be processing XLOG_RUNNING_XACTS record. I have
> > > > not debugged the exact case yet but you can try by emitting some
> > > > more records on publisher, it should let the standby advance the
> > > > slot. It is possible that we can do something like you are proposing
> > > > to silence the LOG messages but we should know what is going on here.
> > >
> > > I agree it's important to investigate why restart_lsn doesn't advance
> > > in that case and fix that issue.
> > >
> > > Separately, I think mistreating the slot as updated is another problem:
> > > the local slot is treated as updated even when it isn't, which causes
> > > the slotsync worker to use the minimum sleep interval. We should
> > > address this independently.
> > >
> > > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
> > > slot, but update_local_synced_slot() currently assumes it does. That
> > > seems to be the root cause. The attached 0001 patch makes
> > > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
> > > actually updated, and updates update_local_synced_slot() to act based on
> > that result.
> > >
> > > I've also attached 0002, which reduces the log level of some logical
> > > decoding messages (e.g., "logical decoding found consistent point")
> > > from LOG to DEBUG1. These are low-level, developer-oriented messages
> > > and can be noisy when they occur regularly. This is the same issue I
> > reported earlier at [1].
> > >
> > > How about applying these patches while we continue working on the
> > > restart_lsn issue?
> > >
> >
> > *
> > @@ -2174,7 +2193,10 @@
> > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> >
> >   if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
> >   {
> > - LogicalConfirmReceivedLocation(moveto);
> > + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> > +
> > + if (updated_xmin_or_lsn)
> > + *updated_xmin_or_lsn = slot_updated;
> >
> > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> > processing running_xact record, so not sure we can rely only on the explicit
> > call LogicalConfirmReceivedLocation() above to ascertain the same.
>
> Perhaps we could simply compare the slot's old and new LSN/xmin to determine
> whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
> into each individual update path at the lower level. Attaching a patch for reference.
>
> >
> > Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
> > fix this issue but I'll look at it after the feature freeze.
>
> The other patches attempt to advance restart_lsn in more cases by modifying the
> logical decoding logic. However, those scenarios are relatively rare and may not
> produce significant improvement outside the slot synchronization context. So, I
> think modifying the slot sync worker to correctly increase the polling interval instead
> is also a good starting point.
>

After reviewing the entire thread and patches, the above solution
appears to be the simplest and least intrusive, especially since the
issue is limited to the sltosync worker.  So +1 for this idea.

thanks
Shveta





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* Re: Fix slotsync worker busy loop causing repeated log messages
@ 2026-04-10 09:52  Amit Kapila <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  2 siblings, 1 reply; 13+ messages in thread

From: Amit Kapila @ 2026-04-10 09:52 UTC (permalink / raw)
  To: Zhijie Hou (Fujitsu) <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Fujii Masao <[email protected]>

On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Sunday, March 22, 2026 1:47 AM Amit Kapila <[email protected]> wrote:
> > On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <[email protected]>
> > wrote:
> > >
> > > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]>
> > wrote:
> > > >
> > > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]>
> > wrote:
> > > > >
> > > > > Normally, the slotsync worker updates the standby slot using the
> > > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > > but restart_lsn does not, the worker does not actually update the
> > > > > standby slot. Despite that, the current code of
> > > > > update_local_synced_slot() appears to treat this situation as if
> > > > > an update occurred. As a result, the worker sleeps only for the
> > > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > > again assumes an update happened, and continues looping with the
> > > > > short sleep interval, causing the repeated logical decoding log messages.
> > Based on a quick analysis, this seems to be the root cause.
> > > > >
> > > > > I think update_local_synced_slot() should return false (i.e., no
> > > > > update
> > > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > > differs between primary and standby.
> > > > >
> > > >
> > > > We expect that in such a case update_local_synced_slot() should
> > > > advance local_slot's 'restart_lsn' via
> > > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > > > cheap code path next time. Normally, restart_lsn advancement should
> > > > happen when we process XLOG_RUNNING_XACTS and call
> > > > SnapBuildProcessRunningXacts(). In this particular case as both
> > > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > > > machinery may not be processing XLOG_RUNNING_XACTS record. I have
> > > > not debugged the exact case yet but you can try by emitting some
> > > > more records on publisher, it should let the standby advance the
> > > > slot. It is possible that we can do something like you are proposing
> > > > to silence the LOG messages but we should know what is going on here.
> > >
> > > I agree it's important to investigate why restart_lsn doesn't advance
> > > in that case and fix that issue.
> > >
> > > Separately, I think mistreating the slot as updated is another problem:
> > > the local slot is treated as updated even when it isn't, which causes
> > > the slotsync worker to use the minimum sleep interval. We should
> > > address this independently.
> > >
> > > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
> > > slot, but update_local_synced_slot() currently assumes it does. That
> > > seems to be the root cause. The attached 0001 patch makes
> > > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
> > > actually updated, and updates update_local_synced_slot() to act based on
> > that result.
> > >
> > > I've also attached 0002, which reduces the log level of some logical
> > > decoding messages (e.g., "logical decoding found consistent point")
> > > from LOG to DEBUG1. These are low-level, developer-oriented messages
> > > and can be noisy when they occur regularly. This is the same issue I
> > reported earlier at [1].
> > >
> > > How about applying these patches while we continue working on the
> > > restart_lsn issue?
> > >
> >
> > *
> > @@ -2174,7 +2193,10 @@
> > LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> >
> >   if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
> >   {
> > - LogicalConfirmReceivedLocation(moveto);
> > + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> > +
> > + if (updated_xmin_or_lsn)
> > + *updated_xmin_or_lsn = slot_updated;
> >
> > BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> > processing running_xact record, so not sure we can rely only on the explicit
> > call LogicalConfirmReceivedLocation() above to ascertain the same.
>
> Perhaps we could simply compare the slot's old and new LSN/xmin to determine
> whether updated_xmin_or_lsn needs to be set.
>

The patch works for me. Can we change the comment to as follows: "It
is possible that the slot's xmin or LSNs are not updated, when the
synced slot has reached consistent snapshot state or cannot build the
one at all."?

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 13+ messages in thread

* RE: Fix slotsync worker busy loop causing repeated log messages
@ 2026-04-10 10:28  Zhijie Hou (Fujitsu) <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 0 replies; 13+ messages in thread

From: Zhijie Hou (Fujitsu) @ 2026-04-10 10:28 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Fujii Masao <[email protected]>

On Friday, April 10, 2026 5:53 PM Amit Kapila <[email protected]> wrote:
> On Fri, Apr 10, 2026 at 8:28 AM Zhijie Hou (Fujitsu) <[email protected]>
> wrote:
> >
> >
> > Perhaps we could simply compare the slot's old and new LSN/xmin to
> > determine whether updated_xmin_or_lsn needs to be set.
> >
> 
> The patch works for me. Can we change the comment to as follows: "It is
> possible that the slot's xmin or LSNs are not updated, when the synced slot
> has reached consistent snapshot state or cannot build the one at all."?

Changed as suggested.

Here are the patches for both HEAD and back branches.

Best Regards,
Hou zj


Attachments:

  [application/octet-stream] v4-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch (2.3K, 2-v4-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch)
  download | inline diff:
From 52c53fa9c2e7648565103aea82650964bff2385c Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Thu, 9 Apr 2026 20:11:40 +0800
Subject: [PATCH v4] Fix slotsync worker busy loop causing repeated logical
 decoding logs.

Previously, the slotsync worker could enter a busy loop and emit four logical
log messages every 200 ms, even when both the primary and standby were idle.

This happened because the worker incorrectly treated certain cases as
successful slot updates, causing it to use the minimum sleep interval and
repeatedly restart slot syncing.

This commit fixes this by ensuring the worker does not treat such cases as
updates, allowing it to sleep normally and avoid excessive log output.
---
 src/backend/replication/logical/slotsync.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index f90653e5232..64db2e9fd8c 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -332,10 +332,15 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 			slot->data.confirmed_flush = remote_slot->confirmed_lsn;
 			slot->data.catalog_xmin = remote_slot->catalog_xmin;
 			SpinLockRelease(&slot->mutex);
+
+			updated_xmin_or_lsn = true;
 		}
 		else
 		{
 			bool		found_consistent_snapshot;
+			XLogRecPtr	old_confirmed_lsn = slot->data.confirmed_flush;
+			XLogRecPtr	old_restart_lsn = slot->data.restart_lsn;
+			XLogRecPtr	old_catalog_xmin = slot->data.catalog_xmin;
 
 			LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
 												&found_consistent_snapshot);
@@ -365,9 +370,16 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 				skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
 			}
-		}
 
-		updated_xmin_or_lsn = true;
+			/*
+			 * It is possible that the slot's xmin or LSNs are not updated,
+			 * when the synced slot has reached consistent snapshot state or
+			 * cannot build one at all.
+			 */
+			updated_xmin_or_lsn = (old_confirmed_lsn != slot->data.confirmed_flush ||
+								   old_restart_lsn != slot->data.restart_lsn ||
+								   old_catalog_xmin != slot->data.catalog_xmin);
+		}
 	}
 
 	/* Update slot sync skip stats */
-- 
2.31.1



  [application/octet-stream] v4_PG17_18-0001-Fix-slotsync-worker-busy-loop-causing-rep.patch (2.3K, 3-v4_PG17_18-0001-Fix-slotsync-worker-busy-loop-causing-rep.patch)
  download | inline diff:
From c3744118809796e34faa37640a1e3b428705a8b9 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Fri, 10 Apr 2026 16:28:44 +0800
Subject: [PATCH v4_PG17_18] Fix slotsync worker busy loop causing repeated
 logical decoding logs.

Previously, the slotsync worker could enter a busy loop and emit four logical
log messages every 200 ms, even when both the primary and standby were idle.

This happened because the worker incorrectly treated certain cases as
successful slot updates, causing it to use the minimum sleep interval and
repeatedly restart slot syncing.

This commit fixes this by ensuring the worker does not treat such cases as
updates, allowing it to sleep normally and avoid excessive log output.
---
 src/backend/replication/logical/slotsync.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 4cbee197ddb..3cbcd77d12f 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -275,9 +275,15 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid,
 
 			if (found_consistent_snapshot)
 				*found_consistent_snapshot = true;
+
+			updated_xmin_or_lsn = true;
 		}
 		else
 		{
+			XLogRecPtr	old_confirmed_lsn = slot->data.confirmed_flush;
+			XLogRecPtr	old_restart_lsn = slot->data.restart_lsn;
+			XLogRecPtr	old_catalog_xmin = slot->data.catalog_xmin;
+
 			LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
 												found_consistent_snapshot);
 
@@ -289,9 +295,16 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid,
 						errdetail_internal("Remote slot has LSN %X/%X but local slot has LSN %X/%X.",
 										   LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
 										   LSN_FORMAT_ARGS(slot->data.confirmed_flush)));
-		}
 
-		updated_xmin_or_lsn = true;
+			/*
+			 * It is possible that the slot's xmin or LSNs are not updated,
+			 * when the synced slot has reached consistent snapshot state or
+			 * cannot build one at all.
+			 */
+			updated_xmin_or_lsn = (old_confirmed_lsn != slot->data.confirmed_flush ||
+								   old_restart_lsn != slot->data.restart_lsn ||
+								   old_catalog_xmin != slot->data.catalog_xmin);
+		}
 	}
 
 	if (remote_dbid != slot->data.database ||
-- 
2.53.0.windows.2



^ permalink  raw  reply  [nested|flat] 13+ messages in thread


end of thread, other threads:[~2026-04-10 10:28 UTC | newest]

Thread overview: 13+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-28 05:03 Re: Fix slotsync worker busy loop causing repeated log messages Amit Kapila <[email protected]>
2026-03-03 07:42 ` Zhijie Hou (Fujitsu) <[email protected]>
2026-03-03 10:25   ` Amit Kapila <[email protected]>
2026-03-03 12:24     ` Zhijie Hou (Fujitsu) <[email protected]>
2026-03-04 02:59       ` =?utf-8?B?Y2NhNTUwNw==?= <[email protected]>
2026-03-06 11:04       ` Zhijie Hou (Fujitsu) <[email protected]>
2026-03-19 13:38 ` Fujii Masao <[email protected]>
2026-03-21 17:46   ` Amit Kapila <[email protected]>
2026-04-10 02:58     ` Zhijie Hou (Fujitsu) <[email protected]>
2026-04-10 04:17       ` Amit Kapila <[email protected]>
2026-04-10 06:01       ` shveta malik <[email protected]>
2026-04-10 09:52       ` Amit Kapila <[email protected]>
2026-04-10 10:28         ` Zhijie Hou (Fujitsu) <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox