RE: Fix slotsync worker busy loop causing repeated log messages

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Zhijie Hou (Fujitsu) <[email protected]>
To: Amit Kapila <[email protected]>
To: Fujii Masao <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: RE: Fix slotsync worker busy loop causing repeated log messages
Date: Tue, 3 Mar 2026 07:42:31 +0000
Message-ID: <OS7PR01MB16909C13530D84781E7C2E2EF947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>
References: <CAHGQGwF6zG9Z8ws1yb3hY1VqV-WT7hR0qyXCn2HdbjvZQKufDw@mail.gmail.com>
	<CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>

On Saturday, February 28, 2026 1:03 PM Amit Kapila <[email protected]> wrote:
> On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> >
> > Normally, the slotsync worker updates the standby slot using the
> > primary's slot state. However, when confirmed_flush_lsn matches but
> > restart_lsn does not, the worker does not actually update the standby
> > slot. Despite that, the current code of update_local_synced_slot()
> > appears to treat this situation as if an update occurred. As a result,
> > the worker sleeps only for the minimum interval (200 ms) before
> > retrying. In the next cycle, it again assumes an update happened, and
> > continues looping with the short sleep interval, causing the repeated
> > logical decoding log messages. Based on a quick analysis, this seems to be
> the root cause.
> >
> > I think update_local_synced_slot() should return false (i.e., no
> > update
> > happened) when confirmed_flush_lsn is equal but restart_lsn differs
> > between primary and standby.
> >
> 
> We expect that in such a case update_local_synced_slot() should advance
> local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
> otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
> advancement should happen when we process XLOG_RUNNING_XACTS and
> call SnapBuildProcessRunningXacts(). In this particular case as both
> restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> debugged the exact case yet but you can try by emitting some more records
> on publisher, it should let the standby advance the slot. It is possible that we
> can do something like you are proposing to silence the LOG messages but we
> should know what is going on here.

I reproduced and debugged this issue where a replication slot's restart_lsn
fails to advance. In my environment, I found it only occurs when a synced
slot first builds a consistent snapshot. The problematic code path is in
SnapBuildProcessRunningXacts():

    if (builder->state < SNAPBUILD_CONSISTENT)
    {
        /* returns false if there's no point in performing cleanup just yet */
        if (!SnapBuildFindSnapshot(builder, lsn, running))
            return;
    }

When a synced slot reaches consistency for the first time with no running
transactions, SnapBuildFindSnapshot() returns false, causing the function to
return without updating the candidate restart_lsn.

So, an alternative approach is to improve this logic by updating the candidate
restart_lsn in this case instead of returning early. See the attached patch for
details. This can fix the issue on my machine.

Best Regards,
Hou zj


Attachments:

  [application/octet-stream] v1-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch (3.5K, 2-v1-0001-Advance-restart_lsn-when-reaching-consistency-wit.patch)
  download | inline diff:
From 05f4ea29319637df578a92e90df5d24919cc2f79 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Tue, 3 Mar 2026 11:44:38 +0800
Subject: [PATCH v1] Advance restart_lsn when reaching consistency without
 waiting

Currently, the replication slot's restart_lsn is not advanced when first time
building a consistent snapshot, even when it's safe to do so. This can lead
to unnecessary retention of WAL segments, though the impact is rare.

This commit advances restart_lsn at the consistency point if either:
a serialized snapshot from a previous decoding session is available, or
there were no running transactions when reaching consistency

In both cases, it's safe and efficient to restart decoding from this LSN,
reducing WAL retention without affecting decoding capabilities.
---
 src/backend/replication/logical/snapbuild.c | 27 +++++++++++++++------
 1 file changed, 19 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 7f79621b57e..490b948267b 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1136,6 +1136,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 {
 	ReorderBufferTXN *txn;
 	TransactionId xmin;
+	bool	snapshot_built_immediately = false;
 
 	/*
 	 * If we're not consistent yet, inspect the record to see whether it
@@ -1143,11 +1144,7 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 * our snapshot so others or we, after a restart, can use it.
 	 */
 	if (builder->state < SNAPBUILD_CONSISTENT)
-	{
-		/* returns false if there's no point in performing cleanup just yet */
-		if (!SnapBuildFindSnapshot(builder, lsn, running))
-			return;
-	}
+		snapshot_built_immediately = !SnapBuildFindSnapshot(builder, lsn, running);
 	else
 		SnapBuildSerialize(builder, lsn);
 
@@ -1165,8 +1162,15 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	builder->xmin = running->oldestRunningXid;
 
-	/* Remove transactions we don't need to keep track off anymore */
-	SnapBuildPurgeOlderTxn(builder);
+	/*
+	 * Remove transactions we don't need to keep track off anymore.
+	 *
+	 * Cleanup is skipped if this is the first time we built a consistent
+	 * snapshot and we didn't wait for any transactions. In that case, no
+	 * transaction data has accumulated.
+	 */
+	if (!snapshot_built_immediately)
+		SnapBuildPurgeOlderTxn(builder);
 
 	/*
 	 * Advance the xmin limit for the current replication slot, to allow
@@ -1211,7 +1215,6 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 	 */
 	if (txn != NULL && XLogRecPtrIsValid(txn->restart_decoding_lsn))
 		LogicalIncreaseRestartDecodingForSlot(lsn, txn->restart_decoding_lsn);
-
 	/*
 	 * No in-progress transaction, can reuse the last serialized snapshot if
 	 * we have one.
@@ -1221,6 +1224,14 @@ SnapBuildProcessRunningXacts(SnapBuild *builder, XLogRecPtr lsn, xl_running_xact
 			 XLogRecPtrIsValid(builder->last_serialized_snapshot))
 		LogicalIncreaseRestartDecodingForSlot(lsn,
 											  builder->last_serialized_snapshot);
+	/*
+	 * If we built a snapshot immediately at this LSN, either a serialized
+	 * snapshot from a different decoding session is available or there were no
+	 * running transactions. In either case, it's safe and efficient to restart
+	 * from this LSN next time.
+	 */
+	else if (snapshot_built_immediately)
+		LogicalIncreaseRestartDecodingForSlot(lsn, lsn);
 }
 
 
-- 
2.51.1.windows.1

view thread (13+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: RE: Fix slotsync worker busy loop causing repeated log messages
  In-Reply-To: <OS7PR01MB16909C13530D84781E7C2E2EF947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox