RE: Fix slotsync worker busy loop causing repeated log messages

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Zhijie Hou (Fujitsu) <[email protected]>
To: Amit Kapila <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Fujii Masao <[email protected]>
Subject: RE: Fix slotsync worker busy loop causing repeated log messages
Date: Fri, 10 Apr 2026 02:58:18 +0000
Message-ID: <TYRPR01MB14195ED854BE834AFDC815E9494592@TYRPR01MB14195.jpnprd01.prod.outlook.com> (raw)
In-Reply-To: <CAA4eK1KB6geysORVyOh-dJ+9U57CBC30K2XD3fQ6SKZc0Hsi3Q@mail.gmail.com>
References: <CAHGQGwF6zG9Z8ws1yb3hY1VqV-WT7hR0qyXCn2HdbjvZQKufDw@mail.gmail.com>
	<CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>
	<CAHGQGwG1br8SQQnBieyJeRAmnMejEtJrVmurfTPxNouLHSW=jA@mail.gmail.com>
	<CAA4eK1KB6geysORVyOh-dJ+9U57CBC30K2XD3fQ6SKZc0Hsi3Q@mail.gmail.com>

On Sunday, March 22, 2026 1:47 AM Amit Kapila <[email protected]> wrote:
> On Thu, Mar 19, 2026 at 7:08 PM Fujii Masao <[email protected]>
> wrote:
> >
> > On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]>
> wrote:
> > >
> > > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]>
> wrote:
> > > >
> > > > Normally, the slotsync worker updates the standby slot using the
> > > > primary's slot state. However, when confirmed_flush_lsn matches
> > > > but restart_lsn does not, the worker does not actually update the
> > > > standby slot. Despite that, the current code of
> > > > update_local_synced_slot() appears to treat this situation as if
> > > > an update occurred. As a result, the worker sleeps only for the
> > > > minimum interval (200 ms) before retrying. In the next cycle, it
> > > > again assumes an update happened, and continues looping with the
> > > > short sleep interval, causing the repeated logical decoding log messages.
> Based on a quick analysis, this seems to be the root cause.
> > > >
> > > > I think update_local_synced_slot() should return false (i.e., no
> > > > update
> > > > happened) when confirmed_flush_lsn is equal but restart_lsn
> > > > differs between primary and standby.
> > > >
> > >
> > > We expect that in such a case update_local_synced_slot() should
> > > advance local_slot's 'restart_lsn' via
> > > LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> > > cheap code path next time. Normally, restart_lsn advancement should
> > > happen when we process XLOG_RUNNING_XACTS and call
> > > SnapBuildProcessRunningXacts(). In this particular case as both
> > > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > > machinery may not be processing XLOG_RUNNING_XACTS record. I have
> > > not debugged the exact case yet but you can try by emitting some
> > > more records on publisher, it should let the standby advance the
> > > slot. It is possible that we can do something like you are proposing
> > > to silence the LOG messages but we should know what is going on here.
> >
> > I agree it's important to investigate why restart_lsn doesn't advance
> > in that case and fix that issue.
> >
> > Separately, I think mistreating the slot as updated is another problem:
> > the local slot is treated as updated even when it isn't, which causes
> > the slotsync worker to use the minimum sleep interval. We should
> > address this independently.
> >
> > LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local
> > slot, but update_local_synced_slot() currently assumes it does. That
> > seems to be the root cause. The attached 0001 patch makes
> > LogicalSlotAdvanceAndCheckSnapState() return whether the slot was
> > actually updated, and updates update_local_synced_slot() to act based on
> that result.
> >
> > I've also attached 0002, which reduces the log level of some logical
> > decoding messages (e.g., "logical decoding found consistent point")
> > from LOG to DEBUG1. These are low-level, developer-oriented messages
> > and can be noisy when they occur regularly. This is the same issue I
> reported earlier at [1].
> >
> > How about applying these patches while we continue working on the
> > restart_lsn issue?
> >
> 
> *
> @@ -2174,7 +2193,10 @@
> LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
> 
>   if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
>   {
> - LogicalConfirmReceivedLocation(moveto);
> + bool slot_updated = LogicalConfirmReceivedLocation(moveto);
> +
> + if (updated_xmin_or_lsn)
> + *updated_xmin_or_lsn = slot_updated;
> 
> BTW, LogicalSlotAdvanceAndCheckSnapState() could also advance slot while
> processing running_xact record, so not sure we can rely only on the explicit
> call LogicalConfirmReceivedLocation() above to ascertain the same.

Perhaps we could simply compare the slot's old and new LSN/xmin to determine
whether updated_xmin_or_lsn needs to be set. This would avoid the need to hook
into each individual update path at the lower level. Attaching a patch for reference.

> 
> Sorry, I couldn't get the chance to look at the patches proposed by Hou-san to
> fix this issue but I'll look at it after the feature freeze.

The other patches attempt to advance restart_lsn in more cases by modifying the
logical decoding logic. However, those scenarios are relatively rare and may not
produce significant improvement outside the slot synchronization context. So, I
think modifying the slot sync worker to correctly increase the polling interval instead
is also a good starting point.

Best Regards,
Hou zj



Attachments:

  [application/octet-stream] v3-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch (2.5K, 2-v3-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch)
  download | inline diff:
From 2cb9f3a2b70a9182d9b4d45d25e67779b2712ca8 Mon Sep 17 00:00:00 2001
From: Zhijie Hou <[email protected]>
Date: Thu, 9 Apr 2026 20:11:40 +0800
Subject: [PATCH v3] Fix slotsync worker busy loop causing repeated logical
 decoding logs.

Previously, the slotsync worker could enter a busy loop and emit four logical
log messages every 200 ms, even when both the primary and standby were idle.

This happened because the worker incorrectly treated certain cases as
successful slot updates, causing it to use the minimum sleep interval and
repeatedly restart slot syncing.

This commit fixes this by ensuring the worker does not treat such cases as
updates, allowing it to sleep normally and avoid excessive log output.
---
 src/backend/replication/logical/slotsync.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index f90653e5232..527b26d1a2b 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -332,10 +332,15 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 			slot->data.confirmed_flush = remote_slot->confirmed_lsn;
 			slot->data.catalog_xmin = remote_slot->catalog_xmin;
 			SpinLockRelease(&slot->mutex);
+
+			updated_xmin_or_lsn = true;
 		}
 		else
 		{
 			bool		found_consistent_snapshot;
+			XLogRecPtr	old_confirmed_lsn = slot->data.confirmed_flush;
+			XLogRecPtr	old_restart_lsn = slot->data.restart_lsn;
+			XLogRecPtr	old_catalog_xmin = slot->data.catalog_xmin;
 
 			LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
 												&found_consistent_snapshot);
@@ -365,9 +370,20 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
 
 				skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
 			}
-		}
 
-		updated_xmin_or_lsn = true;
+			/*
+			 * The restart_lsn and catalog_xmin may not change when the synced
+			 * slot is building its first consistent snapshot (or cannot build
+			 * one at all), or when the slot on the primary advances
+			 * restart_lsn to an xl_running_xacts record that begins exactly
+			 * at confirmed_flush_lsn (e.g., via pg_logical_slot_peek_xxx), a
+			 * position not reachable by
+			 * LogicalSlotAdvanceAndCheckSnapState().
+			 */
+			updated_xmin_or_lsn = (old_confirmed_lsn != slot->data.confirmed_flush ||
+								   old_restart_lsn != slot->data.restart_lsn ||
+								   old_catalog_xmin != slot->data.catalog_xmin);
+		}
 	}
 
 	/* Update slot sync skip stats */
-- 
2.31.1

view thread (13+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: RE: Fix slotsync worker busy loop causing repeated log messages
  In-Reply-To: <TYRPR01MB14195ED854BE834AFDC815E9494592@TYRPR01MB14195.jpnprd01.prod.outlook.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox