public inbox for [email protected]
help / color / mirror / Atom feedFrom: Fujii Masao <[email protected]>
To: Amit Kapila <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Fix slotsync worker busy loop causing repeated log messages
Date: Thu, 19 Mar 2026 22:38:33 +0900
Message-ID: <CAHGQGwG1br8SQQnBieyJeRAmnMejEtJrVmurfTPxNouLHSW=jA@mail.gmail.com> (raw)
In-Reply-To: <CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>
References: <CAHGQGwF6zG9Z8ws1yb3hY1VqV-WT7hR0qyXCn2HdbjvZQKufDw@mail.gmail.com>
<CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>
On Sat, Feb 28, 2026 at 2:03 PM Amit Kapila <[email protected]> wrote:
>
> On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> >
> > Normally, the slotsync worker updates the standby slot using the primary's slot
> > state. However, when confirmed_flush_lsn matches but restart_lsn does not,
> > the worker does not actually update the standby slot. Despite that, the current
> > code of update_local_synced_slot() appears to treat this situation as if
> > an update occurred. As a result, the worker sleeps only for the minimum
> > interval (200 ms) before retrying. In the next cycle, it again assumes
> > an update happened, and continues looping with the short sleep interval,
> > causing the repeated logical decoding log messages. Based on a quick analysis,
> > this seems to be the root cause.
> >
> > I think update_local_synced_slot() should return false (i.e., no update
> > happened) when confirmed_flush_lsn is equal but restart_lsn differs between
> > primary and standby.
> >
>
> We expect that in such a case update_local_synced_slot() should
> advance local_slot's 'restart_lsn' via
> LogicalSlotAdvanceAndCheckSnapState(), otherwise, it won't go in the
> cheap code path next time. Normally, restart_lsn advancement should
> happen when we process XLOG_RUNNING_XACTS and call
> SnapBuildProcessRunningXacts(). In this particular case as both
> restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> debugged the exact case yet but you can try by emitting some more
> records on publisher, it should let the standby advance the slot. It
> is possible that we can do something like you are proposing to silence
> the LOG messages but we should know what is going on here.
I agree it's important to investigate why restart_lsn doesn't advance
in that case and fix that issue.
Separately, I think mistreating the slot as updated is another problem:
the local slot is treated as updated even when it isn't, which causes
the slotsync worker to use the minimum sleep interval. We should
address this independently.
LogicalSlotAdvanceAndCheckSnapState() doesn't always update the local slot,
but update_local_synced_slot() currently assumes it does. That seems to be
the root cause. The attached 0001 patch makes
LogicalSlotAdvanceAndCheckSnapState() return whether the slot was actually
updated, and updates update_local_synced_slot() to act based on that result.
I've also attached 0002, which reduces the log level of some logical decoding
messages (e.g., "logical decoding found consistent point") from LOG to
DEBUG1. These are low-level, developer-oriented messages and can be noisy
when they occur regularly. This is the same issue I reported earlier at [1].
How about applying these patches while we continue working on
the restart_lsn issue?
Regards,
[1] https://postgr.es/m/CAHGQGwED5kJxZ_MdCh=WLa5M7ekXdnzCO1ZCQhQNCEdfaEPwFQ@mail.gmail.com
--
Fujii Masao
Attachments:
[application/octet-stream] v2-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch (6.6K, 2-v2-0001-Fix-slotsync-worker-busy-loop-causing-repeated-lo.patch)
download | inline diff:
From 4e8fbce7accd6872c09ff0da02bfd0d7ae51b1a4 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 19 Mar 2026 17:34:55 +0900
Subject: [PATCH v2 1/2] Fix slotsync worker busy loop causing repeated logical
decoding logs.
Previously, the slotsync worker could enter a busy loop and emit four logical
log messages every 200 ms, even when both the primary and standby were idle.
This happened because the worker incorrectly treated certain cases as
successful slot updates, causing it to use the minimum sleep interval and
repeatedly restart slot syncing.
This commit fixes this by ensuring the worker does not treat such cases as
updates, allowing it to sleep normally and avoid excessive log output.
---
src/backend/replication/logical/logical.c | 32 ++++++++++++++++++----
src/backend/replication/logical/slotsync.c | 7 +++--
src/backend/replication/slotfuncs.c | 2 +-
src/include/replication/logical.h | 5 ++--
4 files changed, 35 insertions(+), 11 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 603a2b94d05..9a3b0020040 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1806,18 +1806,22 @@ LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn, XLogRecPtr restart
/*
* Handle a consumer's confirmation having received all changes up to lsn.
+ *
+ * Return TRUE if the local slot is updated.
*/
-void
+bool
LogicalConfirmReceivedLocation(XLogRecPtr lsn)
{
+ bool updated_confirmed = false;
+ bool updated_xmin = false;
+ bool updated_restart = false;
+
Assert(XLogRecPtrIsValid(lsn));
/* Do an unlocked check for candidate_lsn first. */
if (XLogRecPtrIsValid(MyReplicationSlot->candidate_xmin_lsn) ||
XLogRecPtrIsValid(MyReplicationSlot->candidate_restart_valid))
{
- bool updated_xmin = false;
- bool updated_restart = false;
XLogRecPtr restart_lsn pg_attribute_unused();
SpinLockAcquire(&MyReplicationSlot->mutex);
@@ -1837,7 +1841,10 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
* LSN. See similar case handling in CreateDecodingContext.
*/
if (lsn > MyReplicationSlot->data.confirmed_flush)
+ {
MyReplicationSlot->data.confirmed_flush = lsn;
+ updated_confirmed = true;
+ }
/* if we're past the location required for bumping xmin, do so */
if (XLogRecPtrIsValid(MyReplicationSlot->candidate_xmin_lsn) &&
@@ -1920,10 +1927,15 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
* for the details.
*/
if (lsn > MyReplicationSlot->data.confirmed_flush)
+ {
MyReplicationSlot->data.confirmed_flush = lsn;
+ updated_confirmed = true;
+ }
SpinLockRelease(&MyReplicationSlot->mutex);
}
+
+ return updated_confirmed || updated_xmin || updated_restart;
}
/*
@@ -2089,10 +2101,14 @@ LogicalReplicationSlotCheckPendingWal(XLogRecPtr end_of_wal,
*
* *found_consistent_snapshot will be true if the initial decoding snapshot has
* been built; Otherwise, it will be false.
+ *
+ * *updated_xmin_or_lsn will be true if the local slot is updated; Otherwise,
+ * it will be false.
*/
XLogRecPtr
LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
- bool *found_consistent_snapshot)
+ bool *found_consistent_snapshot,
+ bool *updated_xmin_or_lsn)
{
LogicalDecodingContext *ctx;
ResourceOwner old_resowner PG_USED_FOR_ASSERTS_ONLY = CurrentResourceOwner;
@@ -2103,6 +2119,9 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
if (found_consistent_snapshot)
*found_consistent_snapshot = false;
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = false;
+
PG_TRY();
{
/*
@@ -2174,7 +2193,10 @@ LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
if (XLogRecPtrIsValid(ctx->reader->EndRecPtr))
{
- LogicalConfirmReceivedLocation(moveto);
+ bool slot_updated = LogicalConfirmReceivedLocation(moveto);
+
+ if (updated_xmin_or_lsn)
+ *updated_xmin_or_lsn = slot_updated;
/*
* If only the confirmed_flush LSN has changed the slot won't get
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..3e70cfb0a3e 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -315,13 +315,16 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
slot->data.confirmed_flush = remote_slot->confirmed_lsn;
slot->data.catalog_xmin = remote_slot->catalog_xmin;
SpinLockRelease(&slot->mutex);
+
+ updated_xmin_or_lsn = true;
}
else
{
bool found_consistent_snapshot;
LogicalSlotAdvanceAndCheckSnapState(remote_slot->confirmed_lsn,
- &found_consistent_snapshot);
+ &found_consistent_snapshot,
+ &updated_xmin_or_lsn);
/* Sanity check */
if (slot->data.confirmed_flush != remote_slot->confirmed_lsn)
@@ -349,8 +352,6 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid)
skip_reason = SS_SKIP_NO_CONSISTENT_SNAPSHOT;
}
}
-
- updated_xmin_or_lsn = true;
}
/* Update slot sync skip stats */
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 9f5e4f998fe..b6846bb313e 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -524,7 +524,7 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
static XLogRecPtr
pg_logical_replication_slot_advance(XLogRecPtr moveto)
{
- return LogicalSlotAdvanceAndCheckSnapState(moveto, NULL);
+ return LogicalSlotAdvanceAndCheckSnapState(moveto, NULL, NULL);
}
/*
diff --git a/src/include/replication/logical.h b/src/include/replication/logical.h
index bc9d4ece672..0208f9ca6a2 100644
--- a/src/include/replication/logical.h
+++ b/src/include/replication/logical.h
@@ -140,7 +140,7 @@ extern void LogicalIncreaseXminForSlot(XLogRecPtr current_lsn,
TransactionId xmin);
extern void LogicalIncreaseRestartDecodingForSlot(XLogRecPtr current_lsn,
XLogRecPtr restart_lsn);
-extern void LogicalConfirmReceivedLocation(XLogRecPtr lsn);
+extern bool LogicalConfirmReceivedLocation(XLogRecPtr lsn);
extern bool filter_prepare_cb_wrapper(LogicalDecodingContext *ctx,
TransactionId xid, const char *gid);
@@ -151,6 +151,7 @@ extern void UpdateDecodingStats(LogicalDecodingContext *ctx);
extern XLogRecPtr LogicalReplicationSlotCheckPendingWal(XLogRecPtr end_of_wal,
XLogRecPtr scan_cutoff_lsn);
extern XLogRecPtr LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr moveto,
- bool *found_consistent_snapshot);
+ bool *found_consistent_snapshot,
+ bool *updated_xmin_or_lsn);
#endif
--
2.51.2
[application/octet-stream] v2-0002-Reduce-log-level-of-some-logical-decoding-message.patch (3.3K, 3-v2-0002-Reduce-log-level-of-some-logical-decoding-message.patch)
download | inline diff:
From 87dfed10facb2d3ea692db72c2e0963c19d7c70f Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 19 Mar 2026 20:32:43 +0900
Subject: [PATCH v2 2/2] Reduce log level of some logical decoding messages
from LOG to DEBUG1
Previously some logical decoding messages (e.g., "logical decoding found
consistent point") were logged at level LOG, even though they provided
low-level, developer-oriented information that DBAs were typically not
interested in.
Since these messages can occur routinely (for example, when keeping calling
pg_logical_slot_get_changes() to obtain the changes from logical decoding),
logging them at LOG can be overly verbose.
This commit reduces their log level to DEBUG1 to avoid unnecessary log noise.
---
src/backend/replication/logical/logical.c | 2 +-
src/backend/replication/logical/snapbuild.c | 6 +++---
src/test/recovery/t/038_save_logical_slots_shutdown.pl | 1 +
3 files changed, 5 insertions(+), 4 deletions(-)
diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 9a3b0020040..4b5fefbc426 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -598,7 +598,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
ctx->reorder->output_rewrites = ctx->options.receive_rewrites;
- ereport(LOG,
+ ereport(DEBUG1,
(errmsg("starting logical decoding for slot \"%s\"",
NameStr(slot->data.name)),
errdetail("Streaming transactions committing after %X/%08X, reading WAL from %X/%08X.",
diff --git a/src/backend/replication/logical/snapbuild.c b/src/backend/replication/logical/snapbuild.c
index 37f0c6028bd..b4269a3b102 100644
--- a/src/backend/replication/logical/snapbuild.c
+++ b/src/backend/replication/logical/snapbuild.c
@@ -1312,7 +1312,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
- ereport(LOG,
+ ereport(DEBUG1,
errmsg("logical decoding found consistent point at %X/%08X",
LSN_FORMAT_ARGS(lsn)),
errdetail("There are no running transactions."));
@@ -1409,7 +1409,7 @@ SnapBuildFindSnapshot(SnapBuild *builder, XLogRecPtr lsn, xl_running_xacts *runn
builder->state = SNAPBUILD_CONSISTENT;
builder->next_phase_at = InvalidTransactionId;
- ereport(LOG,
+ ereport(DEBUG1,
errmsg("logical decoding found consistent point at %X/%08X",
LSN_FORMAT_ARGS(lsn)),
errdetail("There are no old transactions anymore."));
@@ -1915,7 +1915,7 @@ SnapBuildRestore(SnapBuild *builder, XLogRecPtr lsn)
Assert(builder->state == SNAPBUILD_CONSISTENT);
- ereport(LOG,
+ ereport(DEBUG1,
errmsg("logical decoding found consistent point at %X/%08X",
LSN_FORMAT_ARGS(lsn)),
errdetail("Logical decoding will begin using saved snapshot."));
diff --git a/src/test/recovery/t/038_save_logical_slots_shutdown.pl b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
index c0392d50460..05aa78b4bc7 100644
--- a/src/test/recovery/t/038_save_logical_slots_shutdown.pl
+++ b/src/test/recovery/t/038_save_logical_slots_shutdown.pl
@@ -48,6 +48,7 @@ $node_publisher->append_conf(
'postgresql.conf', q{
checkpoint_timeout = 1h
autovacuum = off
+log_min_messages = 'debug1'
});
$node_publisher->start;
--
2.51.2
view thread (13+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Fix slotsync worker busy loop causing repeated log messages
In-Reply-To: <CAHGQGwG1br8SQQnBieyJeRAmnMejEtJrVmurfTPxNouLHSW=jA@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox