Re: Fix slotsync worker busy loop causing repeated log messages

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Amit Kapila <[email protected]>
To: Zhijie Hou (Fujitsu) <[email protected]>
Cc: Fujii Masao <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Fix slotsync worker busy loop causing repeated log messages
Date: Tue, 3 Mar 2026 15:55:27 +0530
Message-ID: <CAA4eK1LGDd2y7Bnj9rHEJLzJx4vThF23+jH9j8bZjuMard9RRA@mail.gmail.com> (raw)
In-Reply-To: <OS7PR01MB16909C13530D84781E7C2E2EF947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com>
References: <CAHGQGwF6zG9Z8ws1yb3hY1VqV-WT7hR0qyXCn2HdbjvZQKufDw@mail.gmail.com>
	<CAA4eK1KLk+TWyNPJ=z6SzQQXySc-N9Gs3eR-QKfV+MX7vfJWiw@mail.gmail.com>
	<OS7PR01MB16909C13530D84781E7C2E2EF947FA@OS7PR01MB16909.jpnprd01.prod.outlook.com>

On Tue, Mar 3, 2026 at 1:12 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Saturday, February 28, 2026 1:03 PM Amit Kapila <[email protected]> wrote:
> > On Fri, Feb 27, 2026 at 8:34 PM Fujii Masao <[email protected]> wrote:
> > >
> > > Normally, the slotsync worker updates the standby slot using the
> > > primary's slot state. However, when confirmed_flush_lsn matches but
> > > restart_lsn does not, the worker does not actually update the standby
> > > slot. Despite that, the current code of update_local_synced_slot()
> > > appears to treat this situation as if an update occurred. As a result,
> > > the worker sleeps only for the minimum interval (200 ms) before
> > > retrying. In the next cycle, it again assumes an update happened, and
> > > continues looping with the short sleep interval, causing the repeated
> > > logical decoding log messages. Based on a quick analysis, this seems to be
> > the root cause.
> > >
> > > I think update_local_synced_slot() should return false (i.e., no
> > > update
> > > happened) when confirmed_flush_lsn is equal but restart_lsn differs
> > > between primary and standby.
> > >
> >
> > We expect that in such a case update_local_synced_slot() should advance
> > local_slot's 'restart_lsn' via LogicalSlotAdvanceAndCheckSnapState(),
> > otherwise, it won't go in the cheap code path next time. Normally, restart_lsn
> > advancement should happen when we process XLOG_RUNNING_XACTS and
> > call SnapBuildProcessRunningXacts(). In this particular case as both
> > restart_lsn and confirmed_flush_lsn are the same (0/03000140), the
> > machinery may not be processing XLOG_RUNNING_XACTS record. I have not
> > debugged the exact case yet but you can try by emitting some more records
> > on publisher, it should let the standby advance the slot. It is possible that we
> > can do something like you are proposing to silence the LOG messages but we
> > should know what is going on here.
>
> I reproduced and debugged this issue where a replication slot's restart_lsn
> fails to advance. In my environment, I found it only occurs when a synced
> slot first builds a consistent snapshot. The problematic code path is in
> SnapBuildProcessRunningXacts():
>
>     if (builder->state < SNAPBUILD_CONSISTENT)
>     {
>         /* returns false if there's no point in performing cleanup just yet */
>         if (!SnapBuildFindSnapshot(builder, lsn, running))
>             return;
>     }
>
> When a synced slot reaches consistency for the first time with no running
> transactions, SnapBuildFindSnapshot() returns false, causing the function to
> return without updating the candidate restart_lsn.
>
> So, an alternative approach is to improve this logic by updating the candidate
> restart_lsn in this case instead of returning early.
>

But why not return 'true' from SnapBuildFindSnapshot() in that case?
The comment atop SnapBuildFindSnapshot() says: "Returns true if there
is a point in performing internal maintenance/cleanup using the
xl_running_xacts record.". Doesn't updating restart_lsn fall under
that category?

However, I have a question that even if we haven't incremented it in
the first cycle, why is it not incrementing restart_lsn in consecutive
sync cycles.

-- 
With Regards,
Amit Kapila.

view thread (13+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Fix slotsync worker busy loop causing repeated log messages
  In-Reply-To: <CAA4eK1LGDd2y7Bnj9rHEJLzJx4vThF23+jH9j8bZjuMard9RRA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox