public inbox for [email protected]
help / color / mirror / Atom feedFrom: Nisha Moond <[email protected]>
To: Fujii Masao <[email protected]>
Cc: Amit Kapila <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date: Tue, 24 Mar 2026 22:21:35 +0530
Message-ID: <CABdArM4tTwiLjaeg_Fipn2r9omnbJFYMHfRtgRSJNVnpYQyQJg@mail.gmail.com> (raw)
In-Reply-To: <CAHGQGwGETy+7Gv5=6kfYucQUy81SwGQbYr=nftHg7ZeqP07sBA@mail.gmail.com>
References: <CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com>
<CAA4eK1+CrQNqiPDKv1wYfdkbX0FARJoi1=0ioaAqkLzbq2vG1w@mail.gmail.com>
<CAHGQGwHABvuCoyM24HUiFZ5oJq_CoFomjt_cqD-0cJLMjFXJjQ@mail.gmail.com>
<CABdArM4a8am4_PYhpse1UwoP2pbh5BzLbTmaePoDMsbFOeJZ-A@mail.gmail.com>
<CAHGQGwFKULfab1NH1+_-+GdpJ8itUaKGU0_4Uwcr-y0MLZchyQ@mail.gmail.com>
<CAHGQGwGETy+7Gv5=6kfYucQUy81SwGQbYr=nftHg7ZeqP07sBA@mail.gmail.com>
On Tue, Mar 24, 2026 at 2:45 PM Fujii Masao <[email protected]> wrote:
>
> On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <[email protected]> wrote:
> >
> > On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <[email protected]> wrote:
> > > Hi Fujii-san,
> > >
> > > I tried reproducing the wait scenario as you mentioned, but could not
> > > reproduce it.
> > > Steps I followed:
> > > 1) Place a debugger in the slotsync worker and hold it at
> > > fetch_remote_slots() ... -> libpqsrv_get_result()
> > > 2) Kill the primary.
> > > 3) Triggered promotion of the standby and release debugger from slotsync worker.
> > >
> > > The slot sync worker stops when the promotion is triggered and then
> > > restarts, but fails to connect to the primary. The promotion happens
> > > immediately.
> > > ```
> > > LOG: received promote request
> > > LOG: redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> > > 0.02 s, elapsed: 4574.89 s
> > > LOG: last completed transaction was at log time 2026-03-23
> > > 17:13:15.782313+05:30
> > > LOG: replication slot synchronization worker will stop because
> > > promotion is triggered
> > > LOG: slot sync worker started
> > > ERROR: synchronization worker "slotsync worker" could not connect to
> > > the primary server: connection to server at "127.0.0.1", port 9933
> > > failed: Connection refused
> > > Is the server running on that host and accepting TCP/IP connections?
> > > ```
> > >
> > > I’ll debug this further to understand it better.
> > > In the meantime, please let me know if I’m missing any step, or if you
> > > followed a specific setup/script to reproduce this scenario.
> >
> > Thanks for testing!
> >
> > If you killed the primary with a signal like SIGTERM, an RST packet might have
> > been sent to the slotsync worker at that moment. That allowed the worker to
> > detect the connection loss and exited the wait state, so promotion could
> > complete as expected.
> >
> > To reproduce the issue, you'll need a scenario where the worker cannot detect
> > the connection loss. For example, you could block network traffic (e.g., with
> > iptables) between the primary and the slotsync worker. The key is to create
> > a situation where the worker remains stuck waiting for input for a long time.
>
> Here's one way to reproduce the issue using iptables:
>
Thank you, Fujii-san, for sharing the steps. I am now able to
reproduce the behavior where promotion gets stuck because the slot
sync worker remains in a wait loop.
As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
(using slightly higher values for debugging). With this setting, the
TCP stack terminates the connection if data sent to the primary
remains unacknowledged beyond the configured timeout (e.g., due to a
network drop). In such cases the slot sync worker exits instead of
waiting indefinitely. With an appropriately tuned timeout, this could
help avoid the promotion issue by ensuring the worker does not remain
stuck when the connection to the primary is lost.
Thanks,
Nisha
view thread (42+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected]
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
In-Reply-To: <CABdArM4tTwiLjaeg_Fipn2r9omnbJFYMHfRtgRSJNVnpYQyQJg@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox