public inbox for [email protected]  
help / color / mirror / Atom feed
From: Fujii Masao <[email protected]>
To: Nisha Moond <[email protected]>
Cc: Amit Kapila <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date: Wed, 25 Mar 2026 09:51:17 +0900
Message-ID: <CAHGQGwG3L=ppus6D6+RXxfZEdFgoAstJnbau=UU9WJZWdAoRoA@mail.gmail.com> (raw)
In-Reply-To: <CABdArM4tTwiLjaeg_Fipn2r9omnbJFYMHfRtgRSJNVnpYQyQJg@mail.gmail.com>
References: <CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com>
	<CAA4eK1+CrQNqiPDKv1wYfdkbX0FARJoi1=0ioaAqkLzbq2vG1w@mail.gmail.com>
	<CAHGQGwHABvuCoyM24HUiFZ5oJq_CoFomjt_cqD-0cJLMjFXJjQ@mail.gmail.com>
	<CABdArM4a8am4_PYhpse1UwoP2pbh5BzLbTmaePoDMsbFOeJZ-A@mail.gmail.com>
	<CAHGQGwFKULfab1NH1+_-+GdpJ8itUaKGU0_4Uwcr-y0MLZchyQ@mail.gmail.com>
	<CAHGQGwGETy+7Gv5=6kfYucQUy81SwGQbYr=nftHg7ZeqP07sBA@mail.gmail.com>
	<CABdArM4tTwiLjaeg_Fipn2r9omnbJFYMHfRtgRSJNVnpYQyQJg@mail.gmail.com>

On Wed, Mar 25, 2026 at 1:51 AM Nisha Moond <[email protected]> wrote:
> Thank you, Fujii-san, for sharing the steps. I am now able to
> reproduce the behavior where promotion gets stuck because the slot
> sync worker remains in a wait loop.

Thanks for the test!


> As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
> (using slightly higher values for debugging). With this setting, the
> TCP stack terminates the connection if data sent to the primary
> remains unacknowledged beyond the configured timeout (e.g., due to a
> network drop). In such cases the slot sync worker exits instead of
> waiting indefinitely. With an appropriately tuned timeout, this could
> help avoid the promotion issue by ensuring the worker does not remain
> stuck when the connection to the primary is lost.

Yes, TCP timeout settings like tcp_user_timeout, keepalives,
and net.ipv4.tcp_retries2 can help in this situation. However,
they involve a trade-off: using very small timeouts can reduce
failover time but increases the risk of false network failure detection,
while larger timeouts (e.g., 10s) avoid false positives but can
delay failover by that amount.

Because of this, I think it's better to address the issue without
relying on such TCP timeout parameters.

Also, tcp_user_timeout is not available on platforms that don't
support TCP_USER_TIMEOUT (e.g., Windows).

Regards,

-- 
Fujii Masao





view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
  In-Reply-To: <CAHGQGwG3L=ppus6D6+RXxfZEdFgoAstJnbau=UU9WJZWdAoRoA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox