public inbox for [email protected]  
help / color / mirror / Atom feed
Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
42+ messages / 6 participants
[nested] [flat]

* Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-18 16:05  Fujii Masao <[email protected]>
  0 siblings, 2 replies; 42+ messages in thread

From: Fujii Masao @ 2026-03-18 16:05 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>

Hi,

I noticed that during standby promotion the startup process sends SIGUSR1 to
the slotsync worker to make it exit. Is there a reason for using SIGUSR1?

If the slotsync worker is blocked waiting for input from the primary (e.g.,
due to a network outage between the primary and standby), SIGUSR1 won't
interrupt the wait. As a result, the worker can remain stuck and delay
promotion for a long time.

Would it make sense to send SIGTERM instead, so the worker can exit promptly
even while waiting? I've attached a WIP patch that does this. I haven't updated
the source comments yet, but I can do so if we agree on the approach.

SIGTERM alone is not sufficient, though. A new slotsync worker could start
immediately after the old one exits and block promotion again. To address this,
the patch makes a newly started worker exit immediately if promotion is
in progress.

Thoughts?

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v1-0001-Use-SIGTERM-to-stop-slotsync-worker-during-standb.patch (2.7K, 2-v1-0001-Use-SIGTERM-to-stop-slotsync-worker-during-standb.patch)
  download | inline diff:
From cdcce240fcabf3c4f91ad6931b451bc77db76caf Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 19 Mar 2026 00:50:07 +0900
Subject: [PATCH v1] Use SIGTERM to stop slotsync worker during standby
 promotion

Previously, when standby promotion was requested, the startup process sent
SIGUSR1 to the slotsync worker (or a backend performing slot synchronization)
to make it exit. This generally worked, but if the slotsync worker was blocked
waiting for input from the primary, SIGUSR1 would not interrupt the wait.
As a result, the worker could remain stuck, preventing promotion from
completing for a long time.

This commit fixes the issue by having the startup process send SIGTERM
instead of SIGUSR1, allowing the slotsync worker (or backend) to exit promptly
even while waiting for input.

Additionally, new slotsync worker could launch immediately after the old one
was terminated, which could again block promotion. To prevent this, a slotsync
worker that starts up during promotion now detects this condition and exits
immediately.

This ensures that standby promotion is not delayed by stuck or newly started
slotsync workers.
---
 src/backend/replication/logical/slotsync.c | 29 ++++++----------------
 1 file changed, 7 insertions(+), 22 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..3d18ff4a66e 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -1298,27 +1298,6 @@ ProcessSlotSyncInterrupts(void)
 {
 	CHECK_FOR_INTERRUPTS();
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
-
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
-	}
-
 	if (ConfigReloadPending)
 		slotsync_reread_config();
 }
@@ -1427,6 +1406,12 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+		proc_exit(0);
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1752,7 +1737,7 @@ ShutDownSlotSync(void)
 	 * detecting that the stopSignaled flag is set to true.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		kill(sync_process_pid, SIGTERM);
 
 	/* Wait for slot sync to end */
 	for (;;)
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-18 16:23  Tom Lane <[email protected]>
  parent: Fujii Masao <[email protected]>
  1 sibling, 1 reply; 42+ messages in thread

From: Tom Lane @ 2026-03-18 16:23 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Fujii Masao <[email protected]> writes:
> I noticed that during standby promotion the startup process sends SIGUSR1 to
> the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> Would it make sense to send SIGTERM instead, so the worker can exit promptly
> even while waiting?

One consideration here is that we expect all processes to receive
SIGTERM from init at the beginning of an operating system shutdown
sequence.  Background workers should exit at that point only if their
services will not be needed during database shutdown.  While it
sounds plausible that a slotsync worker should exit immediately,
I'm not quite sure if that's what we want.

			regards, tom lane





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-18 17:10  Fujii Masao <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 0 replies; 42+ messages in thread

From: Fujii Masao @ 2026-03-18 17:10 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Thu, Mar 19, 2026 at 1:23 AM Tom Lane <[email protected]> wrote:
>
> Fujii Masao <[email protected]> writes:
> > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > Would it make sense to send SIGTERM instead, so the worker can exit promptly
> > even while waiting?
>
> One consideration here is that we expect all processes to receive
> SIGTERM from init at the beginning of an operating system shutdown
> sequence.  Background workers should exit at that point only if their
> services will not be needed during database shutdown.  While it
> sounds plausible that a slotsync worker should exit immediately,
> I'm not quite sure if that's what we want.

Currently, when the slotsync worker receives SIGUSR1 during promotion,
it exits at the next interrupt check (i.e., in ProcessSlotSyncInterrupts()).
There's no additional termination handling, so it seems the worker is expected
to exit promptly once the startup process requests it.

Given that, using SIGTERM to make the worker exit immediately seems OK to me...

With the patch, on SIGTERM, the worker basically still exits at the next
interrupt check. The difference is that if it's waiting for input from
the primary (i.e., DoingCommandRead = true), it calls ProcessInterrupts()
in the SIGTERM signal handler (die()) and exits immediately.

Regards.

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-21 16:52  Amit Kapila <[email protected]>
  parent: Fujii Masao <[email protected]>
  1 sibling, 1 reply; 42+ messages in thread

From: Amit Kapila @ 2026-03-21 16:52 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
>
> I noticed that during standby promotion the startup process sends SIGUSR1 to
> the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
>

IIRC, this same signal is used for both the backend executing
pg_sync_replication_slots() and slotsync worker. We want the worker to
exit and error_out backend. Using SIGTERM for backend could result in
its exit. Also, we want the last slotsync cycle to complete before
promotion so that chances of subscribers that do failover/switchover
to new primary has better chances of finding failover slots
sync-ready.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-23 05:50  Fujii Masao <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 3 replies; 42+ messages in thread

From: Fujii Masao @ 2026-03-23 05:50 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
>
> On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> >
> > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> >
>
> IIRC, this same signal is used for both the backend executing
> pg_sync_replication_slots() and slotsync worker. We want the worker to
> exit and error_out backend. Using SIGTERM for backend could result in
> its exit.

Why do we want the backend running pg_sync_replication_slots() to throw
an error here, rather than just exit? If emitting an error is really required,
another option would be to store the process type in SlotSyncCtx and send
different signals accordingly, for example, SIGTERM for the slotsync worker
and another signal for a backend. But it seems simpler and sufficient to have
the backend exit in this case as well.


> Also, we want the last slotsync cycle to complete before
> promotion so that chances of subscribers that do failover/switchover
> to new primary has better chances of finding failover slots
> sync-ready.

I'm not sure how much this behavior helps in failover/switchover scenarios.
But the main issue is that if a primary crash triggers standby promotion,
that last slotsync cycle can get stuck waiting for input from the primary,
which delays promotion. IOW, failover time can become unnecessarily long
due to the slotsync worker. I'd like to address that problem.

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-24 04:01  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  2 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-03-24 04:01 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
>
> On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> >
> > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > >
> > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > >
> >
> > IIRC, this same signal is used for both the backend executing
> > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > exit and error_out backend. Using SIGTERM for backend could result in
> > its exit.
>
> Why do we want the backend running pg_sync_replication_slots() to throw
> an error here, rather than just exit? If emitting an error is really required,
> another option would be to store the process type in SlotSyncCtx and send
> different signals accordingly, for example, SIGTERM for the slotsync worker
> and another signal for a backend. But it seems simpler and sufficient to have
> the backend exit in this case as well.
>
>
> > Also, we want the last slotsync cycle to complete before
> > promotion so that chances of subscribers that do failover/switchover
> > to new primary has better chances of finding failover slots
> > sync-ready.
>
> I'm not sure how much this behavior helps in failover/switchover scenarios.
> But the main issue is that if a primary crash triggers standby promotion,
> that last slotsync cycle can get stuck waiting for input from the primary,
> which delays promotion. IOW, failover time can become unnecessarily long
> due to the slotsync worker. I'd like to address that problem.
>

Hi Fujii-san,

I tried reproducing the wait scenario as you mentioned, but could not
reproduce it.
Steps I followed:
1) Place a debugger in the slotsync worker and hold it at
fetch_remote_slots() ... -> libpqsrv_get_result()
2) Kill the primary.
3) Triggered promotion of the standby and release debugger from slotsync worker.

The slot sync worker stops when the promotion is triggered and then
restarts, but fails to connect to the primary. The promotion happens
immediately.
```
LOG:  received promote request
LOG:  redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
0.02 s, elapsed: 4574.89 s
LOG:  last completed transaction was at log time 2026-03-23
17:13:15.782313+05:30
LOG:  replication slot synchronization worker will stop because
promotion is triggered
LOG:  slot sync worker started
ERROR:  synchronization worker "slotsync worker" could not connect to
the primary server: connection to server at "127.0.0.1", port 9933
failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?
```

I’ll debug this further to understand it better.
In the meantime, please let me know if I’m missing any step, or if you
followed a specific setup/script to reproduce this scenario.

--
Thanks,
Nisha





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-24 06:00  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-03-24 06:00 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <[email protected]> wrote:
> Hi Fujii-san,
>
> I tried reproducing the wait scenario as you mentioned, but could not
> reproduce it.
> Steps I followed:
> 1) Place a debugger in the slotsync worker and hold it at
> fetch_remote_slots() ... -> libpqsrv_get_result()
> 2) Kill the primary.
> 3) Triggered promotion of the standby and release debugger from slotsync worker.
>
> The slot sync worker stops when the promotion is triggered and then
> restarts, but fails to connect to the primary. The promotion happens
> immediately.
> ```
> LOG:  received promote request
> LOG:  redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> 0.02 s, elapsed: 4574.89 s
> LOG:  last completed transaction was at log time 2026-03-23
> 17:13:15.782313+05:30
> LOG:  replication slot synchronization worker will stop because
> promotion is triggered
> LOG:  slot sync worker started
> ERROR:  synchronization worker "slotsync worker" could not connect to
> the primary server: connection to server at "127.0.0.1", port 9933
> failed: Connection refused
> Is the server running on that host and accepting TCP/IP connections?
> ```
>
> I’ll debug this further to understand it better.
> In the meantime, please let me know if I’m missing any step, or if you
> followed a specific setup/script to reproduce this scenario.

Thanks for testing!

If you killed the primary with a signal like SIGTERM, an RST packet might have
been sent to the slotsync worker at that moment. That allowed the worker to
detect the connection loss and exited the wait state, so promotion could
complete as expected.

To reproduce the issue, you'll need a scenario where the worker cannot detect
the connection loss. For example, you could block network traffic (e.g., with
iptables) between the primary and the slotsync worker. The key is to create
a situation where the worker remains stuck waiting for input for a long time.

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-24 09:15  Fujii Masao <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-03-24 09:15 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <[email protected]> wrote:
>
> On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <[email protected]> wrote:
> > Hi Fujii-san,
> >
> > I tried reproducing the wait scenario as you mentioned, but could not
> > reproduce it.
> > Steps I followed:
> > 1) Place a debugger in the slotsync worker and hold it at
> > fetch_remote_slots() ... -> libpqsrv_get_result()
> > 2) Kill the primary.
> > 3) Triggered promotion of the standby and release debugger from slotsync worker.
> >
> > The slot sync worker stops when the promotion is triggered and then
> > restarts, but fails to connect to the primary. The promotion happens
> > immediately.
> > ```
> > LOG:  received promote request
> > LOG:  redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> > 0.02 s, elapsed: 4574.89 s
> > LOG:  last completed transaction was at log time 2026-03-23
> > 17:13:15.782313+05:30
> > LOG:  replication slot synchronization worker will stop because
> > promotion is triggered
> > LOG:  slot sync worker started
> > ERROR:  synchronization worker "slotsync worker" could not connect to
> > the primary server: connection to server at "127.0.0.1", port 9933
> > failed: Connection refused
> > Is the server running on that host and accepting TCP/IP connections?
> > ```
> >
> > I’ll debug this further to understand it better.
> > In the meantime, please let me know if I’m missing any step, or if you
> > followed a specific setup/script to reproduce this scenario.
>
> Thanks for testing!
>
> If you killed the primary with a signal like SIGTERM, an RST packet might have
> been sent to the slotsync worker at that moment. That allowed the worker to
> detect the connection loss and exited the wait state, so promotion could
> complete as expected.
>
> To reproduce the issue, you'll need a scenario where the worker cannot detect
> the connection loss. For example, you could block network traffic (e.g., with
> iptables) between the primary and the slotsync worker. The key is to create
> a situation where the worker remains stuck waiting for input for a long time.

Here's one way to reproduce the issue using iptables:

----------------------------------------------------
[Set up slot synchronization environment]

initdb -D data --encoding=UTF8 --locale=C
cat <<EOF >> data/postgresql.conf
wal_level = logical
synchronized_standby_slots = 'physical_slot'
EOF
pg_ctl -D data start
pg_receivewal --create-slot -S physical_slot
pg_recvlogical --create-slot -S logical_slot -P pgoutput
--enable-failover -d postgres
psql -c "CREATE PUBLICATION mypub"

pg_basebackup -D sby1 -c fast -R -S physical_slot -d "dbname=postgres"
-h 127.0.0.1
cat <<EOF >> sby1/postgresql.conf
port = 5433
sync_replication_slots = on
hot_standby_feedback = on
EOF
pg_ctl -D sby1 start

psql -c "SELECT pg_logical_emit_message(true, 'abc', 'xyz')"


[Block network traffic used by slot synchronization]
su -
iptables -A INPUT -p tcp --sport 5432 -j DROP
iptables -A OUTPUT -p tcp --dport 5432 -j DROP


[Promote the standby]
# wait a few seconds
pg_ctl -D sby1 promote
----------------------------------------------------

In my tests on master, promotion got stuck in this scenario.
With the patch, promotion completed promptly.

After testing, you can remove the network block with:

iptables -D INPUT -p tcp --sport 5432 -j DROP
iptables -D OUTPUT -p tcp --dport 5432 -j DROP

Regards,


-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-24 09:37  Amit Kapila <[email protected]>
  parent: Fujii Masao <[email protected]>
  2 siblings, 0 replies; 42+ messages in thread

From: Amit Kapila @ 2026-03-24 09:37 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
>
> On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> >
> > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > >
> > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > >
> >
> > IIRC, this same signal is used for both the backend executing
> > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > exit and error_out backend. Using SIGTERM for backend could result in
> > its exit.
>
> Why do we want the backend running pg_sync_replication_slots() to throw
> an error here, rather than just exit?
>

I think it was because the backends remain connected after promotion
and if we make them exit that will change the existing behavior.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-24 16:51  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-03-24 16:51 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 24, 2026 at 2:45 PM Fujii Masao <[email protected]> wrote:
>
> On Tue, Mar 24, 2026 at 3:00 PM Fujii Masao <[email protected]> wrote:
> >
> > On Tue, Mar 24, 2026 at 1:01 PM Nisha Moond <[email protected]> wrote:
> > > Hi Fujii-san,
> > >
> > > I tried reproducing the wait scenario as you mentioned, but could not
> > > reproduce it.
> > > Steps I followed:
> > > 1) Place a debugger in the slotsync worker and hold it at
> > > fetch_remote_slots() ... -> libpqsrv_get_result()
> > > 2) Kill the primary.
> > > 3) Triggered promotion of the standby and release debugger from slotsync worker.
> > >
> > > The slot sync worker stops when the promotion is triggered and then
> > > restarts, but fails to connect to the primary. The promotion happens
> > > immediately.
> > > ```
> > > LOG:  received promote request
> > > LOG:  redo done at 0/0301AD40 system usage: CPU: user: 0.00 s, system:
> > > 0.02 s, elapsed: 4574.89 s
> > > LOG:  last completed transaction was at log time 2026-03-23
> > > 17:13:15.782313+05:30
> > > LOG:  replication slot synchronization worker will stop because
> > > promotion is triggered
> > > LOG:  slot sync worker started
> > > ERROR:  synchronization worker "slotsync worker" could not connect to
> > > the primary server: connection to server at "127.0.0.1", port 9933
> > > failed: Connection refused
> > > Is the server running on that host and accepting TCP/IP connections?
> > > ```
> > >
> > > I’ll debug this further to understand it better.
> > > In the meantime, please let me know if I’m missing any step, or if you
> > > followed a specific setup/script to reproduce this scenario.
> >
> > Thanks for testing!
> >
> > If you killed the primary with a signal like SIGTERM, an RST packet might have
> > been sent to the slotsync worker at that moment. That allowed the worker to
> > detect the connection loss and exited the wait state, so promotion could
> > complete as expected.
> >
> > To reproduce the issue, you'll need a scenario where the worker cannot detect
> > the connection loss. For example, you could block network traffic (e.g., with
> > iptables) between the primary and the slotsync worker. The key is to create
> > a situation where the worker remains stuck waiting for input for a long time.
>
> Here's one way to reproduce the issue using iptables:
>

Thank you, Fujii-san, for sharing the steps. I am now able to
reproduce the behavior where promotion gets stuck because the slot
sync worker remains in a wait loop.

As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
(using slightly higher values for debugging). With this setting, the
TCP stack terminates the connection if data sent to the primary
remains unacknowledged beyond the configured timeout (e.g., due to a
network drop). In such cases the slot sync worker exits instead of
waiting indefinitely. With an appropriately tuned timeout, this could
help avoid the promotion issue by ensuring the worker does not remain
stuck when the connection to the primary is lost.

Thanks,
Nisha





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-25 00:51  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 0 replies; 42+ messages in thread

From: Fujii Masao @ 2026-03-25 00:51 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Mar 25, 2026 at 1:51 AM Nisha Moond <[email protected]> wrote:
> Thank you, Fujii-san, for sharing the steps. I am now able to
> reproduce the behavior where promotion gets stuck because the slot
> sync worker remains in a wait loop.

Thanks for the test!


> As an experiment, I tried setting tcp_user_timeout to 7000 / 15000
> (using slightly higher values for debugging). With this setting, the
> TCP stack terminates the connection if data sent to the primary
> remains unacknowledged beyond the configured timeout (e.g., due to a
> network drop). In such cases the slot sync worker exits instead of
> waiting indefinitely. With an appropriately tuned timeout, this could
> help avoid the promotion issue by ensuring the worker does not remain
> stuck when the connection to the primary is lost.

Yes, TCP timeout settings like tcp_user_timeout, keepalives,
and net.ipv4.tcp_retries2 can help in this situation. However,
they involve a trade-off: using very small timeouts can reduce
failover time but increases the risk of false network failure detection,
while larger timeouts (e.g., 10s) avoid false positives but can
delay failover by that amount.

Because of this, I think it's better to address the issue without
relying on such TCP timeout parameters.

Also, tcp_user_timeout is not available on platforms that don't
support TCP_USER_TIMEOUT (e.g., Windows).

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-26 10:10  Amit Kapila <[email protected]>
  parent: Fujii Masao <[email protected]>
  2 siblings, 1 reply; 42+ messages in thread

From: Amit Kapila @ 2026-03-26 10:10 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
>
> On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> >
> > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > >
> > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > >
> >
> > IIRC, this same signal is used for both the backend executing
> > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > exit and error_out backend. Using SIGTERM for backend could result in
> > its exit.
>
> Why do we want the backend running pg_sync_replication_slots() to throw
> an error here, rather than just exit? If emitting an error is really required,
> another option would be to store the process type in SlotSyncCtx and send
> different signals accordingly, for example, SIGTERM for the slotsync worker
> and another signal for a backend. But it seems simpler and sufficient to have
> the backend exit in this case as well.
>

As we want to retain the existing behavior for API, so instead of
using two signals, we can achieve what you intend to achieve by one
signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
slotsync worker/backend via SendProcSignal. Then in
procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
HandleSlotSyncInterrupt() will set the InterruptPending and
SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
specific function based on the flag and do what we currently do in
ProcessSlotSyncInterrupts. I think this should address the issue you
are worried about.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-26 10:37  Nisha Moond <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-03-26 10:37 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
>
> On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> >
> > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> > >
> > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > > >
> > > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > > >
> > >
> > > IIRC, this same signal is used for both the backend executing
> > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > exit and error_out backend. Using SIGTERM for backend could result in
> > > its exit.
> >
> > Why do we want the backend running pg_sync_replication_slots() to throw
> > an error here, rather than just exit? If emitting an error is really required,
> > another option would be to store the process type in SlotSyncCtx and send
> > different signals accordingly, for example, SIGTERM for the slotsync worker
> > and another signal for a backend. But it seems simpler and sufficient to have
> > the backend exit in this case as well.
> >
>
> As we want to retain the existing behavior for API, so instead of
> using two signals, we can achieve what you intend to achieve by one
> signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> slotsync worker/backend via SendProcSignal. Then in
> procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> HandleSlotSyncInterrupt() will set the InterruptPending and
> SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> specific function based on the flag and do what we currently do in
> ProcessSlotSyncInterrupts. I think this should address the issue you
> are worried about.
>

+1
Retaining the current behavior for the API backend keeps it consistent
with other backends that continue after promotion.

In the reproduced case, the worker (or API backend) is waiting in:
libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
When SIGUSR1 is received, it only sets the latch but does not mark any
interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
effectively a no-op, and the process goes back to waiting. So, control
never returns to the slotsync code path, and we cannot rely on
stopSignaled to handle exit/error separately.
Only SIGTERM works here because its handler sets
INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
break the loop. The other signals like SIGUSR1 or SIGINT do not do
this, so simply using another signal might not solve the API error
handling case.

I’ve implemented the above approach suggested by Amit in the attached
patch and verified it for both worker and API scenarios. With this,
the API can now error-out without exiting the backend.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (7.2K, 2-v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From 2c3a644e303a2c9529efe60ab1d478194cc2f6de Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v2] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in WaitLatchOrSocket()
waiting for a response from the primary (e.g., due to a network failure),
the previous SIGUSR1 signal only sets the latch. The worker wakes up, finds no
pending interrupt, and goes back to waiting, causing ShutDownSlotSync() to wait
indefinitely and blocking promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing promotion
to proceed.
---
 src/backend/replication/logical/slotsync.c | 68 +++++++++++++++++++++-
 src/backend/storage/ipc/procsignal.c       |  4 ++
 src/backend/tcop/postgres.c                |  4 ++
 src/include/replication/slotsync.h         |  7 +++
 src/include/storage/procsignal.h           |  1 +
 5 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..17c29ce3367 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -141,6 +141,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1323,6 +1330,44 @@ ProcessSlotSyncInterrupts(void)
 		slotsync_reread_config();
 }
 
+/*
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdown flag so that ProcessInterrupts()
+ * will dispatch to HandleSlotSyncShutdown() at the next safe point.
+ */
+void
+HandleSlotSyncShutdownInterrupt(void)
+{
+	InterruptPending = true;
+	SlotSyncShutdown = true;
+}
+
+/*
+ * Handle a slot-sync shutdown request, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+HandleSlotSyncShutdown(void)
+{
+	SlotSyncShutdown = false;
+
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
+}
+
 /*
  * Connection cleanup function for slotsync worker.
  *
@@ -1427,6 +1472,23 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+			proc_exit(0);
+		else
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1748,11 +1810,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..770ecf2ab19 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncShutdownInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..b8973ec0646 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdown)
+		HandleSlotSyncShutdown();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..23471b0161e 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncShutdownInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdown;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncShutdownInterrupt(void);
+extern void HandleSlotSyncShutdown(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 03:58  shveta malik <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: shveta malik @ 2026-03-27 03:58 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>; shveta malik <[email protected]>

On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <[email protected]> wrote:
>
> On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
> >
> > On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> > >
> > > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> > > >
> > > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > > > >
> > > > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > > > >
> > > >
> > > > IIRC, this same signal is used for both the backend executing
> > > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > > exit and error_out backend. Using SIGTERM for backend could result in
> > > > its exit.
> > >
> > > Why do we want the backend running pg_sync_replication_slots() to throw
> > > an error here, rather than just exit? If emitting an error is really required,
> > > another option would be to store the process type in SlotSyncCtx and send
> > > different signals accordingly, for example, SIGTERM for the slotsync worker
> > > and another signal for a backend. But it seems simpler and sufficient to have
> > > the backend exit in this case as well.
> > >
> >
> > As we want to retain the existing behavior for API, so instead of
> > using two signals, we can achieve what you intend to achieve by one
> > signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> > ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> > slotsync worker/backend via SendProcSignal. Then in
> > procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> > HandleSlotSyncInterrupt() will set the InterruptPending and
> > SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> > specific function based on the flag and do what we currently do in
> > ProcessSlotSyncInterrupts. I think this should address the issue you
> > are worried about.
> >
>
> +1
> Retaining the current behavior for the API backend keeps it consistent
> with other backends that continue after promotion.
>
> In the reproduced case, the worker (or API backend) is waiting in:
> libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
> When SIGUSR1 is received, it only sets the latch but does not mark any
> interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
> effectively a no-op, and the process goes back to waiting. So, control
> never returns to the slotsync code path, and we cannot rely on
> stopSignaled to handle exit/error separately.
> Only SIGTERM works here because its handler sets
> INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
> break the loop. The other signals like SIGUSR1 or SIGINT do not do
> this, so simply using another signal might not solve the API error
> handling case.
>
> I’ve implemented the above approach suggested by Amit in the attached
> patch and verified it for both worker and API scenarios. With this,
> the API can now error-out without exiting the backend.
>

+1 on the idea. Few comments:

1)
It was not clear initially as to why SetLatch is not done in
HandleSlotSyncShutdownInterrupt(), digging it further revealed that
procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
add below comment at the end of HandleSlotSyncShutdownInterrupt()
similar to how other functions (HandleProcSignalBarrierInterrupt,
HandleRecoveryConflictInterrupt etc) do.

/* latch will be set by procsignal_sigusr1_handler */

2)
In ProcessSlotSyncInterrupts(), now we don't need the below logic right?

if (SlotSyncCtx->stopSignaled)
    {
        if (AmLogicalSlotSyncWorkerProcess())
        {
            ...
            proc_exit(0);
        }
        else
        {
            /*
             * For the backend executing SQL function
             * pg_sync_replication_slots().
             */
            ereport(ERROR,
                    errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
                    errmsg("replication slot synchronization will stop
because promotion is triggered"));
        }
    }

thanks
Shveta





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 04:57  Nisha Moond <[email protected]>
  parent: shveta malik <[email protected]>
  0 siblings, 2 replies; 42+ messages in thread

From: Nisha Moond @ 2026-03-27 04:57 UTC (permalink / raw)
  To: shveta malik <[email protected]>; +Cc: Amit Kapila <[email protected]>; Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 9:28 AM shveta malik <[email protected]> wrote:
>
> On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <[email protected]> wrote:
> >
> > On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> > > >
> > > > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> > > > >
> > > > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > > > > >
> > > > > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > > > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > > > > >
> > > > >
> > > > > IIRC, this same signal is used for both the backend executing
> > > > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > > > exit and error_out backend. Using SIGTERM for backend could result in
> > > > > its exit.
> > > >
> > > > Why do we want the backend running pg_sync_replication_slots() to throw
> > > > an error here, rather than just exit? If emitting an error is really required,
> > > > another option would be to store the process type in SlotSyncCtx and send
> > > > different signals accordingly, for example, SIGTERM for the slotsync worker
> > > > and another signal for a backend. But it seems simpler and sufficient to have
> > > > the backend exit in this case as well.
> > > >
> > >
> > > As we want to retain the existing behavior for API, so instead of
> > > using two signals, we can achieve what you intend to achieve by one
> > > signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> > > ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> > > slotsync worker/backend via SendProcSignal. Then in
> > > procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> > > HandleSlotSyncInterrupt() will set the InterruptPending and
> > > SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> > > specific function based on the flag and do what we currently do in
> > > ProcessSlotSyncInterrupts. I think this should address the issue you
> > > are worried about.
> > >
> >
> > +1
> > Retaining the current behavior for the API backend keeps it consistent
> > with other backends that continue after promotion.
> >
> > In the reproduced case, the worker (or API backend) is waiting in:
> > libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
> > When SIGUSR1 is received, it only sets the latch but does not mark any
> > interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
> > effectively a no-op, and the process goes back to waiting. So, control
> > never returns to the slotsync code path, and we cannot rely on
> > stopSignaled to handle exit/error separately.
> > Only SIGTERM works here because its handler sets
> > INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
> > break the loop. The other signals like SIGUSR1 or SIGINT do not do
> > this, so simply using another signal might not solve the API error
> > handling case.
> >
> > I’ve implemented the above approach suggested by Amit in the attached
> > patch and verified it for both worker and API scenarios. With this,
> > the API can now error-out without exiting the backend.
> >
>
> +1 on the idea. Few comments:
>

Thanks for the review.

> 1)
> It was not clear initially as to why SetLatch is not done in
> HandleSlotSyncShutdownInterrupt(), digging it further revealed that
> procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
> add below comment at the end of HandleSlotSyncShutdownInterrupt()
> similar to how other functions (HandleProcSignalBarrierInterrupt,
> HandleRecoveryConflictInterrupt etc) do.
>
> /* latch will be set by procsignal_sigusr1_handler */
>

Fixed.

> 2)
> In ProcessSlotSyncInterrupts(), now we don't need the below logic right?
>
> if (SlotSyncCtx->stopSignaled)
>     {
>         if (AmLogicalSlotSyncWorkerProcess())
>         {
>             ...
>             proc_exit(0);
>         }
>         else
>         {
>             /*
>              * For the backend executing SQL function
>              * pg_sync_replication_slots().
>              */
>             ereport(ERROR,
>                     errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                     errmsg("replication slot synchronization will stop
> because promotion is triggered"));
>         }
>     }
>

Right. Attached patch with the suggested changes.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (8.1K, 2-v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From de7f4d06a1db405a7aaf5e0d975a303cd7a81baa Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v3] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in WaitLatchOrSocket()
waiting for a response from the primary (e.g., due to a network failure),
the previous SIGUSR1 signal only sets the latch. The worker wakes up, finds no
pending interrupt, and goes back to waiting, causing ShutDownSlotSync() to wait
indefinitely and blocking promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing promotion
to proceed.
---
 src/backend/replication/logical/slotsync.c | 101 ++++++++++++++++-----
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 93 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..ed988eb6dc8 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -141,6 +141,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1298,31 +1305,49 @@ ProcessSlotSyncInterrupts(void)
 {
 	CHECK_FOR_INTERRUPTS();
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
-
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
-	}
-
 	if (ConfigReloadPending)
 		slotsync_reread_config();
 }
 
+/*
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdown flag so that ProcessInterrupts()
+ * will dispatch to HandleSlotSyncShutdown() at the next safe point.
+ */
+void
+HandleSlotSyncShutdownInterrupt(void)
+{
+	InterruptPending = true;
+	SlotSyncShutdown = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
+
+/*
+ * Handle a slot-sync shutdown request, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+HandleSlotSyncShutdown(void)
+{
+	SlotSyncShutdown = false;
+
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
+}
+
 /*
  * Connection cleanup function for slotsync worker.
  *
@@ -1427,6 +1452,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(LOG,
+					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will stop because promotion is triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1748,11 +1801,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..770ecf2ab19 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncShutdownInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..b8973ec0646 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdown)
+		HandleSlotSyncShutdown();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..23471b0161e 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncShutdownInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdown;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncShutdownInterrupt(void);
+extern void HandleSlotSyncShutdown(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 06:06  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  1 sibling, 0 replies; 42+ messages in thread

From: Fujii Masao @ 2026-03-27 06:06 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: shveta malik <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 1:57 PM Nisha Moond <[email protected]> wrote:
>
> Right. Attached patch with the suggested changes.
>

Thanks for making the patch!



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 07:49  Amit Kapila <[email protected]>
  parent: Nisha Moond <[email protected]>
  1 sibling, 1 reply; 42+ messages in thread

From: Amit Kapila @ 2026-03-27 07:49 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: shveta malik <[email protected]>; Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 10:27 AM Nisha Moond <[email protected]> wrote:
>
> On Fri, Mar 27, 2026 at 9:28 AM shveta malik <[email protected]> wrote:
> >
> > In ProcessSlotSyncInterrupts(), now we don't need the below logic right?
> >
> > if (SlotSyncCtx->stopSignaled)
> >     {
> >         if (AmLogicalSlotSyncWorkerProcess())
> >         {
> >             ...
> >             proc_exit(0);
> >         }
> >         else
> >         {
> >             /*
> >              * For the backend executing SQL function
> >              * pg_sync_replication_slots().
> >              */
> >             ereport(ERROR,
> >                     errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> >                     errmsg("replication slot synchronization will stop
> > because promotion is triggered"));
> >         }
> >     }
> >
>
> Right. Attached patch with the suggested changes.
>

After this change, why do we need to invoke
ProcessSlotSyncInterrupts() twice in SyncReplicationSlots?

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 09:49  Amit Kapila <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Amit Kapila @ 2026-03-27 09:49 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: shveta malik <[email protected]>; Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 1:19 PM Amit Kapila <[email protected]> wrote:
>
> On Fri, Mar 27, 2026 at 10:27 AM Nisha Moond <[email protected]> wrote:
> >
> > On Fri, Mar 27, 2026 at 9:28 AM shveta malik <[email protected]> wrote:
> > >
> > > In ProcessSlotSyncInterrupts(), now we don't need the below logic right?
> > >
> > > if (SlotSyncCtx->stopSignaled)
> > >     {
> > >         if (AmLogicalSlotSyncWorkerProcess())
> > >         {
> > >             ...
> > >             proc_exit(0);
> > >         }
> > >         else
> > >         {
> > >             /*
> > >              * For the backend executing SQL function
> > >              * pg_sync_replication_slots().
> > >              */
> > >             ereport(ERROR,
> > >                     errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> > >                     errmsg("replication slot synchronization will stop
> > > because promotion is triggered"));
> > >         }
> > >     }
> > >
> >
> > Right. Attached patch with the suggested changes.
> >
>
> After this change, why do we need to invoke
> ProcessSlotSyncInterrupts() twice in SyncReplicationSlots?
>

Also, not sure if it is a good idea to name current function as
ProcessSlotSyncInterrupts() because we remove most of its interrupt
handling. Shall we copy paste its code at two places as we do similar
handling at other places as well.

Another comment:
*
+
+ if (SlotSyncShutdown)
+ HandleSlotSyncShutdown();
...
...
+ if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+ HandleSlotSyncShutdownInterrupt();

Would it better if we name these functions as HandleSlotSyncMessage()
and HandleSlotSyncMessageInterrupt() because for API, these simply
lead to an ERROR and that would match with the ProcSignalReason name
PROCSIG_SLOTSYNC_MESSAGE?

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 12:37  Nisha Moond <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-03-27 12:37 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: shveta malik <[email protected]>; Fujii Masao <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 3:19 PM Amit Kapila <[email protected]> wrote:
>
> On Fri, Mar 27, 2026 at 1:19 PM Amit Kapila <[email protected]> wrote:
> >
> > After this change, why do we need to invoke
> > ProcessSlotSyncInterrupts() twice in SyncReplicationSlots?
> >

Fixed.

>
> Also, not sure if it is a good idea to name current function as
> ProcessSlotSyncInterrupts() because we remove most of its interrupt
> handling. Shall we copy paste its code at two places as we do similar
> handling at other places as well.
>

Done.

> Another comment:
> *
> +
> + if (SlotSyncShutdown)
> + HandleSlotSyncShutdown();
> ...
> ...
> + if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
> + HandleSlotSyncShutdownInterrupt();
>
> Would it better if we name these functions as HandleSlotSyncMessage()
> and HandleSlotSyncMessageInterrupt() because for API, these simply
> lead to an ERROR and that would match with the ProcSignalReason name
> PROCSIG_SLOTSYNC_MESSAGE?
>
Done.

Attached the updated patch.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v4-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (9.1K, 2-v4-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From d9580bcfcb26ba1e220e7251d04d8b215f3fb405 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v4] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in
WaitLatchOrSocket() waiting for a response from the primary (e.g., due
to a network failure), the previous SIGUSR1 signal only sets the latch.
The worker wakes up, finds no pending interrupt, and goes back to
waiting, causing ShutDownSlotSync() to wait indefinitely and blocking
promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing
promotion to proceed.
---
 src/backend/replication/logical/slotsync.c | 110 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 93 insertions(+), 33 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..7fab9c75a8c 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -141,6 +141,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1298,42 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdown flag so that ProcessInterrupts()
+ * will dispatch to HandleSlotSyncMessage() at the next safe point.
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdown = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+HandleSlotSyncMessage(void)
+{
+	SlotSyncShutdown = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
 	}
-
-	if (ConfigReloadPending)
-		slotsync_reread_config();
 }
 
 /*
@@ -1427,6 +1440,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(LOG,
+					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will stop because promotion is triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1676,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1792,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1975,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +1984,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..7580ba8b5f5 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdown)
+		HandleSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..debac787ab2 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdown;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void HandleSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-27 17:19  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-03-27 17:19 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 9:38 PM Nisha Moond <[email protected]> wrote:
> Attached the updated patch.

Thanks for updating the patch! It looks good overall.

Regarding the comments in SlotSyncCtxStruct, since the role of
stopSignaled field has changed, those comments should be updated
accordingly? For example,

-------------------------
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
+ * the SQL function pg_sync_replication_slots(). On promotion,
+ * the startup process sets 'stopSignaled' and uses this 'pid' to wake up
+ * the currently synchronizing process so that the process can
+ * immediately stop its synchronizing work.
+ * Setting 'stopSignaled' is used to handle the race condition when the
-------------------------


+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;

For the interrupt flag set in procsignal_sigusr1_handler(), other flags
use a *Pending suffix (e.g., ProcSignalBarrierPending,
ParallelApplyMessagePending), so SlotSyncShutdownPending would
be more consistent.


+void
+HandleSlotSyncMessage(void)

Functions called from ProcessInterrupts() typically use the Process* prefix
(e.g., ProcessProcSignalBarrier(), ProcessParallelApplyMessages()),
so ProcessSlotSyncMessage would be more consistent than HandleSlotSyncMessage.


+ ereport(LOG,
+ errmsg("replication slot synchronization worker will stop because
promotion is triggered"));
+
+ proc_exit(0);
+ }
+ else
+ {
+ /*
+ * For the backend executing SQL function
+ * pg_sync_replication_slots().
+ */
+ ereport(ERROR,
+ errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("replication slot synchronization will stop because promotion
is triggered"));

The log messages say "will stop", but since sync hasn't started yet,
"will not start" seems clearer here. For example, "replication slot
synchronization worker will not start because promotion was triggered"
and "replication slot synchronization will not start because promotion was
triggered". Thought?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-30 04:18  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 2 replies; 42+ messages in thread

From: Nisha Moond @ 2026-03-30 04:18 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Amit Kapila <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, Mar 27, 2026 at 10:49 PM Fujii Masao <[email protected]> wrote:
>
> On Fri, Mar 27, 2026 at 9:38 PM Nisha Moond <[email protected]> wrote:
> > Attached the updated patch.
>
> Thanks for updating the patch! It looks good overall.
>

Thank you Fujii-san for the review.

> Regarding the comments in SlotSyncCtxStruct, since the role of
> stopSignaled field has changed, those comments should be updated
> accordingly? For example,
>
> -------------------------
> - * the SQL function pg_sync_replication_slots(). When the startup process sets
> - * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
> - * synchronizing process so that the process can immediately stop its
> - * synchronizing work on seeing 'stopSignaled' set.
> - * Setting 'stopSignaled' is also used to handle the race condition when the
> + * the SQL function pg_sync_replication_slots(). On promotion,
> + * the startup process sets 'stopSignaled' and uses this 'pid' to wake up
> + * the currently synchronizing process so that the process can
> + * immediately stop its synchronizing work.
> + * Setting 'stopSignaled' is used to handle the race condition when the
> -------------------------
>

Updated as suggested.

>
> +/*
> + * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
> + * slotsync worker or pg_sync_replication_slots() to stop because
> + * standby promotion has been triggered.
> + */
> +volatile sig_atomic_t SlotSyncShutdown = false;
>
> For the interrupt flag set in procsignal_sigusr1_handler(), other flags
> use a *Pending suffix (e.g., ProcSignalBarrierPending,
> ParallelApplyMessagePending), so SlotSyncShutdownPending would
> be more consistent.
>
>
> +void
> +HandleSlotSyncMessage(void)
>
> Functions called from ProcessInterrupts() typically use the Process* prefix
> (e.g., ProcessProcSignalBarrier(), ProcessParallelApplyMessages()),
> so ProcessSlotSyncMessage would be more consistent than HandleSlotSyncMessage.
>

Agree, fixed.

>
> + ereport(LOG,
> + errmsg("replication slot synchronization worker will stop because
> promotion is triggered"));
> +
> + proc_exit(0);
> + }
> + else
> + {
> + /*
> + * For the backend executing SQL function
> + * pg_sync_replication_slots().
> + */
> + ereport(ERROR,
> + errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("replication slot synchronization will stop because promotion
> is triggered"));
>
> The log messages say "will stop", but since sync hasn't started yet,
> "will not start" seems clearer here. For example, "replication slot
> synchronization worker will not start because promotion was triggered"
> and "replication slot synchronization will not start because promotion was
> triggered". Thought?
>

We were using the same log message in two places:
check_and_set_sync_info() and HandleSlotSyncMessage().
I think “will not start” fits better in the first case, while “will
stop” makes sense to keep in the second.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v5-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (10.2K, 2-v5-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From 3eb6f6c0600ec15e1d773656e689bca51219744a Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v5] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in
WaitLatchOrSocket() waiting for a response from the primary (e.g., due
to a network failure), the previous SIGUSR1 signal only sets the latch.
The worker wakes up, finds no pending interrupt, and goes back to
waiting, causing ShutDownSlotSync() to wait indefinitely and blocking
promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing
promotion to proceed.
---
 src/backend/replication/logical/slotsync.c | 120 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 98 insertions(+), 38 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..a038ff31148 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -85,11 +85,11 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
+ * the SQL function pg_sync_replication_slots(). On promotion,
+ * the startup process sets 'stopSignaled' and uses this 'pid' to wake up
+ * the currently synchronizing process so that the process can
+ * immediately stop its synchronizing work.
+ * Setting 'stopSignaled' is used to handle the race condition when the
  * postmaster has not noticed the promotion yet and thus may end up restarting
  * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
  * case. The SQL function pg_sync_replication_slots() will also error out if
@@ -141,6 +141,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1298,42 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdownPending flag so that ProcessInterrupts()
+ * will dispatch to ProcessSlotSyncMessage() at the next safe point.
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
 	}
-
-	if (ConfigReloadPending)
-		slotsync_reread_config();
 }
 
 /*
@@ -1427,6 +1440,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(LOG,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1676,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1792,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1975,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +1984,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..b0943abb85e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..35835087128 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-30 10:22  shveta malik <[email protected]>
  parent: Nisha Moond <[email protected]>
  1 sibling, 1 reply; 42+ messages in thread

From: shveta malik @ 2026-03-30 10:22 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Fujii Masao <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>; shveta malik <[email protected]>

On Mon, Mar 30, 2026 at 9:48 AM Nisha Moond <[email protected]> wrote:
>

Thanks for the patch Nisha. Few trivial things:

1)
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdownPending flag so that
ProcessInterrupts()
+ * will dispatch to ProcessSlotSyncMessage() at the next safe point.
  */
+void
+HandleSlotSyncMessageInterrupt(void)

Can we please change the comment to below.
Below suggestion is based on how we have written comments atop other
such Handle*() functions

/*
 * Handle receipt of an interrupt indicating a slotsync shutdown message
 *
 * This is called within SIGUSR1 handler. All we do here is set a flag
 * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
 * ProcessSlotSyncMessage().
 */

2)
I tried to consider whether 'stopSignaled' alone would be sufficient
for our purposes, and whether we really need
'SlotSyncShutdownPending'. It seems that relying solely on
stopSignaled could be problematic:

a) stopSignaled resides in shared memory, so once it is set, it
becomes visible to all other processes. If another process executes
ProcessInterrupts(), it might incorrectly start handling slot sync
shutdown. While this could be made to work, it would require extra
checks to ensure that only the actual synchronizing process reacts.
b) Unlike SlotSyncShutdownPending, we cannot reset stopSignaled, since
it is also used to handle race conditions between the postmaster and
promotion by the startup process.
c) Accessing stopSignaled requires acquiring a mutex. It is unclear if
that is a good idea in ProcessInterrupts(), since every time a process
calls CHECK_FOR_INTERRUPTS(), it would need to acquire the mutex to
decide whether to execute ProcessSlotSyncMessage().

Given these considerations, I think it makes sense to retain both
stopSignaled and SlotSyncShutdownPending, but we should add a comment
above SlotSyncShutdownPending explaining why stopSignaled alone is not
sufficient.

Let me know if others have different opinions here.

thanks
Shveta





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-30 11:09  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  1 sibling, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-03-30 11:09 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <[email protected]> wrote:
> We were using the same log message in two places:
> check_and_set_sync_info() and HandleSlotSyncMessage().
> I think “will not start” fits better in the first case, while “will
> stop” makes sense to keep in the second.

Thanks for updating the patch!

With the patch, in my testing, standby promotion always produces
the following logs:

    LOG:  replication slot synchronization worker will stop because
promotion is triggered
    LOG:  replication slot synchronization worker will not start
because promotion was triggered

It looks like the postmaster immediately restarts the slotsync worker after
promotion terminates it, and that new worker then exits on seeing
SlotSyncCtx->stopSignaled.

IMO, always emitting both messages is a bit confusing. It would be nice to
suppress the second one if possible.

One idea would be to prevent the restart altogether. For example,
ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
a special value (like -1), and SlotSyncWorkerCanRestart() could return
false (i.e., prevent postmater from starting up slotsync worker) when
it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
check SlotSyncCtx->stopSignaled.

That said, as far as I remember correctly, postmaster is generally not
supposed to touch shared memory (per the comments in postmaster.c),
so I'm not sure this approach is acceptable. On the other hand,
postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
so perhaps there's some precedent here.

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-31 06:01  Nisha Moond <[email protected]>
  parent: shveta malik <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-03-31 06:01 UTC (permalink / raw)
  To: shveta malik <[email protected]>; +Cc: Fujii Masao <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Mon, Mar 30, 2026 at 3:52 PM shveta malik <[email protected]> wrote:
>
> On Mon, Mar 30, 2026 at 9:48 AM Nisha Moond <[email protected]> wrote:
> >
>
> Thanks for the patch Nisha. Few trivial things:
>
> 1)
> + * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
> + * is received.  Sets the SlotSyncShutdownPending flag so that
> ProcessInterrupts()
> + * will dispatch to ProcessSlotSyncMessage() at the next safe point.
>   */
> +void
> +HandleSlotSyncMessageInterrupt(void)
>
> Can we please change the comment to below.
> Below suggestion is based on how we have written comments atop other
> such Handle*() functions
>
> /*
>  * Handle receipt of an interrupt indicating a slotsync shutdown message
>  *
>  * This is called within SIGUSR1 handler. All we do here is set a flag
>  * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
>  * ProcessSlotSyncMessage().
>  */
>

Fixed.

> 2)
> I tried to consider whether 'stopSignaled' alone would be sufficient
> for our purposes, and whether we really need
> 'SlotSyncShutdownPending'. It seems that relying solely on
> stopSignaled could be problematic:
>
> a) stopSignaled resides in shared memory, so once it is set, it
> becomes visible to all other processes. If another process executes
> ProcessInterrupts(), it might incorrectly start handling slot sync
> shutdown. While this could be made to work, it would require extra
> checks to ensure that only the actual synchronizing process reacts.
> b) Unlike SlotSyncShutdownPending, we cannot reset stopSignaled, since
> it is also used to handle race conditions between the postmaster and
> promotion by the startup process.
> c) Accessing stopSignaled requires acquiring a mutex. It is unclear if
> that is a good idea in ProcessInterrupts(), since every time a process
> calls CHECK_FOR_INTERRUPTS(), it would need to acquire the mutex to
> decide whether to execute ProcessSlotSyncMessage().
>
> Given these considerations, I think it makes sense to retain both
> stopSignaled and SlotSyncShutdownPending, but we should add a comment
> above SlotSyncShutdownPending explaining why stopSignaled alone is not
> sufficient.
>

Done.

Please find the updated patch (v6) attached.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v6-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (10.8K, 2-v6-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From d1de8a6797fd33f95d591964172c95a84beeeee9 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v6] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in
WaitLatchOrSocket() waiting for a response from the primary (e.g., due
to a network failure), the previous SIGUSR1 signal only sets the latch.
The worker wakes up, finds no pending interrupt, and goes back to
waiting, causing ShutDownSlotSync() to wait indefinitely and blocking
promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing
promotion to proceed.
---
 src/backend/replication/logical/slotsync.c | 131 +++++++++++++++------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 109 insertions(+), 38 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..7f6e5ef14ec 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -85,11 +85,11 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
+ * the SQL function pg_sync_replication_slots(). On promotion,
+ * the startup process sets 'stopSignaled' and uses this 'pid' to wake up
+ * the currently synchronizing process so that the process can
+ * immediately stop its synchronizing work.
+ * Setting 'stopSignaled' is used to handle the race condition when the
  * postmaster has not noticed the promotion yet and thus may end up restarting
  * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
  * case. The SQL function pg_sync_replication_slots() will also error out if
@@ -141,6 +141,22 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ *
+ * We cannot rely solely on 'stopSignaled' here because:
+ * 1) It resides in shared memory and is visible to all processes, so checking
+ *    it directly in ProcessInterrupts() would require additional checks to
+ *    ensure only the synchronizing process acts on it.
+ * 2) It has different lifetime semantics and cannot be reset after handling,
+ *    as it also guards against postmaster and promotion race conditions.
+ * 3) Accessing it requires acquiring a spinlock, which can be too expensive
+ *    or undesirable for every ProcessInterrupts() call.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1307,44 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
 	}
-
-	if (ConfigReloadPending)
-		slotsync_reread_config();
 }
 
 /*
@@ -1427,6 +1451,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(LOG,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1687,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1803,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1986,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +1995,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..b0943abb85e 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..35835087128 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-31 06:04  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-03-31 06:04 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Amit Kapila <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Mon, Mar 30, 2026 at 4:39 PM Fujii Masao <[email protected]> wrote:
>
> On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <[email protected]> wrote:
> > We were using the same log message in two places:
> > check_and_set_sync_info() and HandleSlotSyncMessage().
> > I think “will not start” fits better in the first case, while “will
> > stop” makes sense to keep in the second.
>
> Thanks for updating the patch!
>
> With the patch, in my testing, standby promotion always produces
> the following logs:
>
>     LOG:  replication slot synchronization worker will stop because
> promotion is triggered
>     LOG:  replication slot synchronization worker will not start
> because promotion was triggered
>
> It looks like the postmaster immediately restarts the slotsync worker after
> promotion terminates it, and that new worker then exits on seeing
> SlotSyncCtx->stopSignaled.
>
> IMO, always emitting both messages is a bit confusing. It would be nice to
> suppress the second one if possible.
>
> One idea would be to prevent the restart altogether. For example,
> ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> a special value (like -1), and SlotSyncWorkerCanRestart() could return
> false (i.e., prevent postmater from starting up slotsync worker) when
> it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> check SlotSyncCtx->stopSignaled.
>
> That said, as far as I remember correctly, postmaster is generally not
> supposed to touch shared memory (per the comments in postmaster.c),
> so I'm not sure this approach is acceptable. On the other hand,
> postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> so perhaps there's some precedent here.
>
IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
may not be ideal, as it requires a spinlock to avoid races with the
startup process and it is disallowed to take lock in postmaster main
loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
the postmaster accesses it only when the worker is not alive.

Another option could be to log in check_and_set_sync_info() at DEBUG1
instead of LOG level. This message appears only after stopSignaled is
set, when promotion is already in progress and the first worker has
logged “will stop…”. The second worker doesn’t do any real work. Since
there’s nothing actionable for users, using DEBUG1 would keep it
useful for debugging (e.g., noticing immediate restarts) while
avoiding extra log noise. Thoughts?

--
Thanks,
Nisha





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* RE: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-31 10:26  Zhijie Hou (Fujitsu) <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Zhijie Hou (Fujitsu) @ 2026-03-31 10:26 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; shveta malik <[email protected]>; +Cc: Fujii Masao <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tuesday, March 31, 2026 2:02 PM Nisha Moond <[email protected]> wrote:
> 
> 
> Please find the updated patch (v6) attached.

Thanks for updating the patch. One minor comment:

I think we could avoid interrupting and reporting an ERROR when
IsSyncingReplicationSlots() returns false to avoid reporting ERROR unnecessarily
when the slotsync has already finished.

Best Regards,
Hou zj


^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-31 10:42  shveta malik <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 2 replies; 42+ messages in thread

From: shveta malik @ 2026-03-31 10:42 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Fujii Masao <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>; shveta malik <[email protected]>

On Tue, Mar 31, 2026 at 11:35 AM Nisha Moond <[email protected]> wrote:
>
> On Mon, Mar 30, 2026 at 4:39 PM Fujii Masao <[email protected]> wrote:
> >
> > On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <[email protected]> wrote:
> > > We were using the same log message in two places:
> > > check_and_set_sync_info() and HandleSlotSyncMessage().
> > > I think “will not start” fits better in the first case, while “will
> > > stop” makes sense to keep in the second.
> >
> > Thanks for updating the patch!
> >
> > With the patch, in my testing, standby promotion always produces
> > the following logs:
> >
> >     LOG:  replication slot synchronization worker will stop because
> > promotion is triggered
> >     LOG:  replication slot synchronization worker will not start
> > because promotion was triggered
> >
> > It looks like the postmaster immediately restarts the slotsync worker after
> > promotion terminates it, and that new worker then exits on seeing
> > SlotSyncCtx->stopSignaled.
> >
> > IMO, always emitting both messages is a bit confusing. It would be nice to
> > suppress the second one if possible.
> >
> > One idea would be to prevent the restart altogether. For example,
> > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> > a special value (like -1), and SlotSyncWorkerCanRestart() could return
> > false (i.e., prevent postmater from starting up slotsync worker) when
> > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> > check SlotSyncCtx->stopSignaled.
> >
> > That said, as far as I remember correctly, postmaster is generally not
> > supposed to touch shared memory (per the comments in postmaster.c),
> > so I'm not sure this approach is acceptable. On the other hand,
> > postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> > so perhaps there's some precedent here.
> >
> IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
> may not be ideal, as it requires a spinlock to avoid races with the
> startup process and it is disallowed to take lock in postmaster main
> loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
> the postmaster accesses it only when the worker is not alive.
>

I agree.

> Another option could be to log in check_and_set_sync_info() at DEBUG1
> instead of LOG level. This message appears only after stopSignaled is
> set, when promotion is already in progress and the first worker has
> logged “will stop…”. The second worker doesn’t do any real work. Since
> there’s nothing actionable for users, using DEBUG1 would keep it
> useful for debugging (e.g., noticing immediate restarts) while
> avoiding extra log noise. Thoughts?
>

+1.

Do you think we can slightly tweak the comment in atop file to:

On promotion the startup process sets 'stopSignaled' and uses this
'pid' to signal synchronizing process with PROCSIG_SLOTSYNC_MESSAGE
and also to wake it up so that the process can immediately stop its
synchronizing work. Setting 'stopSignaled' on the other hand is  used
to handle the race condition....

Also shall we quick exit ProcessSlotSyncMessage() if syncing is
already finished by API?

thanks
Shveta





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-03-31 15:33  Fujii Masao <[email protected]>
  parent: shveta malik <[email protected]>
  1 sibling, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-03-31 15:33 UTC (permalink / raw)
  To: shveta malik <[email protected]>; +Cc: Nisha Moond <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 31, 2026 at 7:42 PM shveta malik <[email protected]> wrote:
> > > One idea would be to prevent the restart altogether. For example,
> > > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> > > a special value (like -1), and SlotSyncWorkerCanRestart() could return
> > > false (i.e., prevent postmater from starting up slotsync worker) when
> > > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> > > check SlotSyncCtx->stopSignaled.
> > >
> > > That said, as far as I remember correctly, postmaster is generally not
> > > supposed to touch shared memory (per the comments in postmaster.c),
> > > so I'm not sure this approach is acceptable. On the other hand,
> > > postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> > > so perhaps there's some precedent here.
> > >
> > IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
> > may not be ideal, as it requires a spinlock to avoid races with the
> > startup process and it is disallowed to take lock in postmaster main
> > loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
> > the postmaster accesses it only when the worker is not alive.
> >
>
> I agree.

Could you clarify what issue might arise from checking
SlotSyncCtx->stopSignaled without holding a spinlock in
SlotSyncWorkerCanRestart()? Is it actually problematic?

That said, since the postmaster should generally avoid
touching shared memory, it doesn't seem like a good idea
for it to check SlotSyncCtx->stopSignaled. So I'm fine with
instead lowering the log level for the "worker will not start"
message to DEBUG1.

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-01 05:03  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 0 replies; 42+ messages in thread

From: Nisha Moond @ 2026-04-01 05:03 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: shveta malik <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 31, 2026 at 9:03 PM Fujii Masao <[email protected]> wrote:
>
> On Tue, Mar 31, 2026 at 7:42 PM shveta malik <[email protected]> wrote:
> > > > One idea would be to prevent the restart altogether. For example,
> > > > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> > > > a special value (like -1), and SlotSyncWorkerCanRestart() could return
> > > > false (i.e., prevent postmater from starting up slotsync worker) when
> > > > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> > > > check SlotSyncCtx->stopSignaled.
> > > >
> > > > That said, as far as I remember correctly, postmaster is generally not
> > > > supposed to touch shared memory (per the comments in postmaster.c),
> > > > so I'm not sure this approach is acceptable. On the other hand,
> > > > postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> > > > so perhaps there's some precedent here.
> > > >
> > > IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
> > > may not be ideal, as it requires a spinlock to avoid races with the
> > > startup process and it is disallowed to take lock in postmaster main
> > > loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
> > > the postmaster accesses it only when the worker is not alive.
> > >
> >
> > I agree.
>
> Could you clarify what issue might arise from checking
> SlotSyncCtx->stopSignaled without holding a spinlock in
> SlotSyncWorkerCanRestart()? Is it actually problematic?
>

We might not see issues in practice since stopSignaled changes only
once (false -> true), so value corruption is unlikely.
But, without a lock or memory barrier, correct value-read is not
guaranteed, e.g., on weakly ordered systems (like ARM64) the
postmaster may still see a stale value. This means the worker could be
restarted again, and the same unwanted log may still appear.

> That said, since the postmaster should generally avoid
> touching shared memory, it doesn't seem like a good idea
> for it to check SlotSyncCtx->stopSignaled. So I'm fine with
> instead lowering the log level for the "worker will not start"
> message to DEBUG1.
>

Okay, thanks. I'll share the updated patch soon.

--
Thanks,
Nisha





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-01 05:33  Nisha Moond <[email protected]>
  parent: shveta malik <[email protected]>
  1 sibling, 0 replies; 42+ messages in thread

From: Nisha Moond @ 2026-04-01 05:33 UTC (permalink / raw)
  To: shveta malik <[email protected]>; +Cc: Fujii Masao <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 31, 2026 at 4:12 PM shveta malik <[email protected]> wrote:
>
> On Tue, Mar 31, 2026 at 11:35 AM Nisha Moond <[email protected]> wrote:
> >
> > On Mon, Mar 30, 2026 at 4:39 PM Fujii Masao <[email protected]> wrote:
> > >
> > > On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <[email protected]> wrote:
> > > > We were using the same log message in two places:
> > > > check_and_set_sync_info() and HandleSlotSyncMessage().
> > > > I think “will not start” fits better in the first case, while “will
> > > > stop” makes sense to keep in the second.
> > >
> > > Thanks for updating the patch!
> > >
> > > With the patch, in my testing, standby promotion always produces
> > > the following logs:
> > >
> > >     LOG:  replication slot synchronization worker will stop because
> > > promotion is triggered
> > >     LOG:  replication slot synchronization worker will not start
> > > because promotion was triggered
> > >
> > > It looks like the postmaster immediately restarts the slotsync worker after
> > > promotion terminates it, and that new worker then exits on seeing
> > > SlotSyncCtx->stopSignaled.
> > >
> > > IMO, always emitting both messages is a bit confusing. It would be nice to
> > > suppress the second one if possible.
> > >
> > > One idea would be to prevent the restart altogether. For example,
> > > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> > > a special value (like -1), and SlotSyncWorkerCanRestart() could return
> > > false (i.e., prevent postmater from starting up slotsync worker) when
> > > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> > > check SlotSyncCtx->stopSignaled.
> > >
> > > That said, as far as I remember correctly, postmaster is generally not
> > > supposed to touch shared memory (per the comments in postmaster.c),
> > > so I'm not sure this approach is acceptable. On the other hand,
> > > postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> > > so perhaps there's some precedent here.
> > >
> > IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
> > may not be ideal, as it requires a spinlock to avoid races with the
> > startup process and it is disallowed to take lock in postmaster main
> > loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
> > the postmaster accesses it only when the worker is not alive.
> >
>
> I agree.
>
> > Another option could be to log in check_and_set_sync_info() at DEBUG1
> > instead of LOG level. This message appears only after stopSignaled is
> > set, when promotion is already in progress and the first worker has
> > logged “will stop…”. The second worker doesn’t do any real work. Since
> > there’s nothing actionable for users, using DEBUG1 would keep it
> > useful for debugging (e.g., noticing immediate restarts) while
> > avoiding extra log noise. Thoughts?
> >
>
> +1.
>
> Do you think we can slightly tweak the comment in atop file to:
>
> On promotion the startup process sets 'stopSignaled' and uses this
> 'pid' to signal synchronizing process with PROCSIG_SLOTSYNC_MESSAGE
> and also to wake it up so that the process can immediately stop its
> synchronizing work. Setting 'stopSignaled' on the other hand is  used
> to handle the race condition....
>

Done.

> Also shall we quick exit ProcessSlotSyncMessage() if syncing is
> already finished by API?
>

Make sense. Fixed.

Please find the updated patch (v7) attached.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v7-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (12.0K, 2-v7-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From 49892f6a306934f2148f54c3bcf1ecfd722303ae Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v7] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in
WaitLatchOrSocket() waiting for a response from the primary (e.g., due
to a network failure), the previous SIGUSR1 signal only sets the latch.
The worker wakes up, finds no pending interrupt, and goes back to
waiting, causing ShutDownSlotSync() to wait indefinitely and blocking
promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing
promotion to proceed.
---
 src/backend/replication/logical/slotsync.c | 152 +++++++++++++++------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 +
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 124 insertions(+), 44 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..c070c869e5d 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -85,18 +85,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See LaunchMissingBackgroundProcesses.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See LaunchMissingBackgroundProcesses.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -141,6 +142,22 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ *
+ * We cannot rely solely on 'stopSignaled' here because:
+ * 1) It resides in shared memory and is visible to all processes, so checking
+ *    it directly in ProcessInterrupts() would require additional checks to
+ *    ensure only the synchronizing process acts on it.
+ * 2) It has different lifetime semantics and cannot be reset after handling,
+ *    as it also guards against postmaster and promotion race conditions.
+ * 3) Accessing it requires acquiring a spinlock, which can be too expensive
+ *    or undesirable for every ProcessInterrupts() call.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1308,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1427,6 +1460,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1696,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1812,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1995,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +2004,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 10be60011ad..b3cf53b509f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..35835087128 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-01 05:33  Nisha Moond <[email protected]>
  parent: Zhijie Hou (Fujitsu) <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-04-01 05:33 UTC (permalink / raw)
  To: Zhijie Hou (Fujitsu) <[email protected]>; +Cc: shveta malik <[email protected]>; Fujii Masao <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 31, 2026 at 3:56 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Tuesday, March 31, 2026 2:02 PM Nisha Moond <[email protected]> wrote:
> >
> >
> > Please find the updated patch (v6) attached.
>
> Thanks for updating the patch. One minor comment:
>
> I think we could avoid interrupting and reporting an ERROR when
> IsSyncingReplicationSlots() returns false to avoid reporting ERROR unnecessarily
> when the slotsync has already finished.
>

Thanks for the review. Fixed above in v7.

--
Thanks,
Nisha





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-01 09:48  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-04-01 09:48 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 1, 2026 at 2:34 PM Nisha Moond <[email protected]> wrote:
>
> On Tue, Mar 31, 2026 at 3:56 PM Zhijie Hou (Fujitsu)
> <[email protected]> wrote:
> >
> > On Tuesday, March 31, 2026 2:02 PM Nisha Moond <[email protected]> wrote:
> > >
> > >
> > > Please find the updated patch (v6) attached.
> >
> > Thanks for updating the patch. One minor comment:
> >
> > I think we could avoid interrupting and reporting an ERROR when
> > IsSyncingReplicationSlots() returns false to avoid reporting ERROR unnecessarily
> > when the slotsync has already finished.
> >
>
> Thanks for the review. Fixed above in v7.

Thanks for updating the patch! It looks good to me, with just a few minor points
. If those are addressed, I'd like to push it.

+ * a new worker (or a new API call) that starts after the old worker was

"API" feels a bit vague. It might be clearer to explicitly say
"pg_sync_replication_slots()".

+ PROCSIG_SLOTSYNC_MESSAGE, /* ask slotsync worker/API to stop */

"API" here also feels a bit vague. So I'd like to use "ask slot synchronization
to stop" as the comment, instead.

+ * We cannot rely solely on 'stopSignaled' here because:
+ * 1) It resides in shared memory and is visible to all processes, so checking
+ *    it directly in ProcessInterrupts() would require additional checks to
+ *    ensure only the synchronizing process acts on it.
+ * 2) It has different lifetime semantics and cannot be reset after handling,
+ *    as it also guards against postmaster and promotion race conditions.
+ * 3) Accessing it requires acquiring a spinlock, which can be too expensive
+ *    or undesirable for every ProcessInterrupts() call.

Now that PROCSIG_SLOTSYNC_MESSAGE is in place, using SlotSyncShutdownPending
is intuitive. So it seems more useful to explain why stopSignaled is still
needed rather than why SlotSyncShutdownPending is used (i.e., why stopSignaled
is not sufficient). Since that rationale is already covered in the SlotSyncCtx
comments, I'd suggest removing this comment block.


As for backpatching, this looks like it should go back to v17, where slotsync
was introduced. Thought?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-01 11:11  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-04-01 11:11 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 1, 2026 at 3:19 PM Fujii Masao <[email protected]> wrote:
>
> On Wed, Apr 1, 2026 at 2:34 PM Nisha Moond <[email protected]> wrote:
> >
> > On Tue, Mar 31, 2026 at 3:56 PM Zhijie Hou (Fujitsu)
> > <[email protected]> wrote:
> > >
> > > On Tuesday, March 31, 2026 2:02 PM Nisha Moond <[email protected]> wrote:
> > > >
> > > >
> > > > Please find the updated patch (v6) attached.
> > >
> > > Thanks for updating the patch. One minor comment:
> > >
> > > I think we could avoid interrupting and reporting an ERROR when
> > > IsSyncingReplicationSlots() returns false to avoid reporting ERROR unnecessarily
> > > when the slotsync has already finished.
> > >
> >
> > Thanks for the review. Fixed above in v7.
>
> Thanks for updating the patch! It looks good to me, with just a few minor points
> . If those are addressed, I'd like to push it.
>
> + * a new worker (or a new API call) that starts after the old worker was
>
> "API" feels a bit vague. It might be clearer to explicitly say
> "pg_sync_replication_slots()".
>

Done.

> + PROCSIG_SLOTSYNC_MESSAGE, /* ask slotsync worker/API to stop */
>
> "API" here also feels a bit vague. So I'd like to use "ask slot synchronization
> to stop" as the comment, instead.

Done.

> + * We cannot rely solely on 'stopSignaled' here because:
> + * 1) It resides in shared memory and is visible to all processes, so checking
> + *    it directly in ProcessInterrupts() would require additional checks to
> + *    ensure only the synchronizing process acts on it.
> + * 2) It has different lifetime semantics and cannot be reset after handling,
> + *    as it also guards against postmaster and promotion race conditions.
> + * 3) Accessing it requires acquiring a spinlock, which can be too expensive
> + *    or undesirable for every ProcessInterrupts() call.
>
> Now that PROCSIG_SLOTSYNC_MESSAGE is in place, using SlotSyncShutdownPending
> is intuitive. So it seems more useful to explain why stopSignaled is still
> needed rather than why SlotSyncShutdownPending is used (i.e., why stopSignaled
> is not sufficient). Since that rationale is already covered in the SlotSyncCtx
> comments, I'd suggest removing this comment block.

Okay, done.

>
> As for backpatching, this looks like it should go back to v17, where slotsync
> was introduced. Thought?

Right, the issue exists in v17 as well.

Attached the updated patch.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v8-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (11.5K, 2-v8-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From da37336b60bb3c9243c286d0f14ef2f76101a48c Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v8] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in
WaitLatchOrSocket() waiting for a response from the primary (e.g., due
to a network failure), the previous SIGUSR1 signal only sets the latch.
The worker wakes up, finds no pending interrupt, and goes back to
waiting, causing ShutDownSlotSync() to wait indefinitely and blocking
promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing
promotion to proceed.
---
 src/backend/replication/logical/slotsync.c | 143 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 +
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 115 insertions(+), 44 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..c52c5135360 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -85,18 +85,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See LaunchMissingBackgroundProcesses.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See LaunchMissingBackgroundProcesses.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -141,6 +142,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1299,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1427,6 +1451,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a call to pg_sync_replication_slots()) that starts
+	 * after the old worker was stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1687,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1803,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1986,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +1995,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 10be60011ad..b3cf53b509f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..35835087128 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..9bbf5db67fa 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-02 05:01  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-04-02 05:01 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; Amit Kapila <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 1, 2026 at 8:11 PM Nisha Moond <[email protected]> wrote:
> > As for backpatching, this looks like it should go back to v17, where slotsync
> > was introduced. Thought?
>
> Right, the issue exists in v17 as well.
>
> Attached the updated patch.

Thanks for updating the patch! LGTM.

I noticed that commit 1362bc33e02 updated the slotsync code so that a backend
performing slot synchronization is signaled on promotion, but it was applied
only to master. I’m not sure why it wasn’t backpatched to v17 and v18,
but it seems we need to backpatch it first before backpatching this patch.
Thought?

I've prepared patches for v18 and v17 and attached them. For the patch for
master, I only updated the commit message. For v18 and v17,
commit 1362bc33e02 needs to be applied first.

I haven't tested the v18 and v17 patches yet. I'll do that next.

Regards,

-- 
Fujii Masao

From 998b743efad532054ac9cae5df7b9ffc0dde8781 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v9] Fix slotsync worker blocking promotion when stuck in wait

Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.

This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.

Backpatch to v17, where slotsync was introduced.

Author: Nisha Moond <[email protected]>
Reviewed-by: shveta malik <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Zhijie Hou <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17
---
 src/backend/replication/logical/slotsync.c | 138 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 111 insertions(+), 43 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 9ce6160dcfa..0b48f55d057 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -75,18 +75,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See MaybeStartSlotSyncWorker.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See MaybeStartSlotSyncWorker.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -131,6 +132,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1186,36 +1194,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(WalReceiverConn *wrconn)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1322,6 +1346,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a call to pg_sync_replication_slots()) that starts
+	 * after the old worker was stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1525,7 +1577,10 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
 	{
 		bool		some_slot_updated = false;
 
-		ProcessSlotSyncInterrupts(wrconn);
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		some_slot_updated = synchronize_slots(wrconn);
 
@@ -1625,11 +1680,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1774,9 +1829,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 	{
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		synchronize_slots(wrconn);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4ed9cedcdd4..d6857f5a8bb 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -23,6 +23,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -655,6 +656,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 9cd1d0abe35..00d4073c089 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3497,6 +3498,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		HandleParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e03c2a005a4..c7a93791104 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 7d290ea7d05..126c44bcf1d 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
-- 
2.51.2


From af90759562e86807fad00a38b09df6395136b859 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v9] Fix slotsync worker blocking promotion when stuck in wait

Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.

This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.

Backpatch to v17, where slotsync was introduced.

Author: Nisha Moond <[email protected]>
Reviewed-by: shveta malik <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Zhijie Hou <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17
---
 src/backend/replication/logical/slotsync.c | 138 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 111 insertions(+), 43 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index b00439b641f..4cbee197ddb 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -73,18 +73,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See LaunchMissingBackgroundProcesses.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See LaunchMissingBackgroundProcesses.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -129,6 +130,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1205,36 +1213,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(WalReceiverConn *wrconn)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1341,6 +1365,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a call to pg_sync_replication_slots()) that starts
+	 * after the old worker was stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1546,7 +1598,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 	{
 		bool		some_slot_updated = false;
 
-		ProcessSlotSyncInterrupts(wrconn);
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		some_slot_updated = synchronize_slots(wrconn);
 
@@ -1644,11 +1699,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1793,9 +1848,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 	{
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		synchronize_slots(wrconn);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 087821311cc..05d99b452c3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -23,6 +23,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -694,6 +695,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f918..115a634c641 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -57,6 +57,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3535,6 +3536,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index 16b721463dd..882c46581b3 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..c4a59325604 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
-- 
2.51.2



Attachments:

  [text/plain] v9-0001-pg17-Fix-slotsync-worker-blocking-promotion-when-stuck.txt (11.5K, 2-v9-0001-pg17-Fix-slotsync-worker-blocking-promotion-when-stuck.txt)
  download | inline diff:
From 998b743efad532054ac9cae5df7b9ffc0dde8781 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v9] Fix slotsync worker blocking promotion when stuck in wait

Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.

This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.

Backpatch to v17, where slotsync was introduced.

Author: Nisha Moond <[email protected]>
Reviewed-by: shveta malik <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Zhijie Hou <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17
---
 src/backend/replication/logical/slotsync.c | 138 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 111 insertions(+), 43 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 9ce6160dcfa..0b48f55d057 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -75,18 +75,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See MaybeStartSlotSyncWorker.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See MaybeStartSlotSyncWorker.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -131,6 +132,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1186,36 +1194,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(WalReceiverConn *wrconn)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1322,6 +1346,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a call to pg_sync_replication_slots()) that starts
+	 * after the old worker was stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1525,7 +1577,10 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t startup_data_len)
 	{
 		bool		some_slot_updated = false;
 
-		ProcessSlotSyncInterrupts(wrconn);
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		some_slot_updated = synchronize_slots(wrconn);
 
@@ -1625,11 +1680,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1774,9 +1829,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 	{
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		synchronize_slots(wrconn);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 4ed9cedcdd4..d6857f5a8bb 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -23,6 +23,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -655,6 +656,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 9cd1d0abe35..00d4073c089 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3497,6 +3498,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		HandleParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e03c2a005a4..c7a93791104 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 7d290ea7d05..126c44bcf1d 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
-- 
2.51.2



  [text/plain] v9-0001-pg18-Fix-slotsync-worker-blocking-promotion-when-stuck.txt (11.6K, 3-v9-0001-pg18-Fix-slotsync-worker-blocking-promotion-when-stuck.txt)
  download | inline diff:
From af90759562e86807fad00a38b09df6395136b859 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v9] Fix slotsync worker blocking promotion when stuck in wait

Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.

This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.

Backpatch to v17, where slotsync was introduced.

Author: Nisha Moond <[email protected]>
Reviewed-by: shveta malik <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Zhijie Hou <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17
---
 src/backend/replication/logical/slotsync.c | 138 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 111 insertions(+), 43 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index b00439b641f..4cbee197ddb 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -73,18 +73,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See LaunchMissingBackgroundProcesses.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See LaunchMissingBackgroundProcesses.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -129,6 +130,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1205,36 +1213,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(WalReceiverConn *wrconn)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1341,6 +1365,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a call to pg_sync_replication_slots()) that starts
+	 * after the old worker was stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1546,7 +1598,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 	{
 		bool		some_slot_updated = false;
 
-		ProcessSlotSyncInterrupts(wrconn);
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		some_slot_updated = synchronize_slots(wrconn);
 
@@ -1644,11 +1699,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1793,9 +1848,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 	{
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		synchronize_slots(wrconn);
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 087821311cc..05d99b452c3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -23,6 +23,7 @@
 #include "pgstat.h"
 #include "port/pg_bitutils.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -694,6 +695,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_DATABASE))
 		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_DATABASE);
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 2f8c3d5f918..115a634c641 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -57,6 +57,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3535,6 +3536,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index 16b721463dd..882c46581b3 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..c4a59325604 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
-- 
2.51.2



  [application/octet-stream] v9-0001-Fix-slotsync-worker-blocking-promotion-when-stuck.patch (12.0K, 4-v9-0001-Fix-slotsync-worker-blocking-promotion-when-stuck.patch)
  download | inline diff:
From cf779959c44f3dfc95f4ae79370bab668e2600d0 Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v9] Fix slotsync worker blocking promotion when stuck in wait

Previously, on standby promotion, the startup process sent SIGUSR1 to
the slotsync worker (or a backend performing slot synchronization) and
waited for it to exit. This worked in most cases, but if the process was
blocked waiting for a response from the primary (e.g., due to a network
failure), SIGUSR1 would not interrupt the wait. As a result, the process
could remain stuck, causing the startup process to wait for a long time
and delaying promotion.

This commit fixes the issue by introducing a new procsignal reason,
PROCSIG_SLOTSYNC_MESSAGE. On promotion, the startup process
sends this signal, and the handler sets interrupt flags so the process
exits (or errors out) promptly at CHECK_FOR_INTERRUPTS(), allowing
promotion to complete without delay.

Backpatch to v17, where slotsync was introduced.

Author: Nisha Moond <[email protected]>
Reviewed-by: shveta malik <[email protected]>
Reviewed-by: Amit Kapila <[email protected]>
Reviewed-by: Zhijie Hou <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com
Backpatch-through: 17
---
 src/backend/replication/logical/slotsync.c | 143 ++++++++++++++-------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 +
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 115 insertions(+), 44 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..c52c5135360 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -85,18 +85,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See LaunchMissingBackgroundProcesses.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See LaunchMissingBackgroundProcesses.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -141,6 +142,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1299,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1427,6 +1451,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a call to pg_sync_replication_slots()) that starts
+	 * after the old worker was stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1687,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1803,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1986,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +1995,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 10be60011ad..b3cf53b509f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..35835087128 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..9bbf5db67fa 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-02 06:34  Amit Kapila <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Amit Kapila @ 2026-04-02 06:34 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Nisha Moond <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Thu, Apr 2, 2026 at 10:31 AM Fujii Masao <[email protected]> wrote:
>
> On Wed, Apr 1, 2026 at 8:11 PM Nisha Moond <[email protected]> wrote:
> > > As for backpatching, this looks like it should go back to v17, where slotsync
> > > was introduced. Thought?
> >
> > Right, the issue exists in v17 as well.
> >
> > Attached the updated patch.
>
> Thanks for updating the patch! LGTM.
>
> I noticed that commit 1362bc33e02 updated the slotsync code so that a backend
> performing slot synchronization is signaled on promotion, but it was applied
> only to master. I’m not sure why it wasn’t backpatched to v17 and v18,
>

It is because we added retry slot-sync logic for API in master in
commit 0d2d4a0ec3eca64e7f5ce7f7630b56a561b2663c. So, there is no
chance of API waiting except for the race condition being discussed
here.

> but it seems we need to backpatch it first before backpatching this patch.
> Thought?
>

I feel the use of API before this version was mainly for test-cases as
it was not production ready. So, it is less helpful to backpatch
1362bc33e02, if we want, we can backpatch only the worker part of the
fix. OTOH, as the issue is not frequent and we have some workaround
(at least for more common platforms) as well, we can consider not
backpatching it.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-02 13:16  Fujii Masao <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-04-02 13:16 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: Nisha Moond <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Thu, Apr 2, 2026 at 3:34 PM Amit Kapila <[email protected]> wrote:
> It is because we added retry slot-sync logic for API in master in
> commit 0d2d4a0ec3eca64e7f5ce7f7630b56a561b2663c. So, there is no
> chance of API waiting except for the race condition being discussed
> here.

Thanks for the info!


> > but it seems we need to backpatch it first before backpatching this patch.
> > Thought?
> >
>
> I feel the use of API before this version was mainly for test-cases as
> it was not production ready. So, it is less helpful to backpatch
> 1362bc33e02, if we want, we can backpatch only the worker part of the
> fix. OTOH, as the issue is not frequent and we have some workaround
> (at least for more common platforms) as well, we can consider not
> backpatching it.

I see your point. OTOH, on second thought, if backpatching commit 1362bc33e02
along with this patch to v17 and v18 *is harmless*, I'd prefer to do so. Keeping
the slotsync shutdown code more consistent across versions would make future
backpatching easier, and selectively backpatching only parts of the shutdown
logic would be more complicated and error-prone. Thought?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-07 03:48  Amit Kapila <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Amit Kapila @ 2026-04-07 03:48 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Nisha Moond <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Thu, Apr 2, 2026 at 6:46 PM Fujii Masao <[email protected]> wrote:
>
> On Thu, Apr 2, 2026 at 3:34 PM Amit Kapila <[email protected]> wrote:
> >
> > I feel the use of API before this version was mainly for test-cases as
> > it was not production ready. So, it is less helpful to backpatch
> > 1362bc33e02, if we want, we can backpatch only the worker part of the
> > fix. OTOH, as the issue is not frequent and we have some workaround
> > (at least for more common platforms) as well, we can consider not
> > backpatching it.
>
> I see your point. OTOH, on second thought, if backpatching commit 1362bc33e02
> along with this patch to v17 and v18 *is harmless*, I'd prefer to do so.
>

As such I don't see a problem backpatching commit 1362bc33e02 as it
appears to be a localised change.

> Keeping
> the slotsync shutdown code more consistent across versions would make future
> backpatching easier, and selectively backpatching only parts of the shutdown
> logic would be more complicated and error-prone.
>

I agree with this line of reasoning here or in general as well but
personally I am a bit hesitant to back patch changes which are not
mandatory. In this particular case, I don't see any problem with
backpatching the part of code you want to backpatch, so I leave it to
your judgement.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-08 02:39  Fujii Masao <[email protected]>
  parent: Amit Kapila <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-04-08 02:39 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: Nisha Moond <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Apr 7, 2026 at 12:48 PM Amit Kapila <[email protected]> wrote:
> I agree with this line of reasoning here or in general as well but
> personally I am a bit hesitant to back patch changes which are not
> mandatory. In this particular case, I don't see any problem with
> backpatching the part of code you want to backpatch, so I leave it to
> your judgement.

Thanks for the comment!

I decided to backpatch commit 1362bc33e02. Although pg_sync_replication_slots()
lacks retry logic in v17 and v18 and is therefore less likely to block
promotion, the issue still exists in those versions.

Given that, it seemed worthwhile to backpatch the change and fix cases where
both the slotsync worker and pg_sync_replication_slots() can block promotion
when stuck in a wait.

I've pushed and backpatched the patch. Thanks!

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-08 03:36  Fujii Masao <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-04-08 03:36 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: Nisha Moond <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 11:39 AM Fujii Masao <[email protected]> wrote:
>
> On Tue, Apr 7, 2026 at 12:48 PM Amit Kapila <[email protected]> wrote:
> > I agree with this line of reasoning here or in general as well but
> > personally I am a bit hesitant to back patch changes which are not
> > mandatory. In this particular case, I don't see any problem with
> > backpatching the part of code you want to backpatch, so I leave it to
> > your judgement.
>
> Thanks for the comment!
>
> I decided to backpatch commit 1362bc33e02. Although pg_sync_replication_slots()
> lacks retry logic in v17 and v18 and is therefore less likely to block
> promotion, the issue still exists in those versions.
>
> Given that, it seemed worthwhile to backpatch the change and fix cases where
> both the slotsync worker and pg_sync_replication_slots() can block promotion
> when stuck in a wait.
>
> I've pushed and backpatched the patch. Thanks!

The backpatch added PROCSIG_SLOTSYNC_MESSAGE in the middle of enum
ProcSignalReason, which could break the ABI. I’m planning to move it to
the end of the enum in v17 and v18.

That seems to work for v18. However, in v17, NUM_PROCSIGNALS is defined
as the last enum value:

            NUM_PROCSIGNALS /* Must be last! */
        } ProcSignalReason;

So simply moving PROCSIG_SLOTSYNC_MESSAGE to the end would change the meaning
of NUM_PROCSIGNALS.

One option might be to remove NUM_PROCSIGNALS from the enum, move
PROCSIG_SLOTSYNC_MESSAGE to the end, and define it separately, e.g.
#define NUM_PROCSIGNALS (PROCSIG_SLOTSYNC_MESSAGE + 1). Would that
be acceptable without breaking the ABI? Thoughts?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-08 06:54  Fujii Masao <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Fujii Masao @ 2026-04-08 06:54 UTC (permalink / raw)
  To: Amit Kapila <[email protected]>; +Cc: Nisha Moond <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 12:36 PM Fujii Masao <[email protected]> wrote:
> The backpatch added PROCSIG_SLOTSYNC_MESSAGE in the middle of enum
> ProcSignalReason, which could break the ABI. I’m planning to move it to
> the end of the enum in v17 and v18.
>
> That seems to work for v18. However, in v17, NUM_PROCSIGNALS is defined
> as the last enum value:
>
>             NUM_PROCSIGNALS /* Must be last! */
>         } ProcSignalReason;
>
> So simply moving PROCSIG_SLOTSYNC_MESSAGE to the end would change the meaning
> of NUM_PROCSIGNALS.
>
> One option might be to remove NUM_PROCSIGNALS from the enum, move
> PROCSIG_SLOTSYNC_MESSAGE to the end, and define it separately, e.g.
> #define NUM_PROCSIGNALS (PROCSIG_SLOTSYNC_MESSAGE + 1). Would that
> be acceptable without breaking the ABI? Thoughts?

The patches I'm planning to apply for v17 and v18 are attached.

For v17, I'm still not entirely sure this change is safe from an ABI
perspective. Even if it is, abi-compliance-check may still report
a break since the patch removes NUM_PROCSIGNALS from the enum
(defines it as separate macro). If so, we may need to update
.abi-compliance-history to avoid false positives.

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v1-0001-pg17-Fix-ABI-break-by-moving-PROCSIG_SLOTSYNC_MESSAGE-.patch (1.6K, 2-v1-0001-pg17-Fix-ABI-break-by-moving-PROCSIG_SLOTSYNC_MESSAGE-.patch)
  download | inline diff:
From d6ad4847ddbedeafad682653d30eaaf0ad8dedb7 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Wed, 8 Apr 2026 12:53:23 +0900
Subject: [PATCH v1] Fix ABI break by moving PROCSIG_SLOTSYNC_MESSAGE in
 ProcSignalReason

Commit 58c1188a3ea added PROCSIG_SLOTSYNC_MESSAGE in the middle of
enum ProcSignalReason, breaking the ABI.

Fix this by moving PROCSIG_SLOTSYNC_MESSAGE to the end of the enum.

Per buildfarm member crake.

Discussion: https://postgr.es/m/CAHGQGwH_AAbtsiYDJt65N7_4PJ0CgOJmBEaCq68e5_tcuG_vXw@mail.gmail.com
---
 src/include/storage/procsignal.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 126c44bcf1d..db59d266a2c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,7 +36,6 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
-	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
@@ -49,9 +48,11 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 	PROCSIG_RECOVERY_CONFLICT_LAST = PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 
-	NUM_PROCSIGNALS				/* Must be last! */
+	PROCSIG_SLOTSYNC_MESSAGE	/* ask slot synchronization to stop */
 } ProcSignalReason;
 
+#define NUM_PROCSIGNALS (PROCSIG_SLOTSYNC_MESSAGE + 1)
+
 typedef enum
 {
 	PROCSIGNAL_BARRIER_SMGRRELEASE, /* ask smgr to close files */
-- 
2.51.2



  [application/octet-stream] v1-0001-pg18-Fix-ABI-break-by-moving-PROCSIG_SLOTSYNC_MESSAGE-.patch (1.6K, 3-v1-0001-pg18-Fix-ABI-break-by-moving-PROCSIG_SLOTSYNC_MESSAGE-.patch)
  download | inline diff:
From 8c7f964d5c56fa40b83fdda21e610594fd634897 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Wed, 8 Apr 2026 11:58:43 +0900
Subject: [PATCH v1] Fix ABI break by moving PROCSIG_SLOTSYNC_MESSAGE in
 ProcSignalReason

Commit 58c1188a3ea added PROCSIG_SLOTSYNC_MESSAGE in the middle of
enum ProcSignalReason, breaking the ABI.

Fix this by moving PROCSIG_SLOTSYNC_MESSAGE to the end of the enum.

Per buildfarm member crake.

Discussion: https://postgr.es/m/CAHGQGwH_AAbtsiYDJt65N7_4PJ0CgOJmBEaCq68e5_tcuG_vXw@mail.gmail.com
---
 src/include/storage/procsignal.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index c4a59325604..234cfcb364a 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,7 +36,6 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
-	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
@@ -48,9 +47,11 @@ typedef enum
 	PROCSIG_RECOVERY_CONFLICT_BUFFERPIN,
 	PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
 	PROCSIG_RECOVERY_CONFLICT_LAST = PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slot synchronization to stop */
 } ProcSignalReason;
 
-#define NUM_PROCSIGNALS (PROCSIG_RECOVERY_CONFLICT_LAST + 1)
+#define NUM_PROCSIGNALS (PROCSIG_SLOTSYNC_MESSAGE + 1)
 
 typedef enum
 {
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-08 09:45  Nisha Moond <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 42+ messages in thread

From: Nisha Moond @ 2026-04-08 09:45 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Amit Kapila <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 12:24 PM Fujii Masao <[email protected]> wrote:
>
> On Wed, Apr 8, 2026 at 12:36 PM Fujii Masao <[email protected]> wrote:
> > The backpatch added PROCSIG_SLOTSYNC_MESSAGE in the middle of enum
> > ProcSignalReason, which could break the ABI. I’m planning to move it to
> > the end of the enum in v17 and v18.
> >
> > That seems to work for v18. However, in v17, NUM_PROCSIGNALS is defined
> > as the last enum value:
> >
> >             NUM_PROCSIGNALS /* Must be last! */
> >         } ProcSignalReason;
> >
> > So simply moving PROCSIG_SLOTSYNC_MESSAGE to the end would change the meaning
> > of NUM_PROCSIGNALS.
> >
> > One option might be to remove NUM_PROCSIGNALS from the enum, move
> > PROCSIG_SLOTSYNC_MESSAGE to the end, and define it separately, e.g.
> > #define NUM_PROCSIGNALS (PROCSIG_SLOTSYNC_MESSAGE + 1). Would that
> > be acceptable without breaking the ABI? Thoughts?
>
> The patches I'm planning to apply for v17 and v18 are attached.
>
> For v17, I'm still not entirely sure this change is safe from an ABI
> perspective. Even if it is, abi-compliance-check may still report
> a break since the patch removes NUM_PROCSIGNALS from the enum
> (defines it as separate macro). If so, we may need to update
> .abi-compliance-history to avoid false positives.
>

Regarding the pg17 change, NUM_PROCSIGNALS is not a process signal
reason but simply represents the array size, and its value will also
increase in pg18 (+1) after this backpatch.
AFAIU, the concern is that extensions might rely on the old compiled
values of PROCSIG_*, so we should avoid changing their order. However,
extensions should also not depend on NUM_PROCSIGNALS directly,
otherwise the pg18 backpatch would pose the same ABI concern. So, it
seems safe for pg17 as well.
I also checked core extensions and did not find NUM_PROCSIGNALS being used.

That said, I think both approaches - adding the new entry at the end
and defining NUM_PROCSIGNALS outside as done in the patch or adding it
just before NUM_PROCSIGNALS (like below)  are semantically the same.
….
  PROCSIG_RECOVERY_CONFLICT_LAST = PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
+ PROCSIG_SLOTSYNC_MESSAGE /* ask slot synchronization to stop */
+
NUM_PROCSIGNALS /* Must be last! */
 } ProcSignalReason;

As NUM_PROCSIGNALS increments in both cases, I don’t see any
additional benefit in defining it outside. Thoughts? Happy to be
corrected if I’m missing something.

--
Thanks,
Nisha





^ permalink  raw  reply  [nested|flat] 42+ messages in thread

* Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
@ 2026-04-08 18:09  Fujii Masao <[email protected]>
  parent: Nisha Moond <[email protected]>
  0 siblings, 0 replies; 42+ messages in thread

From: Fujii Masao @ 2026-04-08 18:09 UTC (permalink / raw)
  To: Nisha Moond <[email protected]>; +Cc: Amit Kapila <[email protected]>; Zhijie Hou (Fujitsu) <[email protected]>; shveta malik <[email protected]>; PostgreSQL Hackers <[email protected]>

On Wed, Apr 8, 2026 at 6:45 PM Nisha Moond <[email protected]> wrote:
> Regarding the pg17 change, NUM_PROCSIGNALS is not a process signal
> reason but simply represents the array size, and its value will also
> increase in pg18 (+1) after this backpatch.
> AFAIU, the concern is that extensions might rely on the old compiled
> values of PROCSIG_*, so we should avoid changing their order. However,
> extensions should also not depend on NUM_PROCSIGNALS directly,
> otherwise the pg18 backpatch would pose the same ABI concern. So, it
> seems safe for pg17 as well.
> I also checked core extensions and did not find NUM_PROCSIGNALS being used.

So the question is whether any extensions or third-party code depend on
NUM_PROCSIGNALS. I also couldn't find any such usage, so it seems safe from
an ABI perspective to change its value.


> That said, I think both approaches - adding the new entry at the end
> and defining NUM_PROCSIGNALS outside as done in the patch or adding it
> just before NUM_PROCSIGNALS (like below)  are semantically the same.
> ….
>   PROCSIG_RECOVERY_CONFLICT_LAST = PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK,
> + PROCSIG_SLOTSYNC_MESSAGE /* ask slot synchronization to stop */
> +
> NUM_PROCSIGNALS /* Must be last! */
>  } ProcSignalReason;
>
> As NUM_PROCSIGNALS increments in both cases, I don’t see any
> additional benefit in defining it outside. Thoughts?

Yes, you're right. So, in v17, I'll just move PROCSIG_SLOTSYNC_MESSAGE to
just before NUM_PROCSIGNALS.

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 42+ messages in thread


end of thread, other threads:[~2026-04-08 18:09 UTC | newest]

Thread overview: 42+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-18 16:05 Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion? Fujii Masao <[email protected]>
2026-03-18 16:23 ` Tom Lane <[email protected]>
2026-03-18 17:10   ` Fujii Masao <[email protected]>
2026-03-21 16:52 ` Amit Kapila <[email protected]>
2026-03-23 05:50   ` Fujii Masao <[email protected]>
2026-03-24 04:01     ` Nisha Moond <[email protected]>
2026-03-24 06:00       ` Fujii Masao <[email protected]>
2026-03-24 09:15         ` Fujii Masao <[email protected]>
2026-03-24 16:51           ` Nisha Moond <[email protected]>
2026-03-25 00:51             ` Fujii Masao <[email protected]>
2026-03-24 09:37     ` Amit Kapila <[email protected]>
2026-03-26 10:10     ` Amit Kapila <[email protected]>
2026-03-26 10:37       ` Nisha Moond <[email protected]>
2026-03-27 03:58         ` shveta malik <[email protected]>
2026-03-27 04:57           ` Nisha Moond <[email protected]>
2026-03-27 06:06             ` Fujii Masao <[email protected]>
2026-03-27 07:49             ` Amit Kapila <[email protected]>
2026-03-27 09:49               ` Amit Kapila <[email protected]>
2026-03-27 12:37                 ` Nisha Moond <[email protected]>
2026-03-27 17:19                   ` Fujii Masao <[email protected]>
2026-03-30 04:18                     ` Nisha Moond <[email protected]>
2026-03-30 10:22                       ` shveta malik <[email protected]>
2026-03-31 06:01                         ` Nisha Moond <[email protected]>
2026-03-31 10:26                           ` Zhijie Hou (Fujitsu) <[email protected]>
2026-04-01 05:33                             ` Nisha Moond <[email protected]>
2026-04-01 09:48                               ` Fujii Masao <[email protected]>
2026-04-01 11:11                                 ` Nisha Moond <[email protected]>
2026-04-02 05:01                                   ` Fujii Masao <[email protected]>
2026-04-02 06:34                                     ` Amit Kapila <[email protected]>
2026-04-02 13:16                                       ` Fujii Masao <[email protected]>
2026-04-07 03:48                                         ` Amit Kapila <[email protected]>
2026-04-08 02:39                                           ` Fujii Masao <[email protected]>
2026-04-08 03:36                                             ` Fujii Masao <[email protected]>
2026-04-08 06:54                                               ` Fujii Masao <[email protected]>
2026-04-08 09:45                                                 ` Nisha Moond <[email protected]>
2026-04-08 18:09                                                   ` Fujii Masao <[email protected]>
2026-03-30 11:09                       ` Fujii Masao <[email protected]>
2026-03-31 06:04                         ` Nisha Moond <[email protected]>
2026-03-31 10:42                           ` shveta malik <[email protected]>
2026-03-31 15:33                             ` Fujii Masao <[email protected]>
2026-04-01 05:03                               ` Nisha Moond <[email protected]>
2026-04-01 05:33                             ` Nisha Moond <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox