public inbox for [email protected]  
help / color / mirror / Atom feed
From: Nisha Moond <[email protected]>
To: shveta malik <[email protected]>
Cc: Amit Kapila <[email protected]>
Cc: Fujii Masao <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date: Fri, 27 Mar 2026 10:27:43 +0530
Message-ID: <CABdArM6Mfhk2+9TVR_D3cgfPWPfHDuZEg7MOc5KqULwt0OcQUg@mail.gmail.com> (raw)
In-Reply-To: <CAJpy0uBhJB4HqouLXegD=miSkZfZp86tTC2K0K6nf=bHCcsPLQ@mail.gmail.com>
References: <CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com>
	<CAA4eK1+CrQNqiPDKv1wYfdkbX0FARJoi1=0ioaAqkLzbq2vG1w@mail.gmail.com>
	<CAHGQGwHABvuCoyM24HUiFZ5oJq_CoFomjt_cqD-0cJLMjFXJjQ@mail.gmail.com>
	<CAA4eK1LzZGfRANPAnv6NpKCH2ENuZO6HswgY14A=xsOXmucPhw@mail.gmail.com>
	<CABdArM6nepct0uxizCnZqy-kAjjTOndvu7bWtaNcmknuxx82Hg@mail.gmail.com>
	<CAJpy0uBhJB4HqouLXegD=miSkZfZp86tTC2K0K6nf=bHCcsPLQ@mail.gmail.com>

On Fri, Mar 27, 2026 at 9:28 AM shveta malik <[email protected]> wrote:
>
> On Thu, Mar 26, 2026 at 4:08 PM Nisha Moond <[email protected]> wrote:
> >
> > On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
> > >
> > > On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> > > >
> > > > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> > > > >
> > > > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > > > > >
> > > > > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > > > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > > > > >
> > > > >
> > > > > IIRC, this same signal is used for both the backend executing
> > > > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > > > exit and error_out backend. Using SIGTERM for backend could result in
> > > > > its exit.
> > > >
> > > > Why do we want the backend running pg_sync_replication_slots() to throw
> > > > an error here, rather than just exit? If emitting an error is really required,
> > > > another option would be to store the process type in SlotSyncCtx and send
> > > > different signals accordingly, for example, SIGTERM for the slotsync worker
> > > > and another signal for a backend. But it seems simpler and sufficient to have
> > > > the backend exit in this case as well.
> > > >
> > >
> > > As we want to retain the existing behavior for API, so instead of
> > > using two signals, we can achieve what you intend to achieve by one
> > > signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> > > ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> > > slotsync worker/backend via SendProcSignal. Then in
> > > procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> > > HandleSlotSyncInterrupt() will set the InterruptPending and
> > > SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> > > specific function based on the flag and do what we currently do in
> > > ProcessSlotSyncInterrupts. I think this should address the issue you
> > > are worried about.
> > >
> >
> > +1
> > Retaining the current behavior for the API backend keeps it consistent
> > with other backends that continue after promotion.
> >
> > In the reproduced case, the worker (or API backend) is waiting in:
> > libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
> > When SIGUSR1 is received, it only sets the latch but does not mark any
> > interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
> > effectively a no-op, and the process goes back to waiting. So, control
> > never returns to the slotsync code path, and we cannot rely on
> > stopSignaled to handle exit/error separately.
> > Only SIGTERM works here because its handler sets
> > INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
> > break the loop. The other signals like SIGUSR1 or SIGINT do not do
> > this, so simply using another signal might not solve the API error
> > handling case.
> >
> > I’ve implemented the above approach suggested by Amit in the attached
> > patch and verified it for both worker and API scenarios. With this,
> > the API can now error-out without exiting the backend.
> >
>
> +1 on the idea. Few comments:
>

Thanks for the review.

> 1)
> It was not clear initially as to why SetLatch is not done in
> HandleSlotSyncShutdownInterrupt(), digging it further revealed that
> procsignal_sigusr1_handler() will do SetLatch outside. Perhaps you can
> add below comment at the end of HandleSlotSyncShutdownInterrupt()
> similar to how other functions (HandleProcSignalBarrierInterrupt,
> HandleRecoveryConflictInterrupt etc) do.
>
> /* latch will be set by procsignal_sigusr1_handler */
>

Fixed.

> 2)
> In ProcessSlotSyncInterrupts(), now we don't need the below logic right?
>
> if (SlotSyncCtx->stopSignaled)
>     {
>         if (AmLogicalSlotSyncWorkerProcess())
>         {
>             ...
>             proc_exit(0);
>         }
>         else
>         {
>             /*
>              * For the backend executing SQL function
>              * pg_sync_replication_slots().
>              */
>             ereport(ERROR,
>                     errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
>                     errmsg("replication slot synchronization will stop
> because promotion is triggered"));
>         }
>     }
>

Right. Attached patch with the suggested changes.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (8.1K, 2-v3-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From de7f4d06a1db405a7aaf5e0d975a303cd7a81baa Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v3] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in WaitLatchOrSocket()
waiting for a response from the primary (e.g., due to a network failure),
the previous SIGUSR1 signal only sets the latch. The worker wakes up, finds no
pending interrupt, and goes back to waiting, causing ShutDownSlotSync() to wait
indefinitely and blocking promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing promotion
to proceed.
---
 src/backend/replication/logical/slotsync.c | 101 ++++++++++++++++-----
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 ++
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 93 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..ed988eb6dc8 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -141,6 +141,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1298,31 +1305,49 @@ ProcessSlotSyncInterrupts(void)
 {
 	CHECK_FOR_INTERRUPTS();
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
-
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
-	}
-
 	if (ConfigReloadPending)
 		slotsync_reread_config();
 }
 
+/*
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdown flag so that ProcessInterrupts()
+ * will dispatch to HandleSlotSyncShutdown() at the next safe point.
+ */
+void
+HandleSlotSyncShutdownInterrupt(void)
+{
+	InterruptPending = true;
+	SlotSyncShutdown = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
+
+/*
+ * Handle a slot-sync shutdown request, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+HandleSlotSyncShutdown(void)
+{
+	SlotSyncShutdown = false;
+
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
+}
+
 /*
  * Connection cleanup function for slotsync worker.
  *
@@ -1427,6 +1452,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(LOG,
+					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will stop because promotion is triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1748,11 +1801,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..770ecf2ab19 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncShutdownInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..b8973ec0646 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdown)
+		HandleSlotSyncShutdown();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..23471b0161e 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncShutdownInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdown;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncShutdownInterrupt(void);
+extern void HandleSlotSyncShutdown(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
  In-Reply-To: <CABdArM6Mfhk2+9TVR_D3cgfPWPfHDuZEg7MOc5KqULwt0OcQUg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox