public inbox for [email protected]  
help / color / mirror / Atom feed
From: Nisha Moond <[email protected]>
To: Amit Kapila <[email protected]>
Cc: Fujii Masao <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date: Thu, 26 Mar 2026 16:07:44 +0530
Message-ID: <CABdArM6nepct0uxizCnZqy-kAjjTOndvu7bWtaNcmknuxx82Hg@mail.gmail.com> (raw)
In-Reply-To: <CAA4eK1LzZGfRANPAnv6NpKCH2ENuZO6HswgY14A=xsOXmucPhw@mail.gmail.com>
References: <CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com>
	<CAA4eK1+CrQNqiPDKv1wYfdkbX0FARJoi1=0ioaAqkLzbq2vG1w@mail.gmail.com>
	<CAHGQGwHABvuCoyM24HUiFZ5oJq_CoFomjt_cqD-0cJLMjFXJjQ@mail.gmail.com>
	<CAA4eK1LzZGfRANPAnv6NpKCH2ENuZO6HswgY14A=xsOXmucPhw@mail.gmail.com>

On Thu, Mar 26, 2026 at 3:40 PM Amit Kapila <[email protected]> wrote:
>
> On Mon, Mar 23, 2026 at 11:21 AM Fujii Masao <[email protected]> wrote:
> >
> > On Sun, Mar 22, 2026 at 1:52 AM Amit Kapila <[email protected]> wrote:
> > >
> > > On Wed, Mar 18, 2026 at 9:35 PM Fujii Masao <[email protected]> wrote:
> > > >
> > > > I noticed that during standby promotion the startup process sends SIGUSR1 to
> > > > the slotsync worker to make it exit. Is there a reason for using SIGUSR1?
> > > >
> > >
> > > IIRC, this same signal is used for both the backend executing
> > > pg_sync_replication_slots() and slotsync worker. We want the worker to
> > > exit and error_out backend. Using SIGTERM for backend could result in
> > > its exit.
> >
> > Why do we want the backend running pg_sync_replication_slots() to throw
> > an error here, rather than just exit? If emitting an error is really required,
> > another option would be to store the process type in SlotSyncCtx and send
> > different signals accordingly, for example, SIGTERM for the slotsync worker
> > and another signal for a backend. But it seems simpler and sufficient to have
> > the backend exit in this case as well.
> >
>
> As we want to retain the existing behavior for API, so instead of
> using two signals, we can achieve what you intend to achieve by one
> signal (SIGUSR1) only. We can use SendProcSignal mechanism as is used
> ParallelWorkerShutdown. On promotion, we send a SIGUSR1 signal to
> slotsync worker/backend via SendProcSignal. Then in
> procsignal_sigusr1_handler(), it will call HandleSlotSyncInterrupt.
> HandleSlotSyncInterrupt() will set the InterruptPending and
> SlotSyncPending flag. Then ProcessInterrupt() will call a slotsync
> specific function based on the flag and do what we currently do in
> ProcessSlotSyncInterrupts. I think this should address the issue you
> are worried about.
>

+1
Retaining the current behavior for the API backend keeps it consistent
with other backends that continue after promotion.

In the reproduced case, the worker (or API backend) is waiting in:
libpqsrv_get_result -> WaitLatchOrSocket -> WaitEventSetWait.
When SIGUSR1 is received, it only sets the latch but does not mark any
interrupt as pending. As a result, CHECK_FOR_INTERRUPTS() is
effectively a no-op, and the process goes back to waiting. So, control
never returns to the slotsync code path, and we cannot rely on
stopSignaled to handle exit/error separately.
Only SIGTERM works here because its handler sets
INTERRUPTS_PENDING_CONDITION, allowing ProcessInterrupts() to run and
break the loop. The other signals like SIGUSR1 or SIGINT do not do
this, so simply using another signal might not solve the API error
handling case.

I’ve implemented the above approach suggested by Amit in the attached
patch and verified it for both worker and API scenarios. With this,
the API can now error-out without exiting the backend.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (7.2K, 2-v2-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From 2c3a644e303a2c9529efe60ab1d478194cc2f6de Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v2] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in WaitLatchOrSocket()
waiting for a response from the primary (e.g., due to a network failure),
the previous SIGUSR1 signal only sets the latch. The worker wakes up, finds no
pending interrupt, and goes back to waiting, causing ShutDownSlotSync() to wait
indefinitely and blocking promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing promotion
to proceed.
---
 src/backend/replication/logical/slotsync.c | 68 +++++++++++++++++++++-
 src/backend/storage/ipc/procsignal.c       |  4 ++
 src/backend/tcop/postgres.c                |  4 ++
 src/include/replication/slotsync.h         |  7 +++
 src/include/storage/procsignal.h           |  1 +
 5 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..17c29ce3367 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -141,6 +141,13 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ */
+volatile sig_atomic_t SlotSyncShutdown = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1323,6 +1330,44 @@ ProcessSlotSyncInterrupts(void)
 		slotsync_reread_config();
 }
 
+/*
+ * Signal handler called (in signal context) when PROCSIG_SLOTSYNC_MESSAGE
+ * is received.  Sets the SlotSyncShutdown flag so that ProcessInterrupts()
+ * will dispatch to HandleSlotSyncShutdown() at the next safe point.
+ */
+void
+HandleSlotSyncShutdownInterrupt(void)
+{
+	InterruptPending = true;
+	SlotSyncShutdown = true;
+}
+
+/*
+ * Handle a slot-sync shutdown request, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error.
+ */
+void
+HandleSlotSyncShutdown(void)
+{
+	SlotSyncShutdown = false;
+
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
+	}
+	else
+	{
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
+}
+
 /*
  * Connection cleanup function for slotsync worker.
  *
@@ -1427,6 +1472,23 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+			proc_exit(0);
+		else
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1748,11 +1810,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..770ecf2ab19 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncShutdownInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b3563113219..b8973ec0646 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdown)
+		HandleSlotSyncShutdown();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..23471b0161e 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncShutdownInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdown;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncShutdownInterrupt(void);
+extern void HandleSlotSyncShutdown(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
  In-Reply-To: <CABdArM6nepct0uxizCnZqy-kAjjTOndvu7bWtaNcmknuxx82Hg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox