public inbox for [email protected]  
help / color / mirror / Atom feed
From: Nisha Moond <[email protected]>
To: shveta malik <[email protected]>
Cc: Fujii Masao <[email protected]>
Cc: Amit Kapila <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Date: Wed, 1 Apr 2026 11:03:27 +0530
Message-ID: <CABdArM5a=PHOjK8esNdKaax1fq36F20Rx68PNvtP=B8uj_b5gg@mail.gmail.com> (raw)
In-Reply-To: <CAJpy0uDaR4LKy2T=vLWuCnY8nQ=m7Zde_sr44aoYr7T0jodV2Q@mail.gmail.com>
References: <CAHGQGwFzNYroAxSoyJhqTU-pH=t4Ej6RyvhVmBZ91Exj_TPMMQ@mail.gmail.com>
	<CAA4eK1+CrQNqiPDKv1wYfdkbX0FARJoi1=0ioaAqkLzbq2vG1w@mail.gmail.com>
	<CAHGQGwHABvuCoyM24HUiFZ5oJq_CoFomjt_cqD-0cJLMjFXJjQ@mail.gmail.com>
	<CAA4eK1LzZGfRANPAnv6NpKCH2ENuZO6HswgY14A=xsOXmucPhw@mail.gmail.com>
	<CABdArM6nepct0uxizCnZqy-kAjjTOndvu7bWtaNcmknuxx82Hg@mail.gmail.com>
	<CAJpy0uBhJB4HqouLXegD=miSkZfZp86tTC2K0K6nf=bHCcsPLQ@mail.gmail.com>
	<CABdArM6Mfhk2+9TVR_D3cgfPWPfHDuZEg7MOc5KqULwt0OcQUg@mail.gmail.com>
	<CAA4eK1+d2vN80-Yvy_Hr=ATF3XL5db+_W-sXF=2Vxm+OFBO82w@mail.gmail.com>
	<CAA4eK1KxU4b53GUor41A55x+Bx-DdOaQ9g1DqURyRY2Cg-hhPQ@mail.gmail.com>
	<CABdArM6pmn5yFqiU33KTYBXYM=Vny2ULnJY_gqFbsMEdt+1dPA@mail.gmail.com>
	<CAHGQGwEsEjXbCthXb02=HScBW7C=BHmygv9SK1VGCcc-9bmTsw@mail.gmail.com>
	<CABdArM5rrhSmFvVL4C5LL0iea-R0HRtB=ZvD=ereoTDa1Tm=NA@mail.gmail.com>
	<CAHGQGwFaLPMWTHaPtxTpRb3_=d4o+SLu6+89BNEzALdEnvSWmQ@mail.gmail.com>
	<CABdArM7fPqd9GSXLLyDZfX_bZkAaoJAVDGKSQULfvEvVZZHgsg@mail.gmail.com>
	<CAJpy0uDaR4LKy2T=vLWuCnY8nQ=m7Zde_sr44aoYr7T0jodV2Q@mail.gmail.com>

On Tue, Mar 31, 2026 at 4:12 PM shveta malik <[email protected]> wrote:
>
> On Tue, Mar 31, 2026 at 11:35 AM Nisha Moond <[email protected]> wrote:
> >
> > On Mon, Mar 30, 2026 at 4:39 PM Fujii Masao <[email protected]> wrote:
> > >
> > > On Mon, Mar 30, 2026 at 1:18 PM Nisha Moond <[email protected]> wrote:
> > > > We were using the same log message in two places:
> > > > check_and_set_sync_info() and HandleSlotSyncMessage().
> > > > I think “will not start” fits better in the first case, while “will
> > > > stop” makes sense to keep in the second.
> > >
> > > Thanks for updating the patch!
> > >
> > > With the patch, in my testing, standby promotion always produces
> > > the following logs:
> > >
> > >     LOG:  replication slot synchronization worker will stop because
> > > promotion is triggered
> > >     LOG:  replication slot synchronization worker will not start
> > > because promotion was triggered
> > >
> > > It looks like the postmaster immediately restarts the slotsync worker after
> > > promotion terminates it, and that new worker then exits on seeing
> > > SlotSyncCtx->stopSignaled.
> > >
> > > IMO, always emitting both messages is a bit confusing. It would be nice to
> > > suppress the second one if possible.
> > >
> > > One idea would be to prevent the restart altogether. For example,
> > > ProcessSlotSyncMessage() could set SlotSyncCtx->last_start_time to
> > > a special value (like -1), and SlotSyncWorkerCanRestart() could return
> > > false (i.e., prevent postmater from starting up slotsync worker) when
> > > it sees that. Alternatively, SlotSyncWorkerCanRestart() could simply
> > > check SlotSyncCtx->stopSignaled.
> > >
> > > That said, as far as I remember correctly, postmaster is generally not
> > > supposed to touch shared memory (per the comments in postmaster.c),
> > > so I'm not sure this approach is acceptable. On the other hand,
> > > postmaster and the slotsync worker already rely on SlotSyncCtx->last_start_time,
> > > so perhaps there's some precedent here.
> > >
> > IIUC, checking SlotSyncCtx->stopSignaled in SlotSyncWorkerCanRestart()
> > may not be ideal, as it requires a spinlock to avoid races with the
> > startup process and it is disallowed to take lock in postmaster main
> > loop. Whereas, SlotSyncCtx->last_start_time doesn’t need a lock since
> > the postmaster accesses it only when the worker is not alive.
> >
>
> I agree.
>
> > Another option could be to log in check_and_set_sync_info() at DEBUG1
> > instead of LOG level. This message appears only after stopSignaled is
> > set, when promotion is already in progress and the first worker has
> > logged “will stop…”. The second worker doesn’t do any real work. Since
> > there’s nothing actionable for users, using DEBUG1 would keep it
> > useful for debugging (e.g., noticing immediate restarts) while
> > avoiding extra log noise. Thoughts?
> >
>
> +1.
>
> Do you think we can slightly tweak the comment in atop file to:
>
> On promotion the startup process sets 'stopSignaled' and uses this
> 'pid' to signal synchronizing process with PROCSIG_SLOTSYNC_MESSAGE
> and also to wake it up so that the process can immediately stop its
> synchronizing work. Setting 'stopSignaled' on the other hand is  used
> to handle the race condition....
>

Done.

> Also shall we quick exit ProcessSlotSyncMessage() if syncing is
> already finished by API?
>

Make sense. Fixed.

Please find the updated patch (v7) attached.

--
Thanks,
Nisha


Attachments:

  [application/octet-stream] v7-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch (12.0K, 2-v7-0001-Prevent-slotsync-worker-API-hang-during-standby-p.patch)
  download | inline diff:
From 49892f6a306934f2148f54c3bcf1ecfd722303ae Mon Sep 17 00:00:00 2001
From: Nisha Moond <[email protected]>
Date: Wed, 25 Mar 2026 18:04:12 +0530
Subject: [PATCH v7] Prevent slotsync worker/API hang during standby promotion

During standby promotion, ShutDownSlotSync() signals the slotsync worker
to stop and waits for it to finish. If the worker is blocked in
WaitLatchOrSocket() waiting for a response from the primary (e.g., due
to a network failure), the previous SIGUSR1 signal only sets the latch.
The worker wakes up, finds no pending interrupt, and goes back to
waiting, causing ShutDownSlotSync() to wait indefinitely and blocking
promotion.

Fix this by introducing a new procsignal reason PROCSIG_SLOTSYNC_MESSAGE.
The signal handler sets the appropriate interrupt flags so that
WaitLatchOrSocket() returns and the worker exits cleanly, allowing
promotion to proceed.
---
 src/backend/replication/logical/slotsync.c | 152 +++++++++++++++------
 src/backend/storage/ipc/procsignal.c       |   4 +
 src/backend/tcop/postgres.c                |   4 +
 src/include/replication/slotsync.h         |   7 +
 src/include/storage/procsignal.h           |   1 +
 5 files changed, 124 insertions(+), 44 deletions(-)

diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index e75db69e3f6..c070c869e5d 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -85,18 +85,19 @@
  * Struct for sharing information to control slot synchronization.
  *
  * The 'pid' is either the slot sync worker's pid or the backend's pid running
- * the SQL function pg_sync_replication_slots(). When the startup process sets
- * 'stopSignaled' during promotion, it uses this 'pid' to wake up the currently
- * synchronizing process so that the process can immediately stop its
- * synchronizing work on seeing 'stopSignaled' set.
- * Setting 'stopSignaled' is also used to handle the race condition when the
- * postmaster has not noticed the promotion yet and thus may end up restarting
- * the slot sync worker. If 'stopSignaled' is set, the worker will exit in such a
- * case. The SQL function pg_sync_replication_slots() will also error out if
- * this flag is set. Note that we don't need to reset this variable as after
- * promotion the slot sync worker won't be restarted because the pmState
- * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
- * primary without restarting the server. See LaunchMissingBackgroundProcesses.
+ * the SQL function pg_sync_replication_slots(). On promotion, the startup
+ * process sets 'stopSignaled' and uses this 'pid' to signal the synchronizing
+ * process with PROCSIG_SLOTSYNC_MESSAGE and also to wake it up so that the
+ * process can immediately stop its synchronizing work.
+ * Setting 'stopSignaled' on the other hand is used to handle the race
+ * condition when the postmaster has not noticed the promotion yet and thus may
+ * end up restarting the slot sync worker. If 'stopSignaled' is set, the worker
+ * will exit in such a case. The SQL function pg_sync_replication_slots() will
+ * also error out if this flag is set. Note that we don't need to reset this
+ * variable as after promotion the slot sync worker won't be restarted because
+ * the pmState changes to PM_RUN from PM_HOT_STANDBY and we don't support
+ * demoting primary without restarting the server.
+ * See LaunchMissingBackgroundProcesses.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -141,6 +142,22 @@ static long sleep_ms = MIN_SLOTSYNC_WORKER_NAPTIME_MS;
  */
 static bool syncing_slots = false;
 
+/*
+ * Interrupt flag set when PROCSIG_SLOTSYNC_MESSAGE is received, asking the
+ * slotsync worker or pg_sync_replication_slots() to stop because
+ * standby promotion has been triggered.
+ *
+ * We cannot rely solely on 'stopSignaled' here because:
+ * 1) It resides in shared memory and is visible to all processes, so checking
+ *    it directly in ProcessInterrupts() would require additional checks to
+ *    ensure only the synchronizing process acts on it.
+ * 2) It has different lifetime semantics and cannot be reset after handling,
+ *    as it also guards against postmaster and promotion race conditions.
+ * 3) Accessing it requires acquiring a spinlock, which can be too expensive
+ *    or undesirable for every ProcessInterrupts() call.
+ */
+volatile sig_atomic_t SlotSyncShutdownPending = false;
+
 /*
  * Structure to hold information fetched from the primary server about a logical
  * replication slot.
@@ -1291,36 +1308,52 @@ slotsync_reread_config(void)
 }
 
 /*
- * Interrupt handler for process performing slot synchronization.
+ * Handle receipt of an interrupt indicating a slotsync shutdown message.
+ *
+ * This is called within the SIGUSR1 handler.  All we do here is set a flag
+ * that will cause the next CHECK_FOR_INTERRUPTS() to invoke
+ * ProcessSlotSyncMessage().
  */
-static void
-ProcessSlotSyncInterrupts(void)
+void
+HandleSlotSyncMessageInterrupt(void)
 {
-	CHECK_FOR_INTERRUPTS();
+	InterruptPending = true;
+	SlotSyncShutdownPending = true;
+	/* latch will be set by procsignal_sigusr1_handler */
+}
 
-	if (SlotSyncCtx->stopSignaled)
-	{
-		if (AmLogicalSlotSyncWorkerProcess())
-		{
-			ereport(LOG,
-					errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+/*
+ * Handle a PROCSIG_SLOTSYNC_MESSAGE signal, called from ProcessInterrupts().
+ *
+ * If the current process is the slotsync background worker, log a message
+ * and exit cleanly.  If it is a backend executing pg_sync_replication_slots(),
+ * raise an error, unless the sync has already finished, in which case there
+ * is no need to interrupt the caller.
+ */
+void
+ProcessSlotSyncMessage(void)
+{
+	SlotSyncShutdownPending = false;
 
-			proc_exit(0);
-		}
-		else
-		{
-			/*
-			 * For the backend executing SQL function
-			 * pg_sync_replication_slots().
-			 */
-			ereport(ERROR,
-					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					errmsg("replication slot synchronization will stop because promotion is triggered"));
-		}
+	if (AmLogicalSlotSyncWorkerProcess())
+	{
+		ereport(LOG,
+				errmsg("replication slot synchronization worker will stop because promotion is triggered"));
+		proc_exit(0);
 	}
+	else
+	{
+		/*
+		 * If sync has already completed, there is no need to interrupt the
+		 * caller with an error.
+		 */
+		if (!IsSyncingReplicationSlots())
+			return;
 
-	if (ConfigReloadPending)
-		slotsync_reread_config();
+		ereport(ERROR,
+				errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				errmsg("replication slot synchronization will stop because promotion is triggered"));
+	}
 }
 
 /*
@@ -1427,6 +1460,34 @@ check_and_set_sync_info(pid_t sync_process_pid)
 {
 	SpinLockAcquire(&SlotSyncCtx->mutex);
 
+	/*
+	 * Exit immediately if promotion has been triggered.  This guards against
+	 * a new worker (or a new API call) that starts after the old worker was
+	 * stopped by ShutDownSlotSync().
+	 */
+	if (SlotSyncCtx->stopSignaled)
+	{
+		SpinLockRelease(&SlotSyncCtx->mutex);
+
+		if (AmLogicalSlotSyncWorkerProcess())
+		{
+			ereport(DEBUG1,
+					errmsg("replication slot synchronization worker will not start because promotion was triggered"));
+
+			proc_exit(0);
+		}
+		else
+		{
+			/*
+			 * For the backend executing SQL function
+			 * pg_sync_replication_slots().
+			 */
+			ereport(ERROR,
+					errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					errmsg("replication slot synchronization will not start because promotion was triggered"));
+		}
+	}
+
 	if (SlotSyncCtx->syncing)
 	{
 		SpinLockRelease(&SlotSyncCtx->mutex);
@@ -1635,7 +1696,10 @@ ReplSlotSyncWorkerMain(const void *startup_data, size_t startup_data_len)
 		bool		started_tx = false;
 		List	   *remote_slots;
 
-		ProcessSlotSyncInterrupts();
+		CHECK_FOR_INTERRUPTS();
+
+		if (ConfigReloadPending)
+			slotsync_reread_config();
 
 		/*
 		 * The syscache access in fetch_remote_slots() needs a transaction
@@ -1748,11 +1812,11 @@ ShutDownSlotSync(void)
 	SpinLockRelease(&SlotSyncCtx->mutex);
 
 	/*
-	 * Signal process doing slotsync, if any. The process will stop upon
-	 * detecting that the stopSignaled flag is set to true.
+	 * Signal process doing slotsync, if any, asking it to stop.
 	 */
 	if (sync_process_pid != InvalidPid)
-		kill(sync_process_pid, SIGUSR1);
+		SendProcSignal(sync_process_pid, PROCSIG_SLOTSYNC_MESSAGE,
+					   INVALID_PROC_NUMBER);
 
 	/* Wait for slot sync to end */
 	for (;;)
@@ -1931,9 +1995,6 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 
 		check_and_set_sync_info(MyProcPid);
 
-		/* Check for interrupts and config changes */
-		ProcessSlotSyncInterrupts();
-
 		validate_remote_info(wrconn);
 
 		/* Retry until all the slots are sync-ready */
@@ -1943,7 +2004,10 @@ SyncReplicationSlots(WalReceiverConn *wrconn)
 			bool		some_slot_updated = false;
 
 			/* Check for interrupts and config changes */
-			ProcessSlotSyncInterrupts();
+			CHECK_FOR_INTERRUPTS();
+
+			if (ConfigReloadPending)
+				slotsync_reread_config();
 
 			/* We must be in a valid transaction state */
 			Assert(IsTransactionState());
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 7e017c8d53b..99792b13760 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -24,6 +24,7 @@
 #include "port/pg_bitutils.h"
 #include "replication/logicalctl.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/walsender.h"
 #include "storage/condition_variable.h"
 #include "storage/ipc.h"
@@ -700,6 +701,9 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_PARALLEL_APPLY_MESSAGE))
 		HandleParallelApplyMessageInterrupt();
 
+	if (CheckProcSignal(PROCSIG_SLOTSYNC_MESSAGE))
+		HandleSlotSyncMessageInterrupt();
+
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT))
 		HandleRecoveryConflictInterrupt();
 
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index 10be60011ad..b3cf53b509f 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -58,6 +58,7 @@
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/logicalworker.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
 #include "replication/walsender.h"
 #include "rewrite/rewriteHandler.h"
@@ -3576,6 +3577,9 @@ ProcessInterrupts(void)
 
 	if (ParallelApplyMessagePending)
 		ProcessParallelApplyMessages();
+
+	if (SlotSyncShutdownPending)
+		ProcessSlotSyncMessage();
 }
 
 /*
diff --git a/src/include/replication/slotsync.h b/src/include/replication/slotsync.h
index e546d0d050d..35835087128 100644
--- a/src/include/replication/slotsync.h
+++ b/src/include/replication/slotsync.h
@@ -12,10 +12,15 @@
 #ifndef SLOTSYNC_H
 #define SLOTSYNC_H
 
+#include <signal.h>
+
 #include "replication/walreceiver.h"
 
 extern PGDLLIMPORT bool sync_replication_slots;
 
+/* Interrupt flag set by HandleSlotSyncMessageInterrupt() */
+extern PGDLLIMPORT volatile sig_atomic_t SlotSyncShutdownPending;
+
 /*
  * GUCs needed by slot sync worker to connect to the primary
  * server and carry on with slots synchronization.
@@ -34,5 +39,7 @@ extern bool IsSyncingReplicationSlots(void);
 extern Size SlotSyncShmemSize(void);
 extern void SlotSyncShmemInit(void);
 extern void SyncReplicationSlots(WalReceiverConn *wrconn);
+extern void HandleSlotSyncMessageInterrupt(void);
+extern void ProcessSlotSyncMessage(void);
 
 #endif							/* SLOTSYNC_H */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 348fba53a93..3a75d500e7c 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_SLOTSYNC_MESSAGE,	/* ask slotsync worker/API to stop */
 	PROCSIG_RECOVERY_CONFLICT,	/* backend is blocking recovery, check
 								 * PGPROC->pendingRecoveryConflicts for the
 								 * reason */
-- 
2.50.1 (Apple Git-155)



view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
  In-Reply-To: <CABdArM5a=PHOjK8esNdKaax1fq36F20Rx68PNvtP=B8uj_b5gg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox