public inbox for [email protected]
help / color / mirror / Atom feedStartup process deadlock: WaitForProcSignalBarriers vs aux process
13+ messages / 4 participants
[nested] [flat]
* Startup process deadlock: WaitForProcSignalBarriers vs aux process
@ 2026-04-22 11:21 Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Matthias van de Meent @ 2026-04-22 11:21 UTC (permalink / raw)
To: PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Masahiko Sawada <[email protected]>
Hi,
Over in the Hackers Discord, Melany pointed out [0] a random failure
of tests on the master branch, which seemed to have nothing to do with
the commit they failed on.
The logs [1] indicate that the startup process was waiting for another
process to process a signal barrier. While there isn't enough
information available to conclusively point the blame on any specific
component, I think I have a good understanding of what happened:
>> 2026-04-21 15:10:50.065 UTC startup[19246] LOG: still waiting for backend with PID 19244 to accept ProcSignalBarrier
Here, the startup process is waiting for process with PID 19244 to
handle a signal barrier. It is not entirely clear which process it's
waiting on, but we can deduce this:
In the startup sequence, the postmaster creates these child processes,
in short order:
1. checkpointer
2. bgwriter
3. startup
It is therefore likely that the startup process' PID is just two
larger than that of the checkpointer; and therefore, it's likely the
startup process is waiting for the checkpointer process.
# Which code in the Startup process is waiting?
I think it's this: The startup process logged that it started with a
clean shutdown, so no recovery code should be executed. This excludes
most possible call sites of WaitForProcSignalBarriers, except this
one: The startup process calls StartupXLOG ->
UpdateLogicalDecodingStatusEndOfRecovery(), which then calls
if (IsUnderPostmaster)
WaitForProcSignalBarrier(
EmitProcSignalBarrier(
PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO
));
# Why doesn't the Checkpointer process acknowledge the ProcSignalBarrier?
If the PSB is emitted (and signaled to checkpointer) before the
checkpointer has registered its SIGUSR1 handler, then the checkpointer
won't receive the notice to check its procsignal slots, it won't
notice the updated procsignal flags, and it won't process the PSB; not
until it receives a new SIGUSR1.
Signals are sent to all processes that have their procsignal pss_pid
set, which is true for every process which has called ProcSignalInit,
which for the checkpointer (like other aux processes) happens in
AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
processes) calls AuxiliaryProcessMainCommon before registering its
signal handlers, creating a small window in time where signals are
sent, but not handled.
# Is this new?
The issue of registering signal handlers only after opening the
process up to receiving signals has existed for a long time (unchanged
since at least 2022), only the ProcSignalBarrier in the startup
process is new: UpdateLogicalDecodingStatusEndOfRecovery was added
with Sawada-san's 67c20979.
# A solution?
I don't have one right now.
I was thinking in the direction of having a compile-time aux process
signal handlers array per process type, which is read by
AuxiliaryProcessMainCommon() to register the signal handlers ahead of
ProcSignalInit(), but I've not yet looked at the exact implications,
nor analyzed whether that's actually safe. It would move some
duplicative code patterns into compile-time structs, but that's not
necessarily a universal good.
Kind regards,
Matthias van de Meent
[0] https://discord.com/channels/1258108670710124574/1346208113132568646/1496179622591598592
[1] https://api.cirrus-ci.com/v1/artifact/task/6239099197063168/log/contrib/auto_explain/log/postmaster....
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
@ 2026-04-22 19:05 ` Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Andres Freund @ 2026-04-22 19:05 UTC (permalink / raw)
To: Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Masahiko Sawada <[email protected]>
Hi,
On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> If the PSB is emitted (and signaled to checkpointer) before the
> checkpointer has registered its SIGUSR1 handler, then the checkpointer
> won't receive the notice to check its procsignal slots, it won't
> notice the updated procsignal flags, and it won't process the PSB; not
> until it receives a new SIGUSR1.
>
> Signals are sent to all processes that have their procsignal pss_pid
> set, which is true for every process which has called ProcSignalInit,
> which for the checkpointer (like other aux processes) happens in
> AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> processes) calls AuxiliaryProcessMainCommon before registering its
> signal handlers, creating a small window in time where signals are
> sent, but not handled.
Hm. Have we confirmed this happens?
CheckpointerMain() is called with all signals masked, so it should be ok for
the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
long as it happens before
/*
* Unblock signals (they were blocked when the postmaster forked us)
*/
sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
as the signal delivery should be held until after unblocking signals.
> # A solution?
>
> I don't have one right now.
> I was thinking in the direction of having a compile-time aux process
> signal handlers array per process type, which is read by
> AuxiliaryProcessMainCommon() to register the signal handlers ahead of
> ProcSignalInit(), but I've not yet looked at the exact implications,
> nor analyzed whether that's actually safe. It would move some
> duplicative code patterns into compile-time structs, but that's not
> necessarily a universal good.
We really should move setup of most signal handlers into
AuxiliaryProcessMainCommon(). While there are some special cases (like
checkpointer not wanting to handle SIGTERM), that can be configured after
AuxiliaryProcessMainCommon(), as signals will still be blocked.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
@ 2026-04-24 17:52 ` Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Masahiko Sawada @ 2026-04-24 17:52 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>
On Wed, Apr 22, 2026 at 12:05 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> > If the PSB is emitted (and signaled to checkpointer) before the
> > checkpointer has registered its SIGUSR1 handler, then the checkpointer
> > won't receive the notice to check its procsignal slots, it won't
> > notice the updated procsignal flags, and it won't process the PSB; not
> > until it receives a new SIGUSR1.
> >
> > Signals are sent to all processes that have their procsignal pss_pid
> > set, which is true for every process which has called ProcSignalInit,
> > which for the checkpointer (like other aux processes) happens in
> > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> > processes) calls AuxiliaryProcessMainCommon before registering its
> > signal handlers, creating a small window in time where signals are
> > sent, but not handled.
>
> Hm. Have we confirmed this happens?
>
> CheckpointerMain() is called with all signals masked, so it should be ok for
> the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
> long as it happens before
>
> /*
> * Unblock signals (they were blocked when the postmaster forked us)
> */
> sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
>
> as the signal delivery should be held until after unblocking signals.
Right. The postmaster blocks all signals before starting child process
as the following comment explains:
/*
* We start postmaster children with signals blocked. This allows them to
* install their own handlers before unblocking, to avoid races where they
* might run the postmaster's handler and miss an important control
* signal. With more analysis this could potentially be relaxed.
*/
sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:
1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.
Another similar issue I found would be that child processes could miss
the PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO signal during the
initialization and end up in an inconsistent state because
InitializeProcessXLogLogicalInfo() is called (in BaseInit()) before
ProcSignalInit(). If the startup emits the signal to a process who is
between two steps, the process would not reflect the latest
XLogLogicalInfo state. I think we should move
InitializeProcessXLogLogicalInfo() after ProcSignalInit() like we do
so for InitLocalDataChecksumState().
I've attached the patch for fixing the latter problem as the fix is
straightforward.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
[text/x-patch] 0001-Fix-race-condition-in-XLogLogicalInfo-and-ProcSignal.patch (4.3K, 2-0001-Fix-race-condition-in-XLogLogicalInfo-and-ProcSignal.patch)
download | inline diff:
From 01370879bb0fe4065d06d92efb01582d1b1df996 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Fri, 24 Apr 2026 10:36:55 -0700
Subject: [PATCH] Fix race condition in XLogLogicalInfo and ProcSignal
initialization
Previously, InitializeProcessXLogLogicalInfo() was called before
ProcSignalInit(). This created a window where a process could miss a
signal barrier if it was issued between these two calls. As a result,
the process could fail to update its local XLogLogicalInfo cache,
leading to an inconsistent logical decoding state.
This commit fixes this by moving InitializeProcessXLogLogicalInfo()
after ProcSignalInit(). This ensures that the process is registered to
participate in signal barriers before its state is initialized,
preventing it from missing any state change propagated during the
startup sequence.
Discussion: https://postgr.es/m/
---
src/backend/postmaster/auxprocess.c | 17 ++++++++++++-----
src/backend/utils/init/postinit.c | 20 ++++++++++++--------
2 files changed, 24 insertions(+), 13 deletions(-)
diff --git a/src/backend/postmaster/auxprocess.c b/src/backend/postmaster/auxprocess.c
index 8fdc518b3a1..01cced61492 100644
--- a/src/backend/postmaster/auxprocess.c
+++ b/src/backend/postmaster/auxprocess.c
@@ -71,12 +71,16 @@ AuxiliaryProcessMainCommon(void)
ProcSignalInit(NULL, 0);
/*
- * Initialize a local cache of the data_checksum_version, to be updated by
- * the procsignal-based barriers.
+ * Initialize local states, to be updated by the procsignal-based
+ * barriers.
*
- * This intentionally happens after initializing the procsignal, otherwise
- * we might miss a state change. This means we can get a barrier for the
- * state we've just initialized - but it can happen only once.
+ * These initialization intentionally happens afater initializing the
+ * procsignal, otherwise we might miss a state change. This means we can
+ * get a barrier for the state we've just initialized.
+ */
+
+ /*
+ * Initialize a local cache of the data_checksum_version.
*
* The postmaster (which is what gets forked into the new child process)
* does not handle barriers, therefore it may not have the current value
@@ -88,6 +92,9 @@ AuxiliaryProcessMainCommon(void)
*/
InitLocalDataChecksumState();
+ /* Initialize logical info WAL logging state */
+ InitializeProcessXLogLogicalInfo();
+
/*
* Auxiliary processes don't run transactions, but they may need a
* resource owner anyway to manage buffer pins acquired outside
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 6f074013aa9..96b06e444ec 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -662,9 +662,6 @@ BaseInit(void)
/* Initialize lock manager's local structs */
InitLockManagerAccess();
- /* Initialize logical info WAL logging state */
- InitializeProcessXLogLogicalInfo();
-
/*
* Initialize replication slots after pgstat. The exit hook might need to
* drop ephemeral slots, which in turn triggers stats reporting.
@@ -759,12 +756,16 @@ InitPostgres(const char *in_dbname, Oid dboid,
ProcSignalInit(MyCancelKey, MyCancelKeyLength);
/*
- * Initialize a local cache of the data_checksum_version, to be updated by
- * the procsignal-based barriers.
+ * Initialize local states, to be updated by the procsignal-based
+ * barriers.
*
- * This intentionally happens after initializing the procsignal, otherwise
- * we might miss a state change. This means we can get a barrier for the
- * state we've just initialized.
+ * These initialization intentionally happens afater initializing the
+ * procsignal, otherwise we might miss a state change. This means we can
+ * get a barrier for the state we've just initialized.
+ */
+
+ /*
+ * Initialize a local cache of the data_checksum_version.
*
* The postmaster (which is what gets forked into the new child process)
* does not handle barriers, therefore it may not have the current value
@@ -776,6 +777,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
*/
InitLocalDataChecksumState();
+ /* Initialize logical info WAL logging state */
+ InitializeProcessXLogLogicalInfo();
+
/*
* Also set up timeout handlers needed for backend operation. We need
* these in every case except bootstrap.
--
2.53.0
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
@ 2026-04-27 18:00 ` Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lakhin @ 2026-04-27 18:00 UTC (permalink / raw)
To: Masahiko Sawada <[email protected]>; Andres Freund <[email protected]>; +Cc: Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>
Hello Sawada-san,
24.04.2026 20:52, Masahiko Sawada wrote:
> Right. The postmaster blocks all signals before starting child process
> as the following comment explains:
>
> /*
> * We start postmaster children with signals blocked. This allows them to
> * install their own handlers before unblocking, to avoid races where they
> * might run the postmaster's handler and miss an important control
> * signal. With more analysis this could potentially be relaxed.
> */
> sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
>
> Investigating the issue, I found there is a race condition between the
> procsignal initialization and emitting signal barrier that could be
> the cause of this issue. Imagine the following scenario:
>
> 1. In ProcSignalInit(), the checkpointer initializes its
> slot->pss_barrierGeneration with the global generation.
> 2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
> procsignal slot but it skips emitting the signal as slot->pss_pid is
> still 0. It can happen even though the checkpointer holds a spinlock
> on its slot during the initialization because the first pid check is
> done without a spinlock acquisition.
> 3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
> 4. In WaitForProcSignalBarrier(), the startup checks the
> checkpointer's procsignal slot that has already initialized the
> pss_barrierGeneration, and waits for it to be updated. However, the
> checkpointer never updates its barrier generation as it doesn't get
> the signal.
Thank you for the investigation and explanation of the issue!
I've been puzzled by a buildfarm failure [1] with such symptoms for a while
and even reproduced it locally once, but couldn't gather more information
that time. But now that you have described the scenario, I can easily
reproduce the same test failure with:
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
+pg_usleep(10000);
pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
just running `meson test test_oat_hooks_*/regress` with the test multiplied x30:
26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress OK 1.28s 2 subtests passed
27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress OK 1.25s 2 subtests passed
28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress ERROR 62.49s exit status 2
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG: starting PostgreSQL 19devel on x86_64-linux, compiled by
gcc-16.0.1, 64-bit
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG: listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086"
2026-04-27 17:34:44.302 UTC startup[1578114] LOG: database system was shut down at 2026-04-27 17:34:44 UTC
2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL: the database system is starting up
...
2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL: the database system is starting up
2026-04-27 17:34:49.308 UTC startup[1578114] LOG: still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL: the database system is starting up
...
2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL: the database system is starting up
2026-04-27 17:35:44.351 UTC startup[1578114] LOG: still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL: the database system is starting up
[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2026-03-10%2013%3A58%3A5...
Best regards,
Alexander
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
@ 2026-04-28 19:27 ` Masahiko Sawada <[email protected]>
2026-04-29 10:49 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
0 siblings, 2 replies; 13+ messages in thread
From: Masahiko Sawada @ 2026-04-28 19:27 UTC (permalink / raw)
To: Alexander Lakhin <[email protected]>; +Cc: Andres Freund <[email protected]>; Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>
On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
>
> Hello Sawada-san,
>
> 24.04.2026 20:52, Masahiko Sawada wrote:
>
> Right. The postmaster blocks all signals before starting child process
> as the following comment explains:
>
> /*
> * We start postmaster children with signals blocked. This allows them to
> * install their own handlers before unblocking, to avoid races where they
> * might run the postmaster's handler and miss an important control
> * signal. With more analysis this could potentially be relaxed.
> */
> sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
>
> Investigating the issue, I found there is a race condition between the
> procsignal initialization and emitting signal barrier that could be
> the cause of this issue. Imagine the following scenario:
>
> 1. In ProcSignalInit(), the checkpointer initializes its
> slot->pss_barrierGeneration with the global generation.
> 2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
> procsignal slot but it skips emitting the signal as slot->pss_pid is
> still 0. It can happen even though the checkpointer holds a spinlock
> on its slot during the initialization because the first pid check is
> done without a spinlock acquisition.
> 3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
> 4. In WaitForProcSignalBarrier(), the startup checks the
> checkpointer's procsignal slot that has already initialized the
> pss_barrierGeneration, and waits for it to be updated. However, the
> checkpointer never updates its barrier generation as it doesn't get
> the signal.
>
>
> Thank you for the investigation and explanation of the issue!
>
> I've been puzzled by a buildfarm failure [1] with such symptoms for a while
> and even reproduced it locally once, but couldn't gather more information
> that time. But now that you have described the scenario, I can easily
> reproduce the same test failure with:
> --- a/src/backend/storage/ipc/procsignal.c
> +++ b/src/backend/storage/ipc/procsignal.c
> @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
> if (cancel_key_len > 0)
> memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
> slot->pss_cancel_key_len = cancel_key_len;
> +pg_usleep(10000);
> pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
Thank you for testing this.
I've attached a patch to address the issue. I haven't verified it
across all versions yet, but I suspect it exists in the stable
branches as well. Previously, the issue rarely occurred because
EmitProcSignalBarrier() was only used for smgr invalidation. However,
now that we use signal barriers for online wal_level changes and
checksum status updates, this race condition is likely to be
encountered more frequently.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
[text/x-patch] v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (2.6K, 2-v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch)
download | inline diff:
From 8ed72e1bc748f99fbf8b103ae5bd4cf395cb54ef Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v1] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
ProcSignalInit() read the global barrier generation before publishing
its PID into pss_pid. A concurrent EmitProcSignalBarrier() iterates
the ProcSignal slots and skips any whose pss_pid is still zero, on the
assumption that such a slot will pick up the new generation when it
later reads psh_barrierGeneration. But because the joining backend had
already read the (older) global generation under its slot's spinlock,
it would store a stale value into pss_barrierGeneration and never
absorb the just-emitted barrier, resulting that
WaitForProcSignalBarrier() didn't complete.
Publish pss_pid before reading psh_barrierGeneration, with a memory
barrier in between so that the store is globally visible first. A
concurrent EmitProcSignalBarrier() then either observes the published
PID and signals this slot, or completes its generation increment
before we load it.
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through:
---
src/backend/storage/ipc/procsignal.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 264e4c22ca6..b0681ca0ae2 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -188,6 +188,16 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is globally visible before the load of the global
+ * barrier generation executes.
+ */
+ pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -207,7 +217,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
- pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
SpinLockRelease(&slot->pss_mutex);
--
2.54.0
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
@ 2026-04-29 10:49 ` Matthias van de Meent <[email protected]>
1 sibling, 0 replies; 13+ messages in thread
From: Matthias van de Meent @ 2026-04-29 10:49 UTC (permalink / raw)
To: Masahiko Sawada <[email protected]>; +Cc: Alexander Lakhin <[email protected]>; Andres Freund <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>
On Wed, 22 Apr 2026 at 21:05, Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> > If the PSB is emitted (and signaled to checkpointer) before the
> > checkpointer has registered its SIGUSR1 handler, then the checkpointer
> > won't receive the notice to check its procsignal slots, it won't
> > notice the updated procsignal flags, and it won't process the PSB; not
> > until it receives a new SIGUSR1.
> >
> > Signals are sent to all processes that have their procsignal pss_pid
> > set, which is true for every process which has called ProcSignalInit,
> > which for the checkpointer (like other aux processes) happens in
> > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> > processes) calls AuxiliaryProcessMainCommon before registering its
> > signal handlers, creating a small window in time where signals are
> > sent, but not handled.
>
> Hm. Have we confirmed this happens?
>
> CheckpointerMain() is called with all signals masked, so it should be ok for
> the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
> long as it happens before [...]
Yeah, that was a misidentification of the exact race that caused the issue.
On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada <[email protected]> wrote:
>
> On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
> >
> > Hello Sawada-san,
> >
> > 24.04.2026 20:52, Masahiko Sawada wrote:
> >
> > Right. The postmaster blocks all signals before starting child process
> > as the following comment explains:
> >
> > /*
> > * We start postmaster children with signals blocked. This allows them to
> > * install their own handlers before unblocking, to avoid races where they
> > * might run the postmaster's handler and miss an important control
> > * signal. With more analysis this could potentially be relaxed.
> > */
> > sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
> >
> > Investigating the issue, I found there is a race condition between the
> > procsignal initialization and emitting signal barrier that could be
> > the cause of this issue. Imagine the following scenario:
Ah, that'd be it indeed. Thanks!
> I've attached a patch to address the issue. I haven't verified it
> across all versions yet, but I suspect it exists in the stable
> branches as well. Previously, the issue rarely occurred because
> EmitProcSignalBarrier() was only used for smgr invalidation. However,
> now that we use signal barriers for online wal_level changes and
> checksum status updates, this race condition is likely to be
> encountered more frequently.
Yes, I think the boot process with the xlog_logical_info barrier is
more likely to hit this issue; as indicated by two known detected
cases in various CI jobs; though it could also be that the lockup of
the new barrier is just exceptionally bad for system stability.
As for the patches:
v1-0001 -- LGTM.
0001 (upthread): LGTM, but I'd also suggest to add some code to make
sure that we're actually receiving procsignals by the time we
initialize the Logical/Checksum subsystems that need to process shared
state changes by responding to procsignals; as attached. smgr's
procsignal doesn't really depend on shared memory state, so I've kept
that out of my patch.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
Attachments:
[application/octet-stream] v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch (2.4K, 2-v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch)
download | inline diff:
From 8a1dc18bbcf11a2eb36cfd3dbb290976d87284d1 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <[email protected]>
Date: Wed, 29 Apr 2026 12:10:44 +0200
Subject: [PATCH v1] Assert ProcSignal is initialized before its dependents
---
src/backend/access/transam/xlog.c | 1 +
src/backend/replication/logical/logicalctl.c | 1 +
src/backend/storage/ipc/procsignal.c | 8 ++++++++
src/include/storage/procsignal.h | 5 +++++
4 files changed, 15 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e39af79c03b..63e84b00cec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4960,6 +4960,7 @@ SetDataChecksumsOff(void)
void
InitLocalDataChecksumState(void)
{
+ Assert(ProcSignalIsInitialized());
SpinLockAcquire(&XLogCtl->info_lck);
SetLocalDataChecksumState(XLogCtl->data_checksum_version);
SpinLockRelease(&XLogCtl->info_lck);
diff --git a/src/backend/replication/logical/logicalctl.c b/src/backend/replication/logical/logicalctl.c
index 72f68ec58ef..80308b619a4 100644
--- a/src/backend/replication/logical/logicalctl.c
+++ b/src/backend/replication/logical/logicalctl.c
@@ -173,6 +173,7 @@ update_xlog_logical_info(void)
void
InitializeProcessXLogLogicalInfo(void)
{
+ Assert(ProcSignalIsInitialized());
update_xlog_logical_info();
}
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index b0681ca0ae2..71a0b25e49e 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -232,6 +232,14 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
on_shmem_exit(CleanupProcSignalState, (Datum) 0);
}
+#ifdef USE_ASSERT_CHECKING
+bool
+ProcSignalIsInitialized(void)
+{
+ return MyProcSignalSlot != NULL;
+}
+#endif
+
/*
* CleanupProcSignalState
* Remove current process from ProcSignal mechanism
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index aaa158bfd66..1d2290c6975 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -87,4 +87,9 @@ typedef struct ProcSignalHeader ProcSignalHeader;
extern PGDLLIMPORT ProcSignalHeader *ProcSignal;
#endif
+#ifdef USE_ASSERT_CHECKING
+extern bool ProcSignalIsInitialized(void);
+#endif
+
+
#endif /* PROCSIGNAL_H */
--
2.50.1 (Apple Git-155)
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
@ 2026-04-29 18:00 ` Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
1 sibling, 1 reply; 13+ messages in thread
From: Alexander Lakhin @ 2026-04-29 18:00 UTC (permalink / raw)
To: Masahiko Sawada <[email protected]>; +Cc: Andres Freund <[email protected]>; Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>
Dear Sawada-san,
28.04.2026 22:27, Masahiko Sawada wrote:
> On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
>> I've been puzzled by a buildfarm failure [1] with such symptoms for a while
>> and even reproduced it locally once, but couldn't gather more information
>> that time. But now that you have described the scenario, I can easily
>> reproduce the same test failure with:
>> --- a/src/backend/storage/ipc/procsignal.c
>> +++ b/src/backend/storage/ipc/procsignal.c
>> @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
>> if (cancel_key_len > 0)
>> memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
>> slot->pss_cancel_key_len = cancel_key_len;
>> +pg_usleep(10000);
>> pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
> Thank you for testing this.
>
> I've attached a patch to address the issue. I haven't verified it
> across all versions yet, but I suspect it exists in the stable
> branches as well...
Thank you for the fix! It works for me too.
I was wondering why is that failure the only one of this kind on buildfarm
(in last two years, at least), so I've tried to reproduce it on
REL_18_STABLE... and failed.
Then I've bisected it on the master branch and found (your) commit that
introduced this behavior: 67c20979c from 2025-12-23.
Best regards,
Alexander
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
@ 2026-04-30 22:08 ` Masahiko Sawada <[email protected]>
2026-05-01 08:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Masahiko Sawada @ 2026-04-30 22:08 UTC (permalink / raw)
To: Alexander Lakhin <[email protected]>; +Cc: Andres Freund <[email protected]>; Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>
On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
>
> Dear Sawada-san,
>
> 28.04.2026 22:27, Masahiko Sawada wrote:
> > On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
> >> I've been puzzled by a buildfarm failure [1] with such symptoms for a while
> >> and even reproduced it locally once, but couldn't gather more information
> >> that time. But now that you have described the scenario, I can easily
> >> reproduce the same test failure with:
> >> --- a/src/backend/storage/ipc/procsignal.c
> >> +++ b/src/backend/storage/ipc/procsignal.c
> >> @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
> >> if (cancel_key_len > 0)
> >> memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
> >> slot->pss_cancel_key_len = cancel_key_len;
> >> +pg_usleep(10000);
> >> pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
> > Thank you for testing this.
> >
> > I've attached a patch to address the issue. I haven't verified it
> > across all versions yet, but I suspect it exists in the stable
> > branches as well...
>
> Thank you for the fix! It works for me too.
>
> I was wondering why is that failure the only one of this kind on buildfarm
> (in last two years, at least), so I've tried to reproduce it on
> REL_18_STABLE... and failed.
>
> Then I've bisected it on the master branch and found (your) commit that
> introduced this behavior: 67c20979c from 2025-12-23.
>
I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.
FYI I found that we had a similar report[1] last year, I'm not sure
it hit the exact same issue, though.
Regards,
[1] https://www.postgresql.org/message-id/[email protected]...
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
[text/x-patch] v2_15-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch (2.9K, 2-v2_15-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch)
download | inline diff:
From 63ce4e5578f1703254952cd3aee3a0a22c6da990 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_15] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.
This issue was also reported by buildfarm animal flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 21a9fc0fdd2..cd4fe11b1a6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -175,6 +175,16 @@ ProcSignalInit(int pss_idx)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is globally visible before the load of the global
+ * barrier generation executes.
+ */
+ slot->pss_pid = MyProcPid;
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -192,9 +202,6 @@ ProcSignalInit(int pss_idx)
pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
pg_memory_barrier();
- /* Mark slot with my PID */
- slot->pss_pid = MyProcPid;
-
/* Remember slot location for CheckProcSignal */
MyProcSignalSlot = slot;
--
2.54.0
[text/x-patch] v2_17-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch (2.9K, 3-v2_17-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch)
download | inline diff:
From 099e5a2b1bf0c8631f6b5f2a4bfba4ee039b5d5b Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_17] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.
This issue was also reported by buildfarm animal flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index d6857f5a8bb..86f39e42ad6 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -175,6 +175,16 @@ ProcSignalInit(void)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is globally visible before the load of the global
+ * barrier generation executes.
+ */
+ slot->pss_pid = MyProcPid;
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -192,9 +202,6 @@ ProcSignalInit(void)
pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
pg_memory_barrier();
- /* Mark slot with my PID */
- slot->pss_pid = MyProcPid;
-
/* Remember slot location for CheckProcSignal */
MyProcSignalSlot = slot;
--
2.54.0
[text/x-patch] v2_18-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch (2.9K, 4-v2_18-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch)
download | inline diff:
From cbac8a3b949a893f530150a1da212bc67a46af00 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_18] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.
This issue was also reported by buildfarm animal flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 05d99b452c3..a0117ef969b 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -185,6 +185,16 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is globally visible before the load of the global
+ * barrier generation executes.
+ */
+ pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -204,7 +214,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
- pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
SpinLockRelease(&slot->pss_mutex);
--
2.54.0
[text/x-patch] v2_16-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch (2.9K, 5-v2_16-0001-Fix-race-between-ProcSignalInit-and-EmitProcSi.patch)
download | inline diff:
From a4c69b7ef9eacb79581dd2622ad8e107089b0dd2 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_16] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.
This issue was also reported by buildfarm animal flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c85cb5cc18d..01186ab91fb 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -176,6 +176,16 @@ ProcSignalInit(int pss_idx)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is globally visible before the load of the global
+ * barrier generation executes.
+ */
+ slot->pss_pid = MyProcPid;
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -193,9 +203,6 @@ ProcSignalInit(int pss_idx)
pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
pg_memory_barrier();
- /* Mark slot with my PID */
- slot->pss_pid = MyProcPid;
-
/* Remember slot location for CheckProcSignal */
MyProcSignalSlot = slot;
--
2.54.0
[text/x-patch] v2_master-0001-Fix-race-between-ProcSignalInit-and-EmitPr.patch (2.9K, 6-v2_master-0001-Fix-race-between-ProcSignalInit-and-EmitPr.patch)
download | inline diff:
From c8de0ff6283b620d3f81957c3a1947f3c024bd68 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v2_master] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID into the pss_pid slot. This created a race
condition: a process could initialize its local generation with an
older global value, while a concurrent EmitProcSignalBarrier() might
skip that process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
This commit fixes the issue by publishing pss_pid before reading
psh_barrierGeneration, with a memory barrier in between so that the
store is globally visible first. A concurrent EmitProcSignalBarrier()
then either observes the published PID and signals this slot, or
completes its generation increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). So backpatch to 15.
This issue was also reported by buildfarm animal flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 264e4c22ca6..b0681ca0ae2 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -188,6 +188,16 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is globally visible before the load of the global
+ * barrier generation executes.
+ */
+ pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -207,7 +217,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
- pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
SpinLockRelease(&slot->pss_mutex);
--
2.54.0
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
@ 2026-05-01 08:00 ` Alexander Lakhin <[email protected]>
2026-05-07 17:17 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Alexander Lakhin @ 2026-05-01 08:00 UTC (permalink / raw)
To: Masahiko Sawada <[email protected]>; +Cc: Andres Freund <[email protected]>; Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Andrey Borodin <[email protected]>
Dear Sawada-san,
01.05.2026 01:08, Masahiko Sawada wrote:
> On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin<[email protected]> wrote:
>> I was wondering why is that failure the only one of this kind on buildfarm
>> (in last two years, at least), so I've tried to reproduce it on
>> REL_18_STABLE... and failed.
>>
>> Then I've bisected it on the master branch and found (your) commit that
>> introduced this behavior: 67c20979c from 2025-12-23.
>>
> I've confirmed that this race condition issue is present from v15 to
> the master. In v14, we have the procsignal barrier code but don't use
> it anywhere. In v18 or older, it could happen when executing DROP
> DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
> in more cases as we're using procsignal barrier more places. In any
> case, if a process emits a signal barrier when another process is
> between the initialization of slot->pss_barrierGeneration and
> slot->pss_pid initialization, the subsequent
> WaitForProcSignalBarrier() ends up waiting for that process forever.
> So I think the patch should be backpatched to v15. Please review these
> patches.
Yes, you're right -- it's not reproduced on REL_18_STABLE with
test_oat_hooks, which simply starts postgres node (as many other tests),
but when I tried the full test suite with the sleep inserted before
setting pss_pid, I discovered the following vulnerable tests:
030_stats_cleanup_replica_standby.log
2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
033_replay_tsp_drops_standby2_FILE_COPY.log
2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to
16384/16389
040_standby_failover_slots_sync_publisher.log
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID
1538477 to accept ProcSignalBarrier
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;
002_compare_backups_pitr1.log
2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
I've tried my repro with 033_replay_tsp_drops and it really fails on
REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
> FYI I found that we had a similar report[1] last year, I'm not sure
> it hit the exact same issue, though.
>
> Regards,
>
> [1]https://www.postgresql.org/message-id/[email protected]...
Yeah, and probably this one:
https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru
By the way, mamba produced the same failure just yesterday:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
# Running: pg_ctl --wait --pgdata
/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log
/home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options
--cluster-name=primary start
waiting for server to
start...........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
stopped waiting
pg_ctl: server did not start in time
004_restart_primary.log
2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
...
2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
The proposed patches make the test pass reliably for me in all affected
branches. Thank you for working on this!
Best regards,
Alexander
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-01 08:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
@ 2026-05-07 17:17 ` Masahiko Sawada <[email protected]>
2026-05-14 21:47 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Masahiko Sawada @ 2026-05-07 17:17 UTC (permalink / raw)
To: Alexander Lakhin <[email protected]>; +Cc: Andres Freund <[email protected]>; Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Andrey Borodin <[email protected]>
On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <[email protected]> wrote:
>
> Dear Sawada-san,
>
> 01.05.2026 01:08, Masahiko Sawada wrote:
>
> On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
>
> I was wondering why is that failure the only one of this kind on buildfarm
> (in last two years, at least), so I've tried to reproduce it on
> REL_18_STABLE... and failed.
>
> Then I've bisected it on the master branch and found (your) commit that
> introduced this behavior: 67c20979c from 2025-12-23.
>
> I've confirmed that this race condition issue is present from v15 to
> the master. In v14, we have the procsignal barrier code but don't use
> it anywhere. In v18 or older, it could happen when executing DROP
> DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
> in more cases as we're using procsignal barrier more places. In any
> case, if a process emits a signal barrier when another process is
> between the initialization of slot->pss_barrierGeneration and
> slot->pss_pid initialization, the subsequent
> WaitForProcSignalBarrier() ends up waiting for that process forever.
> So I think the patch should be backpatched to v15. Please review these
> patches.
>
>
> Yes, you're right -- it's not reproduced on REL_18_STABLE with
> test_oat_hooks, which simply starts postgres node (as many other tests),
> but when I tried the full test suite with the sleep inserted before
> setting pss_pid, I discovered the following vulnerable tests:
>
> 030_stats_cleanup_replica_standby.log
> 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
> 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
>
> 033_replay_tsp_drops_standby2_FILE_COPY.log
> 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
> 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389
>
> 040_standby_failover_slots_sync_publisher.log
> 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier
> 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;
>
> 002_compare_backups_pitr1.log
> 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
> 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
>
> I've tried my repro with 033_replay_tsp_drops and it really fails on
> REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
>
> FYI I found that we had a similar report[1] last year, I'm not sure
> it hit the exact same issue, though.
>
> Regards,
>
> [1] https://www.postgresql.org/message-id/[email protected]...
>
>
> Yeah, and probably this one:
> https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru
>
> By the way, mamba produced the same failure just yesterday:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
>
> # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
> waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
> pg_ctl: server did not start in time
> 004_restart_primary.log
> 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> ...
> 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
>
> The proposed patches make the test pass reliably for me in all affected
> branches. Thank you for working on this!
>
Thank you for checking this issue on stable branches too!
Considering that this issue is not very visible in practice and we're
going to release new minor versions next week, I'm planning to push
these fixes to master and backbranches after the minor releases. That
way, we can fix the issue on the master relatively soon and have
enough time to verify that fix works well on backbranches.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-01 08:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-05-07 17:17 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
@ 2026-05-14 21:47 ` Masahiko Sawada <[email protected]>
2026-05-22 23:26 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Masahiko Sawada @ 2026-05-14 21:47 UTC (permalink / raw)
To: Alexander Lakhin <[email protected]>; +Cc: Andres Freund <[email protected]>; Matthias van de Meent <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Andrey Borodin <[email protected]>
On Thu, May 7, 2026 at 10:17 AM Masahiko Sawada <[email protected]> wrote:
>
> On Fri, May 1, 2026 at 1:00 AM Alexander Lakhin <[email protected]> wrote:
> >
> > Dear Sawada-san,
> >
> > 01.05.2026 01:08, Masahiko Sawada wrote:
> >
> > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
> >
> > I was wondering why is that failure the only one of this kind on buildfarm
> > (in last two years, at least), so I've tried to reproduce it on
> > REL_18_STABLE... and failed.
> >
> > Then I've bisected it on the master branch and found (your) commit that
> > introduced this behavior: 67c20979c from 2025-12-23.
> >
> > I've confirmed that this race condition issue is present from v15 to
> > the master. In v14, we have the procsignal barrier code but don't use
> > it anywhere. In v18 or older, it could happen when executing DROP
> > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
> > in more cases as we're using procsignal barrier more places. In any
> > case, if a process emits a signal barrier when another process is
> > between the initialization of slot->pss_barrierGeneration and
> > slot->pss_pid initialization, the subsequent
> > WaitForProcSignalBarrier() ends up waiting for that process forever.
> > So I think the patch should be backpatched to v15. Please review these
> > patches.
> >
> >
> > Yes, you're right -- it's not reproduced on REL_18_STABLE with
> > test_oat_hooks, which simply starts postgres node (as many other tests),
> > but when I tried the full test suite with the sleep inserted before
> > setting pss_pid, I discovered the following vulnerable tests:
> >
> > 030_stats_cleanup_replica_standby.log
> > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend with PID 2086578 to accept ProcSignalBarrier
> > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393
> >
> > 033_replay_tsp_drops_standby2_FILE_COPY.log
> > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend with PID 2030901 to accept ProcSignalBarrier
> > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389
> >
> > 040_standby_failover_slots_sync_publisher.log
> > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrier
> > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT: DROP DATABASE slotsync_test_db;
> >
> > 002_compare_backups_pitr1.log
> > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend with PID 1829396 to accept ProcSignalBarrier
> > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414
> >
> > I've tried my repro with 033_replay_tsp_drops and it really fails on
> > REL_15_STABLE..master and doesn't fail on REL_14_STABLE.
> >
> > FYI I found that we had a similar report[1] last year, I'm not sure
> > it hit the exact same issue, though.
> >
> > Regards,
> >
> > [1] https://www.postgresql.org/message-id/[email protected]...
> >
> >
> > Yeah, and probably this one:
> > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru
> >
> > By the way, mamba produced the same failure just yesterday:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39
> >
> > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
> > waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
> > pg_ctl: server did not start in time
> > 004_restart_primary.log
> > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> > ...
> > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend with PID 11506 to accept ProcSignalBarrier
> >
> > The proposed patches make the test pass reliably for me in all affected
> > branches. Thank you for working on this!
> >
>
> Thank you for checking this issue on stable branches too!
>
> Considering that this issue is not very visible in practice and we're
> going to release new minor versions next week, I'm planning to push
> these fixes to master and backbranches after the minor releases. That
> way, we can fix the issue on the master relatively soon and have
> enough time to verify that fix works well on backbranches.
>
While reviewing the patches, I realized that it would be better to use
pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() +
pg_memory_barrier() where available. I've updated the patch for master
and 18, and slightly commit messages.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Attachments:
[text/x-patch] REL17_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (2.9K, 2-REL17_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch)
download | inline diff:
From b7606bea5ad7564b73ea4a2575f547113e532018 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v1] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID intopss_pid. This created a race condition: a
process could initialize its local generation with an older global
value, while a concurrent EmitProcSignalBarrier() might skip that
process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
Fix this by publishing pss_pid before reading psh_barrierGeneration
with a memory barrier so that the store to pss_pid is ordered before
the load. A concurrent EmitProcSignalBarrier() then either observes
the published PID and signals this slot, or completes its generation
increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). v14 has the
procsiangl barrier infrastricutre but no in-tree caller that actually
emits a barrier, so the case is unreachable there.
This issue was also reported by buildfarm member flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index d6857f5a8bb..50b3cb2fd7b 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -175,6 +175,16 @@ ProcSignalInit(void)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is ordered before the subsequent load of
+ * psh_barrierGeneration.
+ */
+ slot->pss_pid = MyProcPid;
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -192,9 +202,6 @@ ProcSignalInit(void)
pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
pg_memory_barrier();
- /* Mark slot with my PID */
- slot->pss_pid = MyProcPid;
-
/* Remember slot location for CheckProcSignal */
MyProcSignalSlot = slot;
--
2.54.0
[text/x-patch] REL15_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (2.9K, 3-REL15_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch)
download | inline diff:
From 4979dfae9f8638627e5fb79cb0079e00883fd761 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v1] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID intopss_pid. This created a race condition: a
process could initialize its local generation with an older global
value, while a concurrent EmitProcSignalBarrier() might skip that
process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
Fix this by publishing pss_pid before reading psh_barrierGeneration
with a memory barrier so that the store to pss_pid is ordered before
the load. A concurrent EmitProcSignalBarrier() then either observes
the published PID and signals this slot, or completes its generation
increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). v14 has the
procsiangl barrier infrastricutre but no in-tree caller that actually
emits a barrier, so the case is unreachable there.
This issue was also reported by buildfarm member flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 21a9fc0fdd2..f710815d9ec 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -175,6 +175,16 @@ ProcSignalInit(int pss_idx)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is ordered before the subsequent load of
+ * psh_barrierGeneration.
+ */
+ slot->pss_pid = MyProcPid;
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -192,9 +202,6 @@ ProcSignalInit(int pss_idx)
pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
pg_memory_barrier();
- /* Mark slot with my PID */
- slot->pss_pid = MyProcPid;
-
/* Remember slot location for CheckProcSignal */
MyProcSignalSlot = slot;
--
2.54.0
[text/x-patch] master_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (3.0K, 4-master_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch)
download | inline diff:
From 144ace5abf197b4435d9aa1e7525614c0a8ae70f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v1] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID intopss_pid. This created a race condition: a
process could initialize its local generation with an older global
value, while a concurrent EmitProcSignalBarrier() might skip that
process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
Fix this by publishing pss_pid before reading psh_barrierGeneration
with a memory barrier so that the store to pss_pid is ordered before
the load. A concurrent EmitProcSignalBarrier() then either observes
the published PID and signals this slot, or completes its generation
increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). v14 has the
procsiangl barrier infrastricutre but no in-tree caller that actually
emits a barrier, so the case is unreachable there.
This issue was also reported by buildfarm member flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 264e4c22ca6..1397f65f67b 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -188,6 +188,15 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is ordered before the subsequent load of
+ * psh_barrierGeneration.
+ */
+ pg_atomic_write_membarrier_u32(&slot->pss_pid, MyProcPid);
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -207,7 +216,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
- pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
SpinLockRelease(&slot->pss_mutex);
--
2.54.0
[text/x-patch] REL16_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (2.9K, 5-REL16_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch)
download | inline diff:
From 8b303ee35ad640299c5706bceb401a2706a5be2f Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v1] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID intopss_pid. This created a race condition: a
process could initialize its local generation with an older global
value, while a concurrent EmitProcSignalBarrier() might skip that
process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
Fix this by publishing pss_pid before reading psh_barrierGeneration
with a memory barrier so that the store to pss_pid is ordered before
the load. A concurrent EmitProcSignalBarrier() then either observes
the published PID and signals this slot, or completes its generation
increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). v14 has the
procsiangl barrier infrastricutre but no in-tree caller that actually
emits a barrier, so the case is unreachable there.
This issue was also reported by buildfarm member flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index c85cb5cc18d..9dfe000353d 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -176,6 +176,16 @@ ProcSignalInit(int pss_idx)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is ordered before the subsequent load of
+ * psh_barrierGeneration.
+ */
+ slot->pss_pid = MyProcPid;
+ pg_memory_barrier();
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -193,9 +203,6 @@ ProcSignalInit(int pss_idx)
pg_atomic_write_u64(&slot->pss_barrierGeneration, barrier_generation);
pg_memory_barrier();
- /* Mark slot with my PID */
- slot->pss_pid = MyProcPid;
-
/* Remember slot location for CheckProcSignal */
MyProcSignalSlot = slot;
--
2.54.0
[text/x-patch] REL18_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch (3.0K, 6-REL18_v1-0001-Fix-race-between-ProcSignalInit-and-EmitProcSigna.patch)
download | inline diff:
From 921c2e145f081c6acc05e6da2f0d14ac747d2cf0 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Tue, 28 Apr 2026 12:21:21 -0700
Subject: [PATCH v1] Fix race between ProcSignalInit() and
EmitProcSignalBarrier().
Previously, ProcSignalInit() read the global barrier generation before
publishing its PID intopss_pid. This created a race condition: a
process could initialize its local generation with an older global
value, while a concurrent EmitProcSignalBarrier() might skip that
process because its pss_pid was still zero. This resulted in
WaitForProcSignalBarrier() hanging indefinitely.
Fix this by publishing pss_pid before reading psh_barrierGeneration
with a memory barrier so that the store to pss_pid is ordered before
the load. A concurrent EmitProcSignalBarrier() then either observes
the published PID and signals this slot, or completes its generation
increment before we load it.
While this race has become more visible due to recent features using
signal barriers in more places (such as online wal_level changes), the
issue is theoretically present since signal barriers were introduced
to release smgr caches (e.g., in DROP DATABASE). v14 has the
procsiangl barrier infrastricutre but no in-tree caller that actually
emits a barrier, so the case is unreachable there.
This issue was also reported by buildfarm member flaviventris.
Reported-by: Melanie Plageman <[email protected]>
Reviewed-by: Alexander Lakhin <[email protected]>
Reviewed-by: Matthias van de Meent <[email protected]>
Discussion: https://postgr.es/m/CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com
Backpatch-through: 15
---
src/backend/storage/ipc/procsignal.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index 05d99b452c3..e7c9da2b940 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -185,6 +185,15 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
/* Clear out any leftover signal reasons */
MemSet(slot->pss_signalFlags, 0, NUM_PROCSIGNALS * sizeof(sig_atomic_t));
+ /*
+ * Publish the PID before reading the global barrier generation to ensure
+ * that EmitProcSignalBarrier() doesn't skip us while we are grabbing an
+ * older generation. We need a memory barrier here to make sure that the
+ * update of pss_pid is ordered before the subsequent load of
+ * psh_barrierGeneration.
+ */
+ pg_atomic_write_membarrier_u32(&slot->pss_pid, MyProcPid);
+
/*
* Initialize barrier state. Since we're a brand-new process, there
* shouldn't be any leftover backend-private state that needs to be
@@ -204,7 +213,6 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
if (cancel_key_len > 0)
memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
slot->pss_cancel_key_len = cancel_key_len;
- pg_atomic_write_u32(&slot->pss_pid, MyProcPid);
SpinLockRelease(&slot->pss_mutex);
--
2.54.0
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-01 08:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-05-07 17:17 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-14 21:47 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
@ 2026-05-22 23:26 ` Matthias van de Meent <[email protected]>
2026-05-27 23:28 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
0 siblings, 1 reply; 13+ messages in thread
From: Matthias van de Meent @ 2026-05-22 23:26 UTC (permalink / raw)
To: Masahiko Sawada <[email protected]>; +Cc: Alexander Lakhin <[email protected]>; Andres Freund <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Andrey Borodin <[email protected]>
On Thu, 14 May 2026 at 14:48, Masahiko Sawada <[email protected]> wrote:
>
> While reviewing the patches, I realized that it would be better to use
> pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() +
> pg_memory_barrier() where available. I've updated the patch for master
> and 18, and slightly commit messages.
LGTM, thanks for getting this fixed!
Kind regards,
Matthias van de Meent
Databricks
^ permalink raw reply [nested|flat] 13+ messages in thread
* Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Andres Freund <[email protected]>
2026-04-24 17:52 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-04-29 18:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-01 08:00 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Alexander Lakhin <[email protected]>
2026-05-07 17:17 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-14 21:47 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Masahiko Sawada <[email protected]>
2026-05-22 23:26 ` Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
@ 2026-05-27 23:28 ` Masahiko Sawada <[email protected]>
0 siblings, 0 replies; 13+ messages in thread
From: Masahiko Sawada @ 2026-05-27 23:28 UTC (permalink / raw)
To: Matthias van de Meent <[email protected]>; +Cc: Alexander Lakhin <[email protected]>; Andres Freund <[email protected]>; Thomas Munro <[email protected]>; PostgreSQL Hackers <[email protected]>; Heikki Linnakangas <[email protected]>; Andrey Borodin <[email protected]>
On Fri, May 22, 2026 at 4:26 PM Matthias van de Meent
<[email protected]> wrote:
>
> On Thu, 14 May 2026 at 14:48, Masahiko Sawada <[email protected]> wrote:
> >
> > While reviewing the patches, I realized that it would be better to use
> > pg_atomic_write_membarrier_u32() instead of pg_atomic_write_u32() +
> > pg_memory_barrier() where available. I've updated the patch for master
> > and 18, and slightly commit messages.
>
> LGTM, thanks for getting this fixed!
>
Pushed the fix down to v15. Thank you for reviewing the patches!
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
^ permalink raw reply [nested|flat] 13+ messages in thread
end of thread, other threads:[~2026-05-27 23:28 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-22 11:21 Startup process deadlock: WaitForProcSignalBarriers vs aux process Matthias van de Meent <[email protected]>
2026-04-22 19:05 ` Andres Freund <[email protected]>
2026-04-24 17:52 ` Masahiko Sawada <[email protected]>
2026-04-27 18:00 ` Alexander Lakhin <[email protected]>
2026-04-28 19:27 ` Masahiko Sawada <[email protected]>
2026-04-29 10:49 ` Matthias van de Meent <[email protected]>
2026-04-29 18:00 ` Alexander Lakhin <[email protected]>
2026-04-30 22:08 ` Masahiko Sawada <[email protected]>
2026-05-01 08:00 ` Alexander Lakhin <[email protected]>
2026-05-07 17:17 ` Masahiko Sawada <[email protected]>
2026-05-14 21:47 ` Masahiko Sawada <[email protected]>
2026-05-22 23:26 ` Matthias van de Meent <[email protected]>
2026-05-27 23:28 ` Masahiko Sawada <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox