public inbox for [email protected]
help / color / mirror / Atom feedFrom: Matthias van de Meent <[email protected]>
To: Masahiko Sawada <[email protected]>
Cc: Alexander Lakhin <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Heikki Linnakangas <[email protected]>
Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
Date: Wed, 29 Apr 2026 12:49:24 +0200
Message-ID: <CAEze2WhhTnSLpjGJWGupbxkTp_JdNP6v0mNgpqhi_YkXJa=m6A@mail.gmail.com> (raw)
In-Reply-To: <CAD21AoBDjU4fi+QgaC7RimqrDoUi9Ma+oBP10i0Je13NPSYcuw@mail.gmail.com>
References: <CAEze2WgAJmWReDN7Chtba8Er2YBvKCoa0KVN25-1evnTrHsLyA@mail.gmail.com>
<ojnnvj553eber32njg5nphybcf6zvb3bzexfkh3obylsll67pc@jk7jmtzeb66k>
<CAD21AoBj+zKvgw_Q8gjr4YbKccW_uMe3OFQ5+KT246FHUuNXSQ@mail.gmail.com>
<[email protected]>
<CAD21AoBDjU4fi+QgaC7RimqrDoUi9Ma+oBP10i0Je13NPSYcuw@mail.gmail.com>
On Wed, 22 Apr 2026 at 21:05, Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote:
> > If the PSB is emitted (and signaled to checkpointer) before the
> > checkpointer has registered its SIGUSR1 handler, then the checkpointer
> > won't receive the notice to check its procsignal slots, it won't
> > notice the updated procsignal flags, and it won't process the PSB; not
> > until it receives a new SIGUSR1.
> >
> > Signals are sent to all processes that have their procsignal pss_pid
> > set, which is true for every process which has called ProcSignalInit,
> > which for the checkpointer (like other aux processes) happens in
> > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux
> > processes) calls AuxiliaryProcessMainCommon before registering its
> > signal handlers, creating a small window in time where signals are
> > sent, but not handled.
>
> Hm. Have we confirmed this happens?
>
> CheckpointerMain() is called with all signals masked, so it should be ok for
> the signal handler to only be set up after AuxiliaryProcessMainCommon(), as
> long as it happens before [...]
Yeah, that was a misidentification of the exact race that caused the issue.
On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada <[email protected]> wrote:
>
> On Mon, Apr 27, 2026 at 11:00 AM Alexander Lakhin <[email protected]> wrote:
> >
> > Hello Sawada-san,
> >
> > 24.04.2026 20:52, Masahiko Sawada wrote:
> >
> > Right. The postmaster blocks all signals before starting child process
> > as the following comment explains:
> >
> > /*
> > * We start postmaster children with signals blocked. This allows them to
> > * install their own handlers before unblocking, to avoid races where they
> > * might run the postmaster's handler and miss an important control
> > * signal. With more analysis this could potentially be relaxed.
> > */
> > sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);
> >
> > Investigating the issue, I found there is a race condition between the
> > procsignal initialization and emitting signal barrier that could be
> > the cause of this issue. Imagine the following scenario:
Ah, that'd be it indeed. Thanks!
> I've attached a patch to address the issue. I haven't verified it
> across all versions yet, but I suspect it exists in the stable
> branches as well. Previously, the issue rarely occurred because
> EmitProcSignalBarrier() was only used for smgr invalidation. However,
> now that we use signal barriers for online wal_level changes and
> checksum status updates, this race condition is likely to be
> encountered more frequently.
Yes, I think the boot process with the xlog_logical_info barrier is
more likely to hit this issue; as indicated by two known detected
cases in various CI jobs; though it could also be that the lockup of
the new barrier is just exceptionally bad for system stability.
As for the patches:
v1-0001 -- LGTM.
0001 (upthread): LGTM, but I'd also suggest to add some code to make
sure that we're actually receiving procsignals by the time we
initialize the Logical/Checksum subsystems that need to process shared
state changes by responding to procsignals; as attached. smgr's
procsignal doesn't really depend on shared memory state, so I've kept
that out of my patch.
Kind regards,
Matthias van de Meent
Databricks (https://www.databricks.com)
Attachments:
[application/octet-stream] v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch (2.4K, 2-v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch)
download | inline diff:
From 8a1dc18bbcf11a2eb36cfd3dbb290976d87284d1 Mon Sep 17 00:00:00 2001
From: Matthias van de Meent <[email protected]>
Date: Wed, 29 Apr 2026 12:10:44 +0200
Subject: [PATCH v1] Assert ProcSignal is initialized before its dependents
---
src/backend/access/transam/xlog.c | 1 +
src/backend/replication/logical/logicalctl.c | 1 +
src/backend/storage/ipc/procsignal.c | 8 ++++++++
src/include/storage/procsignal.h | 5 +++++
4 files changed, 15 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e39af79c03b..63e84b00cec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4960,6 +4960,7 @@ SetDataChecksumsOff(void)
void
InitLocalDataChecksumState(void)
{
+ Assert(ProcSignalIsInitialized());
SpinLockAcquire(&XLogCtl->info_lck);
SetLocalDataChecksumState(XLogCtl->data_checksum_version);
SpinLockRelease(&XLogCtl->info_lck);
diff --git a/src/backend/replication/logical/logicalctl.c b/src/backend/replication/logical/logicalctl.c
index 72f68ec58ef..80308b619a4 100644
--- a/src/backend/replication/logical/logicalctl.c
+++ b/src/backend/replication/logical/logicalctl.c
@@ -173,6 +173,7 @@ update_xlog_logical_info(void)
void
InitializeProcessXLogLogicalInfo(void)
{
+ Assert(ProcSignalIsInitialized());
update_xlog_logical_info();
}
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index b0681ca0ae2..71a0b25e49e 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -232,6 +232,14 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
on_shmem_exit(CleanupProcSignalState, (Datum) 0);
}
+#ifdef USE_ASSERT_CHECKING
+bool
+ProcSignalIsInitialized(void)
+{
+ return MyProcSignalSlot != NULL;
+}
+#endif
+
/*
* CleanupProcSignalState
* Remove current process from ProcSignal mechanism
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index aaa158bfd66..1d2290c6975 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -87,4 +87,9 @@ typedef struct ProcSignalHeader ProcSignalHeader;
extern PGDLLIMPORT ProcSignalHeader *ProcSignal;
#endif
+#ifdef USE_ASSERT_CHECKING
+extern bool ProcSignalIsInitialized(void);
+#endif
+
+
#endif /* PROCSIGNAL_H */
--
2.50.1 (Apple Git-155)
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process
In-Reply-To: <CAEze2WhhTnSLpjGJWGupbxkTp_JdNP6v0mNgpqhi_YkXJa=m6A@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox