Hello Sawada-san,

24.04.2026 20:52, Masahiko Sawada wrote:
Right. The postmaster blocks all signals before starting child process
as the following comment explains:

     /*
      * We start postmaster children with signals blocked.  This allows them to
      * install their own handlers before unblocking, to avoid races where they
      * might run the postmaster's handler and miss an important control
      * signal. With more analysis this could potentially be relaxed.
      */
     sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);

Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:

1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.

Thank you for the investigation and explanation of the issue!

I've been puzzled by a buildfarm failure [1] with such symptoms for a while
and even reproduced it locally once, but couldn't gather more information
that time. But now that you have described the scenario, I can easily
reproduce the same test failure with:
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
        if (cancel_key_len > 0)
                memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
        slot->pss_cancel_key_len = cancel_key_len;
+pg_usleep(10000);
        pg_atomic_write_u32(&slot->pss_pid, MyProcPid);

just running `meson test test_oat_hooks_*/regress` with the test multiplied x30:
26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress         OK                1.28s   2 subtests passed
27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress         OK                1.25s   2 subtests passed
28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress           ERROR            62.49s   exit status 2

2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-16.0.1, 64-bit
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086"
2026-04-27 17:34:44.302 UTC startup[1578114] LOG:  database system was shut down at 2026-04-27 17:34:44 UTC
2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL:  the database system is starting up
...
2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL:  the database system is starting up
2026-04-27 17:34:49.308 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL:  the database system is starting up
...
2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL:  the database system is starting up
2026-04-27 17:35:44.351 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL:  the database system is starting up


[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2026-03-10%2013%3A58%3A55

Best regards,
Alexander