Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wHQFe-0079xk-38 for pgsql-hackers@arkaria.postgresql.org; Mon, 27 Apr 2026 18:00:11 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wHQFd-00EpvN-2U for pgsql-hackers@arkaria.postgresql.org; Mon, 27 Apr 2026 18:00:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wHQFd-00EpvF-0X for pgsql-hackers@lists.postgresql.org; Mon, 27 Apr 2026 18:00:09 +0000 Received: from mail-lf1-x135.google.com ([2a00:1450:4864:20::135]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wHQFZ-000000034Ay-3vG9 for pgsql-hackers@lists.postgresql.org; Mon, 27 Apr 2026 18:00:07 +0000 Received: by mail-lf1-x135.google.com with SMTP id 2adb3069b0e04-5a2b5ea59a1so15940954e87.1 for ; Mon, 27 Apr 2026 11:00:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777312803; x=1777917603; darn=lists.postgresql.org; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=AWdoVrJWRuoTgXbDkQai5Qbn81e22PKcsiX/kgLUG0s=; b=Yn8ZK0OFyPK550sAGgWSSBeSsIcvBHifiOmCVk7pN2Qzy8otkAXAYPWy2kAOYoXVEx OSfLRhU6AR/OF5ErsRm0cFfNnbGKvU2WOgU6imwKNgrnT0P73PIKP8MgQ9S1VkAtA/QY miZzD78wtlUzFe6Q+veFtS1sLFB0IJYk79KN7oyC5N5ibI0yalNVwdOLO41D/BjeogNY FVJt2O279KM34j9ff547R5BVatMTvqPiXmgYvwUQLDl5pR5a5CKO7uhD5taskVuJCtvP zJOftI1+gVN/TXAij1im6xifgIiHSQLTKYbaM4sLs1BYBvFRAc7/SzsXWnUUSpSDSRnV 0RkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777312803; x=1777917603; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=AWdoVrJWRuoTgXbDkQai5Qbn81e22PKcsiX/kgLUG0s=; b=CNoll/uuMufKI25tGoC4fCKPf07CeFR4Ya4SfbjEopU0/euplaGyrMWDP3aCucPr9z rqNG4f0uAynSp+bF4RDz0K1Es7gduVzbJbqCklruG2ZIYM1Cgy+Wxmvtz2CcTQ1OXAt6 nJxCDQ0DuhP4CIA4zPfZ29tRiqD3SjmJK8IAGkoJG4C8J4Z+MDMZmCj6lnkI15wBiYmr rPL1RXTU1Pz4h3mURlhW7X+svqF0SIQcPMeIH1TEdGMYjiCSifGhdFuFCx4a9RanVF4V itLvAD+3MDAHDQDglVBSzEXZXbjQlyqnIjQdPsYZRorgid/xGGaNEGIfCt7WEcWaEaTu XiSg== X-Forwarded-Encrypted: i=1; AFNElJ9wGhghKvXT5yc8nOUC4hMI23sQFZzbdjeDKqmCKGzBlhMLMpjdVGL2+tvbZxSbEpddLbpDcvl5Z3zgDJEu@lists.postgresql.org X-Gm-Message-State: AOJu0YySsRX8as75ltUI4j8IfQOpZkyBwKtVG+pLCIV15oz5CHv4L+Us IDaZS9QglI32LEuznXZaG68BRCfSp+4SyAsrAinwUNKkqVTzFOKo+cE3 X-Gm-Gg: AeBDieukgCIM77c/p6owd/YRFDBdfrMdmtapswBu7B1AMCtCQqWPkZ0z9jK6WZjmQXT AbJD9jCJjW11Kt/BayQzPhmcClbTalTf6bYFztICqXFym5azGWLSYbu4DaEVrpPMcUYxPLlIkX2 uxfHa1IFH4Q/0ctG6no7urEErZtCFZbJ/V12znUaI1e7WgiHV7GxbyRUUTjzH0ZF4+QAycPzWJV VJCaXu7esiJ6zmd8oPTQJs5MThX4IQk4QNmgNKi8wtdxP+IfEdnPtLKVUwwFjjhACFe0IMct5SO 1nNZA8nt8SYog2KNgX7tYmVtTYzQj7PpAWhYOHsyDfQL0hxOmVa24ommnotl9b2J/FpHOIeD8jw xN4qgcSnbdjZ64fhdCwJSo4jtSU/+lqozqUJ0UqQ24pSsO581tV9DE+EZzHN8LswBTBmHLg9iuK zX09m2c/NyBvvjzKX604vrbOkAP8dsHfmC6uE= X-Received: by 2002:a05:6512:23a2:b0:5a1:381b:fae1 with SMTP id 2adb3069b0e04-5a745e3a372mr92914e87.10.1777312802971; Mon, 27 Apr 2026 11:00:02 -0700 (PDT) Received: from [192.168.0.50] ([89.149.68.143]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5a4185c8b57sm8444796e87.37.2026.04.27.11.00.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Apr 2026 11:00:02 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------l06pmdO3lduMXIrjJWVop05z" Message-ID: <4358bd85-f6b4-4da6-9909-74428fe3c8f7@gmail.com> Date: Mon, 27 Apr 2026 21:00:00 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process To: Masahiko Sawada , Andres Freund Cc: Matthias van de Meent , Thomas Munro , PostgreSQL Hackers , Heikki Linnakangas References: Content-Language: en-US From: Alexander Lakhin In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------l06pmdO3lduMXIrjJWVop05z Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Hello Sawada-san, 24.04.2026 20:52, Masahiko Sawada wrote: > Right. The postmaster blocks all signals before starting child process > as the following comment explains: > > /* > * We start postmaster children with signals blocked. This allows them to > * install their own handlers before unblocking, to avoid races where they > * might run the postmaster's handler and miss an important control > * signal. With more analysis this could potentially be relaxed. > */ > sigprocmask(SIG_SETMASK, &BlockSig, &save_mask); > > Investigating the issue, I found there is a race condition between the > procsignal initialization and emitting signal barrier that could be > the cause of this issue. Imagine the following scenario: > > 1. In ProcSignalInit(), the checkpointer initializes its > slot->pss_barrierGeneration with the global generation. > 2. In EmitProcSignalBarrier(), the startup checks the checkpointer's > procsignal slot but it skips emitting the signal as slot->pss_pid is > still 0. It can happen even though the checkpointer holds a spinlock > on its slot during the initialization because the first pid check is > done without a spinlock acquisition. > 3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock. > 4. In WaitForProcSignalBarrier(), the startup checks the > checkpointer's procsignal slot that has already initialized the > pss_barrierGeneration, and waits for it to be updated. However, the > checkpointer never updates its barrier generation as it doesn't get > the signal. Thank you for the investigation and explanation of the issue! I've been puzzled by a buildfarm failure [1] with such symptoms for a while and even reproduced it locally once, but couldn't gather more information that time. But now that you have described the scenario, I can easily reproduce the same test failure with: --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)         if (cancel_key_len > 0)                 memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);         slot->pss_cancel_key_len = cancel_key_len; +pg_usleep(10000);         pg_atomic_write_u32(&slot->pss_pid, MyProcPid); just running `meson test test_oat_hooks_*/regress` with the test multiplied x30: 26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress         OK 1.28s   2 subtests passed 27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress         OK 1.25s   2 subtests passed 28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress           ERROR 62.49s   exit status 2 2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-16.0.1, 64-bit 2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086" 2026-04-27 17:34:44.302 UTC startup[1578114] LOG:  database system was shut down at 2026-04-27 17:34:44 UTC 2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL:  the database system is starting up ... 2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL:  the database system is starting up 2026-04-27 17:34:49.308 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier 2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL:  the database system is starting up ... 2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL:  the database system is starting up 2026-04-27 17:35:44.351 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier 2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL:  the database system is starting up [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2026-03-10%2013%3A58%3A55 Best regards, Alexander --------------l06pmdO3lduMXIrjJWVop05z Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Hello Sawada-san,

24.04.2026 20:52, Masahiko Sawada wrote:
Right. The postmaster blocks all signals before starting child process
as the following comment explains:

     /*
      * We start postmaster children with signals blocked.  This allows them to
      * install their own handlers before unblocking, to avoid races where they
      * might run the postmaster's handler and miss an important control
      * signal. With more analysis this could potentially be relaxed.
      */
     sigprocmask(SIG_SETMASK, &BlockSig, &save_mask);

Investigating the issue, I found there is a race condition between the
procsignal initialization and emitting signal barrier that could be
the cause of this issue. Imagine the following scenario:

1. In ProcSignalInit(), the checkpointer initializes its
slot->pss_barrierGeneration with the global generation.
2. In EmitProcSignalBarrier(), the startup checks the checkpointer's
procsignal slot but it skips emitting the signal as slot->pss_pid is
still 0. It can happen even though the checkpointer holds a spinlock
on its slot during the initialization because the first pid check is
done without a spinlock acquisition.
3. The checkpointer sets its pid to slot->pss_pid and releases the spin lock.
4. In WaitForProcSignalBarrier(), the startup checks the
checkpointer's procsignal slot that has already initialized the
pss_barrierGeneration, and waits for it to be updated. However, the
checkpointer never updates its barrier generation as it doesn't get
the signal.

Thank you for the investigation and explanation of the issue!

I've been puzzled by a buildfarm failure [1] with such symptoms for a while
and even reproduced it locally once, but couldn't gather more information
that time. But now that you have described the scenario, I can easily
reproduce the same test failure with:
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -206,6 +206,7 @@ ProcSignalInit(const uint8 *cancel_key, int cancel_key_len)
        if (cancel_key_len > 0)
                memcpy(slot->pss_cancel_key, cancel_key, cancel_key_len);
        slot->pss_cancel_key_len = cancel_key_len;
+pg_usleep(10000);
        pg_atomic_write_u32(&slot->pss_pid, MyProcPid);

just running `meson test test_oat_hooks_*/regress` with the test multiplied x30:
26/30 test_oat_hooks_28 - postgresql:test_oat_hooks_28/regress         OK                1.28s   2 subtests passed
27/30 test_oat_hooks_30 - postgresql:test_oat_hooks_30/regress         OK                1.25s   2 subtests passed
28/30 test_oat_hooks_2 - postgresql:test_oat_hooks_2/regress           ERROR            62.49s   exit status 2

2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  starting PostgreSQL 19devel on x86_64-linux, compiled by gcc-16.0.1, 64-bit
2026-04-27 17:34:44.290 UTC postmaster[1578102] LOG:  listening on Unix socket "/tmp/pg_regress-QdhMPt/.s.PGSQL.40086"
2026-04-27 17:34:44.302 UTC startup[1578114] LOG:  database system was shut down at 2026-04-27 17:34:44 UTC
2026-04-27 17:34:44.325 UTC dead-end client backend[1578133] [unknown] FATAL:  the database system is starting up
...
2026-04-27 17:34:49.274 UTC dead-end client backend[1578643] [unknown] FATAL:  the database system is starting up
2026-04-27 17:34:49.308 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:34:49.325 UTC dead-end client backend[1578645] [unknown] FATAL:  the database system is starting up
...
2026-04-27 17:35:44.332 UTC dead-end client backend[1582376] [unknown] FATAL:  the database system is starting up
2026-04-27 17:35:44.351 UTC startup[1578114] LOG:  still waiting for backend with PID 1578110 to accept ProcSignalBarrier
2026-04-27 17:35:44.383 UTC dead-end client backend[1582379] [unknown] FATAL:  the database system is starting up


[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&dt=2026-03-10%2013%3A58%3A55

Best regards,
Alexander
--------------l06pmdO3lduMXIrjJWVop05z--