Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wI2UB-007jWB-17 for pgsql-hackers@arkaria.postgresql.org; Wed, 29 Apr 2026 10:49:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wI2UA-0034gy-18 for pgsql-hackers@arkaria.postgresql.org; Wed, 29 Apr 2026 10:49:42 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wI2UA-0034gp-01 for pgsql-hackers@lists.postgresql.org; Wed, 29 Apr 2026 10:49:42 +0000 Received: from mail-lj1-x236.google.com ([2a00:1450:4864:20::236]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wI2U7-00000003mlL-396D for pgsql-hackers@lists.postgresql.org; Wed, 29 Apr 2026 10:49:41 +0000 Received: by mail-lj1-x236.google.com with SMTP id 38308e7fff4ca-38df1889fb9so122481101fa.1 for ; Wed, 29 Apr 2026 03:49:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1777459778; cv=none; d=google.com; s=arc-20240605; b=kSR7RSKz7fvXa4mJOUAHVLOYNtd3wMa5+MOdXj2UN9ekVsJ8N2MXpEqYDoyBI+DfAJ 7MlNd2B+iMPCUI3l3kaSKSXnFKVn1DSslOxAHV2GQEIdoCRCuWDDdp+YTR/X9XQ+Dkde 0ce8mYJ1dFZEvyPf58i7VTcT9mxtGUjyXHzq2j5SVQC1sAxQJhS8oDSra3zA7w0Uu26X 8KqyfnK9oOD9Rb1BugInxvsRKv689DWena3a6U/K17HDHS8h2tXz8iq8eTBcbOLfMKJ4 njz/JfDK8LFbGLgl0dlvv1cZke17cGWabHtCcoX3NBajoDdPLp6fDIYUdwWvK9SAbBtg JX8w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=j2WiXEaxSD1jZz9+APGEoz0XG/2KNy8/68A6AkHXkrE=; fh=KKjc3dpHSPq154bzeDfrLIFPp7rW1A8itxRRWGdeJJA=; b=DOgFuKIXs8twkIXUF8YoQ10UmH/f1MdY2rGPhSZdM9k3uYIZ19IG8LdtopkhsG4qvf WeYye1m1Gki/RbGEzPwlOIPbdPyoXRU7n/EDsBocoaiNJe+8lbEMjrEdgT5bCa1MnqgB CiMwIKuncJtqAP1T4ib/F8rkiPyISqWsJVuoVsCQVDk6YBgKhmko8Js9+ZvDLdB4lX6U 1bvrsQrs9gmKxoqi37IUfDc4qh7KM8xa3xkBKzAMd762qQZ4MG/q3uBtNBmxX1t3UwUt VobwRSQkGxAa6/RwFfGgjy+s3FxYR1Q/EnQxMMSAFaIXfCgmPVgY6Of+9HKeE0yEb8Lt EnMg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777459778; x=1778064578; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=j2WiXEaxSD1jZz9+APGEoz0XG/2KNy8/68A6AkHXkrE=; b=qKNDRWoQ9PqpamAzvKNKtsgJ95lLaPLXjxSugeViJRUNOzxHDrFNrf9lRbfF9DDRxP 3hZbM+/Wr6XN7piGvVaCXxWQq0hB6zhLId2nP/D69Ucw7nTflT+i6BaJF/3xmxUUVUyF XA7us7fG5Gqiiv0avCsNzIK5kE02C5VbYQleSYHURavxrUjY9X5ZKi/2dV/KTbWsA7qH JiGIKWxedBxCvLzicLFhMxaCbCfrbM4VtRdMYMU9Gus7g4xa844UGvNr4Qz1T2baIHfd Q5Q1jxjrm7Q9Nhi8IY+h1XwsoDT9HHHxUrgwoOuMcESpn3Pd8JyNLtpuOLq8o16NjXyV ziMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777459778; x=1778064578; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=j2WiXEaxSD1jZz9+APGEoz0XG/2KNy8/68A6AkHXkrE=; b=UX09vY3WqoLGu6kRzHZCu266d8VbifuxC3xxODORZHW6geLHT7/Kq9/BvKMJKd2x1Y g+ia02rK5VxKzVeh/Jfr9pKSzqJn2WDTBvtWlGDkoO9tsMUtvgE/jbdjoG3H6EFjcONF NNtR2zmgf5YJL+cFuj+ziZDcSWtXU7Y6BbYhUVVRJNjyglTRium5tPzd4Ppc+wF7soLm caDUsrVgJNsjzBDMHwuCgXE6QVZADfS89MUIn/7IrbjEQ7VdObxYotyq3a3JVdfz8foh HSU2ZIvZpeIsEgVIZIhEbYKeDHIJqGNa4/RE/cFDMJZ7Fddc4p2a+4aIGdM0Mu0alP3r Hvuw== X-Forwarded-Encrypted: i=1; AFNElJ/Tt1CRGm/kJvdNNO38xM61tDTAVdzLuCVPk7+zv5ehe2/3Ewg8V7TXWDoTPSjkNqH145hB9gNirOikZKF/@lists.postgresql.org X-Gm-Message-State: AOJu0Yx22q8iD2ur71X4a9dlvI2YgpVPTUcwO+dNSuRnmjQ3Esq4FPof Py66e2cZ3apO6eSA/7isb74moirGGV1NytxnpePyJFGT6XIYPeoliAHaBIWL+iLMSn7bdxYfB38 38V7HMPsLPYjINAc6uaEkr6hC6nRnLck= X-Gm-Gg: AeBDieu1WrsypxFTtUGquzNSEB5P4Q16TRxY7ODuMWO8PKTKr/zvWmmy6JKkoJhOXiq vHVxNZXqOzVdXyClbg7nEBDL2AdxmdxTLxX82MbbkrwI+D0N9Q/ZSGU+6xekY4Yt7aj2IAOBQ7e 8OlC/nJY8AWT67CO/caIIwDYIdNNIz6tA+pYLkn3RllVgsT4b5jrA0slAPtEEWO72rmYhdUn5nT HBjKnH/OpVL8RFwyZCJCEq7R4PvG4es6yRi5XYp1hEe9MeuDShPXWTbFn0Q0j1ECID3bifzx1lB dhNUqdpoQrAWtndMlKLqfGEFNa9SzgXKE+L5Qvrd/ap8HsE2NawRV+zQmn7HWb1mkPyvpImNQys 9QgUwxSyPwnJbqzqY2dIEDWozKg== X-Received: by 2002:a05:651c:981:b0:38e:a883:6303 with SMTP id 38308e7fff4ca-3924bbf0bf2mr12448651fa.9.1777459777935; Wed, 29 Apr 2026 03:49:37 -0700 (PDT) MIME-Version: 1.0 References: <4358bd85-f6b4-4da6-9909-74428fe3c8f7@gmail.com> In-Reply-To: From: Matthias van de Meent Date: Wed, 29 Apr 2026 12:49:24 +0200 X-Gm-Features: AVHnY4JLPPkVDTsh2WwN4NT4Gn3tiWa7m54kLeOy49xe-yfncY8B9BHjqKVLZ8M Message-ID: Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process To: Masahiko Sawada Cc: Alexander Lakhin , Andres Freund , Thomas Munro , PostgreSQL Hackers , Heikki Linnakangas Content-Type: multipart/mixed; boundary="000000000000aab6890650971ae7" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000aab6890650971ae7 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, 22 Apr 2026 at 21:05, Andres Freund wrote: > > Hi, > > On 2026-04-22 13:21:02 +0200, Matthias van de Meent wrote: > > If the PSB is emitted (and signaled to checkpointer) before the > > checkpointer has registered its SIGUSR1 handler, then the checkpointer > > won't receive the notice to check its procsignal slots, it won't > > notice the updated procsignal flags, and it won't process the PSB; not > > until it receives a new SIGUSR1. > > > > Signals are sent to all processes that have their procsignal pss_pid > > set, which is true for every process which has called ProcSignalInit, > > which for the checkpointer (like other aux processes) happens in > > AuxiliaryProcessMainCommon. However, checkpointer (also like other aux > > processes) calls AuxiliaryProcessMainCommon before registering its > > signal handlers, creating a small window in time where signals are > > sent, but not handled. > > Hm. Have we confirmed this happens? > > CheckpointerMain() is called with all signals masked, so it should be ok = for > the signal handler to only be set up after AuxiliaryProcessMainCommon(), = as > long as it happens before [...] Yeah, that was a misidentification of the exact race that caused the issue. On Tue, 28 Apr 2026 at 21:28, Masahiko Sawada wrote= : > > On Mon, Apr 27, 2026 at 11:00=E2=80=AFAM Alexander Lakhin wrote: > > > > Hello Sawada-san, > > > > 24.04.2026 20:52, Masahiko Sawada wrote: > > > > Right. The postmaster blocks all signals before starting child process > > as the following comment explains: > > > > /* > > * We start postmaster children with signals blocked. This allows= them to > > * install their own handlers before unblocking, to avoid races wh= ere they > > * might run the postmaster's handler and miss an important contro= l > > * signal. With more analysis this could potentially be relaxed. > > */ > > sigprocmask(SIG_SETMASK, &BlockSig, &save_mask); > > > > Investigating the issue, I found there is a race condition between the > > procsignal initialization and emitting signal barrier that could be > > the cause of this issue. Imagine the following scenario: Ah, that'd be it indeed. Thanks! > I've attached a patch to address the issue. I haven't verified it > across all versions yet, but I suspect it exists in the stable > branches as well. Previously, the issue rarely occurred because > EmitProcSignalBarrier() was only used for smgr invalidation. However, > now that we use signal barriers for online wal_level changes and > checksum status updates, this race condition is likely to be > encountered more frequently. Yes, I think the boot process with the xlog_logical_info barrier is more likely to hit this issue; as indicated by two known detected cases in various CI jobs; though it could also be that the lockup of the new barrier is just exceptionally bad for system stability. As for the patches: v1-0001 -- LGTM. 0001 (upthread): LGTM, but I'd also suggest to add some code to make sure that we're actually receiving procsignals by the time we initialize the Logical/Checksum subsystems that need to process shared state changes by responding to procsignals; as attached. smgr's procsignal doesn't really depend on shared memory state, so I've kept that out of my patch. Kind regards, Matthias van de Meent Databricks (https://www.databricks.com) --000000000000aab6890650971ae7 Content-Type: application/octet-stream; name="v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch" Content-Disposition: attachment; filename="v1-0001-Assert-ProcSignal-is-initialized-before-its-depen.patch" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_mojxm31t0 RnJvbSA4YTFkYzE4YmJjZjExYTJlYjM2Y2ZkM2RiYjI5MDk3NmQ4NzI4NGQxIE1vbiBTZXAgMTcg MDA6MDA6MDAgMjAwMQpGcm9tOiBNYXR0aGlhcyB2YW4gZGUgTWVlbnQgPGJvZWtld3VybStwb3N0 Z3Jlc0BnbWFpbC5jb20+CkRhdGU6IFdlZCwgMjkgQXByIDIwMjYgMTI6MTA6NDQgKzAyMDAKU3Vi amVjdDogW1BBVENIIHYxXSBBc3NlcnQgUHJvY1NpZ25hbCBpcyBpbml0aWFsaXplZCBiZWZvcmUg aXRzIGRlcGVuZGVudHMKCi0tLQogc3JjL2JhY2tlbmQvYWNjZXNzL3RyYW5zYW0veGxvZy5jICAg ICAgICAgICAgfCAxICsKIHNyYy9iYWNrZW5kL3JlcGxpY2F0aW9uL2xvZ2ljYWwvbG9naWNhbGN0 bC5jIHwgMSArCiBzcmMvYmFja2VuZC9zdG9yYWdlL2lwYy9wcm9jc2lnbmFsLmMgICAgICAgICB8 IDggKysrKysrKysKIHNyYy9pbmNsdWRlL3N0b3JhZ2UvcHJvY3NpZ25hbC5oICAgICAgICAgICAg IHwgNSArKysrKwogNCBmaWxlcyBjaGFuZ2VkLCAxNSBpbnNlcnRpb25zKCspCgpkaWZmIC0tZ2l0 IGEvc3JjL2JhY2tlbmQvYWNjZXNzL3RyYW5zYW0veGxvZy5jIGIvc3JjL2JhY2tlbmQvYWNjZXNz L3RyYW5zYW0veGxvZy5jCmluZGV4IGUzOWFmNzljMDNiLi42M2U4NGIwMGNlYyAxMDA2NDQKLS0t IGEvc3JjL2JhY2tlbmQvYWNjZXNzL3RyYW5zYW0veGxvZy5jCisrKyBiL3NyYy9iYWNrZW5kL2Fj Y2Vzcy90cmFuc2FtL3hsb2cuYwpAQCAtNDk2MCw2ICs0OTYwLDcgQEAgU2V0RGF0YUNoZWNrc3Vt c09mZih2b2lkKQogdm9pZAogSW5pdExvY2FsRGF0YUNoZWNrc3VtU3RhdGUodm9pZCkKIHsKKwlB c3NlcnQoUHJvY1NpZ25hbElzSW5pdGlhbGl6ZWQoKSk7CiAJU3BpbkxvY2tBY3F1aXJlKCZYTG9n Q3RsLT5pbmZvX2xjayk7CiAJU2V0TG9jYWxEYXRhQ2hlY2tzdW1TdGF0ZShYTG9nQ3RsLT5kYXRh X2NoZWNrc3VtX3ZlcnNpb24pOwogCVNwaW5Mb2NrUmVsZWFzZSgmWExvZ0N0bC0+aW5mb19sY2sp OwpkaWZmIC0tZ2l0IGEvc3JjL2JhY2tlbmQvcmVwbGljYXRpb24vbG9naWNhbC9sb2dpY2FsY3Rs LmMgYi9zcmMvYmFja2VuZC9yZXBsaWNhdGlvbi9sb2dpY2FsL2xvZ2ljYWxjdGwuYwppbmRleCA3 MmY2OGVjNThlZi4uODAzMDhiNjE5YTQgMTAwNjQ0Ci0tLSBhL3NyYy9iYWNrZW5kL3JlcGxpY2F0 aW9uL2xvZ2ljYWwvbG9naWNhbGN0bC5jCisrKyBiL3NyYy9iYWNrZW5kL3JlcGxpY2F0aW9uL2xv Z2ljYWwvbG9naWNhbGN0bC5jCkBAIC0xNzMsNiArMTczLDcgQEAgdXBkYXRlX3hsb2dfbG9naWNh bF9pbmZvKHZvaWQpCiB2b2lkCiBJbml0aWFsaXplUHJvY2Vzc1hMb2dMb2dpY2FsSW5mbyh2b2lk KQogeworCUFzc2VydChQcm9jU2lnbmFsSXNJbml0aWFsaXplZCgpKTsKIAl1cGRhdGVfeGxvZ19s b2dpY2FsX2luZm8oKTsKIH0KIApkaWZmIC0tZ2l0IGEvc3JjL2JhY2tlbmQvc3RvcmFnZS9pcGMv cHJvY3NpZ25hbC5jIGIvc3JjL2JhY2tlbmQvc3RvcmFnZS9pcGMvcHJvY3NpZ25hbC5jCmluZGV4 IGIwNjgxY2EwYWUyLi43MWEwYjI1ZTQ5ZSAxMDA2NDQKLS0tIGEvc3JjL2JhY2tlbmQvc3RvcmFn ZS9pcGMvcHJvY3NpZ25hbC5jCisrKyBiL3NyYy9iYWNrZW5kL3N0b3JhZ2UvaXBjL3Byb2NzaWdu YWwuYwpAQCAtMjMyLDYgKzIzMiwxNCBAQCBQcm9jU2lnbmFsSW5pdChjb25zdCB1aW50OCAqY2Fu Y2VsX2tleSwgaW50IGNhbmNlbF9rZXlfbGVuKQogCW9uX3NobWVtX2V4aXQoQ2xlYW51cFByb2NT aWduYWxTdGF0ZSwgKERhdHVtKSAwKTsKIH0KIAorI2lmZGVmIFVTRV9BU1NFUlRfQ0hFQ0tJTkcK K2Jvb2wKK1Byb2NTaWduYWxJc0luaXRpYWxpemVkKHZvaWQpCit7CisJcmV0dXJuIE15UHJvY1Np Z25hbFNsb3QgIT0gTlVMTDsKK30KKyNlbmRpZgorCiAvKgogICogQ2xlYW51cFByb2NTaWduYWxT dGF0ZQogICoJCVJlbW92ZSBjdXJyZW50IHByb2Nlc3MgZnJvbSBQcm9jU2lnbmFsIG1lY2hhbmlz bQpkaWZmIC0tZ2l0IGEvc3JjL2luY2x1ZGUvc3RvcmFnZS9wcm9jc2lnbmFsLmggYi9zcmMvaW5j bHVkZS9zdG9yYWdlL3Byb2NzaWduYWwuaAppbmRleCBhYWExNThiZmQ2Ni4uMWQyMjkwYzY5NzUg MTAwNjQ0Ci0tLSBhL3NyYy9pbmNsdWRlL3N0b3JhZ2UvcHJvY3NpZ25hbC5oCisrKyBiL3NyYy9p bmNsdWRlL3N0b3JhZ2UvcHJvY3NpZ25hbC5oCkBAIC04Nyw0ICs4Nyw5IEBAIHR5cGVkZWYgc3Ry dWN0IFByb2NTaWduYWxIZWFkZXIgUHJvY1NpZ25hbEhlYWRlcjsKIGV4dGVybiBQR0RMTElNUE9S VCBQcm9jU2lnbmFsSGVhZGVyICpQcm9jU2lnbmFsOwogI2VuZGlmCiAKKyNpZmRlZiBVU0VfQVNT RVJUX0NIRUNLSU5HCitleHRlcm4gYm9vbCBQcm9jU2lnbmFsSXNJbml0aWFsaXplZCh2b2lkKTsK KyNlbmRpZgorCisKICNlbmRpZgkJCQkJCQkvKiBQUk9DU0lHTkFMX0ggKi8KLS0gCjIuNTAuMSAo QXBwbGUgR2l0LTE1NSkKCg== --000000000000aab6890650971ae7--