Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wL2MI-001UAA-1p for pgsql-hackers@arkaria.postgresql.org; Thu, 07 May 2026 17:17:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wL2MG-005GVa-0N for pgsql-hackers@arkaria.postgresql.org; Thu, 07 May 2026 17:17:56 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wL2MF-005GVS-2K for pgsql-hackers@lists.postgresql.org; Thu, 07 May 2026 17:17:55 +0000 Received: from mail-pj1-x1035.google.com ([2607:f8b0:4864:20::1035]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wL2MD-000000012e3-0kbB for pgsql-hackers@lists.postgresql.org; Thu, 07 May 2026 17:17:55 +0000 Received: by mail-pj1-x1035.google.com with SMTP id 98e67ed59e1d1-366089e42eeso663936a91.2 for ; Thu, 07 May 2026 10:17:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1778174270; cv=none; d=google.com; s=arc-20240605; b=YU1SfLV7ovajsgQkxPWPoGDRy/qOpsBoKIgCEifWhA9qPFu1GZ+/AUJgI1x58fwEzx aVMpWloHCtbX4AYXo+GQJWB2AVRknBdq5Jjw4LissJgXzBRn5lpS6dpuWO/xynT2Zns3 yAT0bSlEOkhCVMoWSoWJtHKTL2LQ8tOg4QXM90vz1gskX5UHZANJ1LMfp1qWE+jxyw6N GrJ7i3fMRqmryscC1apjlJDHm6hwGkn3Ip+FChwKp+X45CjZRFZ/kCPCqtYGkg/CJ0fY +SGjXA/QqgXleArbfXPvB5v5jz59rPPBWSNsaP+ZwxBsd0psUr6NdhH9APqJaO0hL4La 0gtw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=XYyB2TspuucJCjgyNkpIoikVgDjiL99I7vlYrwTA3Hc=; fh=u1dTuu/M5jx30GxRwqBDkVnP+xqwFXr+fe2ycyJ31Mg=; b=GNu/MvCS7VF0BZ/tjUFtpsO56koYKP7+VXSXATmtYaluF00DLoje9KkuKnuiCtb5IJ 2srF7iyLJSGJLifKyZBJxRrHhy1h9nPr3+dUDn6qQZpPS9VA3owvyN17p85xKAve6hu3 Yz2uuMmQ8nU02JMDAcRoJqLoadOV29WVRDTAZOaX7YQWfqUzEKVszqZtsP2FRLo0B5ci 05uxNeQg94sZ10jb4MQ26+HZeynV/8qbCu0/7bKmA5I6jd2r5Doc2DID7kw4C++Fty0g /FV9t4Nvudw2lzOe4lqEu0dts2JtjY1emMzn1KoYkTBWRtQSSQNsb5VXRPlM77wQe9yz HB7A==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778174270; x=1778779070; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XYyB2TspuucJCjgyNkpIoikVgDjiL99I7vlYrwTA3Hc=; b=ewikzdKkyQ30uP3exhDLNOwJJW9aUHs0BHhKPf/pA9iJHdHcXDALb3UblmaJtDkfvz DiqD0MWxcG03mJOT1+HiJoTmoe1/IUQ5L5+k83xd6dovF9rPe7uFiT9zcwS7WXOkyoEj Ft5uzD+AGhs47Mwz79VA58Wami/YkoJN5lsJDyhEMCDNXmkirxV+qqtzkfovQzG1xWa1 2ml1/DDdBYrNNwJSedp8X7GyejF8EC690uKOtamUWsGHBnxtdL9aSxbp++HHED/gRTbo kmGDdlCBxrHPQ244QbGF1H7o4itBINQsZ0GhaRsrwYAZ3DCzz5Dj3WJOROP9tRBlU+6x 7hGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778174270; x=1778779070; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=XYyB2TspuucJCjgyNkpIoikVgDjiL99I7vlYrwTA3Hc=; b=IloSwfchvg4wUDXarh4jgu/sXinhY6JbNc7uOXapm20cYOuqI+s8tReWE9YAdUlwo9 Uu0SGaqVy1kRDMO+h019npM1BLKB4fFpI4M9CV2Mf63SocNYqpTV7rafh5C8r5I5SCj5 n56QaH3ZSBBFs+rTFrhvfHDtjtwUWQM8N9CWyfCPCeEiis0XOAosFax2OyqPBSX0ASyZ /bd3ISMe5LSu5jlZdAAzFHEBSQM6Q0OQUmuBDqdJGFWkxp1VlX5fmXgBRf06tNNEUsPU qscku6RMBUMUny+LoGHxR7+gUJrEC7s9MsVWAh2y1uRC6v+LxaooLON0X1fgqdacpoDR aosQ== X-Forwarded-Encrypted: i=1; AFNElJ9xGQIs27xCreORqTfp9H098paNgU1JjA9fnUduhLryrux0qDmhtJYVE7nPQpl+kRzFGKrvsPqIMxpTcWwY@lists.postgresql.org X-Gm-Message-State: AOJu0YygN5BhhHUPZump234GvvnazUvVN7p297Ti6KfywmGTmDiFiIUz Ds/lVQAW+kLr3nU7EnPvV7YKPCEKm+vlrDTRJ79cs31D6j5LHO2vZAcBVXZck3zIo9ncPfH9ay/ 7s15mHln+iXtao0m9KXZesPyOtuOlDMw= X-Gm-Gg: Acq92OG6fiCMg1M6ijrsFoO8cZo/Ga1QpJBYhoAGsJjYoOXoV1iMfTeuZqMEvam0Gyl 2jukUyaBLJy/J0sWvTMAn9LbooY6lKFV0jmpMZpf+9bxAVL7GCr+PuqvuRoYSFjt1DMxrg7dyAR iQoaJNCnH1BD/dAkOs+ENSKLlkqsR8H+GMxbaf3K8FGUTaSYudqsP1Rtxq/oKyajg2kTbE8VeVY 2JgpuQqcDZ/yy8rqIOhHO0ke1Aq4H6DccHdrExdflMCPfnKOQhbsYt+fiVKmNGJPfSdYc0f+2O/ qpcwUU+oTcHAJ9YywgcqPFKw9GcShD4uRC4hoe+NkXCTIiD8hp4= X-Received: by 2002:a17:90a:1649:b0:366:1d5f:87e with SMTP id 98e67ed59e1d1-3661d5f09f0mr1747326a91.17.1778174269881; Thu, 07 May 2026 10:17:49 -0700 (PDT) MIME-Version: 1.0 References: <4358bd85-f6b4-4da6-9909-74428fe3c8f7@gmail.com> <2a199ba7-1d18-438a-847e-5241b7dac514@gmail.com> <18c0f20b-c79a-4358-8d95-cba8819de9f5@gmail.com> In-Reply-To: <18c0f20b-c79a-4358-8d95-cba8819de9f5@gmail.com> From: Masahiko Sawada Date: Thu, 7 May 2026 10:17:13 -0700 X-Gm-Features: AVHnY4L0bdXXQSuOydXKRvTkbFCljXRSWqpZ4_r2KviBBhHACHRtLVPCjD-ittw Message-ID: Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process To: Alexander Lakhin Cc: Andres Freund , Matthias van de Meent , Thomas Munro , PostgreSQL Hackers , Heikki Linnakangas , Andrey Borodin Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, May 1, 2026 at 1:00=E2=80=AFAM Alexander Lakhin wrote: > > Dear Sawada-san, > > 01.05.2026 01:08, Masahiko Sawada wrote: > > On Wed, Apr 29, 2026 at 11:00=E2=80=AFAM Alexander Lakhin wrote: > > I was wondering why is that failure the only one of this kind on buildfar= m > (in last two years, at least), so I've tried to reproduce it on > REL_18_STABLE... and failed. > > Then I've bisected it on the master branch and found (your) commit that > introduced this behavior: 67c20979c from 2025-12-23. > > I've confirmed that this race condition issue is present from v15 to > the master. In v14, we have the procsignal barrier code but don't use > it anywhere. In v18 or older, it could happen when executing DROP > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen > in more cases as we're using procsignal barrier more places. In any > case, if a process emits a signal barrier when another process is > between the initialization of slot->pss_barrierGeneration and > slot->pss_pid initialization, the subsequent > WaitForProcSignalBarrier() ends up waiting for that process forever. > So I think the patch should be backpatched to v15. Please review these > patches. > > > Yes, you're right -- it's not reproduced on REL_18_STABLE with > test_oat_hooks, which simply starts postgres node (as many other tests), > but when I tried the full test suite with the sleep inserted before > setting pss_pid, I discovered the following vulnerable tests: > > 030_stats_cleanup_replica_standby.log > 2026-05-01 06:00:58.789 UTC [2086579] LOG: still waiting for backend wit= h PID 2086578 to accept ProcSignalBarrier > 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT: WAL redo at 0/3410B00 for= Database/DROP: dir 1663/16393 > > 033_replay_tsp_drops_standby2_FILE_COPY.log > 2026-05-01 05:45:12.969 UTC [2030902] LOG: still waiting for backend wit= h PID 2030901 to accept ProcSignalBarrier > 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT: WAL redo at 0/30006A8 for= Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389 > > 040_standby_failover_slots_sync_publisher.log > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl = LOG: still waiting for backend with PID 1538477 to accept ProcSignalBarrie= r > 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl = STATEMENT: DROP DATABASE slotsync_test_db; > > 002_compare_backups_pitr1.log > 2026-05-01 04:50:46.638 UTC [1829328] LOG: still waiting for backend wit= h PID 1829396 to accept ProcSignalBarrier > 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT: WAL redo at 0/30A1DE0 for= Database/DROP: dir 1663/16414 > > I've tried my repro with 033_replay_tsp_drops and it really fails on > REL_15_STABLE..master and doesn't fail on REL_14_STABLE. > > FYI I found that we had a similar report[1] last year, I'm not sure > it hit the exact same issue, though. > > Regards, > > [1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2= BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com > > > Yeah, and probably this one: > https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06B= B%40yandex-team.ru > > By the way, mamba produced the same failure just yesterday: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=3Dmamba&dt=3D2026= -04-30%2005%3A10%3A39 > > # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.buil= d/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --= log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp= _check/log/004_restart_primary.log --options --cluster-name=3Dprimary start > waiting for server to start..............................................= ...........................................................................= ...........................................................................= ...........................................................................= ...........................................................................= ...........................................................................= ...........................................................................= ...........................................................................= ................................ stopped waiting > pg_ctl: server did not start in time > 004_restart_primary.log > 2026-04-30 04:09:04.025 EDT [17814:2] LOG: still waiting for backend wit= h PID 11506 to accept ProcSignalBarrier > ... > 2026-04-30 04:19:55.336 EDT [17814:132] LOG: still waiting for backend w= ith PID 11506 to accept ProcSignalBarrier > > The proposed patches make the test pass reliably for me in all affected > branches. Thank you for working on this! > Thank you for checking this issue on stable branches too! Considering that this issue is not very visible in practice and we're going to release new minor versions next week, I'm planning to push these fixes to master and backbranches after the minor releases. That way, we can fix the issue on the master relatively soon and have enough time to verify that fix works well on backbranches. Regards, --=20 Masahiko Sawada Amazon Web Services: https://aws.amazon.com