Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIinE-008Q8P-24 for pgsql-hackers@arkaria.postgresql.org; Fri, 01 May 2026 08:00:13 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wIinD-00ATag-2G for pgsql-hackers@arkaria.postgresql.org; Fri, 01 May 2026 08:00:11 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wIinD-00ATaW-12 for pgsql-hackers@lists.postgresql.org; Fri, 01 May 2026 08:00:11 +0000 Received: from mail-wr1-x42e.google.com ([2a00:1450:4864:20::42e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wIin7-00000004EXS-1Eau for pgsql-hackers@lists.postgresql.org; Fri, 01 May 2026 08:00:09 +0000 Received: by mail-wr1-x42e.google.com with SMTP id ffacd0b85a97d-449d6c68ed8so1025044f8f.0 for ; Fri, 01 May 2026 01:00:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777622403; x=1778227203; darn=lists.postgresql.org; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:from:to:cc:subject:date :message-id:reply-to; bh=I7Oe0Gt/svbeAmyveixVgG4T2RbacalZRxradBJR7ZI=; b=pPrZvUnvzmlqWntz2rkaVPGOJlMbNmYcmJUiWHDrF0B8IfxFzc2jMicfyH06WULlXk eNVh8ZDi0rTvfU5IeO1UJ5V13pvROxiwSvUg8qyrnOIgb9MZk8HnoGXrgaBQrOoWMyZm Jjw/XAe5Biwetv0l3hOzumg+nHRJa4BbjO6kZsnNMFiff2sIIMFGVabPfGO4z3mZ3ZN0 leWnDRoOZzaXnDvf/Ogc+nYjR9D+tmW8VzVFIpBprj9pk8M+0Z0WrWt8430Ip9TCFyQV CmMXUlU5bH33wmw1gsew2zs1g/oaJwreEP7h2XvTu0wwCXh6kYQvSHtRWdV4gX52FU0r fYbQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777622403; x=1778227203; h=in-reply-to:from:content-language:references:cc:to:subject :user-agent:mime-version:date:message-id:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=I7Oe0Gt/svbeAmyveixVgG4T2RbacalZRxradBJR7ZI=; b=LQKrzI1Z2r6Gm6J3wuMitRK9b44KWdU51QcSpgBc8/BgF5nBlvpXUY3kp3ue6EmLvq 59G7PajtGtjuXRD+lJy5ZLBHwp5uC3qTUEBMWuhTWKFtg1iHh5DIGINIVEvsiWMrHt/d ncbNOOciAK//eSPNl70U58Is+zUa0isf1Mw4tuVXsm5oGNq33IWKpLjquv8YAGnAIvzz camkCoUiz673l/GGnGI3MZh6wy9/Ayx8xWj6uRYwoPUNGltdHSD6PzslOGjuQ0H97Xc4 J1sjbVrYHtD2W/hODer1VhtIBtz2//5g7EcsaVmyWev2bJrdovFzzkIjPyi3Xgif3oVc gq9A== X-Forwarded-Encrypted: i=1; AFNElJ/KvICtNqDYR3t9Ejfo5OCPH/YG1PNzwDK0ufh6s7nLlyCksRHNjXfxeOXwnT4x10BC1tVLqdbnknOYAo0+@lists.postgresql.org X-Gm-Message-State: AOJu0Yy4LlH7LbS4vlQw/YJsV4WWBtdpf9ZYEcXHasp/xXQAArSKa1Iu WWJ18qO0aiARBdnhCzxmORdD+qzIcSuH1j17pVfHKBycf6HfScOXyy5f/jgLiw== X-Gm-Gg: AeBDieu4P9iUJx7pVuHkpI0sCyVvXZ/YeZOZhu5b4rq9o2W4MZNHlA6n2/ONT5hT4lk D0xGu31R32PWhzLnMiTN65NJ7PFqNf2kJv7JusoNxWnnzGpXPHPBs+ohXPcyRw6LFAP7QyPTXyS KFlKfaLeqeLK8dGtnpfY2MpHVvY5K4pydCuBWAjht1iVkJH04KtQICaQ8LTalAbRuInlDWYptNT o1h9ohxROHz+c9Awz+jP2kVhWI1WGqvrUmyTOC6FFSanbvbuG6D3JYbmu1RokoZeZIJ7ofgB/VW 2zzkKLH8jfwqsfFFxwLTBjcAmWLFBPPaRkzhxfq8OZQXa+KiYq5fqzZJEcs7Z9pPPEd69DJ0EMD 77Q7aOPnxiOTqcg0psE26tX7u4pDbCZ7EQa3pEftrUyDx0kQ1EXlcJdKOKlq+3rKzutMVHhtCy/ CTloczr5sGtMAxUkpJVDJpukp2oUWfs1tdRbo= X-Received: by 2002:a05:6000:2f88:b0:43f:df1b:9e07 with SMTP id ffacd0b85a97d-44a88244fd9mr3137890f8f.42.1777622403144; Fri, 01 May 2026 01:00:03 -0700 (PDT) Received: from [192.168.0.50] ([89.149.68.143]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-44a8ea7d035sm3280562f8f.5.2026.05.01.01.00.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 01 May 2026 01:00:02 -0700 (PDT) Content-Type: multipart/alternative; boundary="------------1MxLNyhrMRcQO5wCsiFMKPw8" Message-ID: <18c0f20b-c79a-4358-8d95-cba8819de9f5@gmail.com> Date: Fri, 1 May 2026 11:00:00 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Startup process deadlock: WaitForProcSignalBarriers vs aux process To: Masahiko Sawada Cc: Andres Freund , Matthias van de Meent , Thomas Munro , PostgreSQL Hackers , Heikki Linnakangas , Andrey Borodin References: <4358bd85-f6b4-4da6-9909-74428fe3c8f7@gmail.com> <2a199ba7-1d18-438a-847e-5241b7dac514@gmail.com> Content-Language: en-US From: Alexander Lakhin In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------1MxLNyhrMRcQO5wCsiFMKPw8 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Dear Sawada-san, 01.05.2026 01:08, Masahiko Sawada wrote: > On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin wrote: >> I was wondering why is that failure the only one of this kind on buildfarm >> (in last two years, at least), so I've tried to reproduce it on >> REL_18_STABLE... and failed. >> >> Then I've bisected it on the master branch and found (your) commit that >> introduced this behavior: 67c20979c from 2025-12-23. >> > I've confirmed that this race condition issue is present from v15 to > the master. In v14, we have the procsignal barrier code but don't use > it anywhere. In v18 or older, it could happen when executing DROP > DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen > in more cases as we're using procsignal barrier more places. In any > case, if a process emits a signal barrier when another process is > between the initialization of slot->pss_barrierGeneration and > slot->pss_pid initialization, the subsequent > WaitForProcSignalBarrier() ends up waiting for that process forever. > So I think the patch should be backpatched to v15. Please review these > patches. Yes, you're right -- it's not reproduced on REL_18_STABLE with test_oat_hooks, which simply starts postgres node (as many other tests), but when I tried the full test suite with the sleep inserted before setting pss_pid, I discovered the following vulnerable tests: 030_stats_cleanup_replica_standby.log 2026-05-01 06:00:58.789 UTC [2086579] LOG:  still waiting for backend with PID 2086578 to accept ProcSignalBarrier 2026-05-01 06:00:58.789 UTC [2086579] CONTEXT:  WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393 033_replay_tsp_drops_standby2_FILE_COPY.log 2026-05-01 05:45:12.969 UTC [2030902] LOG:  still waiting for backend with PID 2030901 to accept ProcSignalBarrier 2026-05-01 05:45:12.969 UTC [2030902] CONTEXT:  WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389 040_standby_failover_slots_sync_publisher.log 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG:  still waiting for backend with PID 1538477 to accept ProcSignalBarrier 2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT:  DROP DATABASE slotsync_test_db; 002_compare_backups_pitr1.log 2026-05-01 04:50:46.638 UTC [1829328] LOG:  still waiting for backend with PID 1829396 to accept ProcSignalBarrier 2026-05-01 04:50:46.638 UTC [1829328] CONTEXT:  WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414 I've tried my repro with 033_replay_tsp_drops and it really fails on REL_15_STABLE..master and doesn't fail on REL_14_STABLE. > FYI I found that we had a similar report[1] last year, I'm not sure > it hit the exact same issue, though. > > Regards, > > [1]https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com Yeah, and probably this one: https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru By the way, mamba produced the same failure just yesterday: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39 # Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting pg_ctl: server did not start in time 004_restart_primary.log 2026-04-30 04:09:04.025 EDT [17814:2] LOG:  still waiting for backend with PID 11506 to accept ProcSignalBarrier ... 2026-04-30 04:19:55.336 EDT [17814:132] LOG:  still waiting for backend with PID 11506 to accept ProcSignalBarrier The proposed patches make the test pass reliably for me in all affected branches. Thank you for working on this! Best regards, Alexander --------------1MxLNyhrMRcQO5wCsiFMKPw8 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: 8bit
Dear Sawada-san,

01.05.2026 01:08, Masahiko Sawada wrote:
On Wed, Apr 29, 2026 at 11:00 AM Alexander Lakhin <exclusion@gmail.com> wrote:
I was wondering why is that failure the only one of this kind on buildfarm
(in last two years, at least), so I've tried to reproduce it on
REL_18_STABLE... and failed.

Then I've bisected it on the master branch and found (your) commit that
introduced this behavior: 67c20979c from 2025-12-23.

I've confirmed that this race condition issue is present from v15 to
the master. In v14, we have the procsignal barrier code but don't use
it anywhere. In v18 or older, it could happen when executing DROP
DATABASE, DROP TABLESPACE etc, whereas in the master, it could happen
in more cases as we're using procsignal barrier more places. In any
case, if a process emits a signal barrier when another process is
between the initialization of slot->pss_barrierGeneration and
slot->pss_pid initialization, the subsequent
WaitForProcSignalBarrier() ends up waiting for that process forever.
So I think the patch should be backpatched to v15. Please review these
patches.

Yes, you're right -- it's not reproduced on REL_18_STABLE with
test_oat_hooks, which simply starts postgres node (as many other tests),
but when I tried the full test suite with the sleep inserted before
setting pss_pid, I discovered the following vulnerable tests:

030_stats_cleanup_replica_standby.log
2026-05-01 06:00:58.789 UTC [2086579] LOG:  still waiting for backend with PID 2086578 to accept ProcSignalBarrier
2026-05-01 06:00:58.789 UTC [2086579] CONTEXT:  WAL redo at 0/3410B00 for Database/DROP: dir 1663/16393

033_replay_tsp_drops_standby2_FILE_COPY.log
2026-05-01 05:45:12.969 UTC [2030902] LOG:  still waiting for backend with PID 2030901 to accept ProcSignalBarrier
2026-05-01 05:45:12.969 UTC [2030902] CONTEXT:  WAL redo at 0/30006A8 for Database/CREATE_FILE_COPY: copy dir 1663/1 to 16384/16389

040_standby_failover_slots_sync_publisher.log
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl LOG:  still waiting for backend with PID 1538477 to accept ProcSignalBarrier
2026-05-01 02:16:00.107 UTC [1538468] 040_standby_failover_slots_sync.pl STATEMENT:  DROP DATABASE slotsync_test_db;

002_compare_backups_pitr1.log
2026-05-01 04:50:46.638 UTC [1829328] LOG:  still waiting for backend with PID 1829396 to accept ProcSignalBarrier
2026-05-01 04:50:46.638 UTC [1829328] CONTEXT:  WAL redo at 0/30A1DE0 for Database/DROP: dir 1663/16414

I've tried my repro with 033_replay_tsp_drops and it really fails on
REL_15_STABLE..master and doesn't fail on REL_14_STABLE.

FYI I found that we had a similar report[1]  last year, I'm not sure
it hit the exact same issue, though.

Regards,

[1] https://www.postgresql.org/message-id/CAGQGyDTaVkG3DbTEbtyxZLM48jMZR2BcvTeYBsWLV5HvwSb+2Q@mail.gmail.com

Yeah, and probably this one:
https://www.postgresql.org/message-id/EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB%40yandex-team.ru

By the way, mamba produced the same failure just yesterday:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2026-04-30%2005%3A10%3A39

# Running: pg_ctl --wait --pgdata /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/t_004_restart_primary_data/pgdata --log /home/buildfarm/bf-data/HEAD/pgsql.build/src/test/modules/commit_ts/tmp_check/log/004_restart_primary.log --options --cluster-name=primary start
waiting for server to start........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................... stopped waiting
pg_ctl: server did not start in time
004_restart_primary.log
2026-04-30 04:09:04.025 EDT [17814:2] LOG:  still waiting for backend with PID 11506 to accept ProcSignalBarrier
...
2026-04-30 04:19:55.336 EDT [17814:132] LOG:  still waiting for backend with PID 11506 to accept ProcSignalBarrier

The proposed patches make the test pass reliably for me in all affected
branches. Thank you for working on this!

Best regards,
Alexander
--------------1MxLNyhrMRcQO5wCsiFMKPw8--