Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVT3e-001z25-1a for pgpool-hackers@arkaria.postgresql.org; Fri, 05 Jun 2026 11:49:50 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wVT3c-00C3oq-26 for pgpool-hackers@arkaria.postgresql.org; Fri, 05 Jun 2026 11:49:48 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVT3c-00C3oi-1Z for pgpool-hackers@lists.postgresql.org; Fri, 05 Jun 2026 11:49:48 +0000 Received: from mail-pj1-x102b.google.com ([2607:f8b0:4864:20::102b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wVT3a-00000001ELd-1uUp for pgpool-hackers@lists.postgresql.org; Fri, 05 Jun 2026 11:49:47 +0000 Received: by mail-pj1-x102b.google.com with SMTP id 98e67ed59e1d1-36b9d265355so1013523a91.2 for ; Fri, 05 Jun 2026 04:49:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1780660185; cv=none; d=google.com; s=arc-20240605; b=bTD7BXFtPHVZ60u+EWCGH/JR2kA2L4rWvNkkwpDiudhsCsIRW8zF4iwUxAxOnCPzh+ HOa7I/1MQhHVuNQFya4KUtbCAY2uxrJ+F0Jk9LVMAIBqh87ODh1gILajPWCO9heXv4p8 FwD4JXcG/r3r77I+P1lysaOUwFb9xk/ZDCIHbs4plk9QcnR2mATkdj9hPYxKe3pzTXQ/ yFvaE9cdyXx35t6oASjVkJq9qUbc3i+Qs19hoitXeomlxbr1PFSlkJxsxP4xcBq9IYwj LJZ6R29NGGJz9MSbtZZd95uX8Vj1KIk2Wwc+KnLk2/+R/Eg2lP0Javi5bPceFAuFYgPT vdlw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=HhlonyqT73BhZI6fG/C00h5T6nOvaJDu7X0SZ8Jtykc=; fh=m3oiE62fRN5J1XaE+5vpHZRIgCyFefxVI0mFw3d8yxA=; b=X/8RKBtkHpi3w3w4TfKsXaYCBtGXwhwlaiafiE0IUCOF3drIDNWGdFi71hVCKmJ6/v MVkcQmrDNFrll9xWuyJ+6fHy9BXzxN40YI2ssX1FKftlI8v4kuTPzoL0PzZI5G6aGe3A NiNa20brYvwYNKHum0ozbboQAnpCTklWkkzQAusCq4WeU4uZvYsz8m6oaL/17ktdSvIv T3kc7NvQINfSms9Kansz4GY+qXoONKjoLnWekvXsAiUrLUsFw79jBMb5ZAhiSXLI8zBB OH23L7lnJc9eRsjLucbv014Q+wi8rfnbT5i+7FVqZNmCnxNQB7SCEszFjgj7Rn3MUGMi 0eOQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1780660185; x=1781264985; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HhlonyqT73BhZI6fG/C00h5T6nOvaJDu7X0SZ8Jtykc=; b=MygKXPwWkITng37KjZo1o/DUPqqLpqAsako2cJrF9LyhYTjnpU/CXycX7B84Ehn51p 6Ze7iwBkh6//h5BuAiUR8ZSxkZzdntuaqZi6qNLeQunQDGOfdJiZmYx+rrKxh1L8lSBp llN7L5qbizz1RmMXSxKaf/iy4UBQZbFjeaKpGE+sYlrEOLTgkmlznEV8Jrk+V9VAlCGN FoIno50CZbNDDnyM3zb5nSyqO604wZLl7/F9w461sW3wgHNGb9UgWf1BGSFctcMTGO6H WFiw5YXkANjk2nbZnWNvCRrd7uvlzjj6F6VddZcLgN2ACtdHRJEKFGyBDWHfetShpGsL 98TA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1780660185; x=1781264985; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=HhlonyqT73BhZI6fG/C00h5T6nOvaJDu7X0SZ8Jtykc=; b=gExTVDqwvJ38tj7q8ofRHY5BPEz+WhhAldJ4RIypqGMqDlYdPqNslcSDAKAx5DCCN4 uSq6i1O8yLUhENYMpkSKTn+ueDaskYs2k0I8SEq1fJ7RrSRFRBj14Kazee1thhEzftTv COtdoudv6qacNYJWyuqUob1/VcFKE2jShQk/C52puyWrZoPEbrx9LM5KOpAopD4omVZb 3GmZ2A30mIanzfXNiawIiKn20Xom6ynA15hsPSdh8AGWeGiTuqOUMaoKyMRnRF/veV1P tBpw+Ix/Qc+bPKCXpdoVM5zND40LCp/wH4MCDPc3vXdypxnMlVmUJRGf+k5I0Z9IABs0 OQag== X-Gm-Message-State: AOJu0YyPuf6M5t7Hn85fW64uvRGBanzdT2ITFVu4YWBZ0Q26nZmrOT1B 8m/117N2RYgF2gJ0R1p7WKbc4o1AN6XBvDrs6lqmoyndYiV+RKYuAY2j50Zxwci5BgcLfFsn+sx vElA+7UbCjlV+nZ+MspogOB6KEnAesxhjhdGw X-Gm-Gg: Acq92OEChQxe5uJ0PX7PsHb99Vv8JyQlVo8ZxJVLrNuoQZ/JbuAucR+8PFVBCfS9pqy INUEvEcc3TjcV9U1Bed58/8QWvWcBFhFIP4MZNZ5dMcLrZxaakyslRBv2Abr7aL6DUu1NkcVxSI YIBAbMNytIMGQyVDleKu1s9fnoCO4lSNfPFM0sGlYG0UsYGV4Dd/F5D8XSjhWPGC7W18CD9QxM5 vZnlNAOiip98d11OrfNUVB+Hcm46we28xssK5lDd1YmcbetW+XNAts206VNF1nTqqZTqXeMo8pG xUxg6wZKB4PGnMPaiYjeAIZVHBDcOYsc5aSnqQwTl1JUS0sez/t3lYImnFMxdww= X-Received: by 2002:a17:90b:390e:b0:368:a297:bd3d with SMTP id 98e67ed59e1d1-370ee82f93cmr3454957a91.3.1780660184621; Fri, 05 Jun 2026 04:49:44 -0700 (PDT) MIME-Version: 1.0 References: <20260605.080932.2120655380457259936.ishii@postgresql.org> In-Reply-To: <20260605.080932.2120655380457259936.ishii@postgresql.org> From: Emond Papegaaij Date: Fri, 5 Jun 2026 13:49:32 +0200 X-Gm-Features: AVHnY4LpU8CVUPX_S3UVfhXT08JtqZwjVOI1beR5HDnS4uWEvrgA_yZCbcc7cYU Message-ID: Subject: Re: Race condition in pcp_node_info can cause it to hang To: Tatsuo Ishii Cc: pgpool-hackers@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, Thanks for the quick followup! Best regards, Emond Op vr 5 jun 2026 om 01:09 schreef Tatsuo Ishii : > > Hi Emond, > > > Hi, > > > > We've hit another very rare flake in our tests, which can cause > > pcp_node_info to hang indefinitely. I've analyzed the problem with > > Claude Code, and it came to the conclusion and (quite small) fix > > below. Attached is a patch against 4.7. > > > > The problem: > > In inform_node_info() (src/pcp_con/pcp_worker.c), the PCP reply packet > > reads bi->replication_state and bi->replication_sync_state directly > > from shared memory twice: once via strlen() to compute the packet > > length, and once via pcp_write() to write the payload. > > > > The streaming-replication check worker rewrites those same > > shared-memory strings without a lock (it clears them to "" then > > repopulates them every check cycle and on state transitions, > > src/streaming_replication/pool_worker_child.c). If the string's length > > changes between the two reads, the declared wsize no longer matches > > the bytes actually written, so the PCP byte stream desynchronises. The > > client then blocks forever in pcp_read() waiting for bytes the server > > never sends. > > > > The fix: > > Snapshot the two strings into local buffers once, right after bi =3D > > pool_get_node_info(i), > > and use the locals for both the length and the payload =E2=80=95 so a s= ingle > > packet is always > > internally consistent. This matches how every other field in the > > packet is already > > handled. > > Thank you for the report and fix. Yes, I agree there's a race > condition between sr checker process and pcp_node_info. I think > introducing a lock to protect bi->replication_state and > bi->replication_sync_state is overkill. The suggested fix seems to be > a right direction. Will push after current release freeze is over > (supposed to be finished by the end of today). > > Regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp