Re: Race condition in pcp_node_info can cause it to hang

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Tatsuo Ishii <[email protected]>
To: [email protected]
Cc: [email protected]
Subject: Re: Race condition in pcp_node_info can cause it to hang
Date: Sun, 07 Jun 2026 12:14:51 +0900 (JST)
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAGXsc+akuig0oA7dJX5BNFVRn+5miTALRZMnPrrt3kY7ypB+Ew@mail.gmail.com>
References: <CAGXsc+ZhGjwm+F42Xmt8Qn1qP_h7woipiV0WsY-e-P7W3ZG2OA@mail.gmail.com>
	<[email protected]>
	<CAGXsc+akuig0oA7dJX5BNFVRn+5miTALRZMnPrrt3kY7ypB+Ew@mail.gmail.com>

Hi,

Fix pushed to all supported branches.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=7c918dc247613d16d590a9f30ecc747da6871796

Thank you!

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> Hi,
> 
> Thanks for the quick followup!
> 
> Best regards,
> Emond
> 
> Op vr 5 jun 2026 om 01:09 schreef Tatsuo Ishii <[email protected]>:
>>
>> Hi Emond,
>>
>> > Hi,
>> >
>> > We've hit another very rare flake in our tests, which can cause
>> > pcp_node_info to hang indefinitely. I've analyzed the problem with
>> > Claude Code, and it came to the conclusion and (quite small) fix
>> > below. Attached is a patch against 4.7.
>> >
>> > The problem:
>> > In inform_node_info() (src/pcp_con/pcp_worker.c), the PCP reply packet
>> > reads bi->replication_state and bi->replication_sync_state directly
>> > from shared memory twice: once via strlen() to compute the packet
>> > length, and once via pcp_write() to write the payload.
>> >
>> > The streaming-replication check worker rewrites those same
>> > shared-memory strings without a lock (it clears them to "" then
>> > repopulates them every check cycle and on state transitions,
>> > src/streaming_replication/pool_worker_child.c). If the string's length
>> > changes between the two reads, the declared wsize no longer matches
>> > the bytes actually written, so the PCP byte stream desynchronises. The
>> > client then blocks forever in pcp_read() waiting for bytes the server
>> > never sends.
>> >
>> > The fix:
>> > Snapshot the two strings into local buffers once, right after bi =
>> > pool_get_node_info(i),
>> > and use the locals for both the length and the payload ― so a single
>> > packet is always
>> > internally consistent. This matches how every other field in the
>> > packet is already
>> > handled.
>>
>> Thank you for the report and fix. Yes, I agree there's a race
>> condition between sr checker process and pcp_node_info. I think
>> introducing a lock to protect bi->replication_state and
>> bi->replication_sync_state is overkill. The suggested fix seems to be
>> a right direction.  Will push after current release freeze is over
>> (supposed to be finished by the end of today).
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
> 
>

view thread (4+ messages)

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Race condition in pcp_node_info can cause it to hang
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox