public inbox for [email protected]
help / color / mirror / Atom feedFrom: Tatsuo Ishii <[email protected]>
To: [email protected]
Cc: [email protected]
Subject: Re: Race condition in pcp_node_info can cause it to hang
Date: Fri, 05 Jun 2026 08:09:32 +0900 (JST)
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAGXsc+ZhGjwm+F42Xmt8Qn1qP_h7woipiV0WsY-e-P7W3ZG2OA@mail.gmail.com>
References: <CAGXsc+ZhGjwm+F42Xmt8Qn1qP_h7woipiV0WsY-e-P7W3ZG2OA@mail.gmail.com>
Hi Emond,
> Hi,
>
> We've hit another very rare flake in our tests, which can cause
> pcp_node_info to hang indefinitely. I've analyzed the problem with
> Claude Code, and it came to the conclusion and (quite small) fix
> below. Attached is a patch against 4.7.
>
> The problem:
> In inform_node_info() (src/pcp_con/pcp_worker.c), the PCP reply packet
> reads bi->replication_state and bi->replication_sync_state directly
> from shared memory twice: once via strlen() to compute the packet
> length, and once via pcp_write() to write the payload.
>
> The streaming-replication check worker rewrites those same
> shared-memory strings without a lock (it clears them to "" then
> repopulates them every check cycle and on state transitions,
> src/streaming_replication/pool_worker_child.c). If the string's length
> changes between the two reads, the declared wsize no longer matches
> the bytes actually written, so the PCP byte stream desynchronises. The
> client then blocks forever in pcp_read() waiting for bytes the server
> never sends.
>
> The fix:
> Snapshot the two strings into local buffers once, right after bi =
> pool_get_node_info(i),
> and use the locals for both the length and the payload ― so a single
> packet is always
> internally consistent. This matches how every other field in the
> packet is already
> handled.
Thank you for the report and fix. Yes, I agree there's a race
condition between sr checker process and pcp_node_info. I think
introducing a lock to protect bi->replication_state and
bi->replication_sync_state is overkill. The suggested fix seems to be
a right direction. Will push after current release freeze is over
(supposed to be finished by the end of today).
Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
view thread (4+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Race condition in pcp_node_info can cause it to hang
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox