Primary node detection race at clean startup

public inbox for [email protected]  
help / color / mirror / Atom feed

Primary node detection race at clean startup
5+ messages / 2 participants
[nested] [flat]

* Primary node detection race at clean startup
@ 2026-05-12 08:38  Emond Papegaaij <[email protected]>
  0 siblings, 2 replies; 5+ messages in thread

From: Emond Papegaaij @ 2026-05-12 08:38 UTC (permalink / raw)
  To: [email protected]

Hi,

In our tests, we've found an issue that can cause all Pgpool nodes to
report an incorrect 'Role: standby':
Role                   : standby    ← stale, never updated on this node
Backend Role           : primary    ← actual SR-check result

This can happen if all nodes in a watchdog cluster start with a clean
state at the same time. If the first node is still trying to determine
the primary database, it's primary_node_id is -2. This value is then
synced to other nodes in the cluster, causing all nodes to report the
stale state indefinitely. Attached is a patch against 4.7 that should
fix this.

Note that this analysis was done by Claude Code and it also created
the patch. The failure on our CI was real though and I think the
explanation makes sense.

Best regards,
Emond Papegaaij

Attachments:

  [text/x-patch] pgpool-keep-local-primary-when-leader-initial.patch (3.1K, 2-pgpool-keep-local-primary-when-leader-initial.patch)
  download | inline diff:
Keep local primary_node_id when leader watchdog reports the initial -2 sentinel.

When all pgpool nodes in a cluster are stopped and started simultaneously
(e.g. via an admin "restart all pgpool nodes" action), every node initialises
Req_info->primary_node_id to -2 (the sentinel) and then runs
find_primary_node_repeatedly() locally to discover the real primary. The
watchdog elects a LEADER in parallel; the losing nodes transition to STANDBY,
receive SIG_WATCHDOG_STATE_CHANGED, and call sync_backend_from_watchdog() to
pull the leader's view.

If the leader's own find_primary_node_repeatedly() has not finished yet, the
leader serialises its still-uninitialised primary_node_id (-2) and the
standby's existing protective branch only covers -1 (quarantine). The -2
falls through to the else-clause and overwrites the standby's freshly-
determined valid primary_node_id with -2.

sync_backend_from_watchdog() is only re-invoked on SIG_BACKEND_SYNC_REQUIRED,
which is raised only on WD_FAILOVER_END. No subsequent event fires after a
simultaneous restart, so the standby is stuck at -2 indefinitely.

Add a guard that keeps the local primary_node_id when the leader's value is
the -2 initial sentinel.

diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -3719,18 +3719,39 @@ sync_backend_from_watchdog(void)
 		 * Note that Req_info->primary_node_id could be -2, which is the
 		 * initial value. So we need to avoid crash by checking the value is
 		 * not lower than 0. Otherwise we will get crash while looking up
 		 * BACKEND_INFO array. See Mantis bug id 614 for more details.
 		 */
 		if (Req_info->primary_node_id >= 0 &&
 			backendStatus->primary_node_id == -1 && BACKEND_INFO(Req_info->primary_node_id).backend_status != CON_DOWN)
 		{
 			ereport(LOG,
 					(errmsg("primary node:%d on leader watchdog node \"%s\" seems to be quarantined",
 							Req_info->primary_node_id, backendStatus->nodeName),
 					 errdetail("keeping the current primary")));
 		}
+		else if (Req_info->primary_node_id >= 0 &&
+				 backendStatus->primary_node_id == -2)
+		{
+			/*
+			 * Leader watchdog is still initialising and has not yet run
+			 * find_primary_node_repeatedly(); its primary_node_id is still
+			 * the initial sentinel -2. Do not overwrite our locally-determined
+			 * primary with the leader's stale initial state.
+			 *
+			 * Without this guard, a simultaneous restart of all pgpool nodes
+			 * leaves every STANDBY watchdog with primary_node_id = -2 forever:
+			 * sync_backend_from_watchdog() is only re-invoked on
+			 * SIG_BACKEND_SYNC_REQUIRED (raised from WD_FAILOVER_END), so
+			 * there is no normal path that resyncs once the leader's own
+			 * find_primary_node_repeatedly() completes.
+			 */
+			ereport(LOG,
+					(errmsg("primary node on leader watchdog node \"%s\" is still in the initial state",
+							backendStatus->nodeName),
+					 errdetail("keeping the locally-detected primary node:%d", Req_info->primary_node_id)));
+		}
 		else
 		{
 			Req_info->primary_node_id = backendStatus->primary_node_id;
 			primary_changed = true;
 		}

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Primary node detection race at clean startup
@ 2026-05-12 10:49  Emond Papegaaij <[email protected]>
  parent: Emond Papegaaij <[email protected]>
  1 sibling, 0 replies; 5+ messages in thread

From: Emond Papegaaij @ 2026-05-12 10:49 UTC (permalink / raw)
  To: [email protected]

Hi,

Something was wrong with the attached patch. It is rejected by patch,
probably because of the large context. Attached is a new version that
also works with patch.

Best regards,
Emond Papegaaij

Op di 12 mei 2026 om 10:38 schreef Emond Papegaaij <[email protected]>:

>
> Hi,
>
> In our tests, we've found an issue that can cause all Pgpool nodes to
> report an incorrect 'Role: standby':
> Role                   : standby    ← stale, never updated on this node
> Backend Role           : primary    ← actual SR-check result
>
> This can happen if all nodes in a watchdog cluster start with a clean
> state at the same time. If the first node is still trying to determine
> the primary database, it's primary_node_id is -2. This value is then
> synced to other nodes in the cluster, causing all nodes to report the
> stale state indefinitely. Attached is a patch against 4.7 that should
> fix this.
>
> Note that this analysis was done by Claude Code and it also created
> the patch. The failure on our CI was real though and I think the
> explanation makes sense.
>
> Best regards,
> Emond Papegaaij

Attachments:

  [application/x-patch] pgpool-keep-local-primary-when-leader-initial.patch (2.5K, 2-pgpool-keep-local-primary-when-leader-initial.patch)
  download | inline diff:
Keep local primary_node_id when leader watchdog reports the initial -2 sentinel.

When all pgpool nodes in a cluster are stopped and started simultaneously
(e.g. via an admin "restart all pgpool nodes" action), every node initialises
Req_info->primary_node_id to -2 (the sentinel) and then runs
find_primary_node_repeatedly() locally to discover the real primary. The
watchdog elects a LEADER in parallel; the losing nodes transition to STANDBY,
receive SIG_WATCHDOG_STATE_CHANGED, and call sync_backend_from_watchdog() to
pull the leader's view.

If the leader's own find_primary_node_repeatedly() has not finished yet, the
leader serialises its still-uninitialised primary_node_id (-2) and the
standby's existing protective branch only covers -1 (quarantine). The -2
falls through to the else-clause and overwrites the standby's freshly-
determined valid primary_node_id with -2.

sync_backend_from_watchdog() is only re-invoked on SIG_BACKEND_SYNC_REQUIRED,
which is raised only on WD_FAILOVER_END. No subsequent event fires after a
simultaneous restart, so the standby is stuck at -2 indefinitely.

Add a guard that keeps the local primary_node_id when the leader's value is
the -2 initial sentinel.

diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -3729,6 +3729,27 @@ sync_backend_from_watchdog(void)
 							Req_info->primary_node_id, backendStatus->nodeName),
 					 errdetail("keeping the current primary")));
 		}
+		else if (Req_info->primary_node_id >= 0 &&
+				 backendStatus->primary_node_id == -2)
+		{
+			/*
+			 * Leader watchdog is still initialising and has not yet run
+			 * find_primary_node_repeatedly(); its primary_node_id is still
+			 * the initial sentinel -2. Do not overwrite our locally-determined
+			 * primary with the leader's stale initial state.
+			 *
+			 * Without this guard, a simultaneous restart of all pgpool nodes
+			 * leaves every STANDBY watchdog with primary_node_id = -2 forever:
+			 * sync_backend_from_watchdog() is only re-invoked on
+			 * SIG_BACKEND_SYNC_REQUIRED (raised from WD_FAILOVER_END), so
+			 * there is no normal path that resyncs once the leader's own
+			 * find_primary_node_repeatedly() completes.
+			 */
+			ereport(LOG,
+					(errmsg("primary node on leader watchdog node \"%s\" is still in the initial state",
+							backendStatus->nodeName),
+					 errdetail("keeping the locally-detected primary node:%d", Req_info->primary_node_id)));
+		}
 		else
 		{
 			Req_info->primary_node_id = backendStatus->primary_node_id;

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Primary node detection race at clean startup
@ 2026-05-19 12:40  Tatsuo Ishii <[email protected]>
  parent: Emond Papegaaij <[email protected]>
  1 sibling, 1 reply; 5+ messages in thread

From: Tatsuo Ishii @ 2026-05-19 12:40 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Emond,

> Hi,
> 
> In our tests, we've found an issue that can cause all Pgpool nodes to
> report an incorrect 'Role: standby':
> Role                   : standby    ← stale, never updated on this node
> Backend Role           : primary    ← actual SR-check result
> 
> This can happen if all nodes in a watchdog cluster start with a clean
> state at the same time. If the first node is still trying to determine
> the primary database, it's primary_node_id is -2. This value is then
> synced to other nodes in the cluster, causing all nodes to report the
> stale state indefinitely. Attached is a patch against 4.7 that should
> fix this.
> 
> Note that this analysis was done by Claude Code and it also created
> the patch. The failure on our CI was real though and I think the
> explanation makes sense.

I have looked into the patch. Although I failed to reproduce the
issue, I agree with you: the explanation makes sense. Also I have run
the regression test and all test passed. I am going to push the patch
to all supported branches.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp





^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Primary node detection race at clean startup
@ 2026-05-19 12:55  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Tatsuo Ishii @ 2026-05-19 12:55 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

Hi Emond,

>> Hi,
>> 
>> In our tests, we've found an issue that can cause all Pgpool nodes to
>> report an incorrect 'Role: standby':
>> Role                   : standby    ← stale, never updated on this node
>> Backend Role           : primary    ← actual SR-check result
>> 
>> This can happen if all nodes in a watchdog cluster start with a clean
>> state at the same time. If the first node is still trying to determine
>> the primary database, it's primary_node_id is -2. This value is then
>> synced to other nodes in the cluster, causing all nodes to report the
>> stale state indefinitely. Attached is a patch against 4.7 that should
>> fix this.
>> 
>> Note that this analysis was done by Claude Code and it also created
>> the patch. The failure on our CI was real though and I think the
>> explanation makes sense.
> 
> I have looked into the patch. Although I failed to reproduce the
> issue, I agree with you: the explanation makes sense. Also I have run
> the regression test and all test passed. I am going to push the patch
> to all supported branches.

Done.

Note that the credits section in the commit message is as follows:

Reported-by: Emond Papegaaij <[email protected]>
Reported-by: Claude Code
Author: Tatsuo Ishii <[email protected]>
Discussion: https://www.postgresql.org/message-id/CAGXsc%2BZmBoLs3Mz%3DG-Bdm4JJG%2BfH1NpHfR3qVJVwW4eBKWwStQ%40ma...
Backpatch-through: v4.3

You are credited as a reporter, and Claude Code is also a reporter to
indicate that the report was created by AI. I credited myself as the
author. This is by following reasons:

1) AI cannot be the author.
2) The committer (me) is responsible for the commit.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp





^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: Primary node detection race at clean startup
@ 2026-05-19 13:25  Emond Papegaaij <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Emond Papegaaij @ 2026-05-19 13:25 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]

> Done.

Thanks!

> Note that the credits section in the commit message is as follows:
>
> Reported-by: Emond Papegaaij <[email protected]>
> Reported-by: Claude Code
> Author: Tatsuo Ishii <[email protected]>
> Discussion: https://www.postgresql.org/message-id/CAGXsc%2BZmBoLs3Mz%3DG-Bdm4JJG%2BfH1NpHfR3qVJVwW4eBKWwStQ%40ma...
> Backpatch-through: v4.3
>
> You are credited as a reporter, and Claude Code is also a reporter to
> indicate that the report was created by AI. I credited myself as the
> author. This is by following reasons:
>
> 1) AI cannot be the author.
> 2) The committer (me) is responsible for the commit.

Yes, that makes perfect sense. By default Claude adds a
'Co-Authored-By: Claude Opus 4.7 (1M context)
<[email protected]>'. I totally agree that as a human, you are
responsible for the final commit. AI is a very powerful tool, but it's
still just a tool. At our company, we have a very simple rule: ask
yourself if you're proud of the work you've created. If you're not,
then don't commit.

Best regards,
Emond






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2026-05-19 13:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-12 08:38 Primary node detection race at clean startup Emond Papegaaij <[email protected]>
2026-05-12 10:49 ` Emond Papegaaij <[email protected]>
2026-05-19 12:40 ` Tatsuo Ishii <[email protected]>
2026-05-19 12:55   ` Tatsuo Ishii <[email protected]>
2026-05-19 13:25     ` Emond Papegaaij <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox