Primary node detection race at clean startup

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Emond Papegaaij <[email protected]>
To: [email protected]
Subject: Primary node detection race at clean startup
Date: Tue, 12 May 2026 10:38:08 +0200
Message-ID: <CAGXsc+ZmBoLs3Mz=G-Bdm4JJG+fH1NpHfR3qVJVwW4eBKWwStQ@mail.gmail.com> (raw)

Hi,

In our tests, we've found an issue that can cause all Pgpool nodes to
report an incorrect 'Role: standby':
Role                   : standby    ← stale, never updated on this node
Backend Role           : primary    ← actual SR-check result

This can happen if all nodes in a watchdog cluster start with a clean
state at the same time. If the first node is still trying to determine
the primary database, it's primary_node_id is -2. This value is then
synced to other nodes in the cluster, causing all nodes to report the
stale state indefinitely. Attached is a patch against 4.7 that should
fix this.

Note that this analysis was done by Claude Code and it also created
the patch. The failure on our CI was real though and I think the
explanation makes sense.

Best regards,
Emond Papegaaij

Attachments:

  [text/x-patch] pgpool-keep-local-primary-when-leader-initial.patch (3.1K, 2-pgpool-keep-local-primary-when-leader-initial.patch)
  download | inline diff:
Keep local primary_node_id when leader watchdog reports the initial -2 sentinel.

When all pgpool nodes in a cluster are stopped and started simultaneously
(e.g. via an admin "restart all pgpool nodes" action), every node initialises
Req_info->primary_node_id to -2 (the sentinel) and then runs
find_primary_node_repeatedly() locally to discover the real primary. The
watchdog elects a LEADER in parallel; the losing nodes transition to STANDBY,
receive SIG_WATCHDOG_STATE_CHANGED, and call sync_backend_from_watchdog() to
pull the leader's view.

If the leader's own find_primary_node_repeatedly() has not finished yet, the
leader serialises its still-uninitialised primary_node_id (-2) and the
standby's existing protective branch only covers -1 (quarantine). The -2
falls through to the else-clause and overwrites the standby's freshly-
determined valid primary_node_id with -2.

sync_backend_from_watchdog() is only re-invoked on SIG_BACKEND_SYNC_REQUIRED,
which is raised only on WD_FAILOVER_END. No subsequent event fires after a
simultaneous restart, so the standby is stuck at -2 indefinitely.

Add a guard that keeps the local primary_node_id when the leader's value is
the -2 initial sentinel.

diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -3719,18 +3719,39 @@ sync_backend_from_watchdog(void)
 		 * Note that Req_info->primary_node_id could be -2, which is the
 		 * initial value. So we need to avoid crash by checking the value is
 		 * not lower than 0. Otherwise we will get crash while looking up
 		 * BACKEND_INFO array. See Mantis bug id 614 for more details.
 		 */
 		if (Req_info->primary_node_id >= 0 &&
 			backendStatus->primary_node_id == -1 && BACKEND_INFO(Req_info->primary_node_id).backend_status != CON_DOWN)
 		{
 			ereport(LOG,
 					(errmsg("primary node:%d on leader watchdog node \"%s\" seems to be quarantined",
 							Req_info->primary_node_id, backendStatus->nodeName),
 					 errdetail("keeping the current primary")));
 		}
+		else if (Req_info->primary_node_id >= 0 &&
+				 backendStatus->primary_node_id == -2)
+		{
+			/*
+			 * Leader watchdog is still initialising and has not yet run
+			 * find_primary_node_repeatedly(); its primary_node_id is still
+			 * the initial sentinel -2. Do not overwrite our locally-determined
+			 * primary with the leader's stale initial state.
+			 *
+			 * Without this guard, a simultaneous restart of all pgpool nodes
+			 * leaves every STANDBY watchdog with primary_node_id = -2 forever:
+			 * sync_backend_from_watchdog() is only re-invoked on
+			 * SIG_BACKEND_SYNC_REQUIRED (raised from WD_FAILOVER_END), so
+			 * there is no normal path that resyncs once the leader's own
+			 * find_primary_node_repeatedly() completes.
+			 */
+			ereport(LOG,
+					(errmsg("primary node on leader watchdog node \"%s\" is still in the initial state",
+							backendStatus->nodeName),
+					 errdetail("keeping the locally-detected primary node:%d", Req_info->primary_node_id)));
+		}
 		else
 		{
 			Req_info->primary_node_id = backendStatus->primary_node_id;
 			primary_changed = true;
 		}

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Primary node detection race at clean startup
  In-Reply-To: <CAGXsc+ZmBoLs3Mz=G-Bdm4JJG+fH1NpHfR3qVJVwW4eBKWwStQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox