Re: Primary node detection race at clean startup

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Emond Papegaaij <[email protected]>
To: [email protected]
Subject: Re: Primary node detection race at clean startup
Date: Tue, 12 May 2026 12:49:31 +0200
Message-ID: <CAGXsc+bBbfW7qQ+2JJ4SY7xZDifrz329cJcxHaoG+kpfGqBJaQ@mail.gmail.com> (raw)
In-Reply-To: <CAGXsc+ZmBoLs3Mz=G-Bdm4JJG+fH1NpHfR3qVJVwW4eBKWwStQ@mail.gmail.com>
References: <CAGXsc+ZmBoLs3Mz=G-Bdm4JJG+fH1NpHfR3qVJVwW4eBKWwStQ@mail.gmail.com>

Hi,

Something was wrong with the attached patch. It is rejected by patch,
probably because of the large context. Attached is a new version that
also works with patch.

Best regards,
Emond Papegaaij

Op di 12 mei 2026 om 10:38 schreef Emond Papegaaij <[email protected]>:

>
> Hi,
>
> In our tests, we've found an issue that can cause all Pgpool nodes to
> report an incorrect 'Role: standby':
> Role                   : standby    ← stale, never updated on this node
> Backend Role           : primary    ← actual SR-check result
>
> This can happen if all nodes in a watchdog cluster start with a clean
> state at the same time. If the first node is still trying to determine
> the primary database, it's primary_node_id is -2. This value is then
> synced to other nodes in the cluster, causing all nodes to report the
> stale state indefinitely. Attached is a patch against 4.7 that should
> fix this.
>
> Note that this analysis was done by Claude Code and it also created
> the patch. The failure on our CI was real though and I think the
> explanation makes sense.
>
> Best regards,
> Emond Papegaaij

Attachments:

  [application/x-patch] pgpool-keep-local-primary-when-leader-initial.patch (2.5K, 2-pgpool-keep-local-primary-when-leader-initial.patch)
  download | inline diff:
Keep local primary_node_id when leader watchdog reports the initial -2 sentinel.

When all pgpool nodes in a cluster are stopped and started simultaneously
(e.g. via an admin "restart all pgpool nodes" action), every node initialises
Req_info->primary_node_id to -2 (the sentinel) and then runs
find_primary_node_repeatedly() locally to discover the real primary. The
watchdog elects a LEADER in parallel; the losing nodes transition to STANDBY,
receive SIG_WATCHDOG_STATE_CHANGED, and call sync_backend_from_watchdog() to
pull the leader's view.

If the leader's own find_primary_node_repeatedly() has not finished yet, the
leader serialises its still-uninitialised primary_node_id (-2) and the
standby's existing protective branch only covers -1 (quarantine). The -2
falls through to the else-clause and overwrites the standby's freshly-
determined valid primary_node_id with -2.

sync_backend_from_watchdog() is only re-invoked on SIG_BACKEND_SYNC_REQUIRED,
which is raised only on WD_FAILOVER_END. No subsequent event fires after a
simultaneous restart, so the standby is stuck at -2 indefinitely.

Add a guard that keeps the local primary_node_id when the leader's value is
the -2 initial sentinel.

diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -3729,6 +3729,27 @@ sync_backend_from_watchdog(void)
 							Req_info->primary_node_id, backendStatus->nodeName),
 					 errdetail("keeping the current primary")));
 		}
+		else if (Req_info->primary_node_id >= 0 &&
+				 backendStatus->primary_node_id == -2)
+		{
+			/*
+			 * Leader watchdog is still initialising and has not yet run
+			 * find_primary_node_repeatedly(); its primary_node_id is still
+			 * the initial sentinel -2. Do not overwrite our locally-determined
+			 * primary with the leader's stale initial state.
+			 *
+			 * Without this guard, a simultaneous restart of all pgpool nodes
+			 * leaves every STANDBY watchdog with primary_node_id = -2 forever:
+			 * sync_backend_from_watchdog() is only re-invoked on
+			 * SIG_BACKEND_SYNC_REQUIRED (raised from WD_FAILOVER_END), so
+			 * there is no normal path that resyncs once the leader's own
+			 * find_primary_node_repeatedly() completes.
+			 */
+			ereport(LOG,
+					(errmsg("primary node on leader watchdog node \"%s\" is still in the initial state",
+							backendStatus->nodeName),
+					 errdetail("keeping the locally-detected primary node:%d", Req_info->primary_node_id)));
+		}
 		else
 		{
 			Req_info->primary_node_id = backendStatus->primary_node_id;

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Primary node detection race at clean startup
  In-Reply-To: <CAGXsc+bBbfW7qQ+2JJ4SY7xZDifrz329cJcxHaoG+kpfGqBJaQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox