public inbox for [email protected]help / color / mirror / Atom feed
Primary node detection race at clean startup 5+ messages / 2 participants [nested] [flat]
* Primary node detection race at clean startup @ 2026-05-12 08:38 Emond Papegaaij <[email protected]> 0 siblings, 2 replies; 5+ messages in thread From: Emond Papegaaij @ 2026-05-12 08:38 UTC (permalink / raw) To: [email protected] Hi, In our tests, we've found an issue that can cause all Pgpool nodes to report an incorrect 'Role: standby': Role : standby ← stale, never updated on this node Backend Role : primary ← actual SR-check result This can happen if all nodes in a watchdog cluster start with a clean state at the same time. If the first node is still trying to determine the primary database, it's primary_node_id is -2. This value is then synced to other nodes in the cluster, causing all nodes to report the stale state indefinitely. Attached is a patch against 4.7 that should fix this. Note that this analysis was done by Claude Code and it also created the patch. The failure on our CI was real though and I think the explanation makes sense. Best regards, Emond Papegaaij Attachments: [text/x-patch] pgpool-keep-local-primary-when-leader-initial.patch (3.1K, 2-pgpool-keep-local-primary-when-leader-initial.patch) download | inline diff: Keep local primary_node_id when leader watchdog reports the initial -2 sentinel. When all pgpool nodes in a cluster are stopped and started simultaneously (e.g. via an admin "restart all pgpool nodes" action), every node initialises Req_info->primary_node_id to -2 (the sentinel) and then runs find_primary_node_repeatedly() locally to discover the real primary. The watchdog elects a LEADER in parallel; the losing nodes transition to STANDBY, receive SIG_WATCHDOG_STATE_CHANGED, and call sync_backend_from_watchdog() to pull the leader's view. If the leader's own find_primary_node_repeatedly() has not finished yet, the leader serialises its still-uninitialised primary_node_id (-2) and the standby's existing protective branch only covers -1 (quarantine). The -2 falls through to the else-clause and overwrites the standby's freshly- determined valid primary_node_id with -2. sync_backend_from_watchdog() is only re-invoked on SIG_BACKEND_SYNC_REQUIRED, which is raised only on WD_FAILOVER_END. No subsequent event fires after a simultaneous restart, so the standby is stuck at -2 indefinitely. Add a guard that keeps the local primary_node_id when the leader's value is the -2 initial sentinel. diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c --- a/src/main/pgpool_main.c +++ b/src/main/pgpool_main.c @@ -3719,18 +3719,39 @@ sync_backend_from_watchdog(void) * Note that Req_info->primary_node_id could be -2, which is the * initial value. So we need to avoid crash by checking the value is * not lower than 0. Otherwise we will get crash while looking up * BACKEND_INFO array. See Mantis bug id 614 for more details. */ if (Req_info->primary_node_id >= 0 && backendStatus->primary_node_id == -1 && BACKEND_INFO(Req_info->primary_node_id).backend_status != CON_DOWN) { ereport(LOG, (errmsg("primary node:%d on leader watchdog node \"%s\" seems to be quarantined", Req_info->primary_node_id, backendStatus->nodeName), errdetail("keeping the current primary"))); } + else if (Req_info->primary_node_id >= 0 && + backendStatus->primary_node_id == -2) + { + /* + * Leader watchdog is still initialising and has not yet run + * find_primary_node_repeatedly(); its primary_node_id is still + * the initial sentinel -2. Do not overwrite our locally-determined + * primary with the leader's stale initial state. + * + * Without this guard, a simultaneous restart of all pgpool nodes + * leaves every STANDBY watchdog with primary_node_id = -2 forever: + * sync_backend_from_watchdog() is only re-invoked on + * SIG_BACKEND_SYNC_REQUIRED (raised from WD_FAILOVER_END), so + * there is no normal path that resyncs once the leader's own + * find_primary_node_repeatedly() completes. + */ + ereport(LOG, + (errmsg("primary node on leader watchdog node \"%s\" is still in the initial state", + backendStatus->nodeName), + errdetail("keeping the locally-detected primary node:%d", Req_info->primary_node_id))); + } else { Req_info->primary_node_id = backendStatus->primary_node_id; primary_changed = true; } ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Primary node detection race at clean startup @ 2026-05-12 10:49 Emond Papegaaij <[email protected]> parent: Emond Papegaaij <[email protected]> 1 sibling, 0 replies; 5+ messages in thread From: Emond Papegaaij @ 2026-05-12 10:49 UTC (permalink / raw) To: [email protected] Hi, Something was wrong with the attached patch. It is rejected by patch, probably because of the large context. Attached is a new version that also works with patch. Best regards, Emond Papegaaij Op di 12 mei 2026 om 10:38 schreef Emond Papegaaij <[email protected]>: > > Hi, > > In our tests, we've found an issue that can cause all Pgpool nodes to > report an incorrect 'Role: standby': > Role : standby ← stale, never updated on this node > Backend Role : primary ← actual SR-check result > > This can happen if all nodes in a watchdog cluster start with a clean > state at the same time. If the first node is still trying to determine > the primary database, it's primary_node_id is -2. This value is then > synced to other nodes in the cluster, causing all nodes to report the > stale state indefinitely. Attached is a patch against 4.7 that should > fix this. > > Note that this analysis was done by Claude Code and it also created > the patch. The failure on our CI was real though and I think the > explanation makes sense. > > Best regards, > Emond Papegaaij Attachments: [application/x-patch] pgpool-keep-local-primary-when-leader-initial.patch (2.5K, 2-pgpool-keep-local-primary-when-leader-initial.patch) download | inline diff: Keep local primary_node_id when leader watchdog reports the initial -2 sentinel. When all pgpool nodes in a cluster are stopped and started simultaneously (e.g. via an admin "restart all pgpool nodes" action), every node initialises Req_info->primary_node_id to -2 (the sentinel) and then runs find_primary_node_repeatedly() locally to discover the real primary. The watchdog elects a LEADER in parallel; the losing nodes transition to STANDBY, receive SIG_WATCHDOG_STATE_CHANGED, and call sync_backend_from_watchdog() to pull the leader's view. If the leader's own find_primary_node_repeatedly() has not finished yet, the leader serialises its still-uninitialised primary_node_id (-2) and the standby's existing protective branch only covers -1 (quarantine). The -2 falls through to the else-clause and overwrites the standby's freshly- determined valid primary_node_id with -2. sync_backend_from_watchdog() is only re-invoked on SIG_BACKEND_SYNC_REQUIRED, which is raised only on WD_FAILOVER_END. No subsequent event fires after a simultaneous restart, so the standby is stuck at -2 indefinitely. Add a guard that keeps the local primary_node_id when the leader's value is the -2 initial sentinel. diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c --- a/src/main/pgpool_main.c +++ b/src/main/pgpool_main.c @@ -3729,6 +3729,27 @@ sync_backend_from_watchdog(void) Req_info->primary_node_id, backendStatus->nodeName), errdetail("keeping the current primary"))); } + else if (Req_info->primary_node_id >= 0 && + backendStatus->primary_node_id == -2) + { + /* + * Leader watchdog is still initialising and has not yet run + * find_primary_node_repeatedly(); its primary_node_id is still + * the initial sentinel -2. Do not overwrite our locally-determined + * primary with the leader's stale initial state. + * + * Without this guard, a simultaneous restart of all pgpool nodes + * leaves every STANDBY watchdog with primary_node_id = -2 forever: + * sync_backend_from_watchdog() is only re-invoked on + * SIG_BACKEND_SYNC_REQUIRED (raised from WD_FAILOVER_END), so + * there is no normal path that resyncs once the leader's own + * find_primary_node_repeatedly() completes. + */ + ereport(LOG, + (errmsg("primary node on leader watchdog node \"%s\" is still in the initial state", + backendStatus->nodeName), + errdetail("keeping the locally-detected primary node:%d", Req_info->primary_node_id))); + } else { Req_info->primary_node_id = backendStatus->primary_node_id; ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Primary node detection race at clean startup @ 2026-05-19 12:40 Tatsuo Ishii <[email protected]> parent: Emond Papegaaij <[email protected]> 1 sibling, 1 reply; 5+ messages in thread From: Tatsuo Ishii @ 2026-05-19 12:40 UTC (permalink / raw) To: [email protected]; +Cc: [email protected] Hi Emond, > Hi, > > In our tests, we've found an issue that can cause all Pgpool nodes to > report an incorrect 'Role: standby': > Role : standby ← stale, never updated on this node > Backend Role : primary ← actual SR-check result > > This can happen if all nodes in a watchdog cluster start with a clean > state at the same time. If the first node is still trying to determine > the primary database, it's primary_node_id is -2. This value is then > synced to other nodes in the cluster, causing all nodes to report the > stale state indefinitely. Attached is a patch against 4.7 that should > fix this. > > Note that this analysis was done by Claude Code and it also created > the patch. The failure on our CI was real though and I think the > explanation makes sense. I have looked into the patch. Although I failed to reproduce the issue, I agree with you: the explanation makes sense. Also I have run the regression test and all test passed. I am going to push the patch to all supported branches. Regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Primary node detection race at clean startup @ 2026-05-19 12:55 Tatsuo Ishii <[email protected]> parent: Tatsuo Ishii <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Tatsuo Ishii @ 2026-05-19 12:55 UTC (permalink / raw) To: [email protected]; +Cc: [email protected] Hi Emond, >> Hi, >> >> In our tests, we've found an issue that can cause all Pgpool nodes to >> report an incorrect 'Role: standby': >> Role : standby ← stale, never updated on this node >> Backend Role : primary ← actual SR-check result >> >> This can happen if all nodes in a watchdog cluster start with a clean >> state at the same time. If the first node is still trying to determine >> the primary database, it's primary_node_id is -2. This value is then >> synced to other nodes in the cluster, causing all nodes to report the >> stale state indefinitely. Attached is a patch against 4.7 that should >> fix this. >> >> Note that this analysis was done by Claude Code and it also created >> the patch. The failure on our CI was real though and I think the >> explanation makes sense. > > I have looked into the patch. Although I failed to reproduce the > issue, I agree with you: the explanation makes sense. Also I have run > the regression test and all test passed. I am going to push the patch > to all supported branches. Done. Note that the credits section in the commit message is as follows: Reported-by: Emond Papegaaij <[email protected]> Reported-by: Claude Code Author: Tatsuo Ishii <[email protected]> Discussion: https://www.postgresql.org/message-id/CAGXsc%2BZmBoLs3Mz%3DG-Bdm4JJG%2BfH1NpHfR3qVJVwW4eBKWwStQ%40ma... Backpatch-through: v4.3 You are credited as a reporter, and Claude Code is also a reporter to indicate that the report was created by AI. I credited myself as the author. This is by following reasons: 1) AI cannot be the author. 2) The committer (me) is responsible for the commit. Regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: Primary node detection race at clean startup @ 2026-05-19 13:25 Emond Papegaaij <[email protected]> parent: Tatsuo Ishii <[email protected]> 0 siblings, 0 replies; 5+ messages in thread From: Emond Papegaaij @ 2026-05-19 13:25 UTC (permalink / raw) To: Tatsuo Ishii <[email protected]>; +Cc: [email protected] > Done. Thanks! > Note that the credits section in the commit message is as follows: > > Reported-by: Emond Papegaaij <[email protected]> > Reported-by: Claude Code > Author: Tatsuo Ishii <[email protected]> > Discussion: https://www.postgresql.org/message-id/CAGXsc%2BZmBoLs3Mz%3DG-Bdm4JJG%2BfH1NpHfR3qVJVwW4eBKWwStQ%40ma... > Backpatch-through: v4.3 > > You are credited as a reporter, and Claude Code is also a reporter to > indicate that the report was created by AI. I credited myself as the > author. This is by following reasons: > > 1) AI cannot be the author. > 2) The committer (me) is responsible for the commit. Yes, that makes perfect sense. By default Claude adds a 'Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>'. I totally agree that as a human, you are responsible for the final commit. AI is a very powerful tool, but it's still just a tool. At our company, we have a very simple rule: ask yourself if you're proud of the work you've created. If you're not, then don't commit. Best regards, Emond ^ permalink raw reply [nested|flat] 5+ messages in thread
end of thread, other threads:[~2026-05-19 13:25 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-05-12 08:38 Primary node detection race at clean startup Emond Papegaaij <[email protected]> 2026-05-12 10:49 ` Emond Papegaaij <[email protected]> 2026-05-19 12:40 ` Tatsuo Ishii <[email protected]> 2026-05-19 12:55 ` Tatsuo Ishii <[email protected]> 2026-05-19 13:25 ` Emond Papegaaij <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox