Message-ID: From: "vlsi (@vlsi)" To: "pgjdbc/pgjdbc" Date: Thu, 21 May 2026 07:59:01 +0000 Subject: [pgjdbc/pgjdbc] issue #4084: Multi-Host Retry and Host-Status Model: Revisit Operator Semantics List-Id: X-GitHub-Author-Id: 213894 X-GitHub-Author-Login: vlsi X-GitHub-Issue: 4084 X-GitHub-Repo: pgjdbc/pgjdbc X-GitHub-State: open X-GitHub-Type: issue X-GitHub-Url: https://github.com/pgjdbc/pgjdbc/issues/4084 Content-Type: text/plain; charset=utf-8 Review multi-host connection behavior from an operator/SRE perspective, focusing on: - shared `connectTimeout` budget across hosts - cached host status in `GlobalHostStatusTracker` - `hostRecheckSeconds` semantics - how easy it is to reason about retry behavior during outages and failover ## Current behavior Current implementation is centered in: - `org.postgresql.core.v3.ConnectionFactoryImpl` - `org.postgresql.hostchooser.MultiHostChooser` - `org.postgresql.hostchooser.GlobalHostStatusTracker` Observed behavior: - `connectTimeout` is shared across hosts within one `getConnection()` call - host statuses are cached JVM-wide - hosts cached as `ConnectFail` are skipped until `hostRecheckSeconds` expires - after TTL expiry, the host becomes eligible again on the next connection attempt - `loadBalanceHosts`, target server type, and cached host state all participate in host ordering ## Why this is worth revisiting The current model is much better than "fresh full timeout per host forever", but still not especially intuitive operationally. Questions an operator may ask: - How quickly will the driver stop trying a dead host? - How long will it avoid that host? - When does it start probing again? - How do shared timeout budget and host-state TTL interact? - What is the recommended shape for failover-oriented settings? Those are answerable today, but not especially obvious. ## Candidate directions ### 1. Clarify the mental model Possible framing: - `connectTimeout` = attempt budget within one connection attempt - `hostRecheckSeconds` = cache TTL for failed host status across attempts That distinction should be explicit in behavior and docs. ### 2. Explore more operator-intent semantics Potential future concepts: - explicit failed-host backoff - separate retry/backoff tuning for dead hosts vs role-mismatch hosts - adaptive or exponential re-probe policy instead of fixed TTL only ### 3. Review whether cached host-state policy should be more transparent Possible improvements: - better logging / tracing when a host is skipped due to cached status - better diagnostics around why a host was or was not retried ### 4. Revisit interaction with pools and request-retry loops In practice, operators often reason about retries at a higher layer. The driver behavior should be predictable enough that pool/application retry policy can be layered on top without guesswork. ## Questions to resolve - Is fixed `hostRecheckSeconds` sufficient, or should future backoff options be considered? - Are `ConnectFail`, `Primary`, and `Secondary` TTL semantics equally appropriate? - Should the driver expose better diagnostics for host skipping and re-probing? - Should there be a more explicit "production failover profile" recommendation or opt-in behavior? ## Acceptance criteria - multi-host behavior is explainable in one short operator-oriented model - tests clearly cover: - shared `connectTimeout` across hosts - cached `ConnectFail` skipping until TTL expiry - re-probe after expiry - interaction with target server type and load balancing - future improvements can be evaluated independently from other timeout work