Message-ID: From: "vlsi (@vlsi)" To: "pgjdbc/pgjdbc" Date: Thu, 21 May 2026 07:58:06 +0000 Subject: [pgjdbc/pgjdbc] issue #4082: Startup Deadline Model: Make Connection Establishment More Deadline-Based List-Id: X-GitHub-Author-Id: 213894 X-GitHub-Author-Login: vlsi X-GitHub-Issue: 4082 X-GitHub-Repo: pgjdbc/pgjdbc X-GitHub-State: open X-GitHub-Type: issue X-GitHub-Url: https://github.com/pgjdbc/pgjdbc/issues/4082 Content-Type: text/plain; charset=utf-8 Revisit how pgJDBC enforces startup-phase timeouts so connection establishment behaves more like a real deadline budget and less like a caller-side wait wrapper. Today `loginTimeout` caps how long the caller waits, but the actual connection work continues in a daemon thread until it eventually succeeds or fails. This is understandable from an implementation standpoint, but it is less than ideal operationally under outages and partial network failures. ## Current behavior Current implementation is centered in: - `org.postgresql.Driver` - `org.postgresql.core.v3.ConnectionFactoryImpl` Notable behaviors: - `loginTimeout` is enforced by running the connect attempt in a daemon thread and waiting on a condition variable. - if the timeout fires, the caller gets `08001` / `Connection attempt timed out.` - the worker thread keeps running in the background - multi-host `connectTimeout` is now budgeted across hosts, but other startup phases are not modeled as one single propagated deadline ## Why this is worth revisiting From an SRE perspective: - timed-out attempts continuing in the background can amplify resource usage during incidents - timeout semantics become harder to explain: "caller timed out" is not the same as "network operations stopped" - operators typically want one answer to "how long can connection establishment take?" ## Goals - make startup timeout behavior closer to a real end-to-end deadline budget - reduce abandoned in-flight work after caller-side timeout where feasible - preserve compatibility where possible ## Candidate directions ### 1. Propagate a single absolute startup deadline internally Instead of treating startup phases as mostly separate timeout domains, compute a single deadline and derive remaining phase budgets from it during: - TCP connect - SSL/GSS upgrade - TLS / GSS handshake - startup packet exchange - authentication ### 2. Ensure worker-side blocking I/O obeys the deadline Even if the background-thread model stays, the worker should ideally apply shrinking remaining timeouts to its own socket-level operations so timed-out attempts do not continue for much longer than the deadline. ### 3. Abort in-flight socket/stream on login timeout where possible We cannot safely hard-stop Java threads, but if the connect attempt already owns a socket or stream, explicitly closing it on timeout may break the blocking I/O and let the worker unwind sooner. ### 4. Improve observability If behavior remains two-layered, add clearer debug/trace hooks for: - caller timeout fired - in-flight socket was aborted or not - worker later succeeded and discarded the connection - worker later failed after caller had already returned ## Open questions - Can the timeout owner safely close the worker's active socket without introducing races or cleanup hazards? - Should `loginTimeout` remain implemented via detached worker, or is there a path toward more direct deadline enforcement? - If a unified startup timeout property were added in the future, how should it interact with `loginTimeout` and `connectTimeout`? ## Compatibility notes This likely needs phased work: - tests / observability first - internal deadline propagation next - more aggressive cleanup/abort only with careful compatibility review