Message-ID: <gh-pgjdbc-pgjdbc-4082@github.com>
From: "vlsi (@vlsi)" <noreply+vlsi@github.com>
To: "pgjdbc/pgjdbc" <noreply+pgjdbc-pgjdbc@github.com>
Date: Thu, 21 May 2026 07:58:06 +0000
Subject: [pgjdbc/pgjdbc] issue #4082: Startup Deadline Model: Make Connection Establishment More Deadline-Based
List-Id: <gh-pgjdbc-pgjdbc.github.com>
X-GitHub-Author-Id: 213894
X-GitHub-Author-Login: vlsi
X-GitHub-Issue: 4082
X-GitHub-Repo: pgjdbc/pgjdbc
X-GitHub-State: open
X-GitHub-Type: issue
X-GitHub-Url: https://github.com/pgjdbc/pgjdbc/issues/4082
Content-Type: text/plain; charset=utf-8

Revisit how pgJDBC enforces startup-phase timeouts so connection establishment behaves more like a real deadline budget and less like a caller-side wait wrapper.

Today `loginTimeout` caps how long the caller waits, but the actual connection work continues in a daemon thread until it eventually succeeds or fails. This is understandable from an implementation standpoint, but it is less than ideal operationally under outages and partial network failures.

## Current behavior

Current implementation is centered in:

- `org.postgresql.Driver`
- `org.postgresql.core.v3.ConnectionFactoryImpl`

Notable behaviors:

- `loginTimeout` is enforced by running the connect attempt in a daemon thread and waiting on a condition variable.
- if the timeout fires, the caller gets `08001` / `Connection attempt timed out.`
- the worker thread keeps running in the background
- multi-host `connectTimeout` is now budgeted across hosts, but other startup phases are not modeled as one single propagated deadline

## Why this is worth revisiting

From an SRE perspective:

- timed-out attempts continuing in the background can amplify resource usage during incidents
- timeout semantics become harder to explain: "caller timed out" is not the same as "network operations stopped"
- operators typically want one answer to "how long can connection establishment take?"

## Goals

- make startup timeout behavior closer to a real end-to-end deadline budget
- reduce abandoned in-flight work after caller-side timeout where feasible
- preserve compatibility where possible

## Candidate directions

### 1. Propagate a single absolute startup deadline internally

Instead of treating startup phases as mostly separate timeout domains, compute a single deadline and derive remaining phase budgets from it during:

- TCP connect
- SSL/GSS upgrade
- TLS / GSS handshake
- startup packet exchange
- authentication

### 2. Ensure worker-side blocking I/O obeys the deadline

Even if the background-thread model stays, the worker should ideally apply shrinking remaining timeouts to its own socket-level operations so timed-out attempts do not continue for much longer than the deadline.

### 3. Abort in-flight socket/stream on login timeout where possible

We cannot safely hard-stop Java threads, but if the connect attempt already owns a socket or stream, explicitly closing it on timeout may break the blocking I/O and let the worker unwind sooner.

### 4. Improve observability

If behavior remains two-layered, add clearer debug/trace hooks for:

- caller timeout fired
- in-flight socket was aborted or not
- worker later succeeded and discarded the connection
- worker later failed after caller had already returned

## Open questions

- Can the timeout owner safely close the worker's active socket without introducing races or cleanup hazards?
- Should `loginTimeout` remain implemented via detached worker, or is there a path toward more direct deadline enforcement?
- If a unified startup timeout property were added in the future, how should it interact with `loginTimeout` and `connectTimeout`?

## Compatibility notes

This likely needs phased work:

- tests / observability first
- internal deadline propagation next
- more aggressive cleanup/abort only with careful compatibility review