[pgjdbc/pgjdbc] issue #4081: Timeout Semantics Overhaul for Operational Robustness

pgjdbc/pgjdbc GitHub issues and pull requests (mirror)  
help / color / mirror / Atom feed

From: vlsi (@vlsi) <[email protected]>
To: pgjdbc/pgjdbc <[email protected]>
Subject: [pgjdbc/pgjdbc] issue #4081: Timeout Semantics Overhaul for Operational Robustness
Date: Thu, 21 May 2026 07:57:04 +0000
Message-ID: <[email protected]> (raw)

pgJDBC's timeout model has grown incrementally and now mixes:

- operator-intent properties like `loginTimeout`, `socketTimeout`, and `cancelSignalTimeout`
- lower-level protocol-phase properties like `sslResponseTimeout` and `gssResponseTimeout`
- multi-host behavior that has its own budgeting and cached host-state semantics
- connection-pool interactions where third-party code feeds its own timeouts into pgJDBC in non-obvious ways

The implementation is mostly internally consistent, but the resulting behavior is harder for operators to reason about than it should be during incidents, failovers, and partial network failures.

This umbrella issue tracks a timeout/SRE review with two goals:

1. Make timeout behavior easier to understand and easier to configure correctly in production.
2. Improve robustness under degraded network and failover conditions without breaking compatibility unnecessarily.

## Why this matters

From a production/SRE perspective, operators usually think in terms of:

- how long a connection attempt may take
- how long a query may run
- how long a blocked read may hang
- how long a cancel attempt may take
- how long a failed host should be avoided before trying again

Today, some pgJDBC behaviors are implementation-shaped rather than operator-shaped. Examples:

- `loginTimeout` is a caller-side wall-clock cap enforced by a background thread, not a unified startup deadline inside the network operations themselves.
- `sslResponseTimeout` / `gssResponseTimeout` bound only the one-byte upgrade response, not the whole secure-upgrade phase.
- `setQueryTimeout` is often read as "the query stops in N seconds", but the actual implementation is "attempt a backend cancel after N seconds".
- multi-host failover behavior spans `connectTimeout`, `hostRecheckSeconds`, target server type, and cached host status, which can be non-obvious as a whole.

## Current fact base

This issue is based on a source-backed review of current pgJDBC behavior on the active branch, including:

- `org.postgresql.Driver`
- `org.postgresql.core.v3.ConnectionFactoryImpl`
- `org.postgresql.ssl.MakeSSL`
- `org.postgresql.core.QueryExecutorBase`
- `org.postgresql.jdbc.PgStatement`
- `org.postgresql.hostchooser.*`

And third-party interaction checks against current HikariCP sources for pool-related timeout propagation.

## Goals

- Define a cleaner mental model for connection-establishment, read, query, cancel, and failover timeouts.
- Identify compatibility-safe clarifications vs behavior changes.
- Improve deadline propagation across startup phases where feasible.
- Reduce "surprising but technically correct" timeout behavior in incident conditions.
- Keep existing properties working unless there is a strong reason not to.

## Non-goals

- Silent breaking changes to existing timeout defaults.
- Replacing all existing properties immediately.
- Forcing a single timeout model on all workloads, especially long-running ETL / batch / replication-style usage.

## Candidate workstreams

- startup deadline model (`connectTimeout`, `loginTimeout`, SSL/GSS/auth phases)
- secure-handshake timeout semantics (`sslResponseTimeout`, `gssResponseTimeout`)
- multi-host / host-status retry model (`connectTimeout`, `hostRecheckSeconds`, cached host states)
- query-timeout / cancel semantics
- compatibility aliases or clearer naming for future-facing properties
- pool integration guidance and possibly pool-aware behavior where safe

## Open design questions

- Should `loginTimeout` remain a caller-side wait cap, or evolve toward a real startup deadline budget?
- Should the driver expose a single startup-phase timeout in addition to existing low-level properties?
- Should future clearer properties be introduced as aliases while preserving existing names?
- Can the driver offer a more production-safe opt-in profile without changing risky defaults?
- Should query-timeout semantics remain "best effort cancel" only, or should pgJDBC expose stronger or clearer APIs around timeout intent?

## Possible future issues not yet split out

- Explore clearer timeout property aliases centered on operator intent
- Explore pool-friendly / production-safe timeout profiles
- Explore optional `socketTimeout` derivation heuristics from server-side timeouts such as `statement_timeout`

## Compatibility guidance

Any proposed implementation changes should be classified as:

- documentation / observability only
- behavior clarification without externally visible semantic changes
- additive behavior (new property, alias, opt-in mode)
- behavior change with compatibility risk

That split should make it easier to deliver incremental improvements without bundling all timeout work into one risky release.

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: github://pgjdbc/pgjdbc
  Cc: [email protected], [email protected]
  Subject: Re: [pgjdbc/pgjdbc] issue #4081: Timeout Semantics Overhaul for Operational Robustness
  In-Reply-To: <<[email protected]>>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox