public inbox for [email protected]  
help / color / mirror / Atom feed
client_connection_check_interval default value
5+ messages / 3 participants
[nested] [flat]

* client_connection_check_interval default value
@ 2026-02-05 05:30 Jeremy Schneider <[email protected]>
  2026-02-05 13:36 ` Re: client_connection_check_interval default value Jeremy Schneider <[email protected]>
  2026-02-05 15:00 ` Re: client_connection_check_interval default value Greg Sabino Mullane <[email protected]>
  2026-02-05 17:26 ` Re: client_connection_check_interval default value Jacob Champion <[email protected]>
  0 siblings, 3 replies; 5+ messages in thread

From: Jeremy Schneider @ 2026-02-05 05:30 UTC (permalink / raw)
  To: [email protected] <[email protected]>; +Cc: Marat Buharov <[email protected]>

What would people here think about changing the default value of
client_connection_check_interval to 2000 ms? Right now this is disabled
by default.

The background is that I recently saw an incident where a blocking-lock
brownout escalated from a row-level problem to a complete system
outage, due to a combination of factors including a bug in golang's pgx
postgres client (PR 2481 has now been merged w a fix) and a pgbouncer
setup that was missing peers configuration. As a result, cancel
messages were getting dropped while postgres connections were waiting
on a blocked lock, golang aggresively timed out on context deadlines
and retried, and once the database reached max_connections the whole
system ground to a halt.

At the time I thought it was weird that postgres wasn't checking for
dead connections while those conns were waiting for locks; I spent a
bunch of time investigating this and reproduced it and wrote up what I
was able to figure out.

Then, yesterday, I saw a LinkedIn post from Marat at Data Egret who
mentioned that client_connection_check_interval exists.

Plugged this into my repro and confirmed it can prevent postgres from
escalating the blocking-lock brownout into a complete outage due to
connection exhaustion.

While a fix has been merged in pgx for the most direct root cause of
the incident I saw, this setting just seems like a good behavior to
make Postgres more robust in general. 2000 ms seemed like a fairly
safe/conservative starting point for discussion. Thoughts?

-Jeremy


PS. Some more details and graphs are at
https://ardentperf.com/2026/02/04/postgres-client_connection_check_interval/


-- 
To know the thoughts and deeds that have marked man's progress is to
feel the great heart throbs of humanity through the centuries; and if
one does not feel in these pulsations a heavenward striving, one must
indeed be deaf to the harmonies of life.

Helen Keller, The Story Of My Life, 1902, 1903, 1905, introduction by
Ralph Barton Perry (Garden City, NY: Doubleday & Company, 1954), p90.



Attachments:

  [text/x-patch] 0001-Change-default-client_connection_check_interval-to-2.patch (2.0K, 2-0001-Change-default-client_connection_check_interval-to-2.patch)
  download | inline diff:
From 7936fc4ad60e40b38629f372bff397bc42a0a7f5 Mon Sep 17 00:00:00 2001
From: Jeremy Schneider <[email protected]>
Date: Wed, 4 Feb 2026 21:08:32 -0800
Subject: [PATCH] Change default client_connection_check_interval to 2000ms

The default value of client_connection_check_interval is changed from 0
(disabled) to 2000ms (2 seconds). This enables periodic checking for client
disconnection during long-running queries by default, which can help detect
and clean up queries from disconnected clients more promptly.

A value of 0 continues to disable connection checking for users who prefer
the previous behavior.
---
 src/backend/utils/misc/guc_parameters.dat     | 2 +-
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..91c0d740ce5 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -403,7 +403,7 @@
   long_desc => '0 disables connection checks.',
   flags => 'GUC_UNIT_MS',
   variable => 'client_connection_check_interval',
-  boot_val => '0',
+  boot_val => '2000',
   min => '0',
   max => 'INT_MAX',
   check_hook => 'check_client_connection_check_interval',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c4f92fcdac8..8dd89d6da4e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -87,7 +87,7 @@
 #tcp_user_timeout = 0                   # TCP_USER_TIMEOUT, in milliseconds;
                                         # 0 selects the system default
 
-#client_connection_check_interval = 0   # time between checks for client
+#client_connection_check_interval = 2000        # time between checks for client
                                         # disconnection while running queries;
                                         # 0 for never
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: client_connection_check_interval default value
  2026-02-05 05:30 client_connection_check_interval default value Jeremy Schneider <[email protected]>
@ 2026-02-05 13:36 ` Jeremy Schneider <[email protected]>
  2 siblings, 0 replies; 5+ messages in thread

From: Jeremy Schneider @ 2026-02-05 13:36 UTC (permalink / raw)
  To: [email protected] <[email protected]>; +Cc: Marat Buharov <[email protected]>

On Wed, 4 Feb 2026 21:30:32 -0800
Jeremy Schneider <[email protected]> wrote:

> What would people here think about changing the default value of
> client_connection_check_interval to 2000 ms? Right now this is
> disabled by default.

Forgot the doc update in the attached patch. Updated

-Jeremy


Attachments:

  [text/x-patch] 0001-Change-default-client_connection_check_interval-to-2.patch (3.1K, 2-0001-Change-default-client_connection_check_interval-to-2.patch)
  download | inline diff:
From b23d89338daaec9c381783fe6a9df9d148aeace7 Mon Sep 17 00:00:00 2001
From: Jeremy Schneider <[email protected]>
Date: Wed, 4 Feb 2026 21:08:32 -0800
Subject: [PATCH] Change default client_connection_check_interval to 2000ms

The default value of client_connection_check_interval is changed from 0
(disabled) to 2000ms (2 seconds). This enables periodic checking for client
disconnection during long-running queries by default, which can help detect
and clean up queries from disconnected clients more promptly.

A value of 0 continues to disable connection checking for users who prefer
the previous behavior.
---
 doc/src/sgml/config.sgml                      | 9 +++++----
 src/backend/utils/misc/guc_parameters.dat     | 2 +-
 src/backend/utils/misc/postgresql.conf.sample | 2 +-
 3 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 5560b95ee60..5bc7f029e80 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -1059,10 +1059,11 @@ include_dir 'conf.d'
        </para>
        <para>
         If the value is specified without units, it is taken as milliseconds.
-        The default value is <literal>0</literal>, which disables connection
-        checks.  Without connection checks, the server will detect the loss of
-        the connection only at the next interaction with the socket, when it
-        waits for, receives or sends data.
+        The default value is <literal>2000</literal> (2 seconds).  A value of
+        <literal>0</literal> disables connection checks.  Without connection
+        checks, the server will detect the loss of the connection only at the
+        next interaction with the socket, when it waits for, receives or sends
+        data.
        </para>
        <para>
         For the kernel itself to detect lost TCP connections reliably and within
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index f0260e6e412..91c0d740ce5 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -403,7 +403,7 @@
   long_desc => '0 disables connection checks.',
   flags => 'GUC_UNIT_MS',
   variable => 'client_connection_check_interval',
-  boot_val => '0',
+  boot_val => '2000',
   min => '0',
   max => 'INT_MAX',
   check_hook => 'check_client_connection_check_interval',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index c4f92fcdac8..8dd89d6da4e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -87,7 +87,7 @@
 #tcp_user_timeout = 0                   # TCP_USER_TIMEOUT, in milliseconds;
                                         # 0 selects the system default
 
-#client_connection_check_interval = 0   # time between checks for client
+#client_connection_check_interval = 2000        # time between checks for client
                                         # disconnection while running queries;
                                         # 0 for never
 
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: client_connection_check_interval default value
  2026-02-05 05:30 client_connection_check_interval default value Jeremy Schneider <[email protected]>
@ 2026-02-05 15:00 ` Greg Sabino Mullane <[email protected]>
  2 siblings, 0 replies; 5+ messages in thread

From: Greg Sabino Mullane @ 2026-02-05 15:00 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: [email protected] <[email protected]>; Marat Buharov <[email protected]>

I'm a weak -1 on this. Certainly not 2s! That's a lot of context switching
for a busy system for no real reason. Also see this past discussion:

https://www.postgresql.org/message-id/CTEB8LNLOHKR.3I6NK8QVBAGSQ@gonk

--
Cheers,
Greg


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: client_connection_check_interval default value
  2026-02-05 05:30 client_connection_check_interval default value Jeremy Schneider <[email protected]>
@ 2026-02-05 17:26 ` Jacob Champion <[email protected]>
  2026-02-05 23:04   ` Re: client_connection_check_interval default value Jeremy Schneider <[email protected]>
  2 siblings, 1 reply; 5+ messages in thread

From: Jacob Champion @ 2026-02-05 17:26 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: [email protected] <[email protected]>; Marat Buharov <[email protected]>

On Wed, Feb 4, 2026 at 9:30 PM Jeremy Schneider
<[email protected]> wrote:
> While a fix has been merged in pgx for the most direct root cause of
> the incident I saw, this setting just seems like a good behavior to
> make Postgres more robust in general.

At the risk of making perfect the enemy of better, the protocol-level
heartbeat mentioned in the original thread [1] would cover more use
cases, which might give it a better chance of eventually becoming
default behavior. It might also be a lot of work, though.

--Jacob

[1] https://postgr.es/m/CA%2BhUKGLyj5Aqt6ojYfSc%2BqSeB1x%3D3RbU61hnus5sL0BKqEBsLw%40mail.gmail.com






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: client_connection_check_interval default value
  2026-02-05 05:30 client_connection_check_interval default value Jeremy Schneider <[email protected]>
  2026-02-05 17:26 ` Re: client_connection_check_interval default value Jacob Champion <[email protected]>
@ 2026-02-05 23:04   ` Jeremy Schneider <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Jeremy Schneider @ 2026-02-05 23:04 UTC (permalink / raw)
  To: Jacob Champion <[email protected]>; +Cc: [email protected] <[email protected]>; Marat Buharov <[email protected]>; Greg Sabino Mullane <[email protected]>; Thomas Munro <[email protected]>

One interesting thing to me - it seems like all of the past mail
threads were focused on a situation different from mine. Lots of
discussion about freeing resources like CPU.

In the outage I saw, the system was idle and we completely ran out of
max_connections because all sessions were waiting on a row lock.

Importantly, the app was closing these conns but we had sockets stacking
up on the server in CLOSE-WAIT state - and postgres simply never
cleaned them up until we had an outage. The processes were completely
idle waiting for a row lock that was not going to be released.

Impact could have been isolated to sessions hitting that row (with this
GUC), but it escalated to a system outage. It's pretty simple to
reproduce this:
https://github.com/ardentperf/pg-idle-test/tree/main/conn_exhaustion


On Thu, 5 Feb 2026 09:26:34 -0800
Jacob Champion <[email protected]> wrote:

> On Wed, Feb 4, 2026 at 9:30 PM Jeremy Schneider
> <[email protected]> wrote:
> > While a fix has been merged in pgx for the most direct root cause of
> > the incident I saw, this setting just seems like a good behavior to
> > make Postgres more robust in general.  
> 
> At the risk of making perfect the enemy of better, the protocol-level
> heartbeat mentioned in the original thread [1] would cover more use
> cases, which might give it a better chance of eventually becoming
> default behavior. It might also be a lot of work, though.

It seems like a fair bit of discussion is around OS coverage - even
Thomas' message there references keepalive working as expected on
Linux. Tom objects in 2023 that "the default behavior would then be
platform-dependent and that's a documentation problem we could do
without."

But it's been five years - has there been further work on implementing
a postgres-level heartbeat? And I see other places in the docs where we
note platform differences, is it really such a big problem to change
the default here?


On Thu, 5 Feb 2026 10:00:29 -0500
Greg Sabino Mullane <[email protected]> wrote:

> I'm a weak -1 on this. Certainly not 2s! That's a lot of context
> switching for a busy system for no real reason. Also see this past
> discussion:

In the other thread I see larger perf concerns with some early
implementations before they refactored the patch? Konstantin's message
on 2019-08-02 said he didn't see much difference, and the value of the
timeout didn't seem to matter, and if anything the marginal effect was
simply from the presence of any timer (same effect as setting
statement_timeout) - and later on the thread it seems like Thomas also
saw minimal performance concern here.

I did see a real system outage that could have been prevented by an
appropriate default value here, since I didn't yet know to change it.

-Jeremy


Attachments:

  [image/png] client_connection_check.png (45.7K, 2-client_connection_check.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 5+ messages in thread


end of thread, other threads:[~2026-02-05 23:04 UTC | newest]

Thread overview: 5+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-05 05:30 client_connection_check_interval default value Jeremy Schneider <[email protected]>
2026-02-05 13:36 ` Jeremy Schneider <[email protected]>
2026-02-05 15:00 ` Greg Sabino Mullane <[email protected]>
2026-02-05 17:26 ` Jacob Champion <[email protected]>
2026-02-05 23:04   ` Jeremy Schneider <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox