Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug

public inbox for [email protected]  
help / color / mirror / Atom feed

From: hubert depesz lubaczewski <[email protected]>
To: Adrian Klaver <[email protected]>
Cc: PostgreSQL General <[email protected]>
Subject: Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug
Date: Thu, 21 Aug 2025 17:13:42 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>

On Thu, Aug 21, 2025 at 08:04:25AM -0700, Adrian Klaver wrote:
> > > > > "For ~ 1 second there are no logs going to log (we usually have at 5-20
> > > > > messages logged per second), no connection, nothing. And then we get
> > > > > bunch (30+) messages with the same milisecond time."
> > > > > Are the 30+ messages all coming in on one connection or multiple
> > > > > connections?
> > > > Multiple connections.
> > > > > Also to be clear these are statements that are being run on the replica
> > > > > locally, correct?
> > > > What do you mean locally?
> > > I should have been clearer. Are the queries being run against the replica or
> > > the primary?
> > All to replica. Primary has its own work, of course, but the problem
> > we're experiencing is on replicas.
> 
> If I am following there is more then one primary --> replica pair and the
> problem exists across all the pairs.

Not all. We have ~ 300 such clusters. The thing doesn't cause any
customer-visible issues (after all it's just 1 second delay every so
often), so it's generally overlooked when it happens.

But we were paying closer attention to one such cluster, and then couple
of other, and we've seen this behavior.

> > > How many applications servers are hitting the database?
> > 
> > To be honest, I'm not sure. I have visibility into dbs, and bouncers,
> > not really into Apps. I know that these are automatically dynamically
> > scaled, so number of app server is very varying.
> > 
> > I'd say anything from 40 to 200 app servers hit first layer of bouncers,
> > which we usually have 6-9 (2-3 per az).
> > 
> > These go to 2nd layer of bouncers, on the db server itself.
> 
> By bouncer I assume you mean something like pgBouncer, a connection pooler.
> Is it possible to determine what bouncer the queries in question are coming
> from?

From the POV of db, all queries are coming from one of N localhost
bouncers. N is usually 2…6.
From the POV of the local bouncers, the queries come from range of
remote bouncers.

Generally we haven't seen any correlation between queries coming from
specific ranges of ips. Logged queries, the ones that we see with
runtime of 1s, have comments that indicate source, and they some from
"all-around". Specifically, "DISCARD ALL" queries are generated by
bouncers themselves (both layers).

Just so that it will be clear, I don't expect anyone to be able to
diagnose the problem based on description. I'm looking more into idea
what to look for. The issue is that with the situation being pretty
short, and happening on servers with non-trivial query load, I can't do
stuff, like, for example, strace, stuff.

Best regards,

depesz

view thread (27+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox