Re: Possible causes of high_replay lag, given replication settings?

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Possible causes of high_replay lag, given replication settings?
3+ messages / 2 participants
[nested] [flat]

* Re: Possible causes of high_replay lag, given replication settings?
@ 2025-07-25 13:57  Jon Zeppieri <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Jon Zeppieri @ 2025-07-25 13:57 UTC (permalink / raw)
  To: Nick Cleaton <[email protected]>; +Cc: [email protected]

On Wed, Jul 23, 2025 at 4:27 PM Nick Cleaton <[email protected]> wrote:
>
> On Fri, 18 Jul 2025 at 21:29, Jon Zeppieri <[email protected]> wrote:
> >
> > I just had a situation where physical replication fell far behind
> > (hours). The write and flush lag times were 0, but replay_lag was
> > high. The replica has hot_standby_feedback on, and both
> > max_standby_streaming_delay and max_standby_archive_delay are set to
> > 30s.
> >
> > What could cause a situation like this? If the network were a problem,
> > I'd expect the other _lag times to be high. So it appears that the
> > replica was getting the WAL but was unable to apply it. Are there
> > situations where the replica cannot apply WAL other than the kinds of
> > conflicts that would be addressed by the _delay settings?
> >
> > I checked pg_stat_database_conflicts, but there was nothing in it -- all zeros.
>
> This can happen when there are several busy writing processes on the
> primary. The single replay process on the replica can't keep up with
> the writes.

Thanks for the response, Nick. I'm curious why the situation you
describe wouldn't also lead to the write_lag and flush_lag also being
high. If the problem is simply keeping up with the primary, wouldn't
you expect all three lag times to be elevated?

- Jon






^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Possible causes of high_replay lag, given replication settings?
@ 2025-07-25 23:12  Greg Sabino Mullane <[email protected]>
  parent: Jon Zeppieri <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Greg Sabino Mullane @ 2025-07-25 23:12 UTC (permalink / raw)
  To: Jon Zeppieri <[email protected]>; +Cc: Nick Cleaton <[email protected]>; [email protected]

On Fri, Jul 25, 2025 at 9:57 AM Jon Zeppieri <[email protected]> wrote:

> Thanks for the response, Nick. I'm curious why the situation you describe
> wouldn't also lead to the write_lag and flush_lag also being
> high. If the problem is simply keeping up with the primary, wouldn't you
> expect all three lag times to be elevated?
>

No - write and flush are pretty quick and simple, it's just putting the WAL
onto the local disk. Replay involves a lot more work as we have to parse
the WAL and apply the changes, which means doing a lot of I/O across many
files. Still, *hours* to me indicates more than just a lot of extra
traffic. Check that recovery_min_apply_delay is still 0, then log onto the
replica and see what's going on with regards to open transactions and locks.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Possible causes of high_replay lag, given replication settings?
@ 2025-07-26 15:43  Jon Zeppieri <[email protected]>
  parent: Greg Sabino Mullane <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Jon Zeppieri @ 2025-07-26 15:43 UTC (permalink / raw)
  To: Greg Sabino Mullane <[email protected]>; +Cc: Nick Cleaton <[email protected]>; [email protected]

On Fri, Jul 25, 2025 at 7:13 PM Greg Sabino Mullane <[email protected]> wrote:
>
> On Fri, Jul 25, 2025 at 9:57 AM Jon Zeppieri <[email protected]> wrote:
>>
>> Thanks for the response, Nick. I'm curious why the situation you describe wouldn't also lead to the write_lag and flush_lag also being
>> high. If the problem is simply keeping up with the primary, wouldn't you expect all three lag times to be elevated?
>
>
> No - write and flush are pretty quick and simple, it's just putting the WAL onto the local disk. Replay involves a lot more work as we have to parse the WAL and apply the changes, which means doing a lot of I/O across many files. Still, *hours* to me indicates more than just a lot of extra traffic. Check that recovery_min_apply_delay is still 0, then log onto the replica and see what's going on with regards to open transactions and locks.

Thanks Greg. `recovery_min_apply_delay` is 0, just checked. Also, I
didn't mention in my initial post that it seemed the cause of the
delay was long-running queries on the replica, rather than the
primary. It's possible, of course, that I'm wrong, but I was able to
get the replica moving again when I killed off old queries on the
replica. If those were the problem, though, then I don't understand
why the max_standby_streaming_delay didn't prevent that situation.

- Jon

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2025-07-26 15:43 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-25 13:57 Re: Possible causes of high_replay lag, given replication settings? Jon Zeppieri <[email protected]>
2025-07-25 23:12 ` Greg Sabino Mullane <[email protected]>
2025-07-26 15:43   ` Jon Zeppieri <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox