Better way to monitor for failed replication?

public inbox for [email protected]  
help / color / mirror / Atom feed

Better way to monitor for failed replication?
3+ messages / 2 participants
[nested] [flat]

* Better way to monitor for failed replication?
@ 2026-01-09 15:50 Ron Johnson <[email protected]>
  2026-01-09 17:41 ` Re: Better way to monitor for failed replication? Keith <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Ron Johnson @ 2026-01-09 15:50 UTC (permalink / raw)
  To: Pgsql-admin <[email protected]>

Currently, in a bash script, I run this SELECT statement against the
Primary server which is supposed to replicate to multiple servers.  If
active == f, I send an alter email.

postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
from pg_replication_slots rs
    left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
  slot_name   | active | client_hostname
--------------+--------+-----------------
 pgstandby1   | t      | BBOPITCPGS302B
 replicate_dr | f      |
(2 rows)

Is there a better way to check for replication that's supposed to be
happening, but isn't (like PG on the replica was stopped for some reason)?

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Better way to monitor for failed replication?
  2026-01-09 15:50 Better way to monitor for failed replication? Ron Johnson <[email protected]>
@ 2026-01-09 17:41 ` Keith <[email protected]>
  2026-01-09 17:53   ` Re: Better way to monitor for failed replication? Ron Johnson <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Keith @ 2026-01-09 17:41 UTC (permalink / raw)
  To: Ron Johnson <[email protected]>; +Cc: Pgsql-admin <[email protected]>

On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <[email protected]> wrote:

> Currently, in a bash script, I run this SELECT statement against the
> Primary server which is supposed to replicate to multiple servers.  If
> active == f, I send an alter email.
>
> postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
> from pg_replication_slots rs
>     left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
>   slot_name   | active | client_hostname
> --------------+--------+-----------------
>  pgstandby1   | t      | BBOPITCPGS302B
>  replicate_dr | f      |
> (2 rows)
>
> Is there a better way to check for replication that's supposed to be
> happening, but isn't (like PG on the replica was stopped for some reason)?
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!
>

Your example only takes into account if you are using replication slots,
correct? If you're always using those, this is definitely a good metric to
have since the slot going down means WAL buildup, so I'd definitely keep
it. As for general replication monitoring, these have been the two queries
I use

On the Primary:

SELECT client_addr AS replica
        , client_hostname AS replica_hostname
        , client_port AS replica_port
        , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
        FROM pg_catalog.pg_stat_replication;

This checks for byte-lag for all active streaming replicas, physical or
logical. A count of zero or NULL from this metric means all replicas are
down. Can monitor a specific count if you have a known number of replicas.

On any Replica:

SELECT
       CASE
       WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR
(pg_is_in_recovery() = false) THEN 0
       ELSE EXTRACT (EPOCH FROM clock_timestamp() -
pg_last_xact_replay_timestamp())::INTEGER
       END
    AS replay_time
    ,  CASE
       WHEN pg_is_in_recovery() = false THEN 0
       ELSE EXTRACT (EPOCH FROM clock_timestamp() -
pg_last_xact_replay_timestamp())::INTEGER
       END
    AS received_time;

This monitors the lag in seconds from the replica. Technically it monitors
the last time a WAL file was received (received_time) and the last time WAL
was actually replayed (replay_time). The reason for both is that the
received time can be a false positive when there is no write activity on
the primary. If there's always supposed to be write activity, this can be a
another good metric to indicate that something is very wrong. The
replay_time metric avoids the false positive by only being considered when
receive is different than replay. This metric also works when you're doing
WAL-replay replication instead of streaming.

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Better way to monitor for failed replication?
  2026-01-09 15:50 Better way to monitor for failed replication? Ron Johnson <[email protected]>
  2026-01-09 17:41 ` Re: Better way to monitor for failed replication? Keith <[email protected]>
@ 2026-01-09 17:53   ` Ron Johnson <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Ron Johnson @ 2026-01-09 17:53 UTC (permalink / raw)
  To: Pgsql-admin <[email protected]>

On Fri, Jan 9, 2026 at 12:42 PM Keith <[email protected]> wrote:

>
>
> On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <[email protected]>
> wrote:
>
>> Currently, in a bash script, I run this SELECT statement against the
>> Primary server which is supposed to replicate to multiple servers.  If
>> active == f, I send an alter email.
>>
>> postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
>> from pg_replication_slots rs
>>     left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
>>   slot_name   | active | client_hostname
>> --------------+--------+-----------------
>>  pgstandby1   | t      | BBOPITCPGS302B
>>  replicate_dr | f      |
>> (2 rows)
>>
>> Is there a better way to check for replication that's supposed to be
>> happening, but isn't (like PG on the replica was stopped for some reason)?
>>
>

>
> Your example only takes into account if you are using replication slots,
> correct? If you're always using those, this is definitely a good metric to
> have since the slot going down means WAL buildup, so I'd definitely keep it.
>

Yes, just replication slots.


> As for general replication monitoring, these have been the two queries I
> use
>
> On the Primary:
>
> SELECT client_addr AS replica
>         , client_hostname AS replica_hostname
>         , client_port AS replica_port
>         , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
>         FROM pg_catalog.pg_stat_replication;
>
> This checks for byte-lag for all active streaming replicas, physical or
> logical. A count of zero or NULL from this metric means all replicas are
> down. Can monitor a specific count if you have a known number of replicas.
>
> On any Replica:
>
> SELECT
>        CASE
>        WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR
> (pg_is_in_recovery() = false) THEN 0
>        ELSE EXTRACT (EPOCH FROM clock_timestamp() -
> pg_last_xact_replay_timestamp())::INTEGER
>        END
>     AS replay_time
>     ,  CASE
>        WHEN pg_is_in_recovery() = false THEN 0
>        ELSE EXTRACT (EPOCH FROM clock_timestamp() -
> pg_last_xact_replay_timestamp())::INTEGER
>        END
>     AS received_time;
>
> This monitors the lag in seconds from the replica. Technically it monitors
> the last time a WAL file was received (received_time) and the last time WAL
> was actually replayed (replay_time). The reason for both is that the
> received time can be a false positive when there is no write activity on
> the primary. If there's always supposed to be write activity, this can be a
> another good metric to indicate that something is very wrong. The
> replay_time metric avoids the false positive by only being considered when
> receive is different than replay.
>

I'll integrate this into the lag report.


-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!


^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2026-01-09 17:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-09 15:50 Better way to monitor for failed replication? Ron Johnson <[email protected]>
2026-01-09 17:41 ` Keith <[email protected]>
2026-01-09 17:53   ` Ron Johnson <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox