public inbox for [email protected]
help / color / mirror / Atom feedFrom: Keith <[email protected]>
To: Ron Johnson <[email protected]>
Cc: Pgsql-admin <[email protected]>
Subject: Re: Better way to monitor for failed replication?
Date: Fri, 9 Jan 2026 12:41:57 -0500
Message-ID: <CAHw75vvaeoTDO6796G7O_zamiaFWoi81+2YDrjSh4mvFsnATkQ@mail.gmail.com> (raw)
In-Reply-To: <CANzqJaBG-MkG9YTt8pYTnHu+9U5wpwEcWQKzx2aOu85C6Uzn-w@mail.gmail.com>
References: <CANzqJaBG-MkG9YTt8pYTnHu+9U5wpwEcWQKzx2aOu85C6Uzn-w@mail.gmail.com>
On Fri, Jan 9, 2026 at 10:50 AM Ron Johnson <[email protected]> wrote:
> Currently, in a bash script, I run this SELECT statement against the
> Primary server which is supposed to replicate to multiple servers. If
> active == f, I send an alter email.
>
> postgres=# SELECT rs.slot_name, rs.active, sr.client_hostname
> from pg_replication_slots rs
> left outer join pg_stat_replication sr on rs.active_pid = sr.pid;
> slot_name | active | client_hostname
> --------------+--------+-----------------
> pgstandby1 | t | BBOPITCPGS302B
> replicate_dr | f |
> (2 rows)
>
> Is there a better way to check for replication that's supposed to be
> happening, but isn't (like PG on the replica was stopped for some reason)?
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!
>
Your example only takes into account if you are using replication slots,
correct? If you're always using those, this is definitely a good metric to
have since the slot going down means WAL buildup, so I'd definitely keep
it. As for general replication monitoring, these have been the two queries
I use
On the Primary:
SELECT client_addr AS replica
, client_hostname AS replica_hostname
, client_port AS replica_port
, pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
FROM pg_catalog.pg_stat_replication;
This checks for byte-lag for all active streaming replicas, physical or
logical. A count of zero or NULL from this metric means all replicas are
down. Can monitor a specific count if you have a known number of replicas.
On any Replica:
SELECT
CASE
WHEN (pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn()) OR
(pg_is_in_recovery() = false) THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() -
pg_last_xact_replay_timestamp())::INTEGER
END
AS replay_time
, CASE
WHEN pg_is_in_recovery() = false THEN 0
ELSE EXTRACT (EPOCH FROM clock_timestamp() -
pg_last_xact_replay_timestamp())::INTEGER
END
AS received_time;
This monitors the lag in seconds from the replica. Technically it monitors
the last time a WAL file was received (received_time) and the last time WAL
was actually replayed (replay_time). The reason for both is that the
received time can be a false positive when there is no write activity on
the primary. If there's always supposed to be write activity, this can be a
another good metric to indicate that something is very wrong. The
replay_time metric avoids the false positive by only being considered when
receive is different than replay. This metric also works when you're doing
WAL-replay replication instead of streaming.
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: Better way to monitor for failed replication?
In-Reply-To: <CAHw75vvaeoTDO6796G7O_zamiaFWoi81+2YDrjSh4mvFsnATkQ@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox