MIME-Version: 1.0
References: 
 <CANzqJaBG-MkG9YTt8pYTnHu+9U5wpwEcWQKzx2aOu85C6Uzn-w@mail.gmail.com>
In-Reply-To: 
 <CANzqJaBG-MkG9YTt8pYTnHu+9U5wpwEcWQKzx2aOu85C6Uzn-w@mail.gmail.com>
From: Keith <keith@keithf4.com>
Date: Fri, 9 Jan 2026 12:41:57 -0500
Message-ID: 
 <CAHw75vvaeoTDO6796G7O_zamiaFWoi81+2YDrjSh4mvFsnATkQ@mail.gmail.com>
Subject: Re: Better way to monitor for failed replication?
To: Ron Johnson <ronljohnsonjr@gmail.com>
Cc: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000dc843a0647f80cf7"
Archived-At: 
 <https://www.postgresql.org/message-id/CAHw75vvaeoTDO6796G7O_zamiaFWoi81%2B2YDrjSh4mvFsnATkQ%40mail.gmail.com>
Precedence: bulk

--000000000000dc843a0647f80cf7
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 9, 2026 at 10:50=E2=80=AFAM Ron Johnson <ronljohnsonjr@gmail.co=
m> wrote:

> Currently, in a bash script, I run this SELECT statement against the
> Primary server which is supposed to replicate to multiple servers.  If
> active =3D=3D f, I send an alter email.
>
> postgres=3D# SELECT rs.slot_name, rs.active, sr.client_hostname
> from pg_replication_slots rs
>     left outer join pg_stat_replication sr on rs.active_pid =3D sr.pid;
>   slot_name   | active | client_hostname
> --------------+--------+-----------------
>  pgstandby1   | t      | BBOPITCPGS302B
>  replicate_dr | f      |
> (2 rows)
>
> Is there a better way to check for replication that's supposed to be
> happening, but isn't (like PG on the replica was stopped for some reason)=
?
>
> --
> Death to <Redacted>, and butter sauce.
> Don't boil me, I'm still alive.
> <Redacted> lobster!
>

Your example only takes into account if you are using replication slots,
correct? If you're always using those, this is definitely a good metric to
have since the slot going down means WAL buildup, so I'd definitely keep
it. As for general replication monitoring, these have been the two queries
I use

On the Primary:

SELECT client_addr AS replica
        , client_hostname AS replica_hostname
        , client_port AS replica_port
        , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
        FROM pg_catalog.pg_stat_replication;

This checks for byte-lag for all active streaming replicas, physical or
logical. A count of zero or NULL from this metric means all replicas are
down. Can monitor a specific count if you have a known number of replicas.

On any Replica:

SELECT
       CASE
       WHEN (pg_last_wal_receive_lsn() =3D pg_last_wal_replay_lsn()) OR
(pg_is_in_recovery() =3D false) THEN 0
       ELSE EXTRACT (EPOCH FROM clock_timestamp() -
pg_last_xact_replay_timestamp())::INTEGER
       END
    AS replay_time
    ,  CASE
       WHEN pg_is_in_recovery() =3D false THEN 0
       ELSE EXTRACT (EPOCH FROM clock_timestamp() -
pg_last_xact_replay_timestamp())::INTEGER
       END
    AS received_time;

This monitors the lag in seconds from the replica. Technically it monitors
the last time a WAL file was received (received_time) and the last time WAL
was actually replayed (replay_time). The reason for both is that the
received time can be a false positive when there is no write activity on
the primary. If there's always supposed to be write activity, this can be a
another good metric to indicate that something is very wrong. The
replay_time metric avoids the false positive by only being considered when
receive is different than replay. This metric also works when you're doing
WAL-replay replication instead of streaming.

--000000000000dc843a0647f80cf7
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=
=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri, Jan 9, 2026 =
at 10:50=E2=80=AFAM Ron Johnson &lt;<a href=3D"mailto:ronljohnsonjr@gmail.c=
om" target=3D"_blank">ronljohnsonjr@gmail.com</a>&gt; wrote:<br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Currentl=
y, in a bash script, I run this SELECT statement against the Primary server=
 which is supposed to=C2=A0replicate=C2=A0to multiple servers.=C2=A0 If act=
ive =3D=3D f, I send an alter email.</div><div><br></div><div><font face=3D=
"monospace">postgres=3D# SELECT rs.slot_name, rs.active, sr.client_hostname=
</font></div><div><font face=3D"monospace">from pg_replication_slots rs <br=
>=C2=A0 =C2=A0 left outer join pg_stat_replication sr on rs.active_pid =3D =
sr.pid;<br>=C2=A0 slot_name =C2=A0 | active | client_hostname <br>---------=
-----+--------+-----------------<br>=C2=A0pgstandby1 =C2=A0 | t =C2=A0 =C2=
=A0 =C2=A0| BBOPITCPGS302B<br>=C2=A0replicate_dr | f =C2=A0 =C2=A0 =C2=A0| =
<br>(2 rows)</font></div><div><br></div><div>Is there a better way to check=
=C2=A0for replication that&#39;s supposed to be happening, but isn&#39;t (l=
ike PG on the replica was stopped for some reason)?</div><div><br></div><sp=
an class=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D=
"gmail_signature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sa=
uce.<div>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&=
gt; lobster!</div></div></div></div></div></div></blockquote><div><br></div=
><div>Your example only takes into account if you are using replication slo=
ts, correct? If you&#39;re always using those, this is definitely a good me=
tric to have since the slot going down means WAL buildup, so I&#39;d defini=
tely keep it. As for general replication monitoring, these have been the tw=
o queries I use</div><div><br></div><div>On the Primary:</div><div><br></di=
v><div><span style=3D"font-family:monospace">SELECT client_addr AS replica<=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 , client_hostname AS replica_hostname<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 , client_port AS replica_port<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0 FROM pg_catalog.pg_stat_replication;=C2=A0</span></div><d=
iv><br></div><div>This checks for byte-lag for all active streaming replica=
s, physical or logical. A count of zero or NULL from this metric means all =
replicas are down. Can monitor a specific count if you have a known number =
of replicas.</div><div><br></div><div>On any Replica:</div><div><br></div><=
div><span style=3D"font-family:monospace">SELECT<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0CASE<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0WHEN (pg_last_wal_receive_lsn() =
=3D pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() =3D false) THEN 0<br>=
=C2=A0 =C2=A0 =C2=A0 =C2=A0ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_=
last_xact_replay_timestamp())::INTEGER<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0END<br=
>=C2=A0 =C2=A0 AS replay_time<br>=C2=A0 =C2=A0 , =C2=A0CASE<br>=C2=A0 =C2=
=A0 =C2=A0 =C2=A0WHEN pg_is_in_recovery() =3D false THEN 0<br>=C2=A0 =C2=A0=
 =C2=A0 =C2=A0ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_rep=
lay_timestamp())::INTEGER<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0END<br>=C2=A0 =C2=
=A0 AS received_time;</span></div><div><br></div><div>This monitors the lag=
 in seconds from the replica. Technically it monitors the last time a WAL f=
ile was received (received_time) and the last time WAL was actually replaye=
d (replay_time). The reason for both is that the received time can be a fal=
se positive when there is no write activity on the primary. If there&#39;s =
always supposed to be write activity, this can be a another good metric to =
indicate that something is very wrong. The replay_time metric avoids the fa=
lse positive by only being considered when receive is different than replay=
. This metric also works when you&#39;re doing WAL-replay replication inste=
ad of streaming.</div></div></div>
</div>

--000000000000dc843a0647f80cf7--