MIME-Version: 1.0
References: 
 <CANzqJaBG-MkG9YTt8pYTnHu+9U5wpwEcWQKzx2aOu85C6Uzn-w@mail.gmail.com>
 <CAHw75vvaeoTDO6796G7O_zamiaFWoi81+2YDrjSh4mvFsnATkQ@mail.gmail.com>
In-Reply-To: 
 <CAHw75vvaeoTDO6796G7O_zamiaFWoi81+2YDrjSh4mvFsnATkQ@mail.gmail.com>
From: Ron Johnson <ronljohnsonjr@gmail.com>
Date: Fri, 9 Jan 2026 12:53:51 -0500
Message-ID: 
 <CANzqJaBwJJj_-hoN1ZA21HLQi4KDdBYvO6FErCK7o61_MPmReg@mail.gmail.com>
Subject: Re: Better way to monitor for failed replication?
To: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000eac0420647f8357a"
Archived-At: 
 <https://www.postgresql.org/message-id/CANzqJaBwJJj_-hoN1ZA21HLQi4KDdBYvO6FErCK7o61_MPmReg%40mail.gmail.com>
Precedence: bulk

--000000000000eac0420647f8357a
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Fri, Jan 9, 2026 at 12:42=E2=80=AFPM Keith <keith@keithf4.com> wrote:

>
>
> On Fri, Jan 9, 2026 at 10:50=E2=80=AFAM Ron Johnson <ronljohnsonjr@gmail.=
com>
> wrote:
>
>> Currently, in a bash script, I run this SELECT statement against the
>> Primary server which is supposed to replicate to multiple servers.  If
>> active =3D=3D f, I send an alter email.
>>
>> postgres=3D# SELECT rs.slot_name, rs.active, sr.client_hostname
>> from pg_replication_slots rs
>>     left outer join pg_stat_replication sr on rs.active_pid =3D sr.pid;
>>   slot_name   | active | client_hostname
>> --------------+--------+-----------------
>>  pgstandby1   | t      | BBOPITCPGS302B
>>  replicate_dr | f      |
>> (2 rows)
>>
>> Is there a better way to check for replication that's supposed to be
>> happening, but isn't (like PG on the replica was stopped for some reason=
)?
>>
>

>
> Your example only takes into account if you are using replication slots,
> correct? If you're always using those, this is definitely a good metric t=
o
> have since the slot going down means WAL buildup, so I'd definitely keep =
it.
>

Yes, just replication slots.


> As for general replication monitoring, these have been the two queries I
> use
>
> On the Primary:
>
> SELECT client_addr AS replica
>         , client_hostname AS replica_hostname
>         , client_port AS replica_port
>         , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
>         FROM pg_catalog.pg_stat_replication;
>
> This checks for byte-lag for all active streaming replicas, physical or
> logical. A count of zero or NULL from this metric means all replicas are
> down. Can monitor a specific count if you have a known number of replicas=
.
>
> On any Replica:
>
> SELECT
>        CASE
>        WHEN (pg_last_wal_receive_lsn() =3D pg_last_wal_replay_lsn()) OR
> (pg_is_in_recovery() =3D false) THEN 0
>        ELSE EXTRACT (EPOCH FROM clock_timestamp() -
> pg_last_xact_replay_timestamp())::INTEGER
>        END
>     AS replay_time
>     ,  CASE
>        WHEN pg_is_in_recovery() =3D false THEN 0
>        ELSE EXTRACT (EPOCH FROM clock_timestamp() -
> pg_last_xact_replay_timestamp())::INTEGER
>        END
>     AS received_time;
>
> This monitors the lag in seconds from the replica. Technically it monitor=
s
> the last time a WAL file was received (received_time) and the last time W=
AL
> was actually replayed (replay_time). The reason for both is that the
> received time can be a false positive when there is no write activity on
> the primary. If there's always supposed to be write activity, this can be=
 a
> another good metric to indicate that something is very wrong. The
> replay_time metric avoids the false positive by only being considered whe=
n
> receive is different than replay.
>

I'll integrate this into the lag report.


--=20
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

--000000000000eac0420647f8357a
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">On Fri, Jan 9, 2026 at 12:42=E2=80=AFPM K=
eith &lt;<a href=3D"mailto:keith@keithf4.com">keith@keithf4.com</a>&gt; wro=
te:</div><div class=3D"gmail_quote gmail_quote_container"><blockquote class=
=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rg=
b(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div dir=3D"ltr"><div dir=
=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D=
"gmail_attr">On Fri, Jan 9, 2026 at 10:50=E2=80=AFAM Ron Johnson &lt;<a hre=
f=3D"mailto:ronljohnsonjr@gmail.com" target=3D"_blank">ronljohnsonjr@gmail.=
com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"marg=
in:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1e=
x"><div dir=3D"ltr"><div>Currently, in a bash script, I run this SELECT sta=
tement against the Primary server which is supposed to=C2=A0replicate=C2=A0=
to multiple servers.=C2=A0 If active =3D=3D f, I send an alter email.</div>=
<div><br></div><div><font face=3D"monospace">postgres=3D# SELECT rs.slot_na=
me, rs.active, sr.client_hostname</font></div><div><font face=3D"monospace"=
>from pg_replication_slots rs <br>=C2=A0 =C2=A0 left outer join pg_stat_rep=
lication sr on rs.active_pid =3D sr.pid;<br>=C2=A0 slot_name =C2=A0 | activ=
e | client_hostname <br>--------------+--------+-----------------<br>=C2=A0=
pgstandby1 =C2=A0 | t =C2=A0 =C2=A0 =C2=A0| BBOPITCPGS302B<br>=C2=A0replica=
te_dr | f =C2=A0 =C2=A0 =C2=A0| <br>(2 rows)</font></div><div><br></div><di=
v>Is there a better way to check=C2=A0for replication that&#39;s supposed t=
o be happening, but isn&#39;t (like PG on the replica was stopped for some =
reason)?</div></div></blockquote></div></div></div></blockquote><div>=C2=A0=
</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;b=
order-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><d=
iv dir=3D"ltr"><div class=3D"gmail_quote"><div><br></div><div>Your example =
only takes into account if you are using replication slots, correct? If you=
&#39;re always using those, this is definitely a good metric to have since =
the slot going down means WAL buildup, so I&#39;d definitely keep it.</div>=
</div></div></div></blockquote><div><br></div><div>Yes, just replication sl=
ots.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div dir=3D"ltr"><div dir=3D"ltr"><div class=3D"gmail_quote"><div> As for=
 general replication monitoring, these have been the two queries I use</div=
><div><br></div><div>On the Primary:</div><div><br></div><div><span style=
=3D"font-family:monospace">SELECT client_addr AS replica<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 , client_hostname AS replica_hostname<br>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 , client_port AS replica_port<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 , pg_w=
al_lsn_diff(sent_lsn, replay_lsn) AS bytes<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 F=
ROM pg_catalog.pg_stat_replication;=C2=A0</span></div><div><br></div><div>T=
his checks for byte-lag for all active streaming replicas, physical or logi=
cal. A count of zero or NULL from this metric means all replicas are down. =
Can monitor a specific count if you have a known number of replicas.</div><=
div><br></div><div>On any Replica:</div><div><br></div><div><span style=3D"=
font-family:monospace">SELECT<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0CASE<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0WHEN (pg_last_wal_receive_lsn() =3D pg_last_wal_replay_=
lsn()) OR (pg_is_in_recovery() =3D false) THEN 0<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_time=
stamp())::INTEGER<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0END<br>=C2=A0 =C2=A0 AS rep=
lay_time<br>=C2=A0 =C2=A0 , =C2=A0CASE<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0WHEN p=
g_is_in_recovery() =3D false THEN 0<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0ELSE EXTR=
ACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEG=
ER<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0END<br>=C2=A0 =C2=A0 AS received_time;</sp=
an></div><div><br></div><div>This monitors the lag in seconds from the repl=
ica. Technically it monitors the last time a WAL file was received (receive=
d_time) and the last time WAL was actually replayed (replay_time). The reas=
on for both is that the received time can be a false positive when there is=
 no write activity on the primary. If there&#39;s always supposed to be wri=
te activity, this can be a another good metric to indicate that something i=
s very wrong. The replay_time metric avoids the false positive by only bein=
g considered when receive is different than replay.</div></div></div></div>=
</blockquote><div><br></div><div>I&#39;ll integrate this into the lag repor=
t.=C2=A0</div></div><div><br clear=3D"all"></div><div><br></div><span class=
=3D"gmail_signature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_s=
ignature"><div dir=3D"ltr">Death to &lt;Redacted&gt;, and butter sauce.<div=
>Don&#39;t boil me, I&#39;m still alive.<br><div><div>&lt;Redacted&gt; lobs=
ter!</div></div></div></div></div></div>

--000000000000eac0420647f8357a--