Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veGga-00B8Hx-2c for pgsql-admin@arkaria.postgresql.org; Fri, 09 Jan 2026 17:54:09 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1veGgY-008DQh-1R for pgsql-admin@arkaria.postgresql.org; Fri, 09 Jan 2026 17:54:07 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veGgY-008DQZ-06 for pgsql-admin@lists.postgresql.org; Fri, 09 Jan 2026 17:54:06 +0000 Received: from mail-oa1-x36.google.com ([2001:4860:4864:20::36]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1veGgW-005W4q-1P for pgsql-admin@lists.postgresql.org; Fri, 09 Jan 2026 17:54:06 +0000 Received: by mail-oa1-x36.google.com with SMTP id 586e51a60fabf-3f0cbfae787so3326949fac.3 for ; Fri, 09 Jan 2026 09:54:04 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767981242; x=1768586042; darn=lists.postgresql.org; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :from:to:cc:subject:date:message-id:reply-to; bh=v8OGzQuxNG9bqiVglsV5PvdNgNotLs5RnxmnUiZyD7Y=; b=Hkipdbxz539POIZUu/zOH9v9FiiAKE3df3U/asoZuRXJLcvyBhMXRj+MpYR7oiDSXT sfDieMiLz3EmrBoIYAKK4H26ptvR08eGoaD2t8wy/2TwmLkYSQZCH8E0eTinFa1diFaL hG/vIImNrwPY278kHS2IEXDNw73jMpwJFyc0DPUIoOBHI5eyynfaFbEp8gaW32qTQkqA zF3jmlTgTrKjBIKoRbxIyEodXVHAO7Bk9HiNHd0XahxypVxO5yw+SFWpuRhMZvtww1YQ MxArGZBGudTiqU/PDsP80XzTguOh7xh7mbZt9q+a4t6/dsJfLUmtfKcG8Ks/UnVDkXoU iNdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767981242; x=1768586042; h=to:subject:message-id:date:from:in-reply-to:references:mime-version :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=v8OGzQuxNG9bqiVglsV5PvdNgNotLs5RnxmnUiZyD7Y=; b=EMHbFX9WrORE20F6dcXZeMwmCxAxllKtF0oM+JdVwSQEBbXLJ/5PM6qdipJPXkeTl0 PHaTmaRz42WhDqwhuYlVVsA8pK8AdPtGkhLvb1AcdFbeje/XOOpmZipjA3KPKFasbneR Vvuw6PVckbF+Qs1qMI1sykCFHK1gVl9vuZ0pzn37123UldiyjFpkMsjPexVAhds8pAJ0 rS1J6sMRpJeHTW+QM8lCylmP5gCaJfxZg59Se3hVD3z356S3YTuYkMGf7XXQcLBhOMP9 RH8f0fCCHF++XEeYms+oEBKDGYzavPJbkhA8+yTX9XcS7ZdFPkgEpVlZbWmwY1o2sLT3 +eRw== X-Gm-Message-State: AOJu0Yyy+lv/rIBLNoVnKRQlf5DuwnUCqlS40WefbBWTRf+H/+1ViXAb Me6/cR7Fp8l4JYiH46zQb9j3yPRaeRNYr80KunDgeuJmxC1BGQJZJPwzTpnJCRxG5lRxwoQRSIA u9oWtwjWbIVBoykfvZj8uptobgZMT8kTqUw== X-Gm-Gg: AY/fxX6hygNNnGiDBTYcp5zwXaGZJmRqjceoNb8R07+TGiKIDzwRLbFTWfek8M9Z9yb 6mZX95i1ySatMETXdi4S5gUt+/CFasZf+DzChuIfb2NIlCybPHvrvmzawQ7pWOEh+OkDyGhsh6h Ts9M+swQCz6WpGw7Q+NPwlYnsYCsU3ZgoVtITn06OQur7pSAf97kaXTD2tmIZGw99nYgNjd76y7 8m/yLT5I4Dgblwb7LyO5UErhNHo6sphZKYVBQPaW1E99JD6jaSInWRFU9a3L9F3QkKpCCJ2LngA nVKPNqM= X-Google-Smtp-Source: AGHT+IFd65s4SyjbPhg2kwZ/K6hJj8YV5fstdr9xRw6a9WWGfUl/1NeWBPnVd7QRkqgE2pzblWB8zZ7SFbz2WINJNxU= X-Received: by 2002:a05:6820:601:b0:65f:6bcd:e40 with SMTP id 006d021491bc7-65f6bcd1262mr815755eaf.74.1767981242300; Fri, 09 Jan 2026 09:54:02 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Ron Johnson Date: Fri, 9 Jan 2026 12:53:51 -0500 X-Gm-Features: AZwV_QgncUpA9tdUO3IPl1yHwpH0eXp7WM-u4RvcMW4QS2lant2Y3jBJOVSM1EU Message-ID: Subject: Re: Better way to monitor for failed replication? To: Pgsql-admin Content-Type: multipart/alternative; boundary="000000000000eac0420647f8357a" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000eac0420647f8357a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Jan 9, 2026 at 12:42=E2=80=AFPM Keith wrote: > > > On Fri, Jan 9, 2026 at 10:50=E2=80=AFAM Ron Johnson > wrote: > >> Currently, in a bash script, I run this SELECT statement against the >> Primary server which is supposed to replicate to multiple servers. If >> active =3D=3D f, I send an alter email. >> >> postgres=3D# SELECT rs.slot_name, rs.active, sr.client_hostname >> from pg_replication_slots rs >> left outer join pg_stat_replication sr on rs.active_pid =3D sr.pid; >> slot_name | active | client_hostname >> --------------+--------+----------------- >> pgstandby1 | t | BBOPITCPGS302B >> replicate_dr | f | >> (2 rows) >> >> Is there a better way to check for replication that's supposed to be >> happening, but isn't (like PG on the replica was stopped for some reason= )? >> > > > Your example only takes into account if you are using replication slots, > correct? If you're always using those, this is definitely a good metric t= o > have since the slot going down means WAL buildup, so I'd definitely keep = it. > Yes, just replication slots. > As for general replication monitoring, these have been the two queries I > use > > On the Primary: > > SELECT client_addr AS replica > , client_hostname AS replica_hostname > , client_port AS replica_port > , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes > FROM pg_catalog.pg_stat_replication; > > This checks for byte-lag for all active streaming replicas, physical or > logical. A count of zero or NULL from this metric means all replicas are > down. Can monitor a specific count if you have a known number of replicas= . > > On any Replica: > > SELECT > CASE > WHEN (pg_last_wal_receive_lsn() =3D pg_last_wal_replay_lsn()) OR > (pg_is_in_recovery() =3D false) THEN 0 > ELSE EXTRACT (EPOCH FROM clock_timestamp() - > pg_last_xact_replay_timestamp())::INTEGER > END > AS replay_time > , CASE > WHEN pg_is_in_recovery() =3D false THEN 0 > ELSE EXTRACT (EPOCH FROM clock_timestamp() - > pg_last_xact_replay_timestamp())::INTEGER > END > AS received_time; > > This monitors the lag in seconds from the replica. Technically it monitor= s > the last time a WAL file was received (received_time) and the last time W= AL > was actually replayed (replay_time). The reason for both is that the > received time can be a false positive when there is no write activity on > the primary. If there's always supposed to be write activity, this can be= a > another good metric to indicate that something is very wrong. The > replay_time metric avoids the false positive by only being considered whe= n > receive is different than replay. > I'll integrate this into the lag report. --=20 Death to , and butter sauce. Don't boil me, I'm still alive. lobster! --000000000000eac0420647f8357a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Fri, Jan 9, 2026 at 12:42=E2=80=AFPM K= eith <keith@keithf4.com> wro= te:


On Fri, Jan 9, 2026 at 10:50=E2=80=AFAM Ron Johnson <ronljohnsonjr@gmail.= com> wrote:
Currently, in a bash script, I run this SELECT sta= tement against the Primary server which is supposed to=C2=A0replicate=C2=A0= to multiple servers.=C2=A0 If active =3D=3D f, I send an alter email.
=

postgres=3D# SELECT rs.slot_na= me, rs.active, sr.client_hostname
from pg_replication_slots rs
=C2=A0 =C2=A0 left outer join pg_stat_rep= lication sr on rs.active_pid =3D sr.pid;
=C2=A0 slot_name =C2=A0 | activ= e | client_hostname
--------------+--------+-----------------
=C2=A0= pgstandby1 =C2=A0 | t =C2=A0 =C2=A0 =C2=A0| BBOPITCPGS302B
=C2=A0replica= te_dr | f =C2=A0 =C2=A0 =C2=A0|
(2 rows)

Is there a better way to check=C2=A0for replication that's supposed t= o be happening, but isn't (like PG on the replica was stopped for some = reason)?
=C2=A0=

Your example = only takes into account if you are using replication slots, correct? If you= 're always using those, this is definitely a good metric to have since = the slot going down means WAL buildup, so I'd definitely keep it.
=

Yes, just replication sl= ots.
=C2=A0
As for= general replication monitoring, these have been the two queries I use

On the Primary:

SELECT client_addr AS replica
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 , client_hostname AS replica_hostname
=C2=A0 =C2=A0 =C2=A0= =C2=A0 , client_port AS replica_port
=C2=A0 =C2=A0 =C2=A0 =C2=A0 , pg_w= al_lsn_diff(sent_lsn, replay_lsn) AS bytes
=C2=A0 =C2=A0 =C2=A0 =C2=A0 F= ROM pg_catalog.pg_stat_replication;=C2=A0

T= his checks for byte-lag for all active streaming replicas, physical or logi= cal. A count of zero or NULL from this metric means all replicas are down. = Can monitor a specific count if you have a known number of replicas.
<= div>
On any Replica:

SELECT
=C2=A0 =C2=A0 =C2=A0 =C2=A0CASE
=C2=A0 = =C2=A0 =C2=A0 =C2=A0WHEN (pg_last_wal_receive_lsn() =3D pg_last_wal_replay_= lsn()) OR (pg_is_in_recovery() =3D false) THEN 0
=C2=A0 =C2=A0 =C2=A0 = =C2=A0ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_time= stamp())::INTEGER
=C2=A0 =C2=A0 =C2=A0 =C2=A0END
=C2=A0 =C2=A0 AS rep= lay_time
=C2=A0 =C2=A0 , =C2=A0CASE
=C2=A0 =C2=A0 =C2=A0 =C2=A0WHEN p= g_is_in_recovery() =3D false THEN 0
=C2=A0 =C2=A0 =C2=A0 =C2=A0ELSE EXTR= ACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEG= ER
=C2=A0 =C2=A0 =C2=A0 =C2=A0END
=C2=A0 =C2=A0 AS received_time;

This monitors the lag in seconds from the repl= ica. Technically it monitors the last time a WAL file was received (receive= d_time) and the last time WAL was actually replayed (replay_time). The reas= on for both is that the received time can be a false positive when there is= no write activity on the primary. If there's always supposed to be wri= te activity, this can be a another good metric to indicate that something i= s very wrong. The replay_time metric avoids the false positive by only bein= g considered when receive is different than replay.
=

I'll integrate this into the lag repor= t.=C2=A0


--
Death to <Redacted>, and butter sauce.Don't boil me, I'm still alive.
<Redacted> lobs= ter!
--000000000000eac0420647f8357a--