Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veGVT-00B5z5-1g for pgsql-admin@arkaria.postgresql.org; Fri, 09 Jan 2026 17:42:40 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1veGVS-0089xV-0M for pgsql-admin@arkaria.postgresql.org; Fri, 09 Jan 2026 17:42:38 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veGVR-0089xM-1q for pgsql-admin@lists.postgresql.org; Fri, 09 Jan 2026 17:42:38 +0000 Received: from mail-oi1-x230.google.com ([2607:f8b0:4864:20::230]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1veGVP-005VRS-1e for pgsql-admin@lists.postgresql.org; Fri, 09 Jan 2026 17:42:37 +0000 Received: by mail-oi1-x230.google.com with SMTP id 5614622812f47-450b7ee2d4bso621627b6e.2 for ; Fri, 09 Jan 2026 09:42:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=foresightweb-net.20230601.gappssmtp.com; s=20230601; t=1767980553; x=1768585353; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=DabXs4SKHv0Khv5pRUBA2J25a67sV4jiHSRyP5csRDQ=; b=b4vHPDl27GbNHoB/KlGmoVp33tuiwODZp4YMO8hzavrv9YwWaO/zb4hhZeuYVPWKkO 43gOpumtu2AUz6d5migTlYmLPE+fZlBoF5P54Cg65BfrFiaiR0GvgouCXY2HVTuTRnAn eiVSS1uwPNvyrzzEVg7Lpnt3HB3RqrqCXg1pZb+G3lNTcIsyPwtY4TRFgP8sqiop0h/3 t7ZDeXLZ843Bz8RDTK2eSBzXGKGOch48OMojgfECbS9pWsHrMEFGbUCWFZWw59dQJY+k vVtMlJztYr8aQSWCk8p4N6hOjZ8VmUXW8S8CalWvynoKf9W186OtgdLgkZJa9lTNwlCG AWHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767980553; x=1768585353; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=DabXs4SKHv0Khv5pRUBA2J25a67sV4jiHSRyP5csRDQ=; b=BxRAdKaMqcc7gocjg8CDniOT979is88XPb44rd8aC5HmJwgXDuPKPlfGMkDnz0Qwog NNumGRetlOuUErs+kFBvGEATobD6JDMAUkTJ9Kb9SWYr2/yN6O6vUxPocZ0E6T5SEOhC PAfVm3VQPgJnyH5ZY6si78CUZm8InGdI3n+tOZlaX7nGnJeR6XIEqX7FUyVFA0CQB9Zb 1LM8bNH4ZNDQDXiqPnTA3YNoiK4PAXvdXsJVj0mWbJbPbcktNfPk2YJmDpsxJ/0FQfpn ki4Kq5U/Q4suf920bGPHVVtm00gTewqsO5PP7x7FilM9lQNsXDq0ThdwWPebk9Z+0TnQ rC6g== X-Gm-Message-State: AOJu0YxGKT00n+aTIAKaTiFFEUa2aGOdwJuXugwYHF8jGb6nweoiepsU SxgFNG5K3ADX/eqNIGBV5VYWCLdROIJXs+omn+E45kA7uD1IBKw60nnRE83HdY9kjzKucS7Rabg fRZF3yoSpc6Fps9V0dHdAp+HkmhgNFh0wDL8BkSqLTA== X-Gm-Gg: AY/fxX58lc5m8xDl9yA46r/PJtRTuXksArxhRgyws8cKfOfgeZPsf0gnzNg9zNuWoaj zs5l1InLEGA6IU+CJBDgddRE2uNRvkemR0eCuLfmH0Sz6TqXTIxPH8KjXh1d0CEBf/juZxbgQzF K3xZmhIJ1IAubCdCaxH7ElWRg6d82LXkiyYVo2st9UMycPYggogC/WfhTEEOQSTckKM2b98z7Gg cJ6eo5Ewf20EbJS43UVAtkGXze5V+lSOyndlVscuk3Vrxjou+e+ULC1QkArhIN5t3vUn8sB/60f PXXR0j4llrBm+q+BhR3DQ/9Sm86ffi1wltblgLp0wDbrjQfn8k193X1Rlg== X-Google-Smtp-Source: AGHT+IG2CwnbIOi8b++mOp/nzdirOzE1yKmTeGJbjvTqiUU9mGO86pDaAJpyduLauMGUlba1vhNixztjThe/FVjr8/o= X-Received: by 2002:a05:6808:2511:b0:453:79a6:fbf7 with SMTP id 5614622812f47-45a6bd5f398mr4201942b6e.1.1767980553499; Fri, 09 Jan 2026 09:42:33 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Keith Date: Fri, 9 Jan 2026 12:41:57 -0500 X-Gm-Features: AZwV_QiqDKqie4a_yhx0tmtpMs_hr7kgC7Ct2qc_SSEHLK4C3Ar1BsT968ZKiYk Message-ID: Subject: Re: Better way to monitor for failed replication? To: Ron Johnson Cc: Pgsql-admin Content-Type: multipart/alternative; boundary="000000000000dc843a0647f80cf7" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000dc843a0647f80cf7 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, Jan 9, 2026 at 10:50=E2=80=AFAM Ron Johnson wrote: > Currently, in a bash script, I run this SELECT statement against the > Primary server which is supposed to replicate to multiple servers. If > active =3D=3D f, I send an alter email. > > postgres=3D# SELECT rs.slot_name, rs.active, sr.client_hostname > from pg_replication_slots rs > left outer join pg_stat_replication sr on rs.active_pid =3D sr.pid; > slot_name | active | client_hostname > --------------+--------+----------------- > pgstandby1 | t | BBOPITCPGS302B > replicate_dr | f | > (2 rows) > > Is there a better way to check for replication that's supposed to be > happening, but isn't (like PG on the replica was stopped for some reason)= ? > > -- > Death to , and butter sauce. > Don't boil me, I'm still alive. > lobster! > Your example only takes into account if you are using replication slots, correct? If you're always using those, this is definitely a good metric to have since the slot going down means WAL buildup, so I'd definitely keep it. As for general replication monitoring, these have been the two queries I use On the Primary: SELECT client_addr AS replica , client_hostname AS replica_hostname , client_port AS replica_port , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes FROM pg_catalog.pg_stat_replication; This checks for byte-lag for all active streaming replicas, physical or logical. A count of zero or NULL from this metric means all replicas are down. Can monitor a specific count if you have a known number of replicas. On any Replica: SELECT CASE WHEN (pg_last_wal_receive_lsn() =3D pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() =3D false) THEN 0 ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER END AS replay_time , CASE WHEN pg_is_in_recovery() =3D false THEN 0 ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_replay_timestamp())::INTEGER END AS received_time; This monitors the lag in seconds from the replica. Technically it monitors the last time a WAL file was received (received_time) and the last time WAL was actually replayed (replay_time). The reason for both is that the received time can be a false positive when there is no write activity on the primary. If there's always supposed to be write activity, this can be a another good metric to indicate that something is very wrong. The replay_time metric avoids the false positive by only being considered when receive is different than replay. This metric also works when you're doing WAL-replay replication instead of streaming. --000000000000dc843a0647f80cf7 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Fri, Jan 9, 2026 = at 10:50=E2=80=AFAM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Currentl= y, in a bash script, I run this SELECT statement against the Primary server= which is supposed to=C2=A0replicate=C2=A0to multiple servers.=C2=A0 If act= ive =3D=3D f, I send an alter email.

postgres=3D# SELECT rs.slot_name, rs.active, sr.client_hostname=
from pg_replication_slots rs =C2=A0 =C2=A0 left outer join pg_stat_replication sr on rs.active_pid =3D = sr.pid;
=C2=A0 slot_name =C2=A0 | active | client_hostname
---------= -----+--------+-----------------
=C2=A0pgstandby1 =C2=A0 | t =C2=A0 =C2= =A0 =C2=A0| BBOPITCPGS302B
=C2=A0replicate_dr | f =C2=A0 =C2=A0 =C2=A0| =
(2 rows)

Is there a better way to check= =C2=A0for replication that's supposed to be happening, but isn't (l= ike PG on the replica was stopped for some reason)?

--
Death to <Redacted>, and butter sa= uce.
Don't boil me, I'm still alive.
<Redacted&= gt; lobster!

Your example only takes into account if you are using replication slo= ts, correct? If you're always using those, this is definitely a good me= tric to have since the slot going down means WAL buildup, so I'd defini= tely keep it. As for general replication monitoring, these have been the tw= o queries I use

On the Primary:

SELECT client_addr AS replica<= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 , client_hostname AS replica_hostname
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 , client_port AS replica_port
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 , pg_wal_lsn_diff(sent_lsn, replay_lsn) AS bytes
=C2=A0 =C2= =A0 =C2=A0 =C2=A0 FROM pg_catalog.pg_stat_replication;=C2=A0

This checks for byte-lag for all active streaming replica= s, physical or logical. A count of zero or NULL from this metric means all = replicas are down. Can monitor a specific count if you have a known number = of replicas.

On any Replica:

<= div>SELECT
=C2=A0 =C2=A0 =C2=A0 = =C2=A0CASE
=C2=A0 =C2=A0 =C2=A0 =C2=A0WHEN (pg_last_wal_receive_lsn() = =3D pg_last_wal_replay_lsn()) OR (pg_is_in_recovery() =3D false) THEN 0
= =C2=A0 =C2=A0 =C2=A0 =C2=A0ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_= last_xact_replay_timestamp())::INTEGER
=C2=A0 =C2=A0 =C2=A0 =C2=A0END=C2=A0 =C2=A0 AS replay_time
=C2=A0 =C2=A0 , =C2=A0CASE
=C2=A0 =C2= =A0 =C2=A0 =C2=A0WHEN pg_is_in_recovery() =3D false THEN 0
=C2=A0 =C2=A0= =C2=A0 =C2=A0ELSE EXTRACT (EPOCH FROM clock_timestamp() - pg_last_xact_rep= lay_timestamp())::INTEGER
=C2=A0 =C2=A0 =C2=A0 =C2=A0END
=C2=A0 =C2= =A0 AS received_time;

This monitors the lag= in seconds from the replica. Technically it monitors the last time a WAL f= ile was received (received_time) and the last time WAL was actually replaye= d (replay_time). The reason for both is that the received time can be a fal= se positive when there is no write activity on the primary. If there's = always supposed to be write activity, this can be a another good metric to = indicate that something is very wrong. The replay_time metric avoids the fa= lse positive by only being considered when receive is different than replay= . This metric also works when you're doing WAL-replay replication inste= ad of streaming.
--000000000000dc843a0647f80cf7--