Re: pg_stat_replication.*_lag sometimes shows NULL during active replication

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Fujii Masao <[email protected]>
To: Shinya Kato <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
Date: Fri, 13 Mar 2026 00:27:32 +0900
Message-ID: <CAHGQGwGPg20Rw2PydBDXiKHgn55s--E14-qRy7t3M+DHreCJww@mail.gmail.com> (raw)
In-Reply-To: <CAOzEurRF+OWcMZpfE=NV_Wcm6CFFGOnuxC6L9WWCjOMN0_eZMQ@mail.gmail.com>
References: <CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com>
	<CAHGQGwE=kyQ+YnGPn8zpZ959+3ywg8OR_Nu__uXxxuE0E+Y_Zg@mail.gmail.com>
	<CAOzEurRGiGE2Dfe+ySpb=+93ku=7ZC6RgAbHtLC6Xsq3g2XexA@mail.gmail.com>
	<CAHGQGwH2h_R7FWPvEs3+NWLwHZoj9r96tUyRKi5haqxMc6FXiQ@mail.gmail.com>
	<CAOzEurQiP3uebd1GMiC1Dzf5VJwF4ZBEpJ6QYQFE6Y+rVjxqNA@mail.gmail.com>
	<CAHGQGwEmMBBAE0RG-R3_LacfT4fbB55qGE6n9O5mNwrqvbNBtw@mail.gmail.com>
	<CAOzEurRF+OWcMZpfE=NV_Wcm6CFFGOnuxC6L9WWCjOMN0_eZMQ@mail.gmail.com>

On Wed, Mar 11, 2026 at 11:39 AM Shinya Kato <[email protected]> wrote:
>
> On Tue, Mar 10, 2026 at 10:54 AM Fujii Masao <[email protected]> wrote:
> > Even with your latest patch, if we remove fullyAppliedLastTime, and set
> > clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
> > positionsUnchanged,
> > wouldn't the time for the lag to become NULL be almost the same as
> > wal_receiver_status_interval?
> >
> > The documentation doesn't clearly specify how long it should take for
> > the lag to become NULL, so doubling that time might be acceptable.
> > However, if we can keep it roughly the same without much complexity,
> > I think that would be preferable.
> >
> > Thought?
>
> Thank you for the suggestion. I tested this by removing
> fullyAppliedLastTime, but even with synchronous replication, NULL
> still appears. Here is why:
>
> - Reply 1 (flush notification): positions = X. Lag samples are
> consumed with real values, so noLagSamples = false. clearLagTimes is
> not set, and prevPtrs = X is saved.
>
> - Reply 2 (force_reply): positions = X again. Here, noLagSamples =
> true and positionsUnchanged = true. Since applyPtr == sentPtr,
> clearLagTimes is set to true, resulting in a NULL value.
>
> Therefore, I believe fullyAppliedLastTime is still necessary to ensure
> that the previous reply also contained no lag samples.

Thanks for testing and for the clarification! You're right.

However, if we apply this change, the time required for the lag information to
be reset would effectively double. I start wondering if that's really
acceptable, especially for back branches. Although the docs doesn't clearly
specify this timing, doubling it could affect systems that monitor
replication lag, for example. It might still be reasonable to apply
such a change in master, though.

On further thought, the root cause seems to be that walreceiver can send
two consecutive status reply messages with identical WAL locations even
when wal_receiver_status_interval has not yet elapsed. Addressing that
behavior directly might resolve the issue you reported. I've attached a PoC
patch that does this. Thought?

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch (9.1K, 2-v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch)
  download | inline diff:
From c231d1b129d1e8a74bc47badcc9ee3e8d4e04e16 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 12 Mar 2026 20:26:01 +0900
Subject: [PATCH v4] Avoid sending duplicate WAL locations in standby status
 replies.

Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
---
 src/backend/replication/walreceiver.c | 70 ++++++++++++++++-----------
 src/backend/replication/walsender.c   |  2 +-
 src/include/replication/walreceiver.h |  4 +-
 3 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..f5d5379edc7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -143,7 +143,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
 							TimeLineID tli);
 static void XLogWalRcvFlush(bool dying, TimeLineID tli);
 static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -417,7 +417,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 				WalRcvComputeNextWakeup(i, now);
 
 			/* Send initial reply/feedback messages. */
-			XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendReply(true, false, false);
 			XLogWalRcvSendHSFeedback(true);
 
 			/* Loop until end-of-streaming or error */
@@ -493,7 +493,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					}
 
 					/* Let the primary know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -539,7 +539,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					ResetLatch(MyLatch);
 					CHECK_FOR_INTERRUPTS();
 
-					if (walrcv->force_reply)
+					if (walrcv->reply_apply)
 					{
 						/*
 						 * The recovery process has asked us to send apply
@@ -547,9 +547,9 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
-						walrcv->force_reply = false;
+						walrcv->reply_apply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(false, false, true);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -595,7 +595,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						wakeup[WALRCV_WAKEUP_PING] = TIMESTAMP_INFINITY;
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -886,7 +886,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1053,7 +1053,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Also let the primary know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1107,24 +1107,33 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
 }
 
 /*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and
+ * time.
  *
- * If 'force' is not set, the message is only sent if enough time has
- * passed since last status update to reach wal_receiver_status_interval.
- * If wal_receiver_status_interval is disabled altogether and 'force' is
- * false, this is a no-op.
+ * The message is sent if 'force' is set, if enough time has passed since the
+ * last update to reach wal_receiver_status_interval, or if WAL locations have
+ * advanced since the previous status update. If wal_receiver_status_interval
+ * is disabled and 'force' is false, this function does nothing. Set 'force' to
+ * send the message unconditionally.
+ *
+ * Set 'replyApply' when the apply location is expected to have advanced from the
+ * previous update (for example, when the startup process requests an apply
+ * notification to be sent to the primary). In that case, the write, flush, and
+ * apply locations are compared to determine whether WAL has advanced.
+ * Otherwise the apply location is assumed unchanged and is not checked,
+ * so only the write and flush locations are considered.
  *
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbeats, when approaching
  * wal_receiver_timeout.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply)
 {
 	static XLogRecPtr writePtr = InvalidXLogRecPtr;
 	static XLogRecPtr flushPtr = InvalidXLogRecPtr;
-	XLogRecPtr	applyPtr;
+	static XLogRecPtr applyPtr = InvalidXLogRecPtr;
+	XLogRecPtr	latestApplyPtr = InvalidXLogRecPtr;
 	TimestampTz now;
 
 	/*
@@ -1140,17 +1149,19 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/*
 	 * We can compare the write and flush positions to the last message we
 	 * sent without taking any lock, but the apply position requires a spin
-	 * lock, so we don't check that unless something else has changed or 10
-	 * seconds have passed.  This means that the apply WAL location will
-	 * appear, from the primary's point of view, to lag slightly, but since
-	 * this is only for reporting purposes and only on idle systems, that's
-	 * probably OK.
+	 * lock, so we don't check that unless it is expected to advance since the
+	 * previsou update, i.e., when 'replyApply' is true.
 	 */
-	if (!force
-		&& writePtr == LogstreamResult.Write
-		&& flushPtr == LogstreamResult.Flush
-		&& now < wakeup[WALRCV_WAKEUP_REPLY])
-		return;
+	if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
+	{
+		if (replyApply)
+			latestApplyPtr = GetXLogReplayRecPtr(NULL);
+
+		if (writePtr == LogstreamResult.Write
+			&& flushPtr == LogstreamResult.Flush
+			&& (!replyApply || applyPtr == latestApplyPtr))
+			return;
+	}
 
 	/* Make sure we wake up when it's time to send another reply. */
 	WalRcvComputeNextWakeup(WALRCV_WAKEUP_REPLY, now);
@@ -1158,7 +1169,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = (latestApplyPtr == InvalidXLogRecPtr) ?
+		GetXLogReplayRecPtr(NULL) : latestApplyPtr;
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, PqReplMsg_StandbyStatusUpdate);
@@ -1382,7 +1394,7 @@ WalRcvForceReply(void)
 {
 	ProcNumber	procno;
 
-	WalRcv->force_reply = true;
+	WalRcv->reply_apply = true;
 	/* fetching the proc number is probably atomic, but don't rely on it */
 	SpinLockAcquire(&WalRcv->mutex);
 	procno = WalRcv->procno;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 79fc192b171..e672787ec04 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2502,7 +2502,7 @@ ProcessStandbyReplyMessage(void)
 	 * until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
-	if (applyPtr == sentPtr)
+	if (applyPtr == sentPtr && flushPtr == sentPtr)
 	{
 		if (fullyAppliedLastTime)
 			clearLagTimes = true;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..024ebcf4f37 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,11 +156,11 @@ typedef struct
 	pg_atomic_uint64 writtenUpto;
 
 	/*
-	 * force walreceiver reply?  This doesn't need to be locked; memory
+	 * request walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
 	 * store semantics, so use sig_atomic_t.
 	 */
-	sig_atomic_t force_reply;	/* used as a bool */
+	sig_atomic_t reply_apply;	/* used as a bool */
 } WalRcvData;
 
 extern PGDLLIMPORT WalRcvData *WalRcv;
-- 
2.51.2

view thread (21+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
  In-Reply-To: <CAHGQGwGPg20Rw2PydBDXiKHgn55s--E14-qRy7t3M+DHreCJww@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox