pg_stat_replication.*_lag sometimes shows NULL during active replication

public inbox for [email protected]  
help / color / mirror / Atom feed

pg_stat_replication.*_lag sometimes shows NULL during active replication
21+ messages / 3 participants
[nested] [flat]

* pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-02-24 06:53  Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-02-24 06:53 UTC (permalink / raw)
  To: PostgreSQL Hackers <[email protected]>

Hi hackers,

I have noticed that pg_stat_replication.*_lag sometimes shows NULL
when inserting a record per second for health checking. This happens
when the startup process replays WAL fast enough before the
walreceiver sends its flush notification to the walsender.

Here is the sequence that triggers the issue: (See normal.svg and
error.svg for diagrams of the normal and problematic cases.)

1. The walreceiver receives, writes, and flushes WAL, then wakes the
startup process via WakeupRecovery().

2. The startup process replays all available WAL quickly, then calls
WalRcvForceReply() to set force_reply = true and wakes the
walreceiver.

3. The walreceiver sends a flush notification to the walsender
(XLogWalRcvSendReply() in XLogWalRcvFlush()). Since the startup has
already replayed the WAL by this point, this message reports the
incremented applyPtr, which equals sentPtr. The walsender processes
this message, consuming the LagTracker samples and setting
fullyAppliedLastTime = true.

4. In the next loop iteration, the walreceiver sees force_reply = true
and sends another reply with the same positions. The walsender sees
applyPtr == sentPtr for the second consecutive time and sets
clearLagTimes = true. Since the LagTracker samples were already
consumed by step 3, all lag values are -1. With clearLagTimes = true,
these -1 values are written to walsnd->*Lag, causing
pg_stat_replication to show NULL.

The comment in ProcessStandbyReplyMessage() says:

     * If the standby reports that it has fully replayed the WAL in two
     * consecutive reply messages, then the second such message must result
     * from wal_receiver_status_interval expiring on the standby.

But as shown above, the second message can also come from
WalRcvForceReply(), violating this assumption.

The attached patch fixes this by adding a check that all lag values
are -1 to the clearLagTimes condition. This ensures that clearLagTimes
only triggers when there are truly no new lag samples in two
consecutive messages (i.e., the system is genuinely idle), and not
when the samples were simply consumed by a preceding message in a
burst of replies.

Regards,

-- 
Best regards,
Shinya Kato
NTT OSS Center

Attachments:

  [application/octet-stream] v1-0001-Fix-pg_stat_replication.-_lag-showing-NULL-during.patch (3.3K, 2-v1-0001-Fix-pg_stat_replication.-_lag-showing-NULL-during.patch)
  download | inline diff:
From 67eb950123b1bab1f1c3db5ba0f88ce1737b6574 Mon Sep 17 00:00:00 2001
From: Shinya Kato <[email protected]>
Date: Tue, 24 Feb 2026 15:45:04 +0900
Subject: [PATCH v1] Fix pg_stat_replication.*_lag showing NULL during active
 replication

When the startup process replays WAL quickly, the walreceiver's flush
notification and the subsequent force_reply message can both report
applyPtr == sentPtr in quick succession.  The clearLagTimes logic
assumed that two consecutive fully-applied messages meant the
wal_receiver_status_interval had expired, but this assumption is
violated when the second message comes from WalRcvForceReply().  In
that case, the LagTracker samples were already consumed by the first
message, so all lag values are -1; with clearLagTimes = true, these
-1 values were written to walsnd->*Lag, causing pg_stat_replication
to show NULL.

Fix this by also requiring that all lag values are -1 (no new
samples) in the clearLagTimes condition.  This ensures clearLagTimes
only triggers when the system is genuinely idle across two
consecutive messages, not when samples were consumed by a preceding
message in a burst of replies.

Author: Shinya Kato <[email protected]>
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/replication/walsender.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2cde8ebc729..5c7bd0a13ad 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2493,15 +2493,25 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);

 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL and there are
+	 * no new lag samples in two consecutive reply messages, then those
+	 * messages must result from wal_receiver_status_interval expiring on the
+	 * standby.  This is a convenient time to forget the lag times measured
+	 * when it last wrote/flushed/applied a WAL record, to avoid displaying
+	 * stale lag data until more WAL traffic arrives.
+	 *
+	 * We also require that no new lag samples are available (all lag values
+	 * are -1) in both messages to avoid a race condition: when the walreceiver
+	 * sends a flush notification followed immediately by a force_reply (to
+	 * report apply progress), both messages can have applyPtr == sentPtr if
+	 * the startup process replayed the WAL quickly.  In that case, the lag
+	 * tracker samples are consumed by the first message, causing the second
+	 * to see all lags as -1.  Without the lag check, clearLagTimes would
+	 * incorrectly trigger and overwrite valid lag values with -1 (NULL).
 	 */
 	clearLagTimes = false;
-	if (applyPtr == sentPtr)
+	if (applyPtr == sentPtr &&
+		writeLag == -1 && flushLag == -1 && applyLag == -1)
 	{
 		if (fullyAppliedLastTime)
 			clearLagTimes = true;
-- 
2.47.3

  [image/svg+xml] normal.svg (112.8K, 3-normal.svg)
  download | view image

  [image/svg+xml] error.svg (112.8K, 4-error.svg)
  download | view image

^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-02 14:44  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Fujii Masao @ 2026-03-02 14:44 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Tue, Feb 24, 2026 at 3:54 PM Shinya Kato <[email protected]> wrote:
>
> Hi hackers,
>
> I have noticed that pg_stat_replication.*_lag sometimes shows NULL
> when inserting a record per second for health checking. This happens
> when the startup process replays WAL fast enough before the
> walreceiver sends its flush notification to the walsender.
>
> Here is the sequence that triggers the issue: (See normal.svg and
> error.svg for diagrams of the normal and problematic cases.)
>
> 1. The walreceiver receives, writes, and flushes WAL, then wakes the
> startup process via WakeupRecovery().
>
> 2. The startup process replays all available WAL quickly, then calls
> WalRcvForceReply() to set force_reply = true and wakes the
> walreceiver.
>
> 3. The walreceiver sends a flush notification to the walsender
> (XLogWalRcvSendReply() in XLogWalRcvFlush()). Since the startup has
> already replayed the WAL by this point, this message reports the
> incremented applyPtr, which equals sentPtr. The walsender processes
> this message, consuming the LagTracker samples and setting
> fullyAppliedLastTime = true.
>
> 4. In the next loop iteration, the walreceiver sees force_reply = true
> and sends another reply with the same positions. The walsender sees
> applyPtr == sentPtr for the second consecutive time and sets
> clearLagTimes = true. Since the LagTracker samples were already
> consumed by step 3, all lag values are -1. With clearLagTimes = true,
> these -1 values are written to walsnd->*Lag, causing
> pg_stat_replication to show NULL.
>
> The comment in ProcessStandbyReplyMessage() says:
>
>      * If the standby reports that it has fully replayed the WAL in two
>      * consecutive reply messages, then the second such message must result
>      * from wal_receiver_status_interval expiring on the standby.
>
> But as shown above, the second message can also come from
> WalRcvForceReply(), violating this assumption.
>
> The attached patch fixes this by adding a check that all lag values
> are -1 to the clearLagTimes condition. This ensures that clearLagTimes
> only triggers when there are truly no new lag samples in two
> consecutive messages (i.e., the system is genuinely idle), and not
> when the samples were simply consumed by a preceding message in a
> burst of replies.

Thanks for the patch!

With the patch applied, I set up a logical replication and inserted a row every
second. Even with continuous inserts, NULL was shown in the lag columns of
pg_stat_replication. That makes me wonder whether the patch's approach is
sufficient to address the issue.

Relying solely on replies from the standby or subscriber seems a bit fragile to
me. If the goal is to keep showing the last measured lag for some time,
perhaps we should introduce a rate limit on when NULL is displayed in the lag
columns?

For example, if there has been no activity (i.e., sentPtr == applyPtr and
applyPtr has not changed since the previous cycle) for, say, 10 seconds,
then we could allow NULL to be shown. Thought?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-06 07:12  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-06 07:12 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Mon, Mar 2, 2026 at 11:44 PM Fujii Masao <[email protected]> wrote:
> With the patch applied, I set up a logical replication and inserted a row every
> second. Even with continuous inserts, NULL was shown in the lag columns of
> pg_stat_replication. That makes me wonder whether the patch's approach is
> sufficient to address the issue.

Thank you for the review and testing! I had only considered the issue
in the context of physical replication, but as you pointed out, my
approach is insufficient for logical replication.

> Relying solely on replies from the standby or subscriber seems a bit fragile to
> me. If the goal is to keep showing the last measured lag for some time,
> perhaps we should introduce a rate limit on when NULL is displayed in the lag
> columns?

My primary goal was to ensure that the source code comments match the
actual behavior, as the comment stating "the second such message must
result from wal_receiver_status_interval expiring on the standby" is
inaccurate. However, as you noted, the patch alone is not sufficient
to fully address the issue.

> For example, if there has been no activity (i.e., sentPtr == applyPtr and
> applyPtr has not changed since the previous cycle) for, say, 10 seconds,
> then we could allow NULL to be shown. Thought?

I considered a time-based rate limit, but it is difficult to choose an
appropriate threshold. Furthermore, the walsender has no way of
knowing the standby's or subscriber's wal_receiver_status_interval
setting.

The attached v2 patch takes a different approach: it additionally
requires that all reported positions (write/flush/apply) remain
unchanged from the previous reply. This directly detects a truly idle
system without relying on timeouts—if any position has advanced, new
WAL activity must have occurred, so we should not clear the lag values
even if the lag tracker is empty.
--
Best regards,
Shinya Kato
NTT OSS Center


Attachments:

  [application/octet-stream] v2-0001-Fix-spurious-NULL-lag-in-pg_stat_replication.patch (3.9K, 2-v2-0001-Fix-spurious-NULL-lag-in-pg_stat_replication.patch)
  download | inline diff:
From 8de9d904d70c362ca2af00bd4e73c2ad3bda9b6b Mon Sep 17 00:00:00 2001
From: Shinya Kato <[email protected]>
Date: Fri, 6 Mar 2026 16:10:59 +0900
Subject: [PATCH v2] Fix spurious NULL lag in pg_stat_replication

Previously, ProcessStandbyReplyMessage() cleared replication lag times
whenever the standby reported fully-applied WAL in two consecutive
reply messages.  This heuristic was too aggressive: in bursty reply
patterns one message could consume all lag tracker samples, and the
next message -- arriving before new samples accumulated -- would see
no samples and trigger clearing, even though the standby was still
actively replaying WAL.

Add two additional conditions before clearing lag times: (1) all three
LagTrackerRead() calls must return -1, indicating no new lag samples,
and (2) write/flush/apply positions must be unchanged from the
previous reply.  Together with the existing fully-applied check, this
ensures lag is only cleared when the standby is truly idle.

Author: Shinya Kato <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
---
 src/backend/replication/walsender.c | 34 ++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 2cde8ebc729..59dcfa340a5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2456,11 +2456,16 @@ ProcessStandbyReplyMessage(void)
 	TimeOffset	writeLag,
 				flushLag,
 				applyLag;
-	bool		clearLagTimes;
+	bool		clearLagTimes,
+				noLagSamples,
+				positionsUnchanged;
 	TimestampTz now;
 	TimestampTz replyTime;
 
 	static bool fullyAppliedLastTime = false;
+	static XLogRecPtr prevWritePtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevFlushPtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevApplyPtr = InvalidXLogRecPtr;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -2492,16 +2497,25 @@ ProcessStandbyReplyMessage(void)
 	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
+	/* Precompute inputs for clearLagTimes decision below. */
+	noLagSamples = (writeLag == -1 && flushLag == -1 && applyLag == -1);
+	positionsUnchanged = (writePtr == prevWritePtr &&
+						  flushPtr == prevFlushPtr &&
+						  applyPtr == prevApplyPtr);
+
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL, there are
+	 * no new lag samples, and positions remain unchanged across two
+	 * consecutive reply messages, forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record.  This avoids displaying stale lag
+	 * data until more WAL traffic arrives.
+	 *
+	 * The position-unchanged check prevents spuriously clearing lag in
+	 * bursty reply patterns, where one reply consumes all lag tracker
+	 * samples and the next arrives before new samples accumulate.
 	 */
 	clearLagTimes = false;
-	if (applyPtr == sentPtr)
+	if (applyPtr == sentPtr && noLagSamples && positionsUnchanged)
 	{
 		if (fullyAppliedLastTime)
 			clearLagTimes = true;
@@ -2510,6 +2524,10 @@ ProcessStandbyReplyMessage(void)
 	else
 		fullyAppliedLastTime = false;
 
+	prevWritePtr = writePtr;
+	prevFlushPtr = flushPtr;
+	prevApplyPtr = applyPtr;
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false, InvalidXLogRecPtr);
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-09 11:21  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Fujii Masao @ 2026-03-09 11:21 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Fri, Mar 6, 2026 at 4:13 PM Shinya Kato <[email protected]> wrote:
>
> On Mon, Mar 2, 2026 at 11:44 PM Fujii Masao <[email protected]> wrote:
> > With the patch applied, I set up a logical replication and inserted a row every
> > second. Even with continuous inserts, NULL was shown in the lag columns of
> > pg_stat_replication. That makes me wonder whether the patch's approach is
> > sufficient to address the issue.
>
> Thank you for the review and testing! I had only considered the issue
> in the context of physical replication, but as you pointed out, my
> approach is insufficient for logical replication.
>
> > Relying solely on replies from the standby or subscriber seems a bit fragile to
> > me. If the goal is to keep showing the last measured lag for some time,
> > perhaps we should introduce a rate limit on when NULL is displayed in the lag
> > columns?
>
> My primary goal was to ensure that the source code comments match the
> actual behavior, as the comment stating "the second such message must
> result from wal_receiver_status_interval expiring on the standby" is
> inaccurate. However, as you noted, the patch alone is not sufficient
> to fully address the issue.
>
> > For example, if there has been no activity (i.e., sentPtr == applyPtr and
> > applyPtr has not changed since the previous cycle) for, say, 10 seconds,
> > then we could allow NULL to be shown. Thought?
>
> I considered a time-based rate limit, but it is difficult to choose an
> appropriate threshold. Furthermore, the walsender has no way of
> knowing the standby's or subscriber's wal_receiver_status_interval
> setting.
>
> The attached v2 patch takes a different approach: it additionally
> requires that all reported positions (write/flush/apply) remain
> unchanged from the previous reply. This directly detects a truly idle
> system without relying on timeouts—if any position has advanced, new
> WAL activity must have occurred, so we should not clear the lag values
> even if the lag tracker is empty.

This approach looks good to me.

One comment: currently, the lag becomes NULL basically after about one
wal_receiver_status_interval during periods of no activity. OTOH, with this
approach, it seems it would take about twice wal_receiver_status_interval.
Is this understanding correct?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-10 01:01  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-10 01:01 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Mon, Mar 9, 2026 at 8:21 PM Fujii Masao <[email protected]> wrote:
> > The attached v2 patch takes a different approach: it additionally
> > requires that all reported positions (write/flush/apply) remain
> > unchanged from the previous reply. This directly detects a truly idle
> > system without relying on timeouts—if any position has advanced, new
> > WAL activity must have occurred, so we should not clear the lag values
> > even if the lag tracker is empty.
>
> This approach looks good to me.

Thank you for looking into this.

> One comment: currently, the lag becomes NULL basically after about one
> wal_receiver_status_interval during periods of no activity. OTOH, with this
> approach, it seems it would take about twice wal_receiver_status_interval.
> Is this understanding correct?

Exactly. With this patch, it takes about two
wal_receiver_status_interval cycles to show NULL instead of one. I
think this is an acceptable trade-off because it is better to take a
bit longer to detect inactivity than to incorrectly show NULL during
active replication.

-- 
Best regards,
Shinya Kato
NTT OSS Center





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-10 01:54  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Fujii Masao @ 2026-03-10 01:54 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Tue, Mar 10, 2026 at 10:02 AM Shinya Kato <[email protected]> wrote:
>
> On Mon, Mar 9, 2026 at 8:21 PM Fujii Masao <[email protected]> wrote:
> > > The attached v2 patch takes a different approach: it additionally
> > > requires that all reported positions (write/flush/apply) remain
> > > unchanged from the previous reply. This directly detects a truly idle
> > > system without relying on timeouts—if any position has advanced, new
> > > WAL activity must have occurred, so we should not clear the lag values
> > > even if the lag tracker is empty.
> >
> > This approach looks good to me.
>
> Thank you for looking into this.
>
> > One comment: currently, the lag becomes NULL basically after about one
> > wal_receiver_status_interval during periods of no activity. OTOH, with this
> > approach, it seems it would take about twice wal_receiver_status_interval.
> > Is this understanding correct?
>
> Exactly. With this patch, it takes about two
> wal_receiver_status_interval cycles to show NULL instead of one. I
> think this is an acceptable trade-off because it is better to take a
> bit longer to detect inactivity than to incorrectly show NULL during
> active replication.

Even with your latest patch, if we remove fullyAppliedLastTime, and set
clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
positionsUnchanged,
wouldn't the time for the lag to become NULL be almost the same as
wal_receiver_status_interval?

The documentation doesn't clearly specify how long it should take for
the lag to become NULL, so doubling that time might be acceptable.
However, if we can keep it roughly the same without much complexity,
I think that would be preferable.

Thought?

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-11 02:38  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-11 02:38 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Tue, Mar 10, 2026 at 10:54 AM Fujii Masao <[email protected]> wrote:
> Even with your latest patch, if we remove fullyAppliedLastTime, and set
> clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
> positionsUnchanged,
> wouldn't the time for the lag to become NULL be almost the same as
> wal_receiver_status_interval?
>
> The documentation doesn't clearly specify how long it should take for
> the lag to become NULL, so doubling that time might be acceptable.
> However, if we can keep it roughly the same without much complexity,
> I think that would be preferable.
>
> Thought?

Thank you for the suggestion. I tested this by removing
fullyAppliedLastTime, but even with synchronous replication, NULL
still appears. Here is why:

- Reply 1 (flush notification): positions = X. Lag samples are
consumed with real values, so noLagSamples = false. clearLagTimes is
not set, and prevPtrs = X is saved.

- Reply 2 (force_reply): positions = X again. Here, noLagSamples =
true and positionsUnchanged = true. Since applyPtr == sentPtr,
clearLagTimes is set to true, resulting in a NULL value.

Therefore, I believe fullyAppliedLastTime is still necessary to ensure
that the previous reply also contained no lag samples.

BTW I noticed an incorrect comment in walreceiver.c and have included
a fix for it. Patch 0001 remains unchanged.


-- 
Best regards,
Shinya Kato
NTT OSS Center


Attachments:

  [application/octet-stream] v3-0001-Fix-spurious-NULL-lag-in-pg_stat_replication.patch (3.9K, 2-v3-0001-Fix-spurious-NULL-lag-in-pg_stat_replication.patch)
  download | inline diff:
From a06abff86337483ddcd4cd2a49ffbc03c30df966 Mon Sep 17 00:00:00 2001
From: Shinya Kato <[email protected]>
Date: Fri, 6 Mar 2026 16:10:59 +0900
Subject: [PATCH v3 1/2] Fix spurious NULL lag in pg_stat_replication

Previously, ProcessStandbyReplyMessage() cleared replication lag times
whenever the standby reported fully-applied WAL in two consecutive
reply messages.  This heuristic was too aggressive: in bursty reply
patterns one message could consume all lag tracker samples, and the
next message -- arriving before new samples accumulated -- would see
no samples and trigger clearing, even though the standby was still
actively replaying WAL.

Add two additional conditions before clearing lag times: (1) all three
LagTrackerRead() calls must return -1, indicating no new lag samples,
and (2) write/flush/apply positions must be unchanged from the
previous reply.  Together with the existing fully-applied check, this
ensures lag is only cleared when the standby is truly idle.

Author: Shinya Kato <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
---
 src/backend/replication/walsender.c | 34 ++++++++++++++++++++++-------
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 79fc192b171..e0b2ac29d74 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2457,11 +2457,16 @@ ProcessStandbyReplyMessage(void)
 	TimeOffset	writeLag,
 				flushLag,
 				applyLag;
-	bool		clearLagTimes;
+	bool		clearLagTimes,
+				noLagSamples,
+				positionsUnchanged;
 	TimestampTz now;
 	TimestampTz replyTime;
 
 	static bool fullyAppliedLastTime = false;
+	static XLogRecPtr prevWritePtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevFlushPtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevApplyPtr = InvalidXLogRecPtr;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -2493,16 +2498,25 @@ ProcessStandbyReplyMessage(void)
 	flushLag = LagTrackerRead(SYNC_REP_WAIT_FLUSH, flushPtr, now);
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
+	/* Precompute inputs for clearLagTimes decision below. */
+	noLagSamples = (writeLag == -1 && flushLag == -1 && applyLag == -1);
+	positionsUnchanged = (writePtr == prevWritePtr &&
+						  flushPtr == prevFlushPtr &&
+						  applyPtr == prevApplyPtr);
+
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL, there are
+	 * no new lag samples, and positions remain unchanged across two
+	 * consecutive reply messages, forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record.  This avoids displaying stale lag
+	 * data until more WAL traffic arrives.
+	 *
+	 * The position-unchanged check prevents spuriously clearing lag in
+	 * bursty reply patterns, where one reply consumes all lag tracker
+	 * samples and the next arrives before new samples accumulate.
 	 */
 	clearLagTimes = false;
-	if (applyPtr == sentPtr)
+	if (applyPtr == sentPtr && noLagSamples && positionsUnchanged)
 	{
 		if (fullyAppliedLastTime)
 			clearLagTimes = true;
@@ -2511,6 +2525,10 @@ ProcessStandbyReplyMessage(void)
 	else
 		fullyAppliedLastTime = false;
 
+	prevWritePtr = writePtr;
+	prevFlushPtr = flushPtr;
+	prevApplyPtr = applyPtr;
+
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
 		WalSndKeepalive(false, InvalidXLogRecPtr);
-- 
2.47.3



  [application/octet-stream] v3-0002-Fix-a-comment-in-walreceiver.c.patch (1.2K, 3-v3-0002-Fix-a-comment-in-walreceiver.c.patch)
  download | inline diff:
From 50fddedb1c94e720a5858dc61cf3af42c1580fd5 Mon Sep 17 00:00:00 2001
From: Shinya Kato <[email protected]>
Date: Wed, 11 Mar 2026 11:28:00 +0900
Subject: [PATCH v3 2/2] Fix a comment in walreceiver.c

Remove outdated reference to "oldest xmin" in XLogWalRcvSendReply()
comment, since the function no longer reports xmin.

Author: Shinya Kato <[email protected]>
Reviewed-by:
Discussion: https://postgr.es/m/
---
 src/backend/replication/walreceiver.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..bd9a1377e1c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1107,8 +1107,8 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
 }
 
 /*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and the
+ * current time.
  *
  * If 'force' is not set, the message is only sent if enough time has
  * passed since last status update to reach wal_receiver_status_interval.
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-12 15:27  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 2 replies; 21+ messages in thread

From: Fujii Masao @ 2026-03-12 15:27 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, Mar 11, 2026 at 11:39 AM Shinya Kato <[email protected]> wrote:
>
> On Tue, Mar 10, 2026 at 10:54 AM Fujii Masao <[email protected]> wrote:
> > Even with your latest patch, if we remove fullyAppliedLastTime, and set
> > clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
> > positionsUnchanged,
> > wouldn't the time for the lag to become NULL be almost the same as
> > wal_receiver_status_interval?
> >
> > The documentation doesn't clearly specify how long it should take for
> > the lag to become NULL, so doubling that time might be acceptable.
> > However, if we can keep it roughly the same without much complexity,
> > I think that would be preferable.
> >
> > Thought?
>
> Thank you for the suggestion. I tested this by removing
> fullyAppliedLastTime, but even with synchronous replication, NULL
> still appears. Here is why:
>
> - Reply 1 (flush notification): positions = X. Lag samples are
> consumed with real values, so noLagSamples = false. clearLagTimes is
> not set, and prevPtrs = X is saved.
>
> - Reply 2 (force_reply): positions = X again. Here, noLagSamples =
> true and positionsUnchanged = true. Since applyPtr == sentPtr,
> clearLagTimes is set to true, resulting in a NULL value.
>
> Therefore, I believe fullyAppliedLastTime is still necessary to ensure
> that the previous reply also contained no lag samples.

Thanks for testing and for the clarification! You're right.

However, if we apply this change, the time required for the lag information to
be reset would effectively double. I start wondering if that's really
acceptable, especially for back branches. Although the docs doesn't clearly
specify this timing, doubling it could affect systems that monitor
replication lag, for example. It might still be reasonable to apply
such a change in master, though.

On further thought, the root cause seems to be that walreceiver can send
two consecutive status reply messages with identical WAL locations even
when wal_receiver_status_interval has not yet elapsed. Addressing that
behavior directly might resolve the issue you reported. I've attached a PoC
patch that does this. Thought?

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch (9.1K, 2-v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch)
  download | inline diff:
From c231d1b129d1e8a74bc47badcc9ee3e8d4e04e16 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Thu, 12 Mar 2026 20:26:01 +0900
Subject: [PATCH v4] Avoid sending duplicate WAL locations in standby status
 replies.

Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
---
 src/backend/replication/walreceiver.c | 70 ++++++++++++++++-----------
 src/backend/replication/walsender.c   |  2 +-
 src/include/replication/walreceiver.h |  4 +-
 3 files changed, 44 insertions(+), 32 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..f5d5379edc7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -143,7 +143,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
 							TimeLineID tli);
 static void XLogWalRcvFlush(bool dying, TimeLineID tli);
 static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -417,7 +417,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 				WalRcvComputeNextWakeup(i, now);
 
 			/* Send initial reply/feedback messages. */
-			XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendReply(true, false, false);
 			XLogWalRcvSendHSFeedback(true);
 
 			/* Loop until end-of-streaming or error */
@@ -493,7 +493,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					}
 
 					/* Let the primary know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -539,7 +539,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					ResetLatch(MyLatch);
 					CHECK_FOR_INTERRUPTS();
 
-					if (walrcv->force_reply)
+					if (walrcv->reply_apply)
 					{
 						/*
 						 * The recovery process has asked us to send apply
@@ -547,9 +547,9 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
-						walrcv->force_reply = false;
+						walrcv->reply_apply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(false, false, true);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -595,7 +595,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						wakeup[WALRCV_WAKEUP_PING] = TIMESTAMP_INFINITY;
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -886,7 +886,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1053,7 +1053,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Also let the primary know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1107,24 +1107,33 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
 }
 
 /*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and
+ * time.
  *
- * If 'force' is not set, the message is only sent if enough time has
- * passed since last status update to reach wal_receiver_status_interval.
- * If wal_receiver_status_interval is disabled altogether and 'force' is
- * false, this is a no-op.
+ * The message is sent if 'force' is set, if enough time has passed since the
+ * last update to reach wal_receiver_status_interval, or if WAL locations have
+ * advanced since the previous status update. If wal_receiver_status_interval
+ * is disabled and 'force' is false, this function does nothing. Set 'force' to
+ * send the message unconditionally.
+ *
+ * Set 'replyApply' when the apply location is expected to have advanced from the
+ * previous update (for example, when the startup process requests an apply
+ * notification to be sent to the primary). In that case, the write, flush, and
+ * apply locations are compared to determine whether WAL has advanced.
+ * Otherwise the apply location is assumed unchanged and is not checked,
+ * so only the write and flush locations are considered.
  *
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbeats, when approaching
  * wal_receiver_timeout.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply)
 {
 	static XLogRecPtr writePtr = InvalidXLogRecPtr;
 	static XLogRecPtr flushPtr = InvalidXLogRecPtr;
-	XLogRecPtr	applyPtr;
+	static XLogRecPtr applyPtr = InvalidXLogRecPtr;
+	XLogRecPtr	latestApplyPtr = InvalidXLogRecPtr;
 	TimestampTz now;
 
 	/*
@@ -1140,17 +1149,19 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/*
 	 * We can compare the write and flush positions to the last message we
 	 * sent without taking any lock, but the apply position requires a spin
-	 * lock, so we don't check that unless something else has changed or 10
-	 * seconds have passed.  This means that the apply WAL location will
-	 * appear, from the primary's point of view, to lag slightly, but since
-	 * this is only for reporting purposes and only on idle systems, that's
-	 * probably OK.
+	 * lock, so we don't check that unless it is expected to advance since the
+	 * previsou update, i.e., when 'replyApply' is true.
 	 */
-	if (!force
-		&& writePtr == LogstreamResult.Write
-		&& flushPtr == LogstreamResult.Flush
-		&& now < wakeup[WALRCV_WAKEUP_REPLY])
-		return;
+	if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
+	{
+		if (replyApply)
+			latestApplyPtr = GetXLogReplayRecPtr(NULL);
+
+		if (writePtr == LogstreamResult.Write
+			&& flushPtr == LogstreamResult.Flush
+			&& (!replyApply || applyPtr == latestApplyPtr))
+			return;
+	}
 
 	/* Make sure we wake up when it's time to send another reply. */
 	WalRcvComputeNextWakeup(WALRCV_WAKEUP_REPLY, now);
@@ -1158,7 +1169,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = (latestApplyPtr == InvalidXLogRecPtr) ?
+		GetXLogReplayRecPtr(NULL) : latestApplyPtr;
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, PqReplMsg_StandbyStatusUpdate);
@@ -1382,7 +1394,7 @@ WalRcvForceReply(void)
 {
 	ProcNumber	procno;
 
-	WalRcv->force_reply = true;
+	WalRcv->reply_apply = true;
 	/* fetching the proc number is probably atomic, but don't rely on it */
 	SpinLockAcquire(&WalRcv->mutex);
 	procno = WalRcv->procno;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 79fc192b171..e672787ec04 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2502,7 +2502,7 @@ ProcessStandbyReplyMessage(void)
 	 * until more WAL traffic arrives.
 	 */
 	clearLagTimes = false;
-	if (applyPtr == sentPtr)
+	if (applyPtr == sentPtr && flushPtr == sentPtr)
 	{
 		if (fullyAppliedLastTime)
 			clearLagTimes = true;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..024ebcf4f37 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,11 +156,11 @@ typedef struct
 	pg_atomic_uint64 writtenUpto;
 
 	/*
-	 * force walreceiver reply?  This doesn't need to be locked; memory
+	 * request walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
 	 * store semantics, so use sig_atomic_t.
 	 */
-	sig_atomic_t force_reply;	/* used as a bool */
+	sig_atomic_t reply_apply;	/* used as a bool */
 } WalRcvData;
 
 extern PGDLLIMPORT WalRcvData *WalRcv;
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-13 02:15  Chao Li <[email protected]>
  parent: Fujii Masao <[email protected]>
  1 sibling, 0 replies; 21+ messages in thread

From: Chao Li @ 2026-03-13 02:15 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Shinya Kato <[email protected]>; PostgreSQL Hackers <[email protected]>



> On Mar 12, 2026, at 23:27, Fujii Masao <[email protected]> wrote:
> 
> On Wed, Mar 11, 2026 at 11:39 AM Shinya Kato <[email protected]> wrote:
>> 
>> On Tue, Mar 10, 2026 at 10:54 AM Fujii Masao <[email protected]> wrote:
>>> Even with your latest patch, if we remove fullyAppliedLastTime, and set
>>> clearLagTimes to true when applyPtr == sentPtr && noLagSamples &&
>>> positionsUnchanged,
>>> wouldn't the time for the lag to become NULL be almost the same as
>>> wal_receiver_status_interval?
>>> 
>>> The documentation doesn't clearly specify how long it should take for
>>> the lag to become NULL, so doubling that time might be acceptable.
>>> However, if we can keep it roughly the same without much complexity,
>>> I think that would be preferable.
>>> 
>>> Thought?
>> 
>> Thank you for the suggestion. I tested this by removing
>> fullyAppliedLastTime, but even with synchronous replication, NULL
>> still appears. Here is why:
>> 
>> - Reply 1 (flush notification): positions = X. Lag samples are
>> consumed with real values, so noLagSamples = false. clearLagTimes is
>> not set, and prevPtrs = X is saved.
>> 
>> - Reply 2 (force_reply): positions = X again. Here, noLagSamples =
>> true and positionsUnchanged = true. Since applyPtr == sentPtr,
>> clearLagTimes is set to true, resulting in a NULL value.
>> 
>> Therefore, I believe fullyAppliedLastTime is still necessary to ensure
>> that the previous reply also contained no lag samples.
> 
> Thanks for testing and for the clarification! You're right.
> 
> However, if we apply this change, the time required for the lag information to
> be reset would effectively double. I start wondering if that's really
> acceptable, especially for back branches. Although the docs doesn't clearly
> specify this timing, doubling it could affect systems that monitor
> replication lag, for example. It might still be reasonable to apply
> such a change in master, though.
> 
> On further thought, the root cause seems to be that walreceiver can send
> two consecutive status reply messages with identical WAL locations even
> when wal_receiver_status_interval has not yet elapsed. Addressing that
> behavior directly might resolve the issue you reported. I've attached a PoC
> patch that does this. Thought?
> 
> Regards,
> 
> -- 
> Fujii Masao
> <v4-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch>

I just read v4. The approach looks good to me overall. I have a few comments about the naming.

This patch changes the old force reply logic to an applied-location-driven reply. Now a reply is sent only if the applied location has advanced. However, this applied-location-driven reply is still triggered from WalRcvForceReply(), so the function has effectively lost its original “force” semantics. Because of that, it might be better to rename WalRcvForceReply() to something like WalRcvRequestApplyStatusUpdate().

Then,
```
static void
XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply)
```

replyApply reads like “send an apply reply”, but in reality it indicates that the applied location should be checked to decide whether to send the reply. So it might be clearer to rename it to something like checkApplyStatus.

Lastly,
```
    sig_atomic_t reply_apply; /* used as a bool */
```

reply_apply sounds like an action of “reply with apply”, but what it actually represents is that the startup process requested an applied-location-driven reply. If applied location is not advanced, the reply won’t be sent. So a name like apply_update_requested might better reflect the meaning.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/









^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-16 00:25  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  1 sibling, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-16 00:25 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Fri, Mar 13, 2026 at 12:27 AM Fujii Masao <[email protected]> wrote:

> Thanks for testing and for the clarification! You're right.
>
> However, if we apply this change, the time required for the lag information to
> be reset would effectively double. I start wondering if that's really
> acceptable, especially for back branches. Although the docs doesn't clearly
> specify this timing, doubling it could affect systems that monitor
> replication lag, for example. It might still be reasonable to apply
> such a change in master, though.

Yes, I agree. Doubling the lag reset time should be avoided in back
branches if possible.

> On further thought, the root cause seems to be that walreceiver can send
> two consecutive status reply messages with identical WAL locations even
> when wal_receiver_status_interval has not yet elapsed. Addressing that
> behavior directly might resolve the issue you reported. I've attached a PoC
> patch that does this. Thought?

Thank you for the v4 patch. I think this approach is better than mine.
I tested the patch and confirmed that the issue no longer reproduces
with physical replication. However, with logical replication, the lag
columns in pg_stat_replication still show NULL periodically at
wal_receiver_status_interval, since send_feedback() in worker.c can
still send duplicate positions.

+ * previsou update, i.e., when 'replyApply' is true.

One minor thing: there is a typo "previsou". It should be "previous".

-- 
Best regards,
Shinya Kato
NTT OSS Center

^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-17 02:00  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Fujii Masao @ 2026-03-17 02:00 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Mon, Mar 16, 2026 at 9:26 AM Shinya Kato <[email protected]> wrote:
> Thank you for the v4 patch. I think this approach is better than mine.
> I tested the patch and confirmed that the issue no longer reproduces
> with physical replication. However, with logical replication, the lag
> columns in pg_stat_replication still show NULL periodically at
> wal_receiver_status_interval, since send_feedback() in worker.c can
> still send duplicate positions.

I was thinking that if a feedback message triggered by
wal_receiver_status_interval has the same LSNs as the previous message,
it's expected for the lag columns to become NULL. But you see it differently,
don't you? Sorry, I failed to understand your point...

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-19 13:58  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-19 13:58 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Tue, Mar 17, 2026 at 11:00 AM Fujii Masao <[email protected]> wrote:
>
> On Mon, Mar 16, 2026 at 9:26 AM Shinya Kato <[email protected]> wrote:
> > Thank you for the v4 patch. I think this approach is better than mine.
> > I tested the patch and confirmed that the issue no longer reproduces
> > with physical replication. However, with logical replication, the lag
> > columns in pg_stat_replication still show NULL periodically at
> > wal_receiver_status_interval, since send_feedback() in worker.c can
> > still send duplicate positions.
>
> I was thinking that if a feedback message triggered by
> wal_receiver_status_interval has the same LSNs as the previous message,
> it's expected for the lag columns to become NULL. But you see it differently,
> don't you? Sorry, I failed to understand your point...

Sorry for the confusion. I ran a script inserting one row every 0.5
seconds under logical replication and confirmed that NULL still
appears in the lag columns even while replication is actively running.
I was initially mistaken that this was tied to
wal_receiver_status_interval timing — that turned out to be unrelated.

I haven't had time to investigate further, but my current impression
is that the existing approach may not be sufficient for logical
replication.


-- 
Best regards,
Shinya Kato
NTT OSS Center





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-19 17:13  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Fujii Masao @ 2026-03-19 17:13 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Thu, Mar 19, 2026 at 10:58 PM Shinya Kato <[email protected]> wrote:
>
> On Tue, Mar 17, 2026 at 11:00 AM Fujii Masao <[email protected]> wrote:
> >
> > On Mon, Mar 16, 2026 at 9:26 AM Shinya Kato <[email protected]> wrote:
> > > Thank you for the v4 patch. I think this approach is better than mine.
> > > I tested the patch and confirmed that the issue no longer reproduces
> > > with physical replication. However, with logical replication, the lag
> > > columns in pg_stat_replication still show NULL periodically at
> > > wal_receiver_status_interval, since send_feedback() in worker.c can
> > > still send duplicate positions.
> >
> > I was thinking that if a feedback message triggered by
> > wal_receiver_status_interval has the same LSNs as the previous message,
> > it's expected for the lag columns to become NULL. But you see it differently,
> > don't you? Sorry, I failed to understand your point...
>
> Sorry for the confusion. I ran a script inserting one row every 0.5
> seconds under logical replication and confirmed that NULL still
> appears in the lag columns even while replication is actively running.
> I was initially mistaken that this was tied to
> wal_receiver_status_interval timing — that turned out to be unrelated.
>
> I haven't had time to investigate further, but my current impression
> is that the existing approach may not be sufficient for logical
> replication.

Thanks for the clarification! I understand your point now.

I think the issue occurs when the positions in the first message point to
the same LSN (e.g., 0/030D5230), and the second message reports the same but
larger LSN (e.g., 0/030D52E0).

I've updated the patch to address this. It removes fullyAppliedLastTime,
tracks the positions from the previous reply, and clears the lag values only
when the positions remain unchanged across two consecutive messages.

Patch attached. Could you test and review this updated patch?

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v5-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch (10.8K, 2-v5-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch)
  download | inline diff:
From f8732fcf673478aef5907817ca3384d9c48dceda Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Fri, 20 Mar 2026 00:34:06 +0900
Subject: [PATCH v5] Avoid sending duplicate WAL locations in standby status
 replies

Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
---
 src/backend/replication/walreceiver.c | 70 ++++++++++++++++-----------
 src/backend/replication/walsender.c   | 35 ++++++++------
 src/include/replication/walreceiver.h |  4 +-
 3 files changed, 62 insertions(+), 47 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..f5d5379edc7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -143,7 +143,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
 							TimeLineID tli);
 static void XLogWalRcvFlush(bool dying, TimeLineID tli);
 static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -417,7 +417,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 				WalRcvComputeNextWakeup(i, now);
 
 			/* Send initial reply/feedback messages. */
-			XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendReply(true, false, false);
 			XLogWalRcvSendHSFeedback(true);
 
 			/* Loop until end-of-streaming or error */
@@ -493,7 +493,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					}
 
 					/* Let the primary know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -539,7 +539,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					ResetLatch(MyLatch);
 					CHECK_FOR_INTERRUPTS();
 
-					if (walrcv->force_reply)
+					if (walrcv->reply_apply)
 					{
 						/*
 						 * The recovery process has asked us to send apply
@@ -547,9 +547,9 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
-						walrcv->force_reply = false;
+						walrcv->reply_apply = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(false, false, true);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -595,7 +595,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						wakeup[WALRCV_WAKEUP_PING] = TIMESTAMP_INFINITY;
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -886,7 +886,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1053,7 +1053,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Also let the primary know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1107,24 +1107,33 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
 }
 
 /*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and
+ * time.
  *
- * If 'force' is not set, the message is only sent if enough time has
- * passed since last status update to reach wal_receiver_status_interval.
- * If wal_receiver_status_interval is disabled altogether and 'force' is
- * false, this is a no-op.
+ * The message is sent if 'force' is set, if enough time has passed since the
+ * last update to reach wal_receiver_status_interval, or if WAL locations have
+ * advanced since the previous status update. If wal_receiver_status_interval
+ * is disabled and 'force' is false, this function does nothing. Set 'force' to
+ * send the message unconditionally.
+ *
+ * Set 'replyApply' when the apply location is expected to have advanced from the
+ * previous update (for example, when the startup process requests an apply
+ * notification to be sent to the primary). In that case, the write, flush, and
+ * apply locations are compared to determine whether WAL has advanced.
+ * Otherwise the apply location is assumed unchanged and is not checked,
+ * so only the write and flush locations are considered.
  *
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbeats, when approaching
  * wal_receiver_timeout.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool replyApply)
 {
 	static XLogRecPtr writePtr = InvalidXLogRecPtr;
 	static XLogRecPtr flushPtr = InvalidXLogRecPtr;
-	XLogRecPtr	applyPtr;
+	static XLogRecPtr applyPtr = InvalidXLogRecPtr;
+	XLogRecPtr	latestApplyPtr = InvalidXLogRecPtr;
 	TimestampTz now;
 
 	/*
@@ -1140,17 +1149,19 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/*
 	 * We can compare the write and flush positions to the last message we
 	 * sent without taking any lock, but the apply position requires a spin
-	 * lock, so we don't check that unless something else has changed or 10
-	 * seconds have passed.  This means that the apply WAL location will
-	 * appear, from the primary's point of view, to lag slightly, but since
-	 * this is only for reporting purposes and only on idle systems, that's
-	 * probably OK.
+	 * lock, so we don't check that unless it is expected to advance since the
+	 * previsou update, i.e., when 'replyApply' is true.
 	 */
-	if (!force
-		&& writePtr == LogstreamResult.Write
-		&& flushPtr == LogstreamResult.Flush
-		&& now < wakeup[WALRCV_WAKEUP_REPLY])
-		return;
+	if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
+	{
+		if (replyApply)
+			latestApplyPtr = GetXLogReplayRecPtr(NULL);
+
+		if (writePtr == LogstreamResult.Write
+			&& flushPtr == LogstreamResult.Flush
+			&& (!replyApply || applyPtr == latestApplyPtr))
+			return;
+	}
 
 	/* Make sure we wake up when it's time to send another reply. */
 	WalRcvComputeNextWakeup(WALRCV_WAKEUP_REPLY, now);
@@ -1158,7 +1169,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = (latestApplyPtr == InvalidXLogRecPtr) ?
+		GetXLogReplayRecPtr(NULL) : latestApplyPtr;
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, PqReplMsg_StandbyStatusUpdate);
@@ -1382,7 +1394,7 @@ WalRcvForceReply(void)
 {
 	ProcNumber	procno;
 
-	WalRcv->force_reply = true;
+	WalRcv->reply_apply = true;
 	/* fetching the proc number is probably atomic, but don't rely on it */
 	SpinLockAcquire(&WalRcv->mutex);
 	procno = WalRcv->procno;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 08253103cb3..66507e9c2dd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2472,7 +2472,9 @@ ProcessStandbyReplyMessage(void)
 	TimestampTz now;
 	TimestampTz replyTime;
 
-	static bool fullyAppliedLastTime = false;
+	static XLogRecPtr prevWritePtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevFlushPtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevApplyPtr = InvalidXLogRecPtr;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -2505,22 +2507,23 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL, and the
+	 * write/flush/apply positions remain unchanged across two consecutive
+	 * reply messages, forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record.
+	 *
+	 * The second message with unchanged positions typically results from
+	 * wal_receiver_status_interval expiring on the standby, so lag values are
+	 * usually cleared after that interval when there is no activity. This
+	 * avoids displaying stale lag data until more WAL traffic arrives.
 	 */
-	clearLagTimes = false;
-	if (applyPtr == sentPtr)
-	{
-		if (fullyAppliedLastTime)
-			clearLagTimes = true;
-		fullyAppliedLastTime = true;
-	}
-	else
-		fullyAppliedLastTime = false;
+	clearLagTimes = (applyPtr == sentPtr && flushPtr == sentPtr &&
+					 writePtr == prevWritePtr && flushPtr == prevFlushPtr &&
+					 applyPtr == prevApplyPtr);
+
+	prevWritePtr = writePtr;
+	prevFlushPtr = flushPtr;
+	prevApplyPtr = applyPtr;
 
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..024ebcf4f37 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,11 +156,11 @@ typedef struct
 	pg_atomic_uint64 writtenUpto;
 
 	/*
-	 * force walreceiver reply?  This doesn't need to be locked; memory
+	 * request walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
 	 * store semantics, so use sig_atomic_t.
 	 */
-	sig_atomic_t force_reply;	/* used as a bool */
+	sig_atomic_t reply_apply;	/* used as a bool */
 } WalRcvData;
 
 extern PGDLLIMPORT WalRcvData *WalRcv;
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-21 02:04  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-21 02:04 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Fri, Mar 20, 2026 at 2:13 AM Fujii Masao <[email protected]> wrote:
> I think the issue occurs when the positions in the first message point to
> the same LSN (e.g., 0/030D5230), and the second message reports the same but
> larger LSN (e.g., 0/030D52E0).

Thanks for the explanation!

> I've updated the patch to address this. It removes fullyAppliedLastTime,
> tracks the positions from the previous reply, and clears the lag values only
> when the positions remain unchanged across two consecutive messages.
>
> Patch attached. Could you test and review this updated patch?

The patch works properly. I think it looks nice to me, except for the
typo I sent in the previous message.


-- 
Best regards,
Shinya Kato
NTT OSS Center





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-23 15:31  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 2 replies; 21+ messages in thread

From: Fujii Masao @ 2026-03-23 15:31 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Sat, Mar 21, 2026 at 11:05 AM Shinya Kato <[email protected]> wrote:
>
> On Fri, Mar 20, 2026 at 2:13 AM Fujii Masao <[email protected]> wrote:
> > I think the issue occurs when the positions in the first message point to
> > the same LSN (e.g., 0/030D5230), and the second message reports the same but
> > larger LSN (e.g., 0/030D52E0).
>
> Thanks for the explanation!
>
> > I've updated the patch to address this. It removes fullyAppliedLastTime,
> > tracks the positions from the previous reply, and clears the lag values only
> > when the positions remain unchanged across two consecutive messages.
> >
> > Patch attached. Could you test and review this updated patch?
>
> The patch works properly. I think it looks nice to me, except for the
> typo I sent in the previous message.

Thanks for the review!

I've fixed the typo and attached an updated patch. I also incorporated
Chao's comments from upthread. I'm planning to commit this to master.

As for backpatching, I'm hesitant to backpatch the full patch since it may
reduce the number of replication feedback messages, which feels too invasive
for stable branches.

That said, the patch's changes in walsender.c could be backpatched.
As discussed earlier, they don't fully address the reported issue,
but they do help mitigate cases where lag becomes NULL unexpectedly
in logical replication. So it might be worth considering those changes
for stable branches.

Thoughts?

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v6-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch (12.3K, 2-v6-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch)
  download | inline diff:
From c5f32902433ad5b7ce7d5ad45436d1bcb263bf97 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Mon, 23 Mar 2026 21:19:41 +0900
Subject: [PATCH v6] Avoid sending duplicate WAL locations in standby status
 replies

Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
---
 src/backend/access/transam/xlogrecovery.c |  4 +-
 src/backend/replication/walreceiver.c     | 74 ++++++++++++++---------
 src/backend/replication/walsender.c       | 35 ++++++-----
 src/include/replication/walreceiver.h     |  6 +-
 4 files changed, 68 insertions(+), 51 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6d2c4a86b96..fd1c36d061d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2015,7 +2015,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	if (doRequestWalReceiverReply)
 	{
 		doRequestWalReceiverReply = false;
-		WalRcvForceReply();
+		WalRcvRequestApplyReply();
 	}
 
 	/* Allow read-only connections if we're consistent now */
@@ -3970,7 +3970,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (!streaming_reply_sent)
 					{
-						WalRcvForceReply();
+						WalRcvRequestApplyReply();
 						streaming_reply_sent = true;
 					}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..a437273cf9a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -143,7 +143,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
 							TimeLineID tli);
 static void XLogWalRcvFlush(bool dying, TimeLineID tli);
 static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -417,7 +417,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 				WalRcvComputeNextWakeup(i, now);
 
 			/* Send initial reply/feedback messages. */
-			XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendReply(true, false, false);
 			XLogWalRcvSendHSFeedback(true);
 
 			/* Loop until end-of-streaming or error */
@@ -493,7 +493,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					}
 
 					/* Let the primary know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -539,7 +539,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					ResetLatch(MyLatch);
 					CHECK_FOR_INTERRUPTS();
 
-					if (walrcv->force_reply)
+					if (walrcv->apply_reply_requested)
 					{
 						/*
 						 * The recovery process has asked us to send apply
@@ -547,9 +547,9 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
-						walrcv->force_reply = false;
+						walrcv->apply_reply_requested = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(false, false, true);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -595,7 +595,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						wakeup[WALRCV_WAKEUP_PING] = TIMESTAMP_INFINITY;
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -886,7 +886,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1053,7 +1053,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Also let the primary know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1107,24 +1107,35 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
 }
 
 /*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and
+ * time.
  *
- * If 'force' is not set, the message is only sent if enough time has
- * passed since last status update to reach wal_receiver_status_interval.
- * If wal_receiver_status_interval is disabled altogether and 'force' is
- * false, this is a no-op.
+ * The message is sent if 'force' is set, if enough time has passed since the
+ * last update to reach wal_receiver_status_interval, or if WAL locations have
+ * advanced since the previous status update. If wal_receiver_status_interval
+ * is disabled and 'force' is false, this function does nothing. Set 'force' to
+ * send the message unconditionally.
+ *
+ * Whether WAL locations are considered "advanced" depends on 'checkApply'.
+ * If 'checkApply' is false, only the write and flush locations are checked.
+ * This should be used when the call is triggered by write/flush activity
+ * (e.g., after walreceiver writes or flushes WAL), and avoids the
+ * apply-location check, which requires a spinlock. If 'checkApply' is true,
+ * the apply location is also considered. This should be used when the apply
+ * location is expected to advance (e.g., when the startup process requests
+ * an apply notification).
  *
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbeats, when approaching
  * wal_receiver_timeout.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply)
 {
 	static XLogRecPtr writePtr = InvalidXLogRecPtr;
 	static XLogRecPtr flushPtr = InvalidXLogRecPtr;
-	XLogRecPtr	applyPtr;
+	static XLogRecPtr applyPtr = InvalidXLogRecPtr;
+	XLogRecPtr	latestApplyPtr = InvalidXLogRecPtr;
 	TimestampTz now;
 
 	/*
@@ -1140,17 +1151,19 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/*
 	 * We can compare the write and flush positions to the last message we
 	 * sent without taking any lock, but the apply position requires a spin
-	 * lock, so we don't check that unless something else has changed or 10
-	 * seconds have passed.  This means that the apply WAL location will
-	 * appear, from the primary's point of view, to lag slightly, but since
-	 * this is only for reporting purposes and only on idle systems, that's
-	 * probably OK.
+	 * lock, so we don't check that unless it is expected to advance since the
+	 * previous update, i.e., when 'checkApply' is true.
 	 */
-	if (!force
-		&& writePtr == LogstreamResult.Write
-		&& flushPtr == LogstreamResult.Flush
-		&& now < wakeup[WALRCV_WAKEUP_REPLY])
-		return;
+	if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
+	{
+		if (checkApply)
+			latestApplyPtr = GetXLogReplayRecPtr(NULL);
+
+		if (writePtr == LogstreamResult.Write
+			&& flushPtr == LogstreamResult.Flush
+			&& (!checkApply || applyPtr == latestApplyPtr))
+			return;
+	}
 
 	/* Make sure we wake up when it's time to send another reply. */
 	WalRcvComputeNextWakeup(WALRCV_WAKEUP_REPLY, now);
@@ -1158,7 +1171,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = (latestApplyPtr == InvalidXLogRecPtr) ?
+		GetXLogReplayRecPtr(NULL) : latestApplyPtr;
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, PqReplMsg_StandbyStatusUpdate);
@@ -1378,11 +1392,11 @@ WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now)
  * synchronous_commit = remote_apply.
  */
 void
-WalRcvForceReply(void)
+WalRcvRequestApplyReply(void)
 {
 	ProcNumber	procno;
 
-	WalRcv->force_reply = true;
+	WalRcv->apply_reply_requested = true;
 	/* fetching the proc number is probably atomic, but don't rely on it */
 	SpinLockAcquire(&WalRcv->mutex);
 	procno = WalRcv->procno;
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 08253103cb3..66507e9c2dd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2472,7 +2472,9 @@ ProcessStandbyReplyMessage(void)
 	TimestampTz now;
 	TimestampTz replyTime;
 
-	static bool fullyAppliedLastTime = false;
+	static XLogRecPtr prevWritePtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevFlushPtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevApplyPtr = InvalidXLogRecPtr;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -2505,22 +2507,23 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL, and the
+	 * write/flush/apply positions remain unchanged across two consecutive
+	 * reply messages, forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record.
+	 *
+	 * The second message with unchanged positions typically results from
+	 * wal_receiver_status_interval expiring on the standby, so lag values are
+	 * usually cleared after that interval when there is no activity. This
+	 * avoids displaying stale lag data until more WAL traffic arrives.
 	 */
-	clearLagTimes = false;
-	if (applyPtr == sentPtr)
-	{
-		if (fullyAppliedLastTime)
-			clearLagTimes = true;
-		fullyAppliedLastTime = true;
-	}
-	else
-		fullyAppliedLastTime = false;
+	clearLagTimes = (applyPtr == sentPtr && flushPtr == sentPtr &&
+					 writePtr == prevWritePtr && flushPtr == prevFlushPtr &&
+					 applyPtr == prevApplyPtr);
+
+	prevWritePtr = writePtr;
+	prevFlushPtr = flushPtr;
+	prevApplyPtr = applyPtr;
 
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..85d24c87298 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,11 +156,11 @@ typedef struct
 	pg_atomic_uint64 writtenUpto;
 
 	/*
-	 * force walreceiver reply?  This doesn't need to be locked; memory
+	 * request walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
 	 * store semantics, so use sig_atomic_t.
 	 */
-	sig_atomic_t force_reply;	/* used as a bool */
+	sig_atomic_t apply_reply_requested; /* used as a bool */
 } WalRcvData;
 
 extern PGDLLIMPORT WalRcvData *WalRcv;
@@ -488,7 +488,7 @@ walrcv_clear_result(WalRcvExecResult *walres)
 
 /* prototypes for functions in walreceiver.c */
 pg_noreturn extern void WalReceiverMain(const void *startup_data, size_t startup_data_len);
-extern void WalRcvForceReply(void);
+extern void WalRcvRequestApplyReply(void);
 
 /* prototypes for functions in walreceiverfuncs.c */
 extern Size WalRcvShmemSize(void);
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-24 05:32  Chao Li <[email protected]>
  parent: Fujii Masao <[email protected]>
  1 sibling, 1 reply; 21+ messages in thread

From: Chao Li @ 2026-03-24 05:32 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Shinya Kato <[email protected]>; PostgreSQL Hackers <[email protected]>



> On Mar 23, 2026, at 23:31, Fujii Masao <[email protected]> wrote:
> 
> On Sat, Mar 21, 2026 at 11:05 AM Shinya Kato <[email protected]> wrote:
>> 
>> On Fri, Mar 20, 2026 at 2:13 AM Fujii Masao <[email protected]> wrote:
>>> I think the issue occurs when the positions in the first message point to
>>> the same LSN (e.g., 0/030D5230), and the second message reports the same but
>>> larger LSN (e.g., 0/030D52E0).
>> 
>> Thanks for the explanation!
>> 
>>> I've updated the patch to address this. It removes fullyAppliedLastTime,
>>> tracks the positions from the previous reply, and clears the lag values only
>>> when the positions remain unchanged across two consecutive messages.
>>> 
>>> Patch attached. Could you test and review this updated patch?
>> 
>> The patch works properly. I think it looks nice to me, except for the
>> typo I sent in the previous message.
> 
> Thanks for the review!
> 
> I've fixed the typo and attached an updated patch. I also incorporated
> Chao's comments from upthread. I'm planning to commit this to master.
> 
> As for backpatching, I'm hesitant to backpatch the full patch since it may
> reduce the number of replication feedback messages, which feels too invasive
> for stable branches.
> 
> That said, the patch's changes in walsender.c could be backpatched.
> As discussed earlier, they don't fully address the reported issue,
> but they do help mitigate cases where lag becomes NULL unexpectedly
> in logical replication. So it might be worth considering those changes
> for stable branches.
> 
> Thoughts?
> 
> Regards,
> 
> -- 
> Fujii Masao
> <v6-0001-Avoid-sending-duplicate-WAL-locations-in-standby-.patch>

Thank you for updating the patch. I saw that the variable name and function name were changed to reflect my earlier comments.

v6 looks good to me.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/









^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-25 07:02  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  1 sibling, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-25 07:02 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Tue, Mar 24, 2026, 00:31 Fujii Masao <[email protected]> wrote:

> On Sat, Mar 21, 2026 at 11:05 AM Shinya Kato <[email protected]>
> wrote:
> >
> > On Fri, Mar 20, 2026 at 2:13 AM Fujii Masao <[email protected]>
> wrote:
> > > I think the issue occurs when the positions in the first message point
> to
> > > the same LSN (e.g., 0/030D5230), and the second message reports the
> same but
> > > larger LSN (e.g., 0/030D52E0).
> >
> > Thanks for the explanation!
> >
> > > I've updated the patch to address this. It removes
> fullyAppliedLastTime,
> > > tracks the positions from the previous reply, and clears the lag
> values only
> > > when the positions remain unchanged across two consecutive messages.
> > >
> > > Patch attached. Could you test and review this updated patch?
> >
> > The patch works properly. I think it looks nice to me, except for the
> > typo I sent in the previous message.
>
> Thanks for the review!
>
> I've fixed the typo and attached an updated patch. I also incorporated
> Chao's comments from upthread. I'm planning to commit this to master.
>
> As for backpatching, I'm hesitant to backpatch the full patch since it may
> reduce the number of replication feedback messages, which feels too
> invasive
> for stable branches.
>
> That said, the patch's changes in walsender.c could be backpatched.
> As discussed earlier, they don't fully address the reported issue,
> but they do help mitigate cases where lag becomes NULL unexpectedly
> in logical replication. So it might be worth considering those changes
> for stable branches.
>

Thanks for the updated patch. LGTM.

Regarding the backpatch, I'd personally appreciate it if the walsender.c
changes were backpatched to stable branches. As you noted, it don't fully
solve the reported issue, but they do help reduce the cases where lag
columns in pg_stat_replication unexpectedly become NULL.

Even a partial mitigation in the back branches would be valuable for users
running stable releases.

--
Best regards,
Shinya Kato

>


^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-25 15:27  Fujii Masao <[email protected]>
  parent: Chao Li <[email protected]>
  0 siblings, 0 replies; 21+ messages in thread

From: Fujii Masao @ 2026-03-25 15:27 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: Shinya Kato <[email protected]>; PostgreSQL Hackers <[email protected]>

On Tue, Mar 24, 2026 at 2:32 PM Chao Li <[email protected]> wrote:
> Thank you for updating the patch. I saw that the variable name and function name were changed to reflect my earlier comments.
>
> v6 looks good to me.

Thanks for the review!

Regards,


-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-25 15:30  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Fujii Masao @ 2026-03-25 15:30 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Wed, Mar 25, 2026 at 4:03 PM Shinya Kato <[email protected]> wrote:
> Thanks for the updated patch. LGTM.
>
> Regarding the backpatch, I'd personally appreciate it if the walsender.c changes were backpatched to stable branches. As you noted, it don't fully solve the reported issue, but they do help reduce the cases where lag columns in pg_stat_replication unexpectedly become NULL.
>
> Even a partial mitigation in the back branches would be valuable for users running stable releases.

+1

I've split the changes into two patches.

Patch 0001 fixes premature NULL lag reporting in walsender. I will commit it
and backpatch it to all supported branches.

Patch 0002 avoids sending duplicate WAL locations in standby status replies.
I will commit this to master only.

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v7-0002-Avoid-sending-duplicate-WAL-locations-in-standby-.patch (10.3K, 2-v7-0002-Avoid-sending-duplicate-WAL-locations-in-standby-.patch)
  download | inline diff:
From c48f7e1a7bb7c9aab9935ef5b4c5263b904201ae Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Wed, 25 Mar 2026 23:53:48 +0900
Subject: [PATCH v7 2/2] Avoid sending duplicate WAL locations in standby
 status replies

Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.

As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.

This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.

Author: Fujii Masao <[email protected]>
Reviewed-by: Shinya Kato <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
---
 src/backend/access/transam/xlogrecovery.c |  4 +-
 src/backend/replication/walreceiver.c     | 74 ++++++++++++++---------
 src/include/replication/walreceiver.h     |  6 +-
 3 files changed, 49 insertions(+), 35 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6d2c4a86b96..fd1c36d061d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2015,7 +2015,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
 	if (doRequestWalReceiverReply)
 	{
 		doRequestWalReceiverReply = false;
-		WalRcvForceReply();
+		WalRcvRequestApplyReply();
 	}
 
 	/* Allow read-only connections if we're consistent now */
@@ -3970,7 +3970,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 */
 					if (!streaming_reply_sent)
 					{
-						WalRcvForceReply();
+						WalRcvRequestApplyReply();
 						streaming_reply_sent = true;
 					}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..a437273cf9a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -143,7 +143,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
 							TimeLineID tli);
 static void XLogWalRcvFlush(bool dying, TimeLineID tli);
 static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply);
 static void XLogWalRcvSendHSFeedback(bool immed);
 static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
 static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -417,7 +417,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 				WalRcvComputeNextWakeup(i, now);
 
 			/* Send initial reply/feedback messages. */
-			XLogWalRcvSendReply(true, false);
+			XLogWalRcvSendReply(true, false, false);
 			XLogWalRcvSendHSFeedback(true);
 
 			/* Loop until end-of-streaming or error */
@@ -493,7 +493,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					}
 
 					/* Let the primary know that we received some data. */
-					XLogWalRcvSendReply(false, false);
+					XLogWalRcvSendReply(false, false, false);
 
 					/*
 					 * If we've written some records, flush them to disk and
@@ -539,7 +539,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 					ResetLatch(MyLatch);
 					CHECK_FOR_INTERRUPTS();
 
-					if (walrcv->force_reply)
+					if (walrcv->apply_reply_requested)
 					{
 						/*
 						 * The recovery process has asked us to send apply
@@ -547,9 +547,9 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						 * false in shared memory before sending the reply, so
 						 * we don't miss a new request for a reply.
 						 */
-						walrcv->force_reply = false;
+						walrcv->apply_reply_requested = false;
 						pg_memory_barrier();
-						XLogWalRcvSendReply(true, false);
+						XLogWalRcvSendReply(false, false, true);
 					}
 				}
 				if (rc & WL_TIMEOUT)
@@ -595,7 +595,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
 						wakeup[WALRCV_WAKEUP_PING] = TIMESTAMP_INFINITY;
 					}
 
-					XLogWalRcvSendReply(requestReply, requestReply);
+					XLogWalRcvSendReply(requestReply, requestReply, false);
 					XLogWalRcvSendHSFeedback(false);
 				}
 			}
@@ -886,7 +886,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
 
 				/* If the primary requested a reply, send one immediately */
 				if (replyRequested)
-					XLogWalRcvSendReply(true, false);
+					XLogWalRcvSendReply(true, false, false);
 				break;
 			}
 		default:
@@ -1053,7 +1053,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
 		/* Also let the primary know that we made some progress */
 		if (!dying)
 		{
-			XLogWalRcvSendReply(false, false);
+			XLogWalRcvSendReply(false, false, false);
 			XLogWalRcvSendHSFeedback(false);
 		}
 	}
@@ -1107,24 +1107,35 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
 }
 
 /*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and
+ * time.
  *
- * If 'force' is not set, the message is only sent if enough time has
- * passed since last status update to reach wal_receiver_status_interval.
- * If wal_receiver_status_interval is disabled altogether and 'force' is
- * false, this is a no-op.
+ * The message is sent if 'force' is set, if enough time has passed since the
+ * last update to reach wal_receiver_status_interval, or if WAL locations have
+ * advanced since the previous status update. If wal_receiver_status_interval
+ * is disabled and 'force' is false, this function does nothing. Set 'force' to
+ * send the message unconditionally.
+ *
+ * Whether WAL locations are considered "advanced" depends on 'checkApply'.
+ * If 'checkApply' is false, only the write and flush locations are checked.
+ * This should be used when the call is triggered by write/flush activity
+ * (e.g., after walreceiver writes or flushes WAL), and avoids the
+ * apply-location check, which requires a spinlock. If 'checkApply' is true,
+ * the apply location is also considered. This should be used when the apply
+ * location is expected to advance (e.g., when the startup process requests
+ * an apply notification).
  *
  * If 'requestReply' is true, requests the server to reply immediately upon
  * receiving this message. This is used for heartbeats, when approaching
  * wal_receiver_timeout.
  */
 static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply)
 {
 	static XLogRecPtr writePtr = InvalidXLogRecPtr;
 	static XLogRecPtr flushPtr = InvalidXLogRecPtr;
-	XLogRecPtr	applyPtr;
+	static XLogRecPtr applyPtr = InvalidXLogRecPtr;
+	XLogRecPtr	latestApplyPtr = InvalidXLogRecPtr;
 	TimestampTz now;
 
 	/*
@@ -1140,17 +1151,19 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/*
 	 * We can compare the write and flush positions to the last message we
 	 * sent without taking any lock, but the apply position requires a spin
-	 * lock, so we don't check that unless something else has changed or 10
-	 * seconds have passed.  This means that the apply WAL location will
-	 * appear, from the primary's point of view, to lag slightly, but since
-	 * this is only for reporting purposes and only on idle systems, that's
-	 * probably OK.
+	 * lock, so we don't check that unless it is expected to advance since the
+	 * previous update, i.e., when 'checkApply' is true.
 	 */
-	if (!force
-		&& writePtr == LogstreamResult.Write
-		&& flushPtr == LogstreamResult.Flush
-		&& now < wakeup[WALRCV_WAKEUP_REPLY])
-		return;
+	if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
+	{
+		if (checkApply)
+			latestApplyPtr = GetXLogReplayRecPtr(NULL);
+
+		if (writePtr == LogstreamResult.Write
+			&& flushPtr == LogstreamResult.Flush
+			&& (!checkApply || applyPtr == latestApplyPtr))
+			return;
+	}
 
 	/* Make sure we wake up when it's time to send another reply. */
 	WalRcvComputeNextWakeup(WALRCV_WAKEUP_REPLY, now);
@@ -1158,7 +1171,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
 	/* Construct a new message */
 	writePtr = LogstreamResult.Write;
 	flushPtr = LogstreamResult.Flush;
-	applyPtr = GetXLogReplayRecPtr(NULL);
+	applyPtr = (latestApplyPtr == InvalidXLogRecPtr) ?
+		GetXLogReplayRecPtr(NULL) : latestApplyPtr;
 
 	resetStringInfo(&reply_message);
 	pq_sendbyte(&reply_message, PqReplMsg_StandbyStatusUpdate);
@@ -1378,11 +1392,11 @@ WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now)
  * synchronous_commit = remote_apply.
  */
 void
-WalRcvForceReply(void)
+WalRcvRequestApplyReply(void)
 {
 	ProcNumber	procno;
 
-	WalRcv->force_reply = true;
+	WalRcv->apply_reply_requested = true;
 	/* fetching the proc number is probably atomic, but don't rely on it */
 	SpinLockAcquire(&WalRcv->mutex);
 	procno = WalRcv->procno;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..85d24c87298 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,11 +156,11 @@ typedef struct
 	pg_atomic_uint64 writtenUpto;
 
 	/*
-	 * force walreceiver reply?  This doesn't need to be locked; memory
+	 * request walreceiver reply?  This doesn't need to be locked; memory
 	 * barriers for ordering are sufficient.  But we do need atomic fetch and
 	 * store semantics, so use sig_atomic_t.
 	 */
-	sig_atomic_t force_reply;	/* used as a bool */
+	sig_atomic_t apply_reply_requested; /* used as a bool */
 } WalRcvData;
 
 extern PGDLLIMPORT WalRcvData *WalRcv;
@@ -488,7 +488,7 @@ walrcv_clear_result(WalRcvExecResult *walres)
 
 /* prototypes for functions in walreceiver.c */
 pg_noreturn extern void WalReceiverMain(const void *startup_data, size_t startup_data_len);
-extern void WalRcvForceReply(void);
+extern void WalRcvRequestApplyReply(void);
 
 /* prototypes for functions in walreceiverfuncs.c */
 extern Size WalRcvShmemSize(void);
-- 
2.51.2



  [application/octet-stream] v7-0001-Fix-premature-NULL-lag-reporting-in-pg_stat_repli.patch (4.3K, 3-v7-0001-Fix-premature-NULL-lag-reporting-in-pg_stat_repli.patch)
  download | inline diff:
From a7a4bfb8d58a15f1cae109e403e22e169898e59d Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Wed, 25 Mar 2026 22:09:40 +0900
Subject: [PATCH v7 1/2] Fix premature NULL lag reporting in
 pg_stat_replication

pg_stat_replication is documented to keep the last measured lag values for
a short time after the standby catches up, and then set them to NULL when
there is no WAL activity. However, previously lag values could become NULL
prematurely even while WAL activity was ongoing, especially in logical
replication.

This happened because the code cleared lag when two consecutive reply messages
indicated that the apply location had caught up with the send location.
It did not verify that the reported positions were unchanged, so lag could be
cleared even when positions had advanced between messages. In logical
replication, where the apply location often quickly catches up, this issue was
more likely to occur.

This commit fixes the issue by clearing lag only when the standby reports that
it has fully replayed WAL (i.e., both flush and apply locations have caught up
with the send location) and the write/flush/apply positions remain unchanged
across two consecutive reply messages.

The second message with unchanged positions typically results from
wal_receiver_status_interval, so lag values are cleared after that interval
when there is no activity. This avoids showing stale lag data while preventing
premature NULL values.

Even with this fix, lag may rarely become NULL during activity if identical
position reports are sent repeatedly. Eliminating such duplicate messages
would address this fully, but that change is considered too invasive for stable
branches and will be handled in master only later.

Backpatch to all supported branches.

Author: Shinya Kato <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
Backpatch-through: 14
---
 src/backend/replication/walsender.c | 35 ++++++++++++++++-------------
 1 file changed, 19 insertions(+), 16 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 08253103cb3..66507e9c2dd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2472,7 +2472,9 @@ ProcessStandbyReplyMessage(void)
 	TimestampTz now;
 	TimestampTz replyTime;
 
-	static bool fullyAppliedLastTime = false;
+	static XLogRecPtr prevWritePtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevFlushPtr = InvalidXLogRecPtr;
+	static XLogRecPtr prevApplyPtr = InvalidXLogRecPtr;
 
 	/* the caller already consumed the msgtype byte */
 	writePtr = pq_getmsgint64(&reply_message);
@@ -2505,22 +2507,23 @@ ProcessStandbyReplyMessage(void)
 	applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
 
 	/*
-	 * If the standby reports that it has fully replayed the WAL in two
-	 * consecutive reply messages, then the second such message must result
-	 * from wal_receiver_status_interval expiring on the standby.  This is a
-	 * convenient time to forget the lag times measured when it last
-	 * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
-	 * until more WAL traffic arrives.
+	 * If the standby reports that it has fully replayed the WAL, and the
+	 * write/flush/apply positions remain unchanged across two consecutive
+	 * reply messages, forget the lag times measured when it last
+	 * wrote/flushed/applied a WAL record.
+	 *
+	 * The second message with unchanged positions typically results from
+	 * wal_receiver_status_interval expiring on the standby, so lag values are
+	 * usually cleared after that interval when there is no activity. This
+	 * avoids displaying stale lag data until more WAL traffic arrives.
 	 */
-	clearLagTimes = false;
-	if (applyPtr == sentPtr)
-	{
-		if (fullyAppliedLastTime)
-			clearLagTimes = true;
-		fullyAppliedLastTime = true;
-	}
-	else
-		fullyAppliedLastTime = false;
+	clearLagTimes = (applyPtr == sentPtr && flushPtr == sentPtr &&
+					 writePtr == prevWritePtr && flushPtr == prevFlushPtr &&
+					 applyPtr == prevApplyPtr);
+
+	prevWritePtr = writePtr;
+	prevFlushPtr = flushPtr;
+	prevApplyPtr = applyPtr;
 
 	/* Send a reply if the standby requested one. */
 	if (replyRequested)
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-26 05:40  Shinya Kato <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 21+ messages in thread

From: Shinya Kato @ 2026-03-26 05:40 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Thu, Mar 26, 2026, 00:30 Fujii Masao <[email protected]> wrote:

> On Wed, Mar 25, 2026 at 4:03 PM Shinya Kato <[email protected]>
> wrote:
> > Thanks for the updated patch. LGTM.
> >
> > Regarding the backpatch, I'd personally appreciate it if the walsender.c
> changes were backpatched to stable branches. As you noted, it don't fully
> solve the reported issue, but they do help reduce the cases where lag
> columns in pg_stat_replication unexpectedly become NULL.
> >
> > Even a partial mitigation in the back branches would be valuable for
> users running stable releases.
>
> +1
>
> I've split the changes into two patches.
>
> Patch 0001 fixes premature NULL lag reporting in walsender. I will commit
> it
> and backpatch it to all supported branches.
>
> Patch 0002 avoids sending duplicate WAL locations in standby status
> replies.
> I will commit this to master only.
>

Thanks, LGTM.

Best regards,
Shinya Kato

>


^ permalink  raw  reply  [nested|flat] 21+ messages in thread

* Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
@ 2026-03-26 11:56  Fujii Masao <[email protected]>
  parent: Shinya Kato <[email protected]>
  0 siblings, 0 replies; 21+ messages in thread

From: Fujii Masao @ 2026-03-26 11:56 UTC (permalink / raw)
  To: Shinya Kato <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

On Thu, Mar 26, 2026 at 2:40 PM Shinya Kato <[email protected]> wrote:
> Thanks, LGTM.

I've pushed the patches. Thanks!

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 21+ messages in thread

end of thread, other threads:[~2026-03-26 11:56 UTC | newest]

Thread overview: 21+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-24 06:53 pg_stat_replication.*_lag sometimes shows NULL during active replication Shinya Kato <[email protected]>
2026-03-02 14:44 ` Fujii Masao <[email protected]>
2026-03-06 07:12   ` Shinya Kato <[email protected]>
2026-03-09 11:21     ` Fujii Masao <[email protected]>
2026-03-10 01:01       ` Shinya Kato <[email protected]>
2026-03-10 01:54         ` Fujii Masao <[email protected]>
2026-03-11 02:38           ` Shinya Kato <[email protected]>
2026-03-12 15:27             ` Fujii Masao <[email protected]>
2026-03-13 02:15               ` Chao Li <[email protected]>
2026-03-16 00:25               ` Shinya Kato <[email protected]>
2026-03-17 02:00                 ` Fujii Masao <[email protected]>
2026-03-19 13:58                   ` Shinya Kato <[email protected]>
2026-03-19 17:13                     ` Fujii Masao <[email protected]>
2026-03-21 02:04                       ` Shinya Kato <[email protected]>
2026-03-23 15:31                         ` Fujii Masao <[email protected]>
2026-03-24 05:32                           ` Chao Li <[email protected]>
2026-03-25 15:27                             ` Fujii Masao <[email protected]>
2026-03-25 07:02                           ` Shinya Kato <[email protected]>
2026-03-25 15:30                             ` Fujii Masao <[email protected]>
2026-03-26 05:40                               ` Shinya Kato <[email protected]>
2026-03-26 11:56                                 ` Fujii Masao <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox