public inbox for [email protected]
help / color / mirror / Atom feedFrom: Fujii Masao <[email protected]>
To: Shinya Kato <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
Date: Thu, 26 Mar 2026 00:30:42 +0900
Message-ID: <CAHGQGwHLK5ObQShRx15OwBK4whO7MQ0VXveBtxDsMiibGUGK_g@mail.gmail.com> (raw)
In-Reply-To: <CAOzEurRHuBYBt4BXyFnD8DL9dDgbAexqSC9HOT2=6swiyLS1Kw@mail.gmail.com>
References: <CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com>
<CAHGQGwE=kyQ+YnGPn8zpZ959+3ywg8OR_Nu__uXxxuE0E+Y_Zg@mail.gmail.com>
<CAOzEurRGiGE2Dfe+ySpb=+93ku=7ZC6RgAbHtLC6Xsq3g2XexA@mail.gmail.com>
<CAHGQGwH2h_R7FWPvEs3+NWLwHZoj9r96tUyRKi5haqxMc6FXiQ@mail.gmail.com>
<CAOzEurQiP3uebd1GMiC1Dzf5VJwF4ZBEpJ6QYQFE6Y+rVjxqNA@mail.gmail.com>
<CAHGQGwEmMBBAE0RG-R3_LacfT4fbB55qGE6n9O5mNwrqvbNBtw@mail.gmail.com>
<CAOzEurRF+OWcMZpfE=NV_Wcm6CFFGOnuxC6L9WWCjOMN0_eZMQ@mail.gmail.com>
<CAHGQGwGPg20Rw2PydBDXiKHgn55s--E14-qRy7t3M+DHreCJww@mail.gmail.com>
<CAOzEurSDzFsRXjofhq7mbNgoL8HaVbNeEhWBm7m9_K2ZNQnaBw@mail.gmail.com>
<CAHGQGwFzg9J_z71kwnjXOxaoO3upMW8RdT4ieE8MBLq6h-ojZg@mail.gmail.com>
<CAOzEurRV117N2neo1N_O+xWPv=-R7qou+i7k-h79JjTP9sO1Fg@mail.gmail.com>
<CAHGQGwGLUXmjC1+A1fzg-ynP1pdKC-0yfmLYcnnu4YJSEDnuQw@mail.gmail.com>
<CAOzEurSAvOf-vxzn9yQaCipgD6bSr4ncoTYjFceAPDO5fzMNNw@mail.gmail.com>
<CAHGQGwGRLOYj95OfQ5T+mJYmyQ8YrvQ1SNOkiviV+5aWT1iCoA@mail.gmail.com>
<CAOzEurRHuBYBt4BXyFnD8DL9dDgbAexqSC9HOT2=6swiyLS1Kw@mail.gmail.com>
On Wed, Mar 25, 2026 at 4:03 PM Shinya Kato <[email protected]> wrote:
> Thanks for the updated patch. LGTM.
>
> Regarding the backpatch, I'd personally appreciate it if the walsender.c changes were backpatched to stable branches. As you noted, it don't fully solve the reported issue, but they do help reduce the cases where lag columns in pg_stat_replication unexpectedly become NULL.
>
> Even a partial mitigation in the back branches would be valuable for users running stable releases.
+1
I've split the changes into two patches.
Patch 0001 fixes premature NULL lag reporting in walsender. I will commit it
and backpatch it to all supported branches.
Patch 0002 avoids sending duplicate WAL locations in standby status replies.
I will commit this to master only.
Regards,
--
Fujii Masao
Attachments:
[application/octet-stream] v7-0002-Avoid-sending-duplicate-WAL-locations-in-standby-.patch (10.3K, 2-v7-0002-Avoid-sending-duplicate-WAL-locations-in-standby-.patch)
download | inline diff:
From c48f7e1a7bb7c9aab9935ef5b4c5263b904201ae Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Wed, 25 Mar 2026 23:53:48 +0900
Subject: [PATCH v7 2/2] Avoid sending duplicate WAL locations in standby
status replies
Previously, when the startup process applied WAL and requested walreceiver
to send an apply notification to the primary, walreceiver sent a status reply
unconditionally, even if the WAL locations had not advanced since
the previous update.
As a result, the standby could send two consecutive status reply messages
with identical WAL locations even though wal_receiver_status_interval had
not yet elapsed. This could unexpectedly reset the reported replication lag,
making it difficult for users to monitor lag. The second message was also
unnecessary because it reported no progress.
This commit updates walreceiver to send a reply only when the apply location
has advanced since the last status update, even when the startup process
requests a notification.
Author: Fujii Masao <[email protected]>
Reviewed-by: Shinya Kato <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
---
src/backend/access/transam/xlogrecovery.c | 4 +-
src/backend/replication/walreceiver.c | 74 ++++++++++++++---------
src/include/replication/walreceiver.h | 6 +-
3 files changed, 49 insertions(+), 35 deletions(-)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6d2c4a86b96..fd1c36d061d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -2015,7 +2015,7 @@ ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, TimeLineID *repl
if (doRequestWalReceiverReply)
{
doRequestWalReceiverReply = false;
- WalRcvForceReply();
+ WalRcvRequestApplyReply();
}
/* Allow read-only connections if we're consistent now */
@@ -3970,7 +3970,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
*/
if (!streaming_reply_sent)
{
- WalRcvForceReply();
+ WalRcvRequestApplyReply();
streaming_reply_sent = true;
}
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index fabe3c73034..a437273cf9a 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -143,7 +143,7 @@ static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr,
TimeLineID tli);
static void XLogWalRcvFlush(bool dying, TimeLineID tli);
static void XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli);
-static void XLogWalRcvSendReply(bool force, bool requestReply);
+static void XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply);
static void XLogWalRcvSendHSFeedback(bool immed);
static void ProcessWalSndrMessage(XLogRecPtr walEnd, TimestampTz sendTime);
static void WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now);
@@ -417,7 +417,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
WalRcvComputeNextWakeup(i, now);
/* Send initial reply/feedback messages. */
- XLogWalRcvSendReply(true, false);
+ XLogWalRcvSendReply(true, false, false);
XLogWalRcvSendHSFeedback(true);
/* Loop until end-of-streaming or error */
@@ -493,7 +493,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
}
/* Let the primary know that we received some data. */
- XLogWalRcvSendReply(false, false);
+ XLogWalRcvSendReply(false, false, false);
/*
* If we've written some records, flush them to disk and
@@ -539,7 +539,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
ResetLatch(MyLatch);
CHECK_FOR_INTERRUPTS();
- if (walrcv->force_reply)
+ if (walrcv->apply_reply_requested)
{
/*
* The recovery process has asked us to send apply
@@ -547,9 +547,9 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
* false in shared memory before sending the reply, so
* we don't miss a new request for a reply.
*/
- walrcv->force_reply = false;
+ walrcv->apply_reply_requested = false;
pg_memory_barrier();
- XLogWalRcvSendReply(true, false);
+ XLogWalRcvSendReply(false, false, true);
}
}
if (rc & WL_TIMEOUT)
@@ -595,7 +595,7 @@ WalReceiverMain(const void *startup_data, size_t startup_data_len)
wakeup[WALRCV_WAKEUP_PING] = TIMESTAMP_INFINITY;
}
- XLogWalRcvSendReply(requestReply, requestReply);
+ XLogWalRcvSendReply(requestReply, requestReply, false);
XLogWalRcvSendHSFeedback(false);
}
}
@@ -886,7 +886,7 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len, TimeLineID tli)
/* If the primary requested a reply, send one immediately */
if (replyRequested)
- XLogWalRcvSendReply(true, false);
+ XLogWalRcvSendReply(true, false, false);
break;
}
default:
@@ -1053,7 +1053,7 @@ XLogWalRcvFlush(bool dying, TimeLineID tli)
/* Also let the primary know that we made some progress */
if (!dying)
{
- XLogWalRcvSendReply(false, false);
+ XLogWalRcvSendReply(false, false, false);
XLogWalRcvSendHSFeedback(false);
}
}
@@ -1107,24 +1107,35 @@ XLogWalRcvClose(XLogRecPtr recptr, TimeLineID tli)
}
/*
- * Send reply message to primary, indicating our current WAL locations, oldest
- * xmin and the current time.
+ * Send reply message to primary, indicating our current WAL locations and
+ * time.
*
- * If 'force' is not set, the message is only sent if enough time has
- * passed since last status update to reach wal_receiver_status_interval.
- * If wal_receiver_status_interval is disabled altogether and 'force' is
- * false, this is a no-op.
+ * The message is sent if 'force' is set, if enough time has passed since the
+ * last update to reach wal_receiver_status_interval, or if WAL locations have
+ * advanced since the previous status update. If wal_receiver_status_interval
+ * is disabled and 'force' is false, this function does nothing. Set 'force' to
+ * send the message unconditionally.
+ *
+ * Whether WAL locations are considered "advanced" depends on 'checkApply'.
+ * If 'checkApply' is false, only the write and flush locations are checked.
+ * This should be used when the call is triggered by write/flush activity
+ * (e.g., after walreceiver writes or flushes WAL), and avoids the
+ * apply-location check, which requires a spinlock. If 'checkApply' is true,
+ * the apply location is also considered. This should be used when the apply
+ * location is expected to advance (e.g., when the startup process requests
+ * an apply notification).
*
* If 'requestReply' is true, requests the server to reply immediately upon
* receiving this message. This is used for heartbeats, when approaching
* wal_receiver_timeout.
*/
static void
-XLogWalRcvSendReply(bool force, bool requestReply)
+XLogWalRcvSendReply(bool force, bool requestReply, bool checkApply)
{
static XLogRecPtr writePtr = InvalidXLogRecPtr;
static XLogRecPtr flushPtr = InvalidXLogRecPtr;
- XLogRecPtr applyPtr;
+ static XLogRecPtr applyPtr = InvalidXLogRecPtr;
+ XLogRecPtr latestApplyPtr = InvalidXLogRecPtr;
TimestampTz now;
/*
@@ -1140,17 +1151,19 @@ XLogWalRcvSendReply(bool force, bool requestReply)
/*
* We can compare the write and flush positions to the last message we
* sent without taking any lock, but the apply position requires a spin
- * lock, so we don't check that unless something else has changed or 10
- * seconds have passed. This means that the apply WAL location will
- * appear, from the primary's point of view, to lag slightly, but since
- * this is only for reporting purposes and only on idle systems, that's
- * probably OK.
+ * lock, so we don't check that unless it is expected to advance since the
+ * previous update, i.e., when 'checkApply' is true.
*/
- if (!force
- && writePtr == LogstreamResult.Write
- && flushPtr == LogstreamResult.Flush
- && now < wakeup[WALRCV_WAKEUP_REPLY])
- return;
+ if (!force && now < wakeup[WALRCV_WAKEUP_REPLY])
+ {
+ if (checkApply)
+ latestApplyPtr = GetXLogReplayRecPtr(NULL);
+
+ if (writePtr == LogstreamResult.Write
+ && flushPtr == LogstreamResult.Flush
+ && (!checkApply || applyPtr == latestApplyPtr))
+ return;
+ }
/* Make sure we wake up when it's time to send another reply. */
WalRcvComputeNextWakeup(WALRCV_WAKEUP_REPLY, now);
@@ -1158,7 +1171,8 @@ XLogWalRcvSendReply(bool force, bool requestReply)
/* Construct a new message */
writePtr = LogstreamResult.Write;
flushPtr = LogstreamResult.Flush;
- applyPtr = GetXLogReplayRecPtr(NULL);
+ applyPtr = (latestApplyPtr == InvalidXLogRecPtr) ?
+ GetXLogReplayRecPtr(NULL) : latestApplyPtr;
resetStringInfo(&reply_message);
pq_sendbyte(&reply_message, PqReplMsg_StandbyStatusUpdate);
@@ -1378,11 +1392,11 @@ WalRcvComputeNextWakeup(WalRcvWakeupReason reason, TimestampTz now)
* synchronous_commit = remote_apply.
*/
void
-WalRcvForceReply(void)
+WalRcvRequestApplyReply(void)
{
ProcNumber procno;
- WalRcv->force_reply = true;
+ WalRcv->apply_reply_requested = true;
/* fetching the proc number is probably atomic, but don't rely on it */
SpinLockAcquire(&WalRcv->mutex);
procno = WalRcv->procno;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 9b9bd916314..85d24c87298 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -156,11 +156,11 @@ typedef struct
pg_atomic_uint64 writtenUpto;
/*
- * force walreceiver reply? This doesn't need to be locked; memory
+ * request walreceiver reply? This doesn't need to be locked; memory
* barriers for ordering are sufficient. But we do need atomic fetch and
* store semantics, so use sig_atomic_t.
*/
- sig_atomic_t force_reply; /* used as a bool */
+ sig_atomic_t apply_reply_requested; /* used as a bool */
} WalRcvData;
extern PGDLLIMPORT WalRcvData *WalRcv;
@@ -488,7 +488,7 @@ walrcv_clear_result(WalRcvExecResult *walres)
/* prototypes for functions in walreceiver.c */
pg_noreturn extern void WalReceiverMain(const void *startup_data, size_t startup_data_len);
-extern void WalRcvForceReply(void);
+extern void WalRcvRequestApplyReply(void);
/* prototypes for functions in walreceiverfuncs.c */
extern Size WalRcvShmemSize(void);
--
2.51.2
[application/octet-stream] v7-0001-Fix-premature-NULL-lag-reporting-in-pg_stat_repli.patch (4.3K, 3-v7-0001-Fix-premature-NULL-lag-reporting-in-pg_stat_repli.patch)
download | inline diff:
From a7a4bfb8d58a15f1cae109e403e22e169898e59d Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Wed, 25 Mar 2026 22:09:40 +0900
Subject: [PATCH v7 1/2] Fix premature NULL lag reporting in
pg_stat_replication
pg_stat_replication is documented to keep the last measured lag values for
a short time after the standby catches up, and then set them to NULL when
there is no WAL activity. However, previously lag values could become NULL
prematurely even while WAL activity was ongoing, especially in logical
replication.
This happened because the code cleared lag when two consecutive reply messages
indicated that the apply location had caught up with the send location.
It did not verify that the reported positions were unchanged, so lag could be
cleared even when positions had advanced between messages. In logical
replication, where the apply location often quickly catches up, this issue was
more likely to occur.
This commit fixes the issue by clearing lag only when the standby reports that
it has fully replayed WAL (i.e., both flush and apply locations have caught up
with the send location) and the write/flush/apply positions remain unchanged
across two consecutive reply messages.
The second message with unchanged positions typically results from
wal_receiver_status_interval, so lag values are cleared after that interval
when there is no activity. This avoids showing stale lag data while preventing
premature NULL values.
Even with this fix, lag may rarely become NULL during activity if identical
position reports are sent repeatedly. Eliminating such duplicate messages
would address this fully, but that change is considered too invasive for stable
branches and will be handled in master only later.
Backpatch to all supported branches.
Author: Shinya Kato <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAOzEurTzcUrEzrH97DD7+Yz=HGPU81kzWQonKZvqBwYhx2G9_A@mail.gmail.com
Backpatch-through: 14
---
src/backend/replication/walsender.c | 35 ++++++++++++++++-------------
1 file changed, 19 insertions(+), 16 deletions(-)
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 08253103cb3..66507e9c2dd 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2472,7 +2472,9 @@ ProcessStandbyReplyMessage(void)
TimestampTz now;
TimestampTz replyTime;
- static bool fullyAppliedLastTime = false;
+ static XLogRecPtr prevWritePtr = InvalidXLogRecPtr;
+ static XLogRecPtr prevFlushPtr = InvalidXLogRecPtr;
+ static XLogRecPtr prevApplyPtr = InvalidXLogRecPtr;
/* the caller already consumed the msgtype byte */
writePtr = pq_getmsgint64(&reply_message);
@@ -2505,22 +2507,23 @@ ProcessStandbyReplyMessage(void)
applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now);
/*
- * If the standby reports that it has fully replayed the WAL in two
- * consecutive reply messages, then the second such message must result
- * from wal_receiver_status_interval expiring on the standby. This is a
- * convenient time to forget the lag times measured when it last
- * wrote/flushed/applied a WAL record, to avoid displaying stale lag data
- * until more WAL traffic arrives.
+ * If the standby reports that it has fully replayed the WAL, and the
+ * write/flush/apply positions remain unchanged across two consecutive
+ * reply messages, forget the lag times measured when it last
+ * wrote/flushed/applied a WAL record.
+ *
+ * The second message with unchanged positions typically results from
+ * wal_receiver_status_interval expiring on the standby, so lag values are
+ * usually cleared after that interval when there is no activity. This
+ * avoids displaying stale lag data until more WAL traffic arrives.
*/
- clearLagTimes = false;
- if (applyPtr == sentPtr)
- {
- if (fullyAppliedLastTime)
- clearLagTimes = true;
- fullyAppliedLastTime = true;
- }
- else
- fullyAppliedLastTime = false;
+ clearLagTimes = (applyPtr == sentPtr && flushPtr == sentPtr &&
+ writePtr == prevWritePtr && flushPtr == prevFlushPtr &&
+ applyPtr == prevApplyPtr);
+
+ prevWritePtr = writePtr;
+ prevFlushPtr = flushPtr;
+ prevApplyPtr = applyPtr;
/* Send a reply if the standby requested one. */
if (replyRequested)
--
2.51.2
view thread (21+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: pg_stat_replication.*_lag sometimes shows NULL during active replication
In-Reply-To: <CAHGQGwHLK5ObQShRx15OwBK4whO7MQ0VXveBtxDsMiibGUGK_g@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox