From 67eb950123b1bab1f1c3db5ba0f88ce1737b6574 Mon Sep 17 00:00:00 2001 From: Shinya Kato Date: Tue, 24 Feb 2026 15:45:04 +0900 Subject: [PATCH v1] Fix pg_stat_replication.*_lag showing NULL during active replication When the startup process replays WAL quickly, the walreceiver's flush notification and the subsequent force_reply message can both report applyPtr == sentPtr in quick succession. The clearLagTimes logic assumed that two consecutive fully-applied messages meant the wal_receiver_status_interval had expired, but this assumption is violated when the second message comes from WalRcvForceReply(). In that case, the LagTracker samples were already consumed by the first message, so all lag values are -1; with clearLagTimes = true, these -1 values were written to walsnd->*Lag, causing pg_stat_replication to show NULL. Fix this by also requiring that all lag values are -1 (no new samples) in the clearLagTimes condition. This ensures clearLagTimes only triggers when the system is genuinely idle across two consecutive messages, not when samples were consumed by a preceding message in a burst of replies. Author: Shinya Kato Reviewed-by: Discussion: https://postgr.es/m/ --- src/backend/replication/walsender.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 2cde8ebc729..5c7bd0a13ad 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -2493,15 +2493,25 @@ ProcessStandbyReplyMessage(void) applyLag = LagTrackerRead(SYNC_REP_WAIT_APPLY, applyPtr, now); /* - * If the standby reports that it has fully replayed the WAL in two - * consecutive reply messages, then the second such message must result - * from wal_receiver_status_interval expiring on the standby. This is a - * convenient time to forget the lag times measured when it last - * wrote/flushed/applied a WAL record, to avoid displaying stale lag data - * until more WAL traffic arrives. + * If the standby reports that it has fully replayed the WAL and there are + * no new lag samples in two consecutive reply messages, then those + * messages must result from wal_receiver_status_interval expiring on the + * standby. This is a convenient time to forget the lag times measured + * when it last wrote/flushed/applied a WAL record, to avoid displaying + * stale lag data until more WAL traffic arrives. + * + * We also require that no new lag samples are available (all lag values + * are -1) in both messages to avoid a race condition: when the walreceiver + * sends a flush notification followed immediately by a force_reply (to + * report apply progress), both messages can have applyPtr == sentPtr if + * the startup process replayed the WAL quickly. In that case, the lag + * tracker samples are consumed by the first message, causing the second + * to see all lags as -1. Without the lag check, clearLagTimes would + * incorrectly trigger and overwrite valid lag values with -1 (NULL). */ clearLagTimes = false; - if (applyPtr == sentPtr) + if (applyPtr == sentPtr && + writeLag == -1 && flushLag == -1 && applyLag == -1) { if (fullyAppliedLastTime) clearLagTimes = true; -- 2.47.3