Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wI7Ao-007nol-1D for pgsql-hackers@arkaria.postgresql.org; Wed, 29 Apr 2026 15:50:03 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wI7Am-004Btt-1o for pgsql-hackers@arkaria.postgresql.org; Wed, 29 Apr 2026 15:50:00 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wI7Al-004Btc-1a for pgsql-hackers@lists.postgresql.org; Wed, 29 Apr 2026 15:50:00 +0000 Received: from fout-a4-smtp.messagingengine.com ([103.168.172.147]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wI7Ai-00000003pC1-1Y8m for pgsql-hackers@postgresql.org; Wed, 29 Apr 2026 15:49:59 +0000 Received: from phl-compute-07.internal (phl-compute-07.internal [10.202.2.47]) by mailfout.phl.internal (Postfix) with ESMTP id 0FC2DEC0101; Wed, 29 Apr 2026 11:49:54 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-07.internal (MEProxy); Wed, 29 Apr 2026 11:49:54 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:content-type:content-type:date:date:from:from:in-reply-to :message-id:mime-version:reply-to:subject:subject:to:to; s=fm2; t=1777477794; x=1777564194; bh=5SRHMajEzQW46rUdRWGhBAZ0apY+f1IF ujzNmo0NW5w=; b=uczZD4nTWGCtDgxfWQ7+8L7v0lJAxbQnQVAvNlOR8NbKB2rN T33lrxLmUr4uD+6jRcfdPeJoLmbSvaJ+o18ArOWrfuHDX/izhqU0m8LvPwDRhgjo x51heBRs7Ua2kyRoG43BWBbgWPy0J+NqmeH2E88r+2dR6M2b7OlxCKtXh4CCnItW 7HrzqwxrLe5x18W/LuBO4lpivCeIFehcEZFnwogxh2aRgNUB5Kj7DIXo+pQW8dT4 ZKLelXZmscWBb3QObD6odFPBe3WsCaLzfXBHJxw3ihzQmB7M5G3pRLm1Ko+e/pjq WbJJYP36hdl5wE9T4xnN1ez70PcgQW3hd+fvkA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:message-id :mime-version:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t=1777477794; x= 1777564194; bh=5SRHMajEzQW46rUdRWGhBAZ0apY+f1IFujzNmo0NW5w=; b=g Ov12+usKkOa/LXzDOWPfMQ0RoRUdRFt9u3LVhYW3ySOVA6zaBNcmLo8H4+ji/fxP A8dOfFLxSqAlwfe8Fj/gy+JHsfX8zQaXAk4WPq6hatmONO3tlOe5MjgbUccqaIMw RHRTG/GG81Elb2nihIAQeE9Wgba9xYkQPjMyHxYLOO1O4X7PlB4jMbBqLrsBxXn1 VFvpEaanqAL6ZqieM3tNpXGrKfEhzaKJWotEucNpudEMg9s3hyJcqte422H9uv3E XxcZTjrf9uv9CR+wrbh9J5CpcWojdb9mwOmfpTs6BmbYS6Zz99/3JI0Mktf79mxz iX2Bq9O2ew2h04/iMluLQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgdekgeekgecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpuffrtefokffrpgfnqfghnecuuegr ihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjug hrpeffhffvuffkgggtugesthdtsfdttddtvdenucfhrhhomheptehnughrvghsucfhrhgv uhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrthhtvghrnh ephedtffeihfefudelffduffekfeelleeffeejueeiiedtvedutdffjedukeetieejnecu vehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprghnughrvg hssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeegpdhmohguvgepshhmthhp ohhuthdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtoh hmpdhrtghpthhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghp thhtohepphhgshhqlhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdhorhhgpdhrtg hpthhtohepthhglhesshhsshdrphhghhdrphgrrdhush X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 29 Apr 2026 11:49:53 -0400 (EDT) Date: Wed, 29 Apr 2026 11:49:53 -0400 From: Andres Freund To: pgsql-hackers@postgresql.org, Tom Lane , Melanie Plageman , Thomas Munro Subject: Bogus recovery conflicts due to outdated XLogReceiptTime Message-ID: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, While testing something independent, I had a replication setup with pgbench r/w running on the primary and pgbench r/o on the standby, with a replication slot setup and hot_standby_feedback=on. When I saw recovery conflicts, I got worried that we screwed up something in the recent changes around pruning horizons. But that wasn't it. All the conflicts I was seeing were due to buffer pin conflicts. As the replica was keeping up perfectly, with replay lag well below max_standby_streaming_delay, that should never have triggered conflicts. Investigating this was not helped by the fact that we don't log *anything* from the startup process when LockBufferForCleanup() -> ResolveRecoveryConflictWithBufferPin() enters the "We're already behind" path to SendRecoveryConflictWithBufferPin(). I added some logging and found that GetStandbyLimitTime() thought we were quite far behind, even though the actual replay lag was < 1s. I think the reason for these conflicts is very old and it's that basically the basis on which we compute GetStandbyLimitTime() is bogus, when streaming: /* * Walreceiver is active, so see if new data has arrived. * * We only advance XLogReceiptTime when we obtain fresh * WAL from walreceiver and observe that we had already * processed everything before the most recent "chunk" * that it flushed to disk. In steady state where we are * keeping up with the incoming data, XLogReceiptTime will * be updated on each cycle. When we are behind, * XLogReceiptTime will not advance, so the grace time * allotted to conflicting queries will decrease. */ if (RecPtr < flushedUpto) havedata = true; else { XLogRecPtr latestChunkStart; flushedUpto = GetWalRcvFlushRecPtr(&latestChunkStart, &receiveTLI); if (RecPtr < flushedUpto && receiveTLI == curFileTLI) { havedata = true; if (latestChunkStart <= RecPtr) { XLogReceiptTime = GetCurrentTimestamp(); SetCurrentChunkStartTime(XLogReceiptTime); } } else havedata = false; I think the logic explained in the comment doesn't make much sense (but it's not trivial to come up with something good, see below). If the standby is keeping up, the primary is busy and walsender & walreceiver are keeping up, it's entirely normal that the walreceiver would have received one or more additional "chunks" from the primary. If the primary is committing frequently and the standby is keeping up receiving the data, walsender will send out data in small chunks. Thereby making it very likely that the latest chunk is newer than what we just processed. I've seen now-rtime in GetStandbyLimitTime() be many minutes, even though the actual replay lag was always under 2s, and most of the time < 0.5s. Tracing through the history (a lot of moving around), this seems to have originated in: commit e76c1a0f4d2127f11c72c02b3d73a5dcb4517173 Author: Tom Lane Date: 2010-07-03 20:43:58 +0000 Replace max_standby_delay with two parameters, max_standby_archive_delay and max_standby_streaming_delay, and revise the implementation to avoid assuming that timestamps found in WAL records can meaningfully be compared to clock time on the standby server. Instead, the delay limits are compared to the elapsed time since we last obtained a new WAL segment from archive or since we were last "caught up" to WAL data arriving via streaming replication. This avoids problems with clock skew between primary and standby, as well as other corner cases that the original coding would misbehave in, such as the primary server having significant idle time between transactions. Per my complaint some time ago and considerable ensuing discussion. Do some desultory editing on the hot standby documentation, too. I think I've observed this problem in production setups, i.e. recovery conflicts being triggered despite everything seemingly keeping up, with no "source" of recovery conflicts apparent in the log. I just never had enough data to figure out what's going on. It really doesn't help that XLogReceiptTime is basically unobservable in stock postgres. Recovery doesn't log the fact that it thinks we are behind (or how far), it doesn't even log that it triggers a recovery conflicts in that case (the cancellation of is logged though). Now, how to fix that: It doesn't seem trivial. As the commit message cited above explains, we don't want to rely on the timestamps from the primary, so we can't just use XLogRecoveryCtl->recoveryLastXTime or such. And we can't just always update XLogReceiptTime in WaitForWALToBecomeAvailable() (or even more so in walreceiver whenever we receive data), because that'd lead to XLogReceiptTime being advanced way too aggressively. I wonder if a decent minimal improvement would be to update it whenever we had to update the XLogReceiptTime whenever we actually had to refresh flushedUpto from the walreceiver and the new position is >= RecPtr. While that doesn't entirely avoid updating XLogReceiptTime too aggressively, I think it may bound it sufficiently, because if we are behind, we don't need to call GetWalRcvFlushRecPtr() very often - we only need to do so again, after we replayed all the WAL that was available the last time. It be a lot more accurate than what we do today, because it'd avoid the "latestChunkStart <= RecPtr" condition which is often unreachable today. I guess we could also use recoveryLastXTime as an additional lower bound for the apply delay? That way it'd "help" avoid a too stale XLogReceiptTime if the clocks are in sync, but not hurt if the clocks are out of sync? A better solution would probably be a proper "LSN range -> received time" mapping. In walsender we have LagTracker, which keeps a ringbuffer of LSN to time mappings and computes lags for write/flush/replay. Perhaps we could generalize that? There's an easy way to reproduce this issue: Configure the standby to use recovery_apply_delay=1s and have workload on the primary that commits more than once a second. If you then configure walreceiver to log debug2 messages (or change the log level in ProcessWalSndrMessage(), which makes the volume easier to deal with), you can see that the "replication apply delay" in those messages creep up very rapidly (as we never hit the "latest chunk" logic anymore). Greetings, Andres Freund