Re: Implement waiting for wal lsn replay: reloaded

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Xuneng Zhou <[email protected]>
To: Andres Freund <[email protected]>
Cc: Thomas Munro <[email protected]>
Cc: Alexander Korotkov <[email protected]>
Cc: Álvaro Herrera <[email protected]>
Cc: Chao Li <[email protected]>
Cc: pgsql-hackers <[email protected]>
Cc: Michael Paquier <[email protected]>
Cc: jian he <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Yura Sokolov <[email protected]>
Subject: Re: Implement waiting for wal lsn replay: reloaded
Date: Wed, 7 Jan 2026 12:08:18 +0800
Message-ID: <CABPTF7WN_3kDPBYPxaKKcp2kO5BLB5bK_YGz70VTzTCivHibZA@mail.gmail.com> (raw)
In-Reply-To: <iuzfab2lr54iarbt5ijxtkdbrkv5qnioykluu7pg53hn7gv2te@hz2bkg4aw7bi>
References: <CABPTF7Xs-64GQNjmbimZNhj2YSKbBny+evz6=cp3X2fkJS+vMQ@mail.gmail.com>
	<[email protected]>
	<CAPpHfduDVNo4VXgnQFZUg9=2yHQJfUUqjokbi3qVxuJiKNfcwg@mail.gmail.com>
	<CABPTF7UkwQZGx5ub731Q+=+rU8yx4ruqMdDt__L_dm9_32LsMw@mail.gmail.com>
	<CAPpHfds-KiZRuCruc0jHxLSxLqzKcHJGwOFFA0b_RgaJvtUOEQ@mail.gmail.com>
	<CABPTF7Vy4A0S0W7=B-fVd9Bo_15XG_FRYfKUF2CB_i=5svwEZQ@mail.gmail.com>
	<CA+hUKG+L0OQR8dQtsNBSaA3FNNyQeFabdeRVzurm0b7Xtp592g@mail.gmail.com>
	<iuzfab2lr54iarbt5ijxtkdbrkv5qnioykluu7pg53hn7gv2te@hz2bkg4aw7bi>

Hi,

On Wed, Jan 7, 2026 at 8:32 AM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-01-06 18:42:59 +1300, Thomas Munro wrote:
> > Could this be causing the recent flapping failures on CI/macOS in
> > recovery/031_recovery_conflict?  I didn't have time to dig personally
> > but f30848cb looks relevant:
> >
> > Waiting for replication conn standby's replay_lsn to pass 0/03467F58 on primary
> > error running SQL: 'psql:<stdin>:1: ERROR:  canceling statement due to
> > conflict with recovery
> > DETAIL:  User was or might have been using tablespace that must be dropped.'
> > while running 'psql --no-psqlrc --no-align --tuples-only --quiet
> > --dbname port=25195
> > host=/var/folders/g9/7rkt8rt1241bwwhd3_s8ndp40000gn/T/LqcCJnsueI
> > dbname='postgres' --file - --variable ON_ERROR_STOP=1' with sql 'WAIT
> > FOR LSN '0/03467F58' WITH (MODE 'standby_replay', timeout '180s',
> > no_throw);' at /Users/admin/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm
> > line 2300.
> >
> > https://cirrus-ci.com/task/5771274900733952
> >
> > The master branch in time-descending order, macOS tasks only:
> >
> >      task_id      | substring |  status
> > ------------------+-----------+-----------
> >  6460882231754752 | c970bdc0  | FAILED
> >  5771274900733952 | 6ca8506e  | FAILED
> >  6217757068361728 | 63ed3bc7  | FAILED
> >  5980650261446656 | ae283736  | FAILED
> >  6585898394976256 | 5f13999a  | COMPLETED
> >  4527474786172928 | 7f9acc9b  | COMPLETED
> >  4826100842364928 | e8d4e94a  | COMPLETED
> >  4540563027918848 | b9ee5f2d  | FAILED
> >  6358528648019968 | c5af141c  | FAILED
> >  5998005284765696 | e212a0f8  | COMPLETED
> >  6488580526178304 | b85d5dc0  | FAILED
> >  5034091344560128 | 7dc95cc3  | ABORTED
> >  5688692477526016 | bb048e31  | COMPLETED
> >  5481187977723904 | d351063e  | COMPLETED
> >  5101831568752640 | f30848cb  | COMPLETED <-- the change
> >  6395317408497664 | 3f33b63d  | COMPLETED
> >  6741325208354816 | 877ae5db  | COMPLETED
> >  4594007789010944 | de746e0d  | COMPLETED
> >  6497208998035456 | 461b8cc9  | COMPLETED
>
> The failure rates of this are very high - the majority of the CI runs on the
> postgres/postgres repos failed since the change went in. Which then also means
> cfbot has a very high spurious failure rate. I think we need to revert this
> change until the problem has been verified as fixed.

This specific failure can be reproduced with this patch v1.

I guess the potential race condition is: when
wait_for_replay_catchup() runs WAIT FOR LSN on the standby, if a
tablespace conflict fires during that wait, the WAIT FOR LSN session
is killed even though it doesn't use the tablespace.

In my test, the failure won't occur after applying the v2 patch.

-- 
Best,
Xuneng


Attachments:

  [application/octet-stream] v1-0001-reproduce-the-failure-in-031_recovery_conflict.pl.patch (1.7K, 2-v1-0001-reproduce-the-failure-in-031_recovery_conflict.pl.patch)
  download | inline diff:
From ca73929687f9bf7d4aaa258f8e413ff2c3eea6aa Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Wed, 7 Jan 2026 11:39:41 +0800
Subject: [PATCH v1] reproduce the failure in 031_recovery_conflict.pl

---
 src/test/recovery/t/031_recovery_conflict.pl | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/test/recovery/t/031_recovery_conflict.pl b/src/test/recovery/t/031_recovery_conflict.pl
index 7a740f69806..39061fcc0a8 100644
--- a/src/test/recovery/t/031_recovery_conflict.pl
+++ b/src/test/recovery/t/031_recovery_conflict.pl
@@ -198,10 +198,25 @@ like($res, qr/^6000$/m,
 	"$sect: cursor with conflicting temp file established");
 
 # Drop the tablespace currently containing spill files for the query on the
-# standby
+# standby.  We pause replay before the DROP, then resume it via a background
+# session.  This forces wait_for_replay_catchup's internal WAIT FOR LSN to be
+# running when the conflict fires, exercising the recovery conflict handling
+# in Cluster.pm.
+$node_standby->safe_psql('postgres', "SELECT pg_wal_replay_pause()");
 $node_primary->safe_psql($test_db, qq[DROP TABLESPACE $tablespace1;]);
 
+# Start a background session that waits 1 second then resumes replay.
+# This triggers the conflict while wait_for_replay_catchup is running.
+my $resume_session = $node_standby->background_psql('postgres');
+$resume_session->query_until(
+	qr/start/, qq[
+	\\echo start
+	SELECT pg_sleep(1);
+	SELECT pg_wal_replay_resume();
+]);
+
 $node_primary->wait_for_replay_catchup($node_standby);
+$resume_session->quit;
 
 check_conflict_log(
 	"User was or might have been using tablespace that must be dropped");
-- 
2.51.0



  [application/octet-stream] v2-0001-Fix-wait_for_catchup-failure-when-standby-session.patch (4.4K, 3-v2-0001-Fix-wait_for_catchup-failure-when-standby-session.patch)
  download | inline diff:
From 1eaf36cbfafb75c91734615529dcc8f0ed7d7999 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 6 Jan 2026 20:55:43 +0800
Subject: [PATCH v2] Fix wait_for_catchup() failure when standby session is
 killed by recovery conflict

Commit f30848cb optimized wait_for_catchup() to use WAIT FOR LSN on
the standby instead of polling pg_stat_replication on the primary.
However, this introduced a failure mode: the WAIT FOR LSN session
can be killed by recovery conflicts on the standby, causing the
test helper to die unexpectedly.

This manifests as flapping failures in tests like 031_recovery_conflict,
where DROP TABLESPACE on the primary triggers
ResolveRecoveryConflictWithTablespace() on the standby. That function
kills all backends indiscriminately, including the innocent WAIT FOR
LSN session that happens to be connected at that moment.

Fix by wrapping the WAIT FOR LSN call in an eval block and falling
back to the original polling approach when the session is killed by
a recovery conflict. The fallback is selective:

- If WAIT FOR LSN succeeds with 'success': return immediately
- If WAIT FOR LSN returns non-success (timeout, not_in_recovery):
  fail immediately with diagnostics
- If the session is killed by a recovery conflict (error contains
  "conflict with recovery"): fall back to polling on the primary
- For any other error: fail immediately to avoid masking real problems

The polling fallback is immune to standby-side conflicts because it
queries pg_stat_replication on the primary, not the standby.
---
 src/test/perl/PostgreSQL/Test/Cluster.pm | 53 +++++++++++++++++++-----
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index a28ea89aa10..08379aeb8fb 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -3401,22 +3401,52 @@ sub wait_for_catchup
 			my $timeout = $PostgreSQL::Test::Utils::timeout_default;
 			my $wait_query =
 			  qq[WAIT FOR LSN '${target_lsn}' WITH (MODE '${wait_mode}', timeout '${timeout}s', no_throw);];
-			my $output = $standby_node->safe_psql('postgres', $wait_query);
-			chomp($output);
 
-			if ($output ne 'success')
+			# Try WAIT FOR LSN. If it succeeds, we're done. If it returns a
+			# non-success status (timeout, not_in_recovery), fail immediately.
+			# If the session is interrupted (e.g., killed by recovery conflict),
+			# fall back to polling on the upstream which is immune to standby-
+			# side conflicts.
+			my $output;
+			local $@;
+			my $wait_succeeded = eval {
+				$output = $standby_node->safe_psql('postgres', $wait_query);
+				chomp($output);
+				1;
+			};
+
+			if ($wait_succeeded && $output eq 'success')
+			{
+				print "done\n";
+				return;
+			}
+
+			# If WAIT FOR LSN executed but returned non-success (e.g., timeout,
+			# not_in_recovery), fail immediately with diagnostic info. Falling
+			# back to polling would just waste time.
+			if ($wait_succeeded)
 			{
-				# Fetch additional detail for debugging purposes
 				my $details = $self->safe_psql('postgres',
 					"SELECT * FROM pg_catalog.pg_stat_replication");
-				diag qq(WAIT FOR LSN failed with status:
-	${output});
-				diag qq(Last pg_stat_replication contents:
-	${details});
-				croak "failed waiting for catchup";
+				diag qq(WAIT FOR LSN returned '$output'
+pg_stat_replication on upstream:
+${details});
+				croak "WAIT FOR LSN '$wait_mode' returned '$output'";
+			}
+
+			# WAIT FOR LSN was interrupted. Only fall back to polling if this
+			# looks like a recovery conflict - the canonical PostgreSQL error
+			# message contains "conflict with recovery". Other errors should
+			# fail immediately rather than being masked by a silent fallback.
+			if ($@ =~ /conflict with recovery/i)
+			{
+				diag qq(WAIT FOR LSN interrupted, falling back to polling:
+$@);
+			}
+			else
+			{
+				croak "WAIT FOR LSN failed: $@";
 			}
-			print "done\n";
-			return;
 		}
 	}
 
@@ -3424,6 +3454,7 @@ sub wait_for_catchup
 	# - 'sent' mode (no corresponding WAIT FOR LSN mode)
 	# - When standby_name is a string (e.g., subscription name)
 	# - When the standby is no longer in recovery (was promoted)
+	# - When WAIT FOR LSN was interrupted (e.g., killed by a recovery conflict)
 	my $query = qq[SELECT '$target_lsn' <= ${mode}_lsn AND state = 'streaming'
          FROM pg_catalog.pg_stat_replication
          WHERE application_name IN ('$standby_name', 'walreceiver')];
-- 
2.51.0

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Implement waiting for wal lsn replay: reloaded
  In-Reply-To: <CABPTF7WN_3kDPBYPxaKKcp2kO5BLB5bK_YGz70VTzTCivHibZA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox