public inbox for [email protected]  
help / color / mirror / Atom feed
From: Xuneng Zhou <[email protected]>
To: Hayato Kuroda (Fujitsu) <[email protected]>
Cc: Bertrand Drouvot <[email protected]>
Cc: Alexander Lakhin <[email protected]>
Cc: pgsql-hackers <[email protected]>
Subject: Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline
Date: Mon, 8 Jun 2026 21:00:17 +0800
Message-ID: <CABPTF7U_B9pdC563fFONQLX_FXCZtxZgM+jN9snzMVg9b9MRfg@mail.gmail.com> (raw)
In-Reply-To: <CABPTF7WmLKOJcSPod7zT2hynUFQcUs0VoQyR-p=XFsafvmGm7g@mail.gmail.com>
References: <[email protected]>
	<aiP/X1FThlZRCFiN@bdtpg>
	<OS9PR01MB1214908BA67A7811BD6281208F51C2@OS9PR01MB12149.jpnprd01.prod.outlook.com>
	<CABPTF7WmLKOJcSPod7zT2hynUFQcUs0VoQyR-p=XFsafvmGm7g@mail.gmail.com>

On Mon, Jun 8, 2026 at 4:31 PM Xuneng Zhou <[email protected]> wrote:
>
> Hi Kuroda-san,
>
> On Mon, Jun 8, 2026 at 12:25 PM Hayato Kuroda (Fujitsu)
> <[email protected]> wrote:
> >
> > Hi Alexander, Bertrand, Xuneng,
> >
> > Thanks for seeing the failure. Our team also recognized but could not find the reason.
> >
> > > Yeah, it looks like there is a race condition here. I think we should check if
> > > the insertion timeline has already been set (like the walsummarizer is doing).
> >
> > Sorry for stupid question; I tried to reproduce the failure but could not, see attached.
> >
> > IIUC, the issue can happen if the walsender must read the WAL record generated
> > after the promotion but the timeline could not be updated.
>
> I think the race is that the logical walsender is trying to read the
> pre-promotion wal from the slot's restart_lsn.
>
> 1) In 035_standby_logical_decoding.pl, the active slot /
> pg_recvlogical process is started before:
> INSERT rows 1..4   # before promotion
> promote standby
> INSERT rows 5..7   # after promotion
>
> 2) Alexander’s reproducer sleeps immediately after:
> ReplicationSlotAcquire(cmd->slotname, true, true);
> pg_usleep(200000);
>
> So with the delay:
> slot is active but walsender has not yet create decoding context and
> called xlogbeginread();
>
> 3) Then the test continues and promotes the standby while the
> walsender still sleeping, therefore promotion could happen before the
> walsender starts its actual wal read.
>
> 4)  The logical walsender is trying to read the pre-promotion wal from
> the slot's restart_lsn. The failing log implies this:
> Streaming transactions committing after 0/06487A20,
> reading WAL from 0/064879E8.
>
> This line is emitted from:
> errdetail("Streaming transactions committing after %X/%08X, reading
> WAL from %X/%08X.",
>           slot->data.confirmed_flush,
>           slot->data.restart_lsn)
>
> Function StartLogicalReplication() initiates the xlogreader at the restart lsn:
> XLogBeginRead(logical_decoding_ctx->reader,
>               MyReplicationSlot->data.restart_lsn);
>
> Because the slot was created and acquired before promotion, its
> restart_lsn is from the standby-era state which points to wal required
> for decoding the pre-promotion stream. In the test, the first expected
> output is rows 1..4, inserted before promotion.
>
> The failing output was empty:
> got: ''
> expected:
> BEGIN
> rows 1..4
> COMMIT
> BEGIN
> rows 5..7
> COMMIT
>
> Empty output suggests that the walsender died before emitting even the
> first pre-promotion transaction.
>
> 5) The walsender try to read wal but failed
> Failed run:
>  LOG:  !!!WalSndSegmentOpen| nextSegNo: 6,
> path: pg_wal/000000010000000000000006
> ERROR:  requested WAL segment
> 000000010000000000000006 has already been removed
>
> Passed run
> LOG:  !!!WalSndSegmentOpen| nextSegNo: 6,
> path: pg_wal/000000020000000000000006
>
> As the log suggested, the walsender of the successful run readed the
> wal from timeline 2 but failed to do so for timeline1.
> This comment in xlogutils.is almost exactly the issue:
>
> We care about timelines ... reading xlog generated prior to a promotion
> ...
> The server copied the segment to the new timeline ... there's no guarantee
> the old segment will still exist. It may have been deleted or renamed with
> a .partial suffix
>
> So the need for tli1 does not always imply that the wal was generated
> after promotion. It could be bytes from before the switchpoint, but
> they now be available through the new tli segment file.
>
> 6) XLogReadDetermineTimeline() needs the current system-wide TLI.
>
> The current code in does roughly:
>
>          am_cascading_walsender = RecoveryInProgress();
>
>          if (am_cascading_walsender)
>                  GetXLogReplayRecPtr(&currTLI);
>          else
>                  currTLI = GetWALInsertionTimeLine();
>
> During promotion, we set the wal insertion timeline before marking
> recovery as fully done. So there is a window where:
>
> - InsertTimeLineID is already the new timeline, e.g. tli 2.
> - RecoveryInProgress() can still return true in another backend.
> - A logical walsender that started before promotion can wake up and
> choose its read timeline using the old recovery-state test.
>
> In the race, replay tli 1 is chosen even though the promotion has
> created the new tli 2 wal segment copy. Then WalSndSegmentOpen() tries
> to open 000000010000000000000006, which may already be gone or
> renamed, while the correct file is 000000020000000000000006.
>
> 7) Bertrand's proposal to fix this
> Recognizing the intermediate promotion state -- if recovery still
> shows progress but the insertion tli is already advanced, use that
> insertion tli for the timeline to be chosen.
>
> > However, I think logical_read_xlog_page() is called after the new WAL records
> > are generated, i.e., am_cascading_walsender has already been false at that time.
> > So not sure where is the race?
>
> This function is not only called after new post-promotion records are
> generated. When START_REPLICATION begins, CreateDecodingContext() /
> XLogBeginRead() starts from the slot's restart_lsn, which can be well
> before promotion.
>

I tweaked the reproducer based on the theory outlined above. The main
changes from the original reproducer are:

1) blocks at logical-walsender-after-slot-acquire in walsender.c,
before the decoding context is created and before the reader starts
from restart_lsn, matching the delay set by Alexander

2) Forces the first read to occur during promotion. It inserts rows
1..4, waits for replay, starts promotion with pg_promote(false), holds
startup at startup-logical-decoding-status-change-end-of-recovery,
then wakes the walsender.

The test failed on HEAD and passed after applying the patch by Bertrand.

-- 
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.


Attachments:

  [application/octet-stream] v1-0001-Reproducer-for-logical-decoding-choosing-the-wron.patch (5.9K, 2-v1-0001-Reproducer-for-logical-decoding-choosing-the-wron.patch)
  download | inline diff:
From 15333302b07bac64e2eb0b984c2b4e037e2566ab Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Mon, 8 Jun 2026 20:47:56 +0800
Subject: [PATCH v1] Reproducer for logical decoding choosing the wrong
 timeline while a standby is being promoted.

---
 src/backend/replication/walsender.c |   5 +
 src/test/recovery/t/099_repro.pl    | 139 ++++++++++++++++++++++++++++
 2 files changed, 144 insertions(+)
 create mode 100644 src/test/recovery/t/099_repro.pl

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 04aa770d981..dbd48d17250 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -94,6 +94,7 @@
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
+#include "utils/injection_point.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
 #include "utils/pg_lsn.h"
@@ -1103,6 +1104,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	 */
 	am_cascading_walsender = RecoveryInProgress();
 
+	INJECTION_POINT("logical-read-xlog-page-before-tli", NULL);
+
 	if (am_cascading_walsender)
 		GetXLogReplayRecPtr(&currTLI);
 	else
@@ -1501,6 +1504,8 @@ StartLogicalReplication(StartReplicationCmd *cmd)
 
 	ReplicationSlotAcquire(cmd->slotname, true, true);
 
+	INJECTION_POINT("logical-walsender-after-slot-acquire", NULL);
+
 	/*
 	 * Force a disconnect, so that the decoding code doesn't need to care
 	 * about an eventual switch from running in recovery, to running in a
diff --git a/src/test/recovery/t/099_repro.pl b/src/test/recovery/t/099_repro.pl
new file mode 100644
index 00000000000..83a42ae507e
--- /dev/null
+++ b/src/test/recovery/t/099_repro.pl
@@ -0,0 +1,139 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+# Reproducer for logical decoding choosing the wrong timeline while a standby
+# is being promoted.
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+	plan skip_all => 'Injection points not supported by this build';
+}
+
+my ($stdout, $stderr);
+
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+my $default_timeout = $PostgreSQL::Test::Utils::timeout_default;
+
+my $primary_slotname = 'primary_physical';
+
+$node_primary->init(allows_streaming => 1, has_archiving => 1);
+$node_primary->append_conf(
+	'postgresql.conf', q[
+wal_level = 'logical'
+max_replication_slots = 4
+max_wal_senders = 4
+autovacuum = off
+]);
+$node_primary->start;
+
+if (!$node_primary->check_extension('injection_points'))
+{
+	plan skip_all => 'Extension injection_points not installed';
+}
+
+$node_primary->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
+
+$node_primary->safe_psql('postgres',
+	qq[SELECT * FROM pg_create_physical_replication_slot('$primary_slotname');]
+);
+$node_primary->safe_psql('postgres',
+	'CREATE TABLE decoding_test(x integer, y text);');
+
+my $backup_name = 'b1';
+$node_primary->backup($backup_name);
+
+$node_standby->init_from_backup(
+	$node_primary, $backup_name,
+	has_streaming => 1,
+	has_restoring => 1);
+$node_standby->append_conf(
+	'postgresql.conf',
+	qq[primary_slot_name = '$primary_slotname'
+max_replication_slots = 4]);
+$node_standby->start;
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$node_standby->create_logical_slot_on_standby($node_primary, 'testslot',
+	'postgres');
+
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_attach('logical-walsender-after-slot-acquire', 'wait');"
+);
+
+my $handle = IPC::Run::start(
+	[
+		'pg_recvlogical',
+		'--dbname' => $node_standby->connstr('postgres'),
+		'--slot' => 'testslot',
+		'--option' => 'include-xids=0',
+		'--option' => 'skip-empty-xacts=1',
+		'--file' => '-',
+		'--no-loop',
+		'--start',
+	],
+	'>' => \$stdout,
+	'2>' => \$stderr,
+	IPC::Run::timeout($default_timeout));
+
+$node_standby->wait_for_event('walsender',
+	'logical-walsender-after-slot-acquire');
+
+$node_primary->safe_psql('postgres',
+	qq[INSERT INTO decoding_test(x,y)
+	   SELECT s, s::text FROM generate_series(1,4) s;]);
+$node_primary->wait_for_replay_catchup($node_standby);
+
+$node_standby->safe_psql('postgres',
+	qq[SELECT injection_points_attach('startup-logical-decoding-status-change-end-of-recovery', 'wait');]
+);
+
+$node_standby->safe_psql('postgres', 'SELECT pg_promote(false);');
+$node_standby->wait_for_event('startup',
+	'startup-logical-decoding-status-change-end-of-recovery');
+
+$node_standby->safe_psql('postgres',
+	"SELECT injection_points_wakeup('logical-walsender-after-slot-acquire');"
+);
+
+my $pump_timeout = IPC::Run::timer($PostgreSQL::Test::Utils::timeout_default);
+ok(pump_until($handle, $pump_timeout, \$stdout, qr/^.*COMMIT$/s),
+	'decoded pre-promotion transaction while promotion is in progress');
+
+$node_standby->safe_psql('postgres',
+	qq[SELECT injection_points_wakeup('startup-logical-decoding-status-change-end-of-recovery');]
+);
+
+$node_standby->poll_query_until('postgres',
+	"SELECT NOT pg_is_in_recovery();")
+  or die "standby did not finish promotion";
+
+$node_standby->safe_psql('postgres',
+	qq[INSERT INTO decoding_test(x,y)
+	   SELECT s, s::text FROM generate_series(5,7) s;]);
+
+ok(pump_until($handle, $pump_timeout, \$stdout, qr/^.*COMMIT.*COMMIT$/s),
+	'decoded pre- and post-promotion transactions');
+
+my $expected = q{BEGIN
+table public.decoding_test: INSERT: x[integer]:1 y[text]:'1'
+table public.decoding_test: INSERT: x[integer]:2 y[text]:'2'
+table public.decoding_test: INSERT: x[integer]:3 y[text]:'3'
+table public.decoding_test: INSERT: x[integer]:4 y[text]:'4'
+COMMIT
+BEGIN
+table public.decoding_test: INSERT: x[integer]:5 y[text]:'5'
+table public.decoding_test: INSERT: x[integer]:6 y[text]:'6'
+table public.decoding_test: INSERT: x[integer]:7 y[text]:'7'
+COMMIT};
+
+chomp($stdout);
+is($stdout, $expected, 'got expected output from pg_recvlogical');
+
+done_testing();
-- 
2.51.0



view thread (24+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: t/035_standby_logical_decoding.pl might fail on attempt to read wrong timeline
  In-Reply-To: <CABPTF7U_B9pdC563fFONQLX_FXCZtxZgM+jN9snzMVg9b9MRfg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox