public inbox for [email protected]
help / color / mirror / Atom feedFrom: Xuneng Zhou <[email protected]>
To: Zhijie Hou (Fujitsu) <[email protected]>
To: Fujii Masao <[email protected]>
Cc: Srinath Reddy Sadipiralla <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Fix race in ReplicationSlotRelease for ephemeral slots
Date: Tue, 2 Jun 2026 14:00:29 +0800
Message-ID: <CABPTF7VyH1-W2xnDspECDEzFGQj=WTFpZBCqKfM11OAZa6gQHQ@mail.gmail.com> (raw)
In-Reply-To: <TY4PR01MB177180A7CE60BCDF286B1C6F594172@TY4PR01MB17718.jpnprd01.prod.outlook.com>
References: <TY4PR01MB177184FF9EE916F577E1F554194082@TY4PR01MB17718.jpnprd01.prod.outlook.com>
<CAFC+b6o-hD5VxVLZQovmHSYykF8Qzq3eiuBU-U1F_yR9-y6P_w@mail.gmail.com>
<TY4PR01MB177180A7CE60BCDF286B1C6F594172@TY4PR01MB17718.jpnprd01.prod.outlook.com>
Hi,
On Sat, May 30, 2026 at 4:14 PM Zhijie Hou (Fujitsu)
<[email protected]> wrote:
>
> On Saturday, May 30, 2026 1:44 AM Srinath Reddy Sadipiralla <[email protected]> wrote:
>
> > > On Wed, May 27, 2026 at 5:20 PM Zhijie Hou (Fujitsu) <mailto:[email protected]> wrote:
> > > I haven't attached a test for this fix, as the change is straightforward and the
> > > Likelihood of encountering this bug is low, so it may not be worth adding test
> > > cycles for it. However, if others feel differently, I'm OK to add one.
> >
> > +1 for a test. The fix is just an else, so a future refactor could change it and silently
> > reintroduce the corruption, since it scribbles on an unrelated reused slot, nothing
> > would catch it. Injection points make it deterministic; I've attached a diff patch that adds
> > a test that fails without the fix and passes with it.
>
> Thanks for the test.
>
> I'm not sure if adding an injection point for this rare case is worthwhile. Even
> if we were to add one, future refactoring of that function could shift the
> position of the injection point, so its long-term usefulness is uncertain. I
> don't have a strong opinion on this, so I'll leave it to Fujii-San to decide.
Thanks for reporting this issue. I can reproduce it with lldb on Mac.
postgres=# SELECT pg_create_logical_replication_slot(
postgres(# 'test_slot_dropped',
postgres(# 'pgoutput2',
postgres(# false,
postgres(# false,
postgres(# true
postgres(# );
ERROR: could not access file "pgoutput2": No such file or directory
787 * decoding be disabled.
788 */
789 ReplicationSlotDropAcquired(is_logical);
-> 790 }
791
792 /*
793 * If slot needed to temporarily restrain both data and catalog xmin to
Target 0: (postgres) stopped.
(lldb) expr -- slot->data.name.data
(char[64]) $0 = "test_slot_created"
(lldb) expr -- slot->data.persistency
(ReplicationSlotPersistency) $1 = RS_PERSISTENT
(lldb) expr -- slot->active_proc
(ProcNumber) $2 = 126
The fix looks good to me.
There's an adjacent bug around drop_local_obsolete_slots. The root
cause of them looks similar -- ReplicationSlot * is a pointer to a
reusable shared-memory array cell, not a durable identity for the same
slot. In drop_local_obsolete_slots, the issue is that the slot has
been freed after ReplicationSlotDropAcquired(false); however, another
backend may reuse the same cell before the unlock/log reads. This
seems less severe -- it does not normally corrupt slot state, because
the code only read after the drop. But it can still unlock/log
misusing the identity of a different slot. Attached a test using
injection point to reproduce it and a patch to fix it.
--
Regards,
Xuneng Zhou
HighGo Software Co., Ltd.
Attachments:
[text/x-perl-script] repro_slotsync_stale_pointer.pl (3.2K, 2-repro_slotsync_stale_pointer.pl)
download | inline:
use strict;
use warnings FATAL => 'all';
use PostgreSQL::Test::Cluster;
use PostgreSQL::Test::Utils;
use Test::More;
my $tag = $$;
my $primary = PostgreSQL::Test::Cluster->new("repro_ip_primary_$tag");
$primary->init(allows_streaming => 'logical');
$primary->append_conf(
'postgresql.conf', qq(
autovacuum = off
log_min_messages = 'debug2'
));
$primary->start;
if (!$primary->check_extension('injection_points'))
{
plan skip_all => 'Extension injection_points not installed';
}
# Create the extension before the base backup so the standby can call its
# functions while in recovery.
$primary->safe_psql('postgres', 'CREATE EXTENSION injection_points;');
my $backup_name = 'backup';
$primary->backup($backup_name);
my $standby = PostgreSQL::Test::Cluster->new("repro_ip_standby_$tag");
$standby->init_from_backup(
$primary, $backup_name,
has_streaming => 1,
has_restoring => 1);
my $primary_connstr = $primary->connstr;
$standby->append_conf(
'postgresql.conf', qq(
hot_standby_feedback = on
primary_slot_name = 'phys_slot'
primary_conninfo = '$primary_connstr dbname=postgres'
log_min_messages = 'debug2'
));
$primary->safe_psql(
'postgres',
q{
SELECT pg_create_logical_replication_slot('victim_slot', 'pgoutput', false, false, true);
SELECT pg_create_physical_replication_slot('phys_slot');
});
$standby->start;
$primary->advance_wal(1);
$primary->wait_for_replay_catchup($standby);
note('attach injection point on standby');
$standby->safe_psql(
'postgres',
q{SELECT injection_points_attach('slotsync-obsolete-after-drop', 'wait');}
);
note('sync failover slot to standby');
$standby->safe_psql('postgres', 'SELECT pg_sync_replication_slots();');
is( $standby->safe_psql(
'postgres',
q{SELECT synced FROM pg_replication_slots WHERE slot_name = 'victim_slot';}
),
't',
'victim slot is synchronized to the standby');
note('drop remote failover slot');
$primary->safe_psql('postgres',
q{SELECT pg_drop_replication_slot('victim_slot');});
my $log_offset = -s $standby->logfile;
note('start slot sync and wait at injection point after local drop');
my $sync = $standby->background_psql('postgres', on_error_stop => 0);
$sync->query_until(
qr/start_sync/,
q(
\echo start_sync
SELECT pg_sync_replication_slots();
));
$standby->wait_for_event('client backend', 'slotsync-obsolete-after-drop');
ok( $standby->poll_query_until(
'postgres',
q{SELECT NOT EXISTS (SELECT 1 FROM pg_replication_slots WHERE slot_name = 'victim_slot');}
),
'victim slot has been dropped locally');
note('reuse freed slot array cell with a physical slot');
$standby->safe_psql('postgres',
q{SELECT pg_create_physical_replication_slot('replacement_slot');});
note('release injection point');
$standby->safe_psql(
'postgres',
q{
SELECT injection_points_wakeup('slotsync-obsolete-after-drop');
SELECT injection_points_detach('slotsync-obsolete-after-drop');
});
ok( $standby->wait_for_log(
qr/dropped replication slot "replacement_slot" of database with OID 0|you don't own a lock of type AccessShareLock/,
$log_offset),
'stale local_slot pointer was observed after drop');
is( $standby->safe_psql(
'postgres',
q{SELECT slot_type FROM pg_replication_slots WHERE slot_name = 'replacement_slot';}
),
'physical',
'replacement slot still exists');
$sync->quit;
done_testing();
[application/octet-stream] v1-0001-Avoid-stale-slot-access-after-dropping-obsolete-s.patch (3.0K, 3-v1-0001-Avoid-stale-slot-access-after-dropping-obsolete-s.patch)
download | inline diff:
From e920adcb5c01510f94ee75f26133e0efa6089038 Mon Sep 17 00:00:00 2001
From: alterego655 <[email protected]>
Date: Tue, 2 Jun 2026 13:14:54 +0800
Subject: [PATCH v1] Avoid stale slot access after dropping obsolete synced
slots
drop_local_obsolete_slots() kept using local_slot after calling
ReplicationSlotDropAcquired(). Once the drop completes, the slot array entry can
be reused by another backend, so later reads of local_slot->data could refer to a
different slot.
Copy the slot name and database OID before dropping the slot, and use those
saved values for unlocking and logging after the drop.
---
src/backend/replication/logical/slotsync.c | 25 ++++++++++++++--------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/src/backend/replication/logical/slotsync.c b/src/backend/replication/logical/slotsync.c
index 93f41be32af..ea73f0aa262 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -541,6 +541,9 @@ drop_local_obsolete_slots(List *remote_slot_list)
/* Drop the local slot if it is not required to be retained. */
if (!local_sync_slot_required(local_slot, remote_slot_list))
{
+ bool dropped = false;
+ NameData slot_name = {0};
+ Oid slot_database = local_slot->data.database;
bool synced_slot;
/*
@@ -548,8 +551,8 @@ drop_local_obsolete_slots(List *remote_slot_list)
* ReplicationSlotsDropDBSlots(), trying to drop the same slot
* during a drop-database operation.
*/
- LockSharedObject(DatabaseRelationId, local_slot->data.database,
- 0, AccessShareLock);
+ LockSharedObject(DatabaseRelationId, slot_database, 0,
+ AccessShareLock);
/*
* In the small window between getting the slot to drop and
@@ -562,6 +565,8 @@ drop_local_obsolete_slots(List *remote_slot_list)
*/
SpinLockAcquire(&local_slot->mutex);
synced_slot = local_slot->in_use && local_slot->data.synced;
+ if (synced_slot)
+ slot_name = local_slot->data.name;
SpinLockRelease(&local_slot->mutex);
if (synced_slot)
@@ -572,17 +577,19 @@ drop_local_obsolete_slots(List *remote_slot_list)
* a standby, which derives its logical decoding state from
* the primary, it would be wrong to do so.
*/
- ReplicationSlotAcquire(NameStr(local_slot->data.name), true, false);
+ ReplicationSlotAcquire(NameStr(slot_name), true, false);
ReplicationSlotDropAcquired(false);
+ dropped = true;
}
- UnlockSharedObject(DatabaseRelationId, local_slot->data.database,
- 0, AccessShareLock);
+ UnlockSharedObject(DatabaseRelationId, slot_database, 0,
+ AccessShareLock);
- ereport(LOG,
- errmsg("dropped replication slot \"%s\" of database with OID %u",
- NameStr(local_slot->data.name),
- local_slot->data.database));
+ if (dropped)
+ ereport(LOG,
+ errmsg("dropped replication slot \"%s\" of database with OID %u",
+ NameStr(slot_name),
+ slot_database));
}
}
}
--
2.51.0
view thread (27+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Fix race in ReplicationSlotRelease for ephemeral slots
In-Reply-To: <CABPTF7VyH1-W2xnDspECDEzFGQj=WTFpZBCqKfM11OAZa6gQHQ@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox