MIME-Version: 1.0
References: <CAHg+QDfU7rOebrLDESPpHSgdiadKbpCOmBokcbmM6Gr+A5VobQ@mail.gmail.com>
 <CAE9k0PnP0cPuisVeXM+Bma7n6J+HYqhVO5LffosXuHSw7drEDQ@mail.gmail.com>
In-Reply-To: <CAE9k0PnP0cPuisVeXM+Bma7n6J+HYqhVO5LffosXuHSw7drEDQ@mail.gmail.com>
From: Ashutosh Sharma <ashu.coek88@gmail.com>
Date: Thu, 26 Feb 2026 10:28:31 +0530
Message-ID: <CAE9k0Pm_6+4zW-X9zgBHhyLa9dqNKLM=zzUnVeH+ikoh45iALw@mail.gmail.com>
Subject: Re: synchronized_standby_slots behavior inconsistent with
 quorum-based synchronous replication
To: SATYANARAYANA NARLAPURAM <satyanarlapuram@gmail.com>
Cc: PostgreSQL-development <pgsql-hackers@postgresql.org>, 
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAE9k0Pm_6%2B4zW-X9zgBHhyLa9dqNKLM%3DzzUnVeH%2Bikoh45iALw%40mail.gmail.com>
Precedence: bulk

Hi,


On Wed, Feb 25, 2026 at 7:21=E2=80=AFPM Ashutosh Sharma <ashu.coek88@gmail.=
com> wrote:
>
> Hi Satya,
>
> On Wed, Feb 25, 2026 at 3:38=E2=80=AFAM SATYANARAYANA NARLAPURAM
> <satyanarlapuram@gmail.com> wrote:
> >
> >
> > Hi hackers,
> >
> > synchronized_standby_slots requires that every physical slot listed in =
the GUC has caught up before a logical failover slot is allowed to proceed =
with decoding. This is an ALL-of-N slots  semantic.  The logical slot avail=
ability model does not align with quorum replication semantics set using sy=
nchronous_standby_names which can be configured for quorum commit (ANY M of=
 N).
> >
> > In a typical 3 Node HA deployment with quorum sync rep:
> >
> > Primary, standby1 (corresponds to sb1_slot), standby2 (corresponds to s=
b2_slot)
> > synchronized_standby_slots =3D ' sb1_slot,  sb2_slot'
> > synchronous_standby_names =3D 'Any 1 ('standby1','standby2')'
> >
> > If standby1 goes down, synchronous commits still succeed because standb=
y2 satisfies the quorum. However, logical decoding blocks indefinitely in W=
aitForStandbyConfirmation(), waiting for sb1_slot (corresponds to standby1)=
 to catch up =E2=80=94 even though the transaction is already safely commit=
ted on a quorum of synchronous standbys. This blocks logical decoding consu=
mers from progressing and is inconsistent with the availability guarantee t=
he DBA intended by choosing quorum commit.
>
> +1. This can indeed be a blocker for failover enabled logical
> replication. It not only has the potential to disrupt logical
> replication, but can also impact the primary server. Over time, it may
> silently lead to significant WAL accumulation on the primary,
> eventually causing disk-full scenarios and degrading the performance
> of applications running on the primary instance. Therefore, I too
> strongly believe this needs to be addressed to prevent such
> potentially disruptive situations.
>
> >
> >
> > Proposal:
> >
> > Make synchronized_standby_slots quorum aware i.e. extend the GUC to acc=
ept an ANY M (slot1, slot2, ...) syntax similar to synchronous_standby_name=
s, so StandbySlotsHaveCaughtup() can return true when M of N slots (where M=
 <=3D N and M >=3D 1) have caught up. I still prefer two different GUCs for=
 this as the list of slots to be synchronized can still be different (for e=
xample, DBA may want to ensure Geo standby to be sync before allowing the l=
ogical decoding client to read the changes). I kept synchronized_standby_sl=
ots  parse logic similar to  synchronous_standby_names  to keep things simp=
le. The default behavior is also not changed for  synchronized_standby_slot=
s.
> >
>
> Thank you for the proposal. I can spend some time reviewing the
> changes and help take this forward. I would also be happy to hear
> others' thoughts and feedback on the proposal.
>

Thinking about this further, using quorum settings for
synchronized_standby_slots can/will certainly result in at least one
sync standby lagging behind the logical replica, making it probably
impossible to continue with the existing logical replication setup
after a failover to the standby that lags behind. Here is what I am
mean:

Let's say we have 2 synchronous standbys with
"synchronized_standby_slots" configured as ANY 1 (sync_standby1,
sync_standby2). With this quorum setting, WAL only needs to be
confirmed by any one of the two standbys before it can be forwarded to
the logical replica. Now consider a scenario where sync_standby1 is
ahead of sync_standby2, new WAL gets confirmed by sync_standby1 and
subsequently delivered to the logical replica. If sync_standby1 then
goes down and we failover to sync_standby2, the new primary will be at
a lower LSN than the logical replica, since sync_standby2 never
received that WAL. At this point, the logical replication slot on the
new primary is essentially stale, and the logical replication setup
that existed before the failover cannot be resumed. Hence, I think
it's important to ensure that the WAL (including all the necessary
data needed for logical replication) gets delivered to all the
servers/slots specified in synchronized_standby_slots before it gets
delivered to the logical replica.

While I agree that not allowing quorum like settings for this has the
potential to accumulate WAL and impact logical replication, I think we
can explore other ways to mitigate that concern separately.

Let's see what experts have to say on this.

--
With Regards,
Ashutosh Sharma.