MIME-Version: 1.0
From: Edwin UY <edwin.uy@gmail.com>
Date: Thu, 30 Oct 2025 22:30:00 +1300
Message-ID: 
 <CA+wokJ_sf=EZrdHgPnPn_6vUG43B0yTHS50MaM5jLeS1z6LYyQ@mail.gmail.com>
Subject: Replication Question / Issue - PRIMARY with SYNC and ASYNC
 Replication
To: Pgsql-admin <pgsql-admin@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000db443006425ce68b"
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2BwokJ_sf%3DEZrdHgPnPn_6vUG43B0yTHS50MaM5jLeS1z6LYyQ%40mail.gmail.com>
Precedence: bulk

--000000000000db443006425ce68b
Content-Type: text/plain; charset="UTF-8"

Hi,

Apologies for a long email. I suppose as much information as possible will
help with troubleshooting
PostgreSQL is Version 11. I know, it's old, I don't have a choice due to
the application.

There is a PRIMARY and 2 replicas, SYNC and ASYNC.
We had a network outage that rendered the application unusable for some
reason even though we still have a PRIMARY and a replication server in
place.
This is now resolved since the network is restored so I am just wanting to
get some guidance for a quick resolution in the future.

Not really sure how to confirm which one is SYNC or ASYNC.
select * from pg_stat_replication from the PRIMARY shows nothing
So, I am left with no choice but to trust the documentation where it says

SERVER -E = PRIMARY
SERVER -F  = ASYNC
SERVER -G = SYNC

When we have the network issue.
SERVER-E and SERVER-F are accessible and they can communicate to each
other. SERVER-G is not accessible. However the application connection is
intermittently dropping.

The primary is showing several errors like below:
STATEMENT:  ROLLBACK PREPARED 'gid'
ERROR:  prepared transaction with identifier "gid" is busy

SERVER-F is showing
FATAL:  could not connect to the primary server: could not connect to
server: No route to host
                Is the server running on host "SERVER-G" and accepting
                TCP/IP connections on port 5432?

Can't check SERVER-G as it is not accessible.

I assume the prepared transactions are from the replication, not from the
application.
The error from SERVER-F is as expected since SERVER-G is not accessible.
Under this scenario, the application is intermittently having issues
connecting to the database. Not sure why.
We have re-started both databases SERVER-E and SERVER-F and clear up the
prepared transaction as well using
https://www.cybertec-postgresql.com/en/prepared-transactions.
After startup we can see the prepared transaction gone, pg_prepared_xacts
is emptty and then will show one one prepare transaction that is active
based on pg_stat_activity.
select * from pg_stat_replication still shows nothing.
To resolve the SERVER-F error, we change the recovery.conf and
change primary_conninfo to use SERVER-E.
This still did not resolve the application issue and the primary log still
shows the following every so often.

STATEMENT:  ROLLBACK PREPARED 'gid'
ERROR:  prepared transaction with identifier "gid" is busy

At this stage, I thought maybe the PRIMARY and the replicas are configured
in such a way that the PRIMARY must receive confirmation from both that it
has committed too otherwise it will just continue waiting.
Under this scenario, it is not able too since SERVER-G is not accessible.
Does that make sense?

Anyway, maybe someone will be interested to read this email and can shed
some light on this and can advise whether there's some configuration
setting somewhere that we should have modified as a temporary workaround.
Could it be because of synchronous_commit= on? Maybe we should have changed
this when SERVER-G is not accessible?

Everything is back to normal once SERVER-G has become accessible again.
That is about 6 hours though :( and doesn't explain why things will stop
working normally when a replica is down and the PRIMARY is still accessible.
Does that mean, if both replicas are down and only the PRIMARY is
accessible, we have to totally turn off / disable replication?
If we do need to break the replica, when the PRIMARY is UP and both
replicas are inaccessible, do we just unset synchronous_standby_names?

Any reply is much appreciated. Thanks in advance.

Regards,
Ed

--000000000000db443006425ce68b
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><span style=3D"font-family:monospace">Hi,</span></div=
><div><span style=3D"font-family:monospace"><br></span></div><div><span sty=
le=3D"font-family:monospace">Apologies for a long email. I suppose as much =
information as possible will help with troubleshooting</span></div><div><sp=
an style=3D"font-family:monospace">PostgreSQL is Version 11. I know, it&#39=
;s old, I don&#39;t have a choice due to the application.</span></div><div>=
<span style=3D"font-family:monospace"><br></span></div><div><span style=3D"=
font-family:monospace">There is a PRIMARY and 2 replicas, SYNC and ASYNC.</=
span></div><div><span style=3D"font-family:monospace">We had a network outa=
ge that rendered the application unusable=C2=A0for some reason even though =
we still have a PRIMARY and a replication server in place.</span></div><div=
><span style=3D"font-family:monospace">This is now resolved since the netwo=
rk is restored so I am just wanting to get some guidance for a quick resolu=
tion in the future.</span></div><div><span style=3D"font-family:monospace">=
<br></span></div><div><span style=3D"font-family:monospace">Not really sure=
 how to confirm which one is SYNC or ASYNC.</span></div><div><span style=3D=
"font-family:monospace">select * from pg_stat_replication from the PRIMARY =
shows nothing</span></div><div><span style=3D"font-family:monospace">So, I =
am left with no choice but to trust the documentation where it says</span><=
/div><div><span style=3D"font-family:monospace"><br></span></div><div><div>=
<span style=3D"font-family:monospace">
SERVER

-E =3D PRIMARY</span></div><span style=3D"font-family:monospace">
SERVER

-F=C2=A0 =3D ASYNC</span></div><div><div><span style=3D"font-family:monospa=
ce">
SERVER

-G=C2=A0=3D SYNC</span></div><span style=3D"font-family:monospace"><br></sp=
an></div><div><span style=3D"font-family:monospace">When we have the networ=
k issue.</span></div><div><span style=3D"font-family:monospace">SERVER-E an=
d SERVER-F are accessible and they can communicate to each other. SERVER-G =
is not accessible. However the application connection is intermittently=C2=
=A0dropping.</span></div><div><span style=3D"font-family:monospace"><br></s=
pan></div><div><span style=3D"font-family:monospace">The primary is showing=
 several errors like below:</span></div><div><span style=3D"font-family:mon=
ospace">STATEMENT: =C2=A0ROLLBACK PREPARED &#39;gid&#39;<br>ERROR: =C2=A0pr=
epared transaction with identifier &quot;gid&quot; is busy</span></div><div=
><span style=3D"font-family:monospace"><br></span></div><div><span style=3D=
"font-family:monospace">SERVER-F is showing</span></div><div><span style=3D=
"font-family:monospace">FATAL: =C2=A0could not connect to the primary serve=
r: could not connect to server: No route to host<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Is the server running on host &quot;SERV=
ER-G&quot; and accepting<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 TCP/IP connections on port 5432?</span></div><div><span style=3D=
"font-family:monospace"><br></span></div><div><span style=3D"font-family:mo=
nospace">Can&#39;t check SERVER-G as it is not accessible.</span></div><div=
><span style=3D"font-family:monospace"><br></span></div><div><span style=3D=
"font-family:monospace">I assume the prepared transactions are from the rep=
lication, not from the application.</span></div><div><span style=3D"font-fa=
mily:monospace">The error from SERVER-F is as expected since SERVER-G is no=
t accessible.</span></div><div><span style=3D"font-family:monospace">Under =
this scenario, the application is intermittently having issues connecting t=
o the database. Not sure why.</span></div><div><span style=3D"font-family:m=
onospace">We have re-started both databases SERVER-E and SERVER-F and clear=
 up the prepared transaction as well using=C2=A0<a href=3D"https://www.cybe=
rtec-postgresql.com/en/prepared-transactions">https://www.cybertec-postgres=
ql.com/en/prepared-transactions</a>.</span></div><div><span style=3D"font-f=
amily:monospace">After startup we can see the prepared transaction gone,=C2=
=A0pg_prepared_xacts is emptty and then will show one one prepare transacti=
on that is active based on pg_stat_activity.</span></div><div><span style=
=3D"font-family:monospace">
select * from pg_stat_replication still shows nothing.</span></div><div><sp=
an style=3D"font-family:monospace">To resolve the SERVER-F error, we change=
 the recovery.conf and change=C2=A0primary_conninfo to use SERVER-E.</span>=
</div><div><span style=3D"font-family:monospace">This still did not resolve=
 the application issue and the primary log still shows the following every =
so often.</span></div><div><span style=3D"font-family:monospace"><br></span=
></div><div>
<div><span style=3D"font-family:monospace">STATEMENT: =C2=A0ROLLBACK PREPAR=
ED &#39;gid&#39;<br>ERROR: =C2=A0prepared transaction with identifier &quot=
;gid&quot; is busy</span></div>

<span style=3D"font-family:monospace"><br></span></div><div><span style=3D"=
font-family:monospace">At this stage, I thought maybe the PRIMARY and the r=
eplicas are configured in such a way that the PRIMARY must receive confirma=
tion from both that it has committed too otherwise it will just continue wa=
iting.</span></div><div><span style=3D"font-family:monospace">Under this sc=
enario, it is not able too since SERVER-G is not accessible. Does that make=
 sense?</span></div><div><span style=3D"font-family:monospace"><br></span><=
/div><div><span style=3D"font-family:monospace">Anyway, maybe someone will =
be interested to read this email and can shed some light on this and can ad=
vise whether there&#39;s some configuration setting somewhere that we shoul=
d have modified as a temporary workaround.</span></div><div><span style=3D"=
font-family:monospace">Could it be because of=C2=A0synchronous_commit=3D on=
? Maybe we should have changed this when SERVER-G is not accessible?</span>=
</div><div><span style=3D"font-family:monospace"><br></span></div><div>
<div><span style=3D"font-family:monospace">Everything is back to normal onc=
e SERVER-G has become accessible again.</span></div><div><span style=3D"fon=
t-family:monospace">That
 is about 6 hours though :( and doesn&#39;t explain why things will stop=20
working normally when a replica is down and the PRIMARY is still=20
accessible.</span></div><div><span style=3D"font-family:monospace">Does tha=
t mean, if both replicas are down and only the PRIMARY is accessible, we ha=
ve to totally turn off / disable replication?</span></div></div><div><span =
style=3D"font-family:monospace">If we do need to break the replica, when th=
e PRIMARY is UP and both replicas are inaccessible, do we just unset=C2=A0s=
ynchronous_standby_names?</span></div><div><span style=3D"font-family:monos=
pace"><br></span></div><div><span style=3D"font-family:monospace">Any reply=
 is much appreciated. Thanks in advance.</span></div><div><span style=3D"fo=
nt-family:monospace"><br></span></div><div><span style=3D"font-family:monos=
pace">Regards,</span></div><div><span style=3D"font-family:monospace">Ed</s=
pan></div><div><span style=3D"font-family:monospace"><br></span></div></div=
>

--000000000000db443006425ce68b--