Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vEOzV-00BHXE-Dq for pgsql-admin@arkaria.postgresql.org; Thu, 30 Oct 2025 09:30:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1vEOzU-006xR8-Bb for pgsql-admin@arkaria.postgresql.org; Thu, 30 Oct 2025 09:30:43 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1vEOzT-006xQi-VX for pgsql-admin@lists.postgresql.org; Thu, 30 Oct 2025 09:30:43 +0000 Received: from mail-pl1-x632.google.com ([2607:f8b0:4864:20::632]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vEOzQ-0052jq-0g for pgsql-admin@lists.postgresql.org; Thu, 30 Oct 2025 09:30:42 +0000 Received: by mail-pl1-x632.google.com with SMTP id d9443c01a7336-294fb21b068so5729335ad.1 for ; Thu, 30 Oct 2025 02:30:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1761816638; x=1762421438; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=gc7Js5qWid4oElkX30Wlp97YTmo5c/wEgUAoZDUZJxY=; b=LzmhxCxsV0myZsquEQiAmYFUblCwnj3XVDrxORut0z/kZWaPOfDCDCqgvujCrMqqXs F+LEOYOpJXbTTZlq0IfxQub3glJ/CRT6T5JzYTt6SzP1hQdVVjCOUBHdQmROsEMG2LDp LjhHSTqVpu1jIPx+1wCYfCrA7rpfZajSlyeazDYEqirRdiIMDj5Au+vEwjfRQSNsJCrJ qfN07Sw6612tbc5UpoJjR4OwmSfzPd8Dv2dWt2UzFCvSlWVJVFZyWX4MvbP3TRA7Zpv6 ncZYZzdShWt0AXPGdHPm22q0Z3EXAMxANQAEe+xQticVpY4Mt5Yvb4+kQjhqQkd8yKo5 5vjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1761816638; x=1762421438; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=gc7Js5qWid4oElkX30Wlp97YTmo5c/wEgUAoZDUZJxY=; b=FVvQgNfxzTELXoK7bAUswnEtYmJerai+zhosq88EsDqBc7r0Bc4Qr9wRzuN1arIBaw Fy62AVJTgu+/SFZKfjCujckqv1s4gYycbiOtPzN5HsPCGFmWkto243yNHg6nnufEISPa L2AUmS0UAZzSKWuZCTb7IPw0FZLWscgCgXI4a4UCcDqqCUBJkXB3ZFk3c/dduk9RcpqZ l7Cd8gk8qmSYBNOkQk9eg1LH8fmcR9pxJGtQI+7uHEyijUImBozL9Rf0xksExpqcK5gP J35/kMeW1JKKXQ1vXDLMBn62Rg98s8efqsLVIfSFB8RL1TSKbBeTH+V2HacGo8cVXwFx 21AA== X-Gm-Message-State: AOJu0YxSIsNskzWoFvwaxvHCJ8++N43eREAFL8itj1+TMUqGjShTJIKi NnRU4uJQrChbmSuS3E19/09VvVgPakzCLCCQ/QiBZNFONlh+Gndo2O1RK2KH2vW6IPFbhUwYNkJ 6LlkEDR8G/yvx5xbZsz4F36s4TcSyJq+f+kVx8aE= X-Gm-Gg: ASbGncs+cdUsu+7sd5jIn3T5LHVVgSXxTbeoFG8ibfE0UinoqVMTM4ltlVFItm3HUd2 WT+HMbFvpO1agRCAjLnxtSXqxksjOHCGgQ5+JY5niz+75B/hlB0IQ6+NvlCo8w6MrMX2Rh4Dwvm wCSyKGrM78O41NsJNI8LGmA6lajMK2W36Z2fbMarSo+7p21l66H50E0JsW/Tum5FYxaOk0MDYyr 993h2kKz39Wsf36yeIXsLPsuBxPkrfRBTdom6K8P1iB6Ic/kcYElzJ21bYs X-Google-Smtp-Source: AGHT+IH8I7MRH+xCi2eRTSqYrVDhn67BAarXJz7qRRPJueEmjFPN4OAvai5DFvDhxbsUdSi4ktMfByyGTwyMy2M2ODU= X-Received: by 2002:a17:902:e28f:b0:292:c97c:3569 with SMTP id d9443c01a7336-294ee45113fmr22109835ad.48.1761816637816; Thu, 30 Oct 2025 02:30:37 -0700 (PDT) MIME-Version: 1.0 From: Edwin UY Date: Thu, 30 Oct 2025 22:30:00 +1300 X-Gm-Features: AWmQ_bmtqtGKjOjaE8o0A870Be53rhsYpupPwJ137upR9YsY3E5MBKs7bdpZba0 Message-ID: Subject: Replication Question / Issue - PRIMARY with SYNC and ASYNC Replication To: Pgsql-admin Content-Type: multipart/alternative; boundary="000000000000db443006425ce68b" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000db443006425ce68b Content-Type: text/plain; charset="UTF-8" Hi, Apologies for a long email. I suppose as much information as possible will help with troubleshooting PostgreSQL is Version 11. I know, it's old, I don't have a choice due to the application. There is a PRIMARY and 2 replicas, SYNC and ASYNC. We had a network outage that rendered the application unusable for some reason even though we still have a PRIMARY and a replication server in place. This is now resolved since the network is restored so I am just wanting to get some guidance for a quick resolution in the future. Not really sure how to confirm which one is SYNC or ASYNC. select * from pg_stat_replication from the PRIMARY shows nothing So, I am left with no choice but to trust the documentation where it says SERVER -E = PRIMARY SERVER -F = ASYNC SERVER -G = SYNC When we have the network issue. SERVER-E and SERVER-F are accessible and they can communicate to each other. SERVER-G is not accessible. However the application connection is intermittently dropping. The primary is showing several errors like below: STATEMENT: ROLLBACK PREPARED 'gid' ERROR: prepared transaction with identifier "gid" is busy SERVER-F is showing FATAL: could not connect to the primary server: could not connect to server: No route to host Is the server running on host "SERVER-G" and accepting TCP/IP connections on port 5432? Can't check SERVER-G as it is not accessible. I assume the prepared transactions are from the replication, not from the application. The error from SERVER-F is as expected since SERVER-G is not accessible. Under this scenario, the application is intermittently having issues connecting to the database. Not sure why. We have re-started both databases SERVER-E and SERVER-F and clear up the prepared transaction as well using https://www.cybertec-postgresql.com/en/prepared-transactions. After startup we can see the prepared transaction gone, pg_prepared_xacts is emptty and then will show one one prepare transaction that is active based on pg_stat_activity. select * from pg_stat_replication still shows nothing. To resolve the SERVER-F error, we change the recovery.conf and change primary_conninfo to use SERVER-E. This still did not resolve the application issue and the primary log still shows the following every so often. STATEMENT: ROLLBACK PREPARED 'gid' ERROR: prepared transaction with identifier "gid" is busy At this stage, I thought maybe the PRIMARY and the replicas are configured in such a way that the PRIMARY must receive confirmation from both that it has committed too otherwise it will just continue waiting. Under this scenario, it is not able too since SERVER-G is not accessible. Does that make sense? Anyway, maybe someone will be interested to read this email and can shed some light on this and can advise whether there's some configuration setting somewhere that we should have modified as a temporary workaround. Could it be because of synchronous_commit= on? Maybe we should have changed this when SERVER-G is not accessible? Everything is back to normal once SERVER-G has become accessible again. That is about 6 hours though :( and doesn't explain why things will stop working normally when a replica is down and the PRIMARY is still accessible. Does that mean, if both replicas are down and only the PRIMARY is accessible, we have to totally turn off / disable replication? If we do need to break the replica, when the PRIMARY is UP and both replicas are inaccessible, do we just unset synchronous_standby_names? Any reply is much appreciated. Thanks in advance. Regards, Ed --000000000000db443006425ce68b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi,

Apologies for a long email. I suppose as much = information as possible will help with troubleshooting
PostgreSQL is Version 11. I know, it'= ;s old, I don't have a choice due to the application.
=
There is a PRIMARY and 2 replicas, SYNC and ASYNC.
We had a network outa= ge that rendered the application unusable=C2=A0for some reason even though = we still have a PRIMARY and a replication server in place.
This is now resolved since the netwo= rk is restored so I am just wanting to get some guidance for a quick resolu= tion in the future.
=
Not really sure= how to confirm which one is SYNC or ASYNC.
select * from pg_stat_replication from the PRIMARY = shows nothing
So, I = am left with no choice but to trust the documentation where it says<= /div>

= SERVER -E =3D PRIMARY
SERVER -F=C2=A0 =3D ASYNC
SERVER -G=C2=A0=3D SYNC

When we have the networ= k issue.
SERVER-E an= d SERVER-F are accessible and they can communicate to each other. SERVER-G = is not accessible. However the application connection is intermittently=C2= =A0dropping.

The primary is showing= several errors like below:
STATEMENT: =C2=A0ROLLBACK PREPARED 'gid'
ERROR: =C2=A0pr= epared transaction with identifier "gid" is busy

SERVER-F is showing
FATAL: =C2=A0could not connect to the primary serve= r: could not connect to server: No route to host
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Is the server running on host "SERV= ER-G" and accepting
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 TCP/IP connections on port 5432?

Can't check SERVER-G as it is not accessible.

I assume the prepared transactions are from the rep= lication, not from the application.
The error from SERVER-F is as expected since SERVER-G is no= t accessible.
Under = this scenario, the application is intermittently having issues connecting t= o the database. Not sure why.
We have re-started both databases SERVER-E and SERVER-F and clear= up the prepared transaction as well using=C2=A0https://www.cybertec-postgres= ql.com/en/prepared-transactions.
After startup we can see the prepared transaction gone,=C2= =A0pg_prepared_xacts is emptty and then will show one one prepare transacti= on that is active based on pg_stat_activity.
select * from pg_stat_replication still shows nothing.
To resolve the SERVER-F error, we change= the recovery.conf and change=C2=A0primary_conninfo to use SERVER-E.=
This still did not resolve= the application issue and the primary log still shows the following every = so often.

STATEMENT: =C2=A0ROLLBACK PREPAR= ED 'gid'
ERROR: =C2=A0prepared transaction with identifier "= ;gid" is busy

At this stage, I thought maybe the PRIMARY and the r= eplicas are configured in such a way that the PRIMARY must receive confirma= tion from both that it has committed too otherwise it will just continue wa= iting.
Under this sc= enario, it is not able too since SERVER-G is not accessible. Does that make= sense?

<= /div>
Anyway, maybe someone will = be interested to read this email and can shed some light on this and can ad= vise whether there's some configuration setting somewhere that we shoul= d have modified as a temporary workaround.
Could it be because of=C2=A0synchronous_commit=3D on= ? Maybe we should have changed this when SERVER-G is not accessible?=

Everything is back to normal onc= e SERVER-G has become accessible again.
That is about 6 hours though :( and doesn't explain why things will stop=20 working normally when a replica is down and the PRIMARY is still=20 accessible.
Does tha= t mean, if both replicas are down and only the PRIMARY is accessible, we ha= ve to totally turn off / disable replication?
If we do need to break the replica, when th= e PRIMARY is UP and both replicas are inaccessible, do we just unset=C2=A0s= ynchronous_standby_names?

Any reply= is much appreciated. Thanks in advance.

Regards,
Ed

--000000000000db443006425ce68b--