MIME-Version: 1.0
References: 
 <CAG9Amsj61Oy7YGJw40uv7fedZwabLu2b5dVmJK4aDrZLDMVj+w@mail.gmail.com>
 <20251216.084325.87549965646144515.ishii@postgresql.org>
 <CAG9AmsigfRELb2B1K17vJ7YB_5xw0eVXY9AUmH3vj74RcLY08w@mail.gmail.com>
 <20251217.102935.700318527231632102.ishii@postgresql.org>
In-Reply-To: <20251217.102935.700318527231632102.ishii@postgresql.org>
From: Adam Blomeke <adam.blomeke@gmail.com>
Date: Tue, 30 Dec 2025 16:59:57 -0500
Message-ID: 
 <CAG9AmshGJjxEdjRuYJKYt4nJMOuvs2M_mZ1T7cYLwbbrmG7yzw@mail.gmail.com>
Subject: Re: Pgpool can't detect database status properly
To: Tatsuo Ishii <ishii@postgresql.org>
Cc: pgpool-general@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000a2f5260647327b11"
Archived-At: 
 <https://www.postgresql.org/message-id/CAG9AmshGJjxEdjRuYJKYt4nJMOuvs2M_mZ1T7cYLwbbrmG7yzw%40mail.gmail.com>
Precedence: bulk

--000000000000a2f5260647327b11
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Follow up on this. I've got autofailback set to off, but it's executing the
follow primary script on the node as soon as I attempt to detach it. Is
this expected behavior?

[postgres@awaprodxtrldbpgpool1 ~]$ grep auto_failback
/etc/pgpool-II/pgpool.conf
auto_failback =3D off
auto_failback_interval =3D 1
                                   # Min interval of executing
auto_failback in
[postgres@awaprodxtrldbpgpool1 ~]$ # Checkpoint on primary
psql -h 10.6.1.199 -d postgres -c "CHECKPOINT;"
# Detach primary to trigger failover
pcp_detach_node -h 10.6.1.54 -U pgpool -n 0
# Verify node 1 is now primary
pcp_node_info -a -U pgpool -h 10.6.1.54
CHECKPOINT
pcp_detach_node -- Command Successful
10.6.1.199 5432 3 0.500000 down up primary primary 0 none none 2025-12-30
14:22:10
10.6.1.200 5432 2 0.500000 up up standby standby 0 none none 2025-12-30
12:04:09
[postgres@awaprodxtrldbpgpool1 ~]$ pcp_node_info -a -U pgpool -h 10.6.1.54
10.6.1.199 5432 3 0.500000 down down standby unknown 0 none none 2025-12-30
14:22:11
10.6.1.200 5432 2 0.500000 up up primary primary 0 none none 2025-12-30
14:22:11
[postgres@awaprodxtrldbpgpool1 ~]$ ssh postgres@10.6.1.199
"/usr/pgsql-18/bin/pg_ctl -D /opt/data/data18 stop"
Authorized uses only. All activity may be monitored and reported.
pg_ctl: PID file "/opt/data/data18/postmaster.pid" does not exist
Is server running?
[postgres@awaprodxtrldbpgpool1 ~]$ ssh postgres@10.6.1.199
Authorized uses only. All activity may be monitored and reported.
Last login: Tue Dec 30 11:45:09 2025 from 10.6.1.196
-bash: typeset: TMOUT: readonly variable
[postgres@awaproddbvmnasdwusers1 ~]$ cd /opt/data
[postgres@awaproddbvmnasdwusers1 data]$ ll
total 1236
drwxr-x---.  2 postgres postgres       6 Dec 30 14:22 archive18
drwx------. 23 postgres postgres    4096 Dec 29 12:46 data15
drwx------. 13 postgres postgres    4096 Dec 30 14:22 data18
drwx------.  2 postgres postgres      74 Dec  1 15:50 dbservercert
-rw-r-----.  1 postgres postgres 1255731 Oct 24 18:04 pg_basebackup.log
[postgres@awaproddbvmnasdwusers1 data]$ cd data18/
[postgres@awaproddbvmnasdwusers1 data18]$ ll
total 8
-rw-------. 1 postgres postgres  233 Dec 30 14:22 backup_label
drwx------. 6 postgres postgres   66 Dec 30 14:22 base
drwx------. 2 postgres postgres 4096 Dec 30 14:22 global
drwx------. 2 postgres postgres   26 Dec 30 14:22 pg_commit_ts
drwx------. 2 postgres postgres   10 Dec 30 14:22 pg_dynshmem
drwx------. 4 postgres postgres   48 Dec 30 14:22 pg_multixact
drwx------. 2 postgres postgres   10 Dec 30 14:22 pg_notify
drwx------. 2 postgres postgres   10 Dec 30 14:22 pg_serial
drwx------. 2 postgres postgres   10 Dec 30 14:22 pg_snapshots
drwx------. 2 postgres postgres   10 Dec 30 14:22 pg_subtrans
drwx------. 2 postgres postgres   10 Dec 30 14:22 pg_twophase
drwx------. 4 postgres postgres  121 Dec 30 14:22 pg_wal
[postgres@awaproddbvmnasdwusers1 data18]$

Cheers,
Adam


On Tue, Dec 16, 2025 at 5:29=E2=80=AFPM Tatsuo Ishii <ishii@postgresql.org>=
 wrote:

> > Thanks for the reply.
> >
> > Your response makes sense as I'm still setting this cluster up, so it's
> > just me trying to connect to it.
> >
> > I'm curious then what the right process is for when I need to pull a no=
de
> > out of the cluster for maintenance (e.g. patching). I was under the
> > impression that I should drop the node, do a pg_rewind, manually set it
> as
> > a standby if it was the primary, and then add the node back in pgpool. =
I
> > guess I can't do that with auto failback turned on?
>
> Yes, while the maintenance, you should turn off auto failback.
> In fact, it's written in the document:
>
> https://www.pgpool.net/docs/47/en/html/runtime-config-failover.html#RUNTI=
ME-CONFIG-FAILOVER-SETTINGS
>
>     If you plan to detach standby node for maintenance, set this
>     parameter to off beforehand. Otherwise it's possible that standby
>     node is reattached against your intention.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
> > Cheers,
> > Adam
> >
> >
> > On Mon, Dec 15, 2025 at 6:43=E2=80=AFPM Tatsuo Ishii <ishii@postgresql.=
org>
> wrote:
> >
> >> > I'm resending this as it's been sitting in the moderation queue for =
a
> >> > while. Possibly because I didn't have a subject line? Anyways, any
> help
> >> > would be great. Thanks!
> >>
> >> I received your email this time.
> >>
> >> > I=E2=80=99m setting up a pgpool cluster to replace a single node dat=
abase in
> my
> >> > environment. The single node is separate from the cluster at the
> moment.
> >> > When it=E2=80=99s time to implement the DB I=E2=80=99m going to redo=
 the
> backup/restore,
> >> > throw an upgrade from pg15->18, and then bring the cluster and take
> over
> >> > the old IP.
> >> >
> >> >
> >> >
> >> > *Environment:*
> >> >
> >> >    - pgpool-II version: 4.6.3 (chirikoboshi)
> >> >    - PostgreSQL version: 18
> >> >    - OS: RHEL9
> >> >    - Cluster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197,
> >> 10.6.1.198)
> >> >    + 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby)
> >> >
> >> >
> >> >
> >> > *Issue:*
> >> >
> >> > I have pgpool configured and I=E2=80=99ve set it up using the script=
s and
> config
> >> > files from a different instance, one which has been running just fin=
e
> >> for a
> >> > year and a half or so. The issue I=E2=80=99m experiencing is that wh=
en I
> >> > detach/reattach a node, it sits in waiting constantly. It never
> >> transitions
> >> > to up.
> >>
> >> If you connect to 10.6.1.196 (or 10.6.1.197, 10.6.1.198) using psql
> >> and issue an SQL command, for example "SELECT 1", does it work? If it
> >> works, it means pgpool works fine.
> >>
> >> > I have to manually change the status file to up for it to get to
> >> > agree that it is,
> >>
> >> The pgpool status "waiting" means that the backend node has never
> >> revceived any query from pgpool clients yet. You can safely assume
> >> that that pgpool is up and running. Once pgpool receives queries, the
> >> status should be changed from "waiting" to "up".
> >>
> >> > and when I try to drop the node it doesn't actually drop
> >> > it. It just goes into waiting again.
> >>
> >> Sounds like an effect of auto fail back. because you set:
> >>
> >>  auto_failback_interval =3D 1
> >>
> >> pgpool almost immediately brings the pgpool to online.
> >>
> >> > I also don=E2=80=99t see any connection
> >> > attempts from the pgpool server to the postgres nodes if I look at
> >> postgres
> >> > logs. I've confirmed that it can run the postgres commands from the
> >> command
> >> > line. I've tried this both running pgpool as a service and running i=
t
> >> > directly from the command line. No difference in behavior.
> >>
> >> Probably there's something wrong in the configuration or trying to
> >> connect to wrong IP and/or port. Please turn on log_client_messages
> >> and log_per_node_statement, then send an SQL command to pgpool, and
> >> examin the pgpool log.
> >>
> >> > Here=E2=80=99s the log output:
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOG:  =3D=3D=3D Starting =
fail
> back.
> >> > reconnect host 10.6.1.200(5432) =3D=3D=3D
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:
> pgpool_main.c:4169
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOG:  Node 0 is not down
> >> > (status: 2)
> >> >
> >> > 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:
> pgpool_main.c:1524
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOG:  Do not restart
> children
> >> > because we are failing back node id 1 host: 10.6.1.200 port: 5432 an=
d
> we
> >> > are in streaming replication mode and not all backends were down
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:
> pgpool_main.c:4370
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOG:
> >> > find_primary_node_repeatedly: waiting for finding a primary node
> >> >
> >> > 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:
> pgpool_main.c:2896
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG:  find_primary_node:
> >> primary
> >> > node is 0
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:2815
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG:  find_primary_node:
> >> standby
> >> > node is 1
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:2821
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG:  failover: set new
> >> primary
> >> > node: 0
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:4660
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG:  failover: set new
> main
> >> > node: 0
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:4667
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOG:  =3D=3D=3D Failback =
done.
> >> > reconnect host 10.6.1.200(5432) =3D=3D=3D
> >> >
> >> > 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:
> pgpool_main.c:4763
> >> >
> >> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG:  worker
> >> process
> >> > received restart request
> >> >
> >> > 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATION:
> >> > pool_worker_child.c:182
> >> >
> >> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG:  restart request
> >> > received in pcp child process
> >> >
> >> > 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION:
> pcp_child.c:173
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOG:  PCP child 1085087
> exits
> >> > with status 0 in failover()
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:
> pgpool_main.c:4850
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOG:  fork a new PCP chil=
d
> pid
> >> > 1085089 in failover()
> >> >
> >> > 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:
> pgpool_main.c:4854
> >> >
> >> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG:  PCP process:
> 1085089
> >> > started
> >> >
> >> > 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION:
> pcp_child.c:165
> >> >
> >> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG:  process
> >> started
> >> >
> >> > 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATION:
> >> > pgpool_main.c:905
> >> >
> >> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG:  forked new pcp
> >> worker,
> >> > pid=3D1085093 socket=3D7
> >> >
> >> > 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION:
> pcp_child.c:327
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG:  PCP process wit=
h
> >> pid:
> >> > 1085093 exit with SUCCESS.
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:
> pcp_child.c:384
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG:  PCP process wit=
h
> >> pid:
> >> > 1085093 exits with status 0
> >> >
> >> > 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:
> pcp_child.c:398
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085050: LOG:  failover or failba=
ck
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085050: DETAIL:  restarting myse=
lf
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085050: LOCATION:  child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.480: child pid 1085038: LOG:  failover or failba=
ck
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085038: DETAIL:  restarting myse=
lf
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085038: LOCATION:  child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085035: LOG:  failover or failba=
ck
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085035: DETAIL:  restarting myse=
lf
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085035: LOCATION:  child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085061: LOG:  failover or failba=
ck
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085061: DETAIL:  restarting myse=
lf
> >> >
> >> > 2025-12-03 14:25:39.481: child pid 1085061: LOCATION:  child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085053: LOG:  failover or failba=
ck
> >> > event detected
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085053: DETAIL:  restarting myse=
lf
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085053: LOCATION:  child.c:1524
> >> >
> >> > 2025-12-03 14:25:39.483: child pid 1085059: LOG:  failover or failba=
ck
> >> > event detected
> >> >
> >> > ......over and over and over again.
> >> >
> >> >
> >> >
> >> >
> >> > pcp_node_info output:
> >> >
> >> > 10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none none
> >> > 2025-12-03 14:04:39
> >> >
> >> > 10.6.1.200 5432 1 0.500000 waiting up standby standby 0 streaming
> async
> >> > 2025-12-03 14:04:39
> >> >
> >> > Logs show:
> >> >
> >> > node status[0]: 1
> >> >
> >> > node status[1]: 2
> >> >
> >> > Node 0 (primary) gets status 1 (waiting), node 1 (standby) gets
> status 2
> >> > (up).
> >>
> >> No, this does not show the backend status. Instead, it says
> >>
> >> > node status[0]: 1
> >>
> >> This means backend 0 is primary.
> >>
> >> > node status[1]: 2
> >>
> >> This means backend 1 is standby.
> >>
> >> > *auto_failback behavior:*
> >> >
> >> >    - When a node is detached (pcp_detach_node), it goes to status 3
> >> (down)
> >> >    - auto_failback triggers and moves it to status 1 (waiting)
> >> >    - Node never transitions from waiting to up
> >>
> >> Sounds like pgpool has not received queries.
> >>
> >> > *Key configuration:*
> >> >
> >> > backend_clustering_mode =3D 'streaming_replication'
> >> >
> >> > backend_hostname0 =3D '10.6.1.199'
> >> >
> >> > backend_hostname1 =3D '10.6.1.200'
> >> >
> >> > backend_application_name0 =3D 'nasdw_users_1'
> >> >
> >> > backend_application_name1 =3D 'nasdw_users_2'
> >> >
> >> >
> >> >
> >> > use_watchdog =3D on
> >> >
> >> > # 3 watchdog nodes configured
> >> >
> >> >
> >> >
> >> > auto_failback =3D on
> >> >
> >> > auto_failback_interval =3D 1
> >> >
> >> >
> >> >
> >> > sr_check_period =3D 10
> >> >
> >> > sr_check_user =3D 'pgpool'
> >> >
> >> > sr_check_database =3D 'nasdw_users'
> >> >
> >> >
> >> >
> >> > health_check_period =3D 1
> >> >
> >> > health_check_user =3D 'pgpool'
> >> >
> >> > health_check_database =3D 'nasdw_users'
> >> >
> >> >
> >> >
> >> > failover_when_quorum_exists =3D on (default)
> >> >
> >> > failover_require_consensus =3D on (default)
> >> > Cheers,
> >> > Adam
> >>
>

--000000000000a2f5260647327b11
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Follow up on this. I&#39;ve got autofailback=C2=A0set=
 to off, but it&#39;s executing the follow primary script on the node as so=
on as I attempt to detach it. Is this expected behavior?<br><br>[postgres@a=
waprodxtrldbpgpool1 ~]$ grep auto_failback /etc/pgpool-II/pgpool.conf<br>au=
to_failback =3D off<br>auto_failback_interval =3D 1<br>=C2=A0 =C2=A0 =C2=A0=
 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0# Min interval of executing auto_failback in=
<br>[postgres@awaprodxtrldbpgpool1 ~]$ # Checkpoint on primary<br>psql -h 1=
0.6.1.199 -d postgres -c &quot;CHECKPOINT;&quot;<br># Detach primary to tri=
gger failover<br>pcp_detach_node -h 10.6.1.54 -U pgpool -n 0<br># Verify no=
de 1 is now primary<br>pcp_node_info -a -U pgpool -h 10.6.1.54<br>CHECKPOIN=
T<br>pcp_detach_node -- Command Successful<br>10.6.1.199 5432 3 0.500000 do=
wn up primary primary 0 none none 2025-12-30 14:22:10<br>10.6.1.200 5432 2 =
0.500000 up up standby standby 0 none none 2025-12-30 12:04:09<br>[postgres=
@awaprodxtrldbpgpool1 ~]$ pcp_node_info -a -U pgpool -h 10.6.1.54<br>10.6.1=
.199 5432 3 0.500000 down down standby unknown 0 none none 2025-12-30 14:22=
:11<br>10.6.1.200 5432 2 0.500000 up up primary primary 0 none none 2025-12=
-30 14:22:11<br>[postgres@awaprodxtrldbpgpool1 ~]$ ssh <a href=3D"mailto:po=
stgres@10.6.1.199">postgres@10.6.1.199</a> &quot;/usr/pgsql-18/bin/pg_ctl -=
D /opt/data/data18 stop&quot;<br>Authorized uses only. All activity may be =
monitored and reported.<br>pg_ctl: PID file &quot;/opt/data/data18/postmast=
er.pid&quot; does not exist<br>Is server running?<br>[postgres@awaprodxtrld=
bpgpool1 ~]$ ssh <a href=3D"mailto:postgres@10.6.1.199">postgres@10.6.1.199=
</a><br>Authorized uses only. All activity may be monitored and reported.<b=
r>Last login: Tue Dec 30 11:45:09 2025 from 10.6.1.196<br>-bash: typeset: T=
MOUT: readonly variable<br>[postgres@awaproddbvmnasdwusers1 ~]$ cd /opt/dat=
a<br>[postgres@awaproddbvmnasdwusers1 data]$ ll<br>total 1236<br>drwxr-x---=
. =C2=A02 postgres postgres =C2=A0 =C2=A0 =C2=A0 6 Dec 30 14:22 archive18<b=
r>drwx------. 23 postgres postgres =C2=A0 =C2=A04096 Dec 29 12:46 data15<br=
>drwx------. 13 postgres postgres =C2=A0 =C2=A04096 Dec 30 14:22 data18<br>=
drwx------. =C2=A02 postgres postgres =C2=A0 =C2=A0 =C2=A074 Dec =C2=A01 15=
:50 dbservercert<br>-rw-r-----. =C2=A01 postgres postgres 1255731 Oct 24 18=
:04 pg_basebackup.log<br>[postgres@awaproddbvmnasdwusers1 data]$ cd data18/=
<br>[postgres@awaproddbvmnasdwusers1 data18]$ ll<br>total 8<br>-rw-------. =
1 postgres postgres =C2=A0233 Dec 30 14:22 backup_label<br>drwx------. 6 po=
stgres postgres =C2=A0 66 Dec 30 14:22 base<br>drwx------. 2 postgres postg=
res 4096 Dec 30 14:22 global<br>drwx------. 2 postgres postgres =C2=A0 26 D=
ec 30 14:22 pg_commit_ts<br>drwx------. 2 postgres postgres =C2=A0 10 Dec 3=
0 14:22 pg_dynshmem<br>drwx------. 4 postgres postgres =C2=A0 48 Dec 30 14:=
22 pg_multixact<br>drwx------. 2 postgres postgres =C2=A0 10 Dec 30 14:22 p=
g_notify<br>drwx------. 2 postgres postgres =C2=A0 10 Dec 30 14:22 pg_seria=
l<br>drwx------. 2 postgres postgres =C2=A0 10 Dec 30 14:22 pg_snapshots<br=
>drwx------. 2 postgres postgres =C2=A0 10 Dec 30 14:22 pg_subtrans<br>drwx=
------. 2 postgres postgres =C2=A0 10 Dec 30 14:22 pg_twophase<br>drwx-----=
-. 4 postgres postgres =C2=A0121 Dec 30 14:22 pg_wal<br>[postgres@awaproddb=
vmnasdwusers1 data18]$<br><br></div><div><div dir=3D"ltr" class=3D"gmail_si=
gnature" data-smartmail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=
=3D"ltr"><div>Cheers,</div>Adam<br></div></div></div></div></div><br></div>=
<br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Tue=
, Dec 16, 2025 at 5:29=E2=80=AFPM Tatsuo Ishii &lt;<a href=3D"mailto:ishii@=
postgresql.org" target=3D"_blank">ishii@postgresql.org</a>&gt; wrote:<br></=
div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor=
der-left:1px solid rgb(204,204,204);padding-left:1ex">&gt; Thanks for the r=
eply.<br>
&gt; <br>
&gt; Your response makes sense as I&#39;m still setting this cluster up, so=
 it&#39;s<br>
&gt; just me trying to connect to it.<br>
&gt; <br>
&gt; I&#39;m curious then what the right process is for when I need to pull=
 a node<br>
&gt; out of the cluster for maintenance (e.g. patching). I was under the<br=
>
&gt; impression that I should drop the node, do a pg_rewind, manually set i=
t as<br>
&gt; a standby if it was the primary, and then add the node back in pgpool.=
 I<br>
&gt; guess I can&#39;t do that with auto failback turned on?<br>
<br>
Yes, while the maintenance, you should turn off auto failback.<br>
In fact, it&#39;s written in the document:<br>
<a href=3D"https://www.pgpool.net/docs/47/en/html/runtime-config-failover.h=
tml#RUNTIME-CONFIG-FAILOVER-SETTINGS" rel=3D"noreferrer" target=3D"_blank">=
https://www.pgpool.net/docs/47/en/html/runtime-config-failover.html#RUNTIME=
-CONFIG-FAILOVER-SETTINGS</a><br>
<br>
=C2=A0 =C2=A0 If you plan to detach standby node for maintenance, set this<=
br>
=C2=A0 =C2=A0 parameter to off beforehand. Otherwise it&#39;s possible that=
 standby<br>
=C2=A0 =C2=A0 node is reattached against your intention.<br>
<br>
Best regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS K.K.<br>
English: <a href=3D"http://www.sraoss.co.jp/index_en/" rel=3D"noreferrer" t=
arget=3D"_blank">http://www.sraoss.co.jp/index_en/</a><br>
Japanese:<a href=3D"http://www.sraoss.co.jp" rel=3D"noreferrer" target=3D"_=
blank">http://www.sraoss.co.jp</a><br>
<br>
&gt; Cheers,<br>
&gt; Adam<br>
&gt; <br>
&gt; <br>
&gt; On Mon, Dec 15, 2025 at 6:43=E2=80=AFPM Tatsuo Ishii &lt;<a href=3D"ma=
ilto:ishii@postgresql.org" target=3D"_blank">ishii@postgresql.org</a>&gt; w=
rote:<br>
&gt; <br>
&gt;&gt; &gt; I&#39;m resending this as it&#39;s been sitting in the modera=
tion queue for a<br>
&gt;&gt; &gt; while. Possibly because I didn&#39;t have a subject line? Any=
ways, any help<br>
&gt;&gt; &gt; would be great. Thanks!<br>
&gt;&gt;<br>
&gt;&gt; I received your email this time.<br>
&gt;&gt;<br>
&gt;&gt; &gt; I=E2=80=99m setting up a pgpool cluster to replace a single n=
ode database in my<br>
&gt;&gt; &gt; environment. The single node is separate from the cluster at =
the moment.<br>
&gt;&gt; &gt; When it=E2=80=99s time to implement the DB I=E2=80=99m going =
to redo the backup/restore,<br>
&gt;&gt; &gt; throw an upgrade from pg15-&gt;18, and then bring the cluster=
 and take over<br>
&gt;&gt; &gt; the old IP.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; *Environment:*<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - pgpool-II version: 4.6.3 (chirikoboshi)<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - PostgreSQL version: 18<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - OS: RHEL9<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - Cluster topology: 3 pgpool nodes (10.6.1.196, =
10.6.1.197,<br>
&gt;&gt; 10.6.1.198)<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 + 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1=
.200 standby)<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; *Issue:*<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; I have pgpool configured and I=E2=80=99ve set it up using the=
 scripts and config<br>
&gt;&gt; &gt; files from a different instance, one which has been running j=
ust fine<br>
&gt;&gt; for a<br>
&gt;&gt; &gt; year and a half or so. The issue I=E2=80=99m experiencing is =
that when I<br>
&gt;&gt; &gt; detach/reattach a node, it sits in waiting constantly. It nev=
er<br>
&gt;&gt; transitions<br>
&gt;&gt; &gt; to up.<br>
&gt;&gt;<br>
&gt;&gt; If you connect to 10.6.1.196 (or 10.6.1.197, 10.6.1.198) using psq=
l<br>
&gt;&gt; and issue an SQL command, for example &quot;SELECT 1&quot;, does i=
t work? If it<br>
&gt;&gt; works, it means pgpool works fine.<br>
&gt;&gt;<br>
&gt;&gt; &gt; I have to manually change the status file to up for it to get=
 to<br>
&gt;&gt; &gt; agree that it is,<br>
&gt;&gt;<br>
&gt;&gt; The pgpool status &quot;waiting&quot; means that the backend node =
has never<br>
&gt;&gt; revceived any query from pgpool clients yet. You can safely assume=
<br>
&gt;&gt; that that pgpool is up and running. Once pgpool receives queries, =
the<br>
&gt;&gt; status should be changed from &quot;waiting&quot; to &quot;up&quot=
;.<br>
&gt;&gt;<br>
&gt;&gt; &gt; and when I try to drop the node it doesn&#39;t actually drop<=
br>
&gt;&gt; &gt; it. It just goes into waiting again.<br>
&gt;&gt;<br>
&gt;&gt; Sounds like an effect of auto fail back. because you set:<br>
&gt;&gt;<br>
&gt;&gt;=C2=A0 auto_failback_interval =3D 1<br>
&gt;&gt;<br>
&gt;&gt; pgpool almost immediately brings the pgpool to online.<br>
&gt;&gt;<br>
&gt;&gt; &gt; I also don=E2=80=99t see any connection<br>
&gt;&gt; &gt; attempts from the pgpool server to the postgres nodes if I lo=
ok at<br>
&gt;&gt; postgres<br>
&gt;&gt; &gt; logs. I&#39;ve confirmed that it can run the postgres command=
s from the<br>
&gt;&gt; command<br>
&gt;&gt; &gt; line. I&#39;ve tried this both running pgpool as a service an=
d running it<br>
&gt;&gt; &gt; directly from the command line. No difference in behavior.<br=
>
&gt;&gt;<br>
&gt;&gt; Probably there&#39;s something wrong in the configuration or tryin=
g to<br>
&gt;&gt; connect to wrong IP and/or port. Please turn on log_client_message=
s<br>
&gt;&gt; and log_per_node_statement, then send an SQL command to pgpool, an=
d<br>
&gt;&gt; examin the pgpool log.<br>
&gt;&gt;<br>
&gt;&gt; &gt; Here=E2=80=99s the log output:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.037: main pid 1085028: LOG:=C2=A0 =3D=3D=
=3D Starting fail back.<br>
&gt;&gt; &gt; reconnect host 10.6.1.200(5432) =3D=3D=3D<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4169<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.037: main pid 1085028: LOG:=C2=A0 Node 0 =
is not down<br>
&gt;&gt; &gt; (status: 2)<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.037: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:1524<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.038: main pid 1085028: LOG:=C2=A0 Do not =
restart children<br>
&gt;&gt; &gt; because we are failing back node id 1 host: 10.6.1.200 port: =
5432 and we<br>
&gt;&gt; &gt; are in streaming replication mode and not all backends were d=
own<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4370<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.038: main pid 1085028: LOG:<br>
&gt;&gt; &gt; find_primary_node_repeatedly: waiting for finding a primary n=
ode<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.038: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:2896<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 find_pr=
imary_node:<br>
&gt;&gt; primary<br>
&gt;&gt; &gt; node is 0<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:2815<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 find_pr=
imary_node:<br>
&gt;&gt; standby<br>
&gt;&gt; &gt; node is 1<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:2821<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 failove=
r: set new<br>
&gt;&gt; primary<br>
&gt;&gt; &gt; node: 0<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4660<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 failove=
r: set new main<br>
&gt;&gt; &gt; node: 0<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4667<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 =3D=3D=
=3D Failback done.<br>
&gt;&gt; &gt; reconnect host 10.6.1.200(5432) =3D=3D=3D<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4763<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG:=C2=
=A0 worker<br>
&gt;&gt; process<br>
&gt;&gt; &gt; received restart request<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATIO=
N:<br>
&gt;&gt; &gt; pool_worker_child.c:182<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG:=C2=A0 res=
tart request<br>
&gt;&gt; &gt; received in pcp child process<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION:=C2=
=A0 pcp_child.c:173<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.193: main pid 1085028: LOG:=C2=A0 PCP chi=
ld 1085087 exits<br>
&gt;&gt; &gt; with status 0 in failover()<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4850<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.193: main pid 1085028: LOG:=C2=A0 fork a =
new PCP child pid<br>
&gt;&gt; &gt; 1085089 in failover()<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.193: main pid 1085028: LOCATION:=C2=A0 pg=
pool_main.c:4854<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG:=C2=A0 PCP=
 process: 1085089<br>
&gt;&gt; &gt; started<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION:=C2=
=A0 pcp_child.c:165<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG:=C2=
=A0 process<br>
&gt;&gt; started<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATIO=
N:<br>
&gt;&gt; &gt; pgpool_main.c:905<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG:=C2=A0 for=
ked new pcp<br>
&gt;&gt; worker,<br>
&gt;&gt; &gt; pid=3D1085093 socket=3D7<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION:=C2=
=A0 pcp_child.c:327<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG:=C2=A0 PCP=
 process with<br>
&gt;&gt; pid:<br>
&gt;&gt; &gt; 1085093 exit with SUCCESS.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:=C2=
=A0 pcp_child.c:384<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG:=C2=A0 PCP=
 process with<br>
&gt;&gt; pid:<br>
&gt;&gt; &gt; 1085093 exits with status 0<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:=C2=
=A0 pcp_child.c:398<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.480: child pid 1085050: LOG:=C2=A0 failov=
er or failback<br>
&gt;&gt; &gt; event detected<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.480: child pid 1085050: DETAIL:=C2=A0 res=
tarting myself<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.480: child pid 1085050: LOCATION:=C2=A0 c=
hild.c:1524<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.480: child pid 1085038: LOG:=C2=A0 failov=
er or failback<br>
&gt;&gt; &gt; event detected<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085038: DETAIL:=C2=A0 res=
tarting myself<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085038: LOCATION:=C2=A0 c=
hild.c:1524<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085035: LOG:=C2=A0 failov=
er or failback<br>
&gt;&gt; &gt; event detected<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085035: DETAIL:=C2=A0 res=
tarting myself<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085035: LOCATION:=C2=A0 c=
hild.c:1524<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085061: LOG:=C2=A0 failov=
er or failback<br>
&gt;&gt; &gt; event detected<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085061: DETAIL:=C2=A0 res=
tarting myself<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.481: child pid 1085061: LOCATION:=C2=A0 c=
hild.c:1524<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.483: child pid 1085053: LOG:=C2=A0 failov=
er or failback<br>
&gt;&gt; &gt; event detected<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.483: child pid 1085053: DETAIL:=C2=A0 res=
tarting myself<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.483: child pid 1085053: LOCATION:=C2=A0 c=
hild.c:1524<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 2025-12-03 14:25:39.483: child pid 1085059: LOG:=C2=A0 failov=
er or failback<br>
&gt;&gt; &gt; event detected<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; ......over and over and over again.<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; pcp_node_info output:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none =
none<br>
&gt;&gt; &gt; 2025-12-03 14:04:39<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; 10.6.1.200 5432 1 0.500000 waiting up standby standby 0 strea=
ming async<br>
&gt;&gt; &gt; 2025-12-03 14:04:39<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Logs show:<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; node status[0]: 1<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; node status[1]: 2<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; Node 0 (primary) gets status 1 (waiting), node 1 (standby) ge=
ts status 2<br>
&gt;&gt; &gt; (up).<br>
&gt;&gt;<br>
&gt;&gt; No, this does not show the backend status. Instead, it says<br>
&gt;&gt;<br>
&gt;&gt; &gt; node status[0]: 1<br>
&gt;&gt;<br>
&gt;&gt; This means backend 0 is primary.<br>
&gt;&gt;<br>
&gt;&gt; &gt; node status[1]: 2<br>
&gt;&gt;<br>
&gt;&gt; This means backend 1 is standby.<br>
&gt;&gt;<br>
&gt;&gt; &gt; *auto_failback behavior:*<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - When a node is detached (pcp_detach_node), it =
goes to status 3<br>
&gt;&gt; (down)<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - auto_failback triggers and moves it to status =
1 (waiting)<br>
&gt;&gt; &gt;=C2=A0 =C2=A0 - Node never transitions from waiting to up<br>
&gt;&gt;<br>
&gt;&gt; Sounds like pgpool has not received queries.<br>
&gt;&gt;<br>
&gt;&gt; &gt; *Key configuration:*<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; backend_clustering_mode =3D &#39;streaming_replication&#39;<b=
r>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; backend_hostname0 =3D &#39;10.6.1.199&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; backend_hostname1 =3D &#39;10.6.1.200&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; backend_application_name0 =3D &#39;nasdw_users_1&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; backend_application_name1 =3D &#39;nasdw_users_2&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; use_watchdog =3D on<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; # 3 watchdog nodes configured<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; auto_failback =3D on<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; auto_failback_interval =3D 1<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; sr_check_period =3D 10<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; sr_check_user =3D &#39;pgpool&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; sr_check_database =3D &#39;nasdw_users&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; health_check_period =3D 1<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; health_check_user =3D &#39;pgpool&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; health_check_database =3D &#39;nasdw_users&#39;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; failover_when_quorum_exists =3D on (default)<br>
&gt;&gt; &gt;<br>
&gt;&gt; &gt; failover_require_consensus =3D on (default)<br>
&gt;&gt; &gt; Cheers,<br>
&gt;&gt; &gt; Adam<br>
&gt;&gt;<br>
</blockquote></div>

--000000000000a2f5260647327b11--