Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vUAfV-0014GI-2F for pgpool-general@arkaria.postgresql.org; Fri, 12 Dec 2025 21:27:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vUAeU-009WGO-11 for pgpool-general@arkaria.postgresql.org; Fri, 12 Dec 2025 21:26:15 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vUAeU-009WGG-0D for pgpool-general@lists.postgresql.org; Fri, 12 Dec 2025 21:26:14 +0000 Received: from mail-qt1-x833.google.com ([2607:f8b0:4864:20::833]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vUAeS-000QIF-1Z for pgpool-general@lists.postgresql.org; Fri, 12 Dec 2025 21:26:14 +0000 Received: by mail-qt1-x833.google.com with SMTP id d75a77b69052e-4ee2014c228so12479811cf.2 for ; Fri, 12 Dec 2025 13:26:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1765574770; x=1766179570; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=RpMPbn/IkjJ2T3RR+1sOXSR9/EIikbEc8vwUloahgxc=; b=ALAoW4m6q308HohroECTheuTmbTuw1I40PsWEll++LnneE0MaJWuN6XroTm2LmAqC+ 9836p5cO8/eNQMXjyvMN3m8troqOrCZfrhjNftlwmF0NDTpA92J9JdEt5sQ34qZz85at Ai9iMCNrvhlSl25b5MVOHcnnRv4SJdZeIskg7o6+86sK/0ZPew+HND6Bry3bTBbasIqY 3BmuHi9xoqeSYHZRtqWgGH7vJyojLn8zSPRs8Sa4B46Fl+gLXOesDPnQTBgFtStkMeVQ YvKBZ1OOOc6Ady+GimCFitkNLIF2l5Tuyko+kX1bbvCmo3LHE3ZROFbHjMC6JfDS9hzn onJA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1765574770; x=1766179570; h=to:subject:message-id:date:from:mime-version:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=RpMPbn/IkjJ2T3RR+1sOXSR9/EIikbEc8vwUloahgxc=; b=pjXxfd4PyJsIRQiG1EEaWvS6tt0B8W5I5lHyOkAGnej49MO4Ocj7Yq4YqghAOvcwTS z4sof7m9aTLh4OrSylpwGwDRQmaAbtRvlKC7N3zUWphXLrSGCeQ0xAbxiv5Fdg6V3AGQ NoQ5+ZO/gtzBjiLnVBvR+a40FKKTVBLrOv7BBoHHoYc/+93rtGguBL8GqlN0P8Z0tiGg BfX+BF8MkbkHD62geOcGQcEpF7Ex7BHQzlHsEs7VOToY21HB7am9eQofTbPN8NBxO0NM 5SCRRqSmnH3MSja1xAZAc1JL4psoo+Yf4elshyeRMRfkTRHJTEpPURV/ikXtHFJZvXI4 F6WQ== X-Gm-Message-State: AOJu0Yz69lPXr5WMeqV3gnPkEDRhIfyiUFPodfyAtDoDvJ0z09tbvQwH Sf9coLplNhzciTHCeBtvF/LsvYvX3kixAlg0nqvHnMon9c3glLaDaC3vlLC2yRS3uCPkoIWIqUA gQZuBYh2iE/Q6DmBDXKrM5EWnxSoUsYRbXpv5OdxXPg== X-Gm-Gg: AY/fxX7igSr5Hp2Zn9kPYAfo67E9c/y9eD/M9/ZdD7pz8c9lpvu7zbSEPC6fJMRUHOD 1hVlLiexfM/K7XeSEz9zjQGwdOZaUbHhtDhcfzXWmM2wMt2ZVtmI4RxpPFkVqCoJjTgxAy11/Qw kKwQCa/37FZoQVqoV/4C03WxRKGfOIX4SewMv65mOn5Qd83A8TZYijGV1C1fvcWuDrov1zndWfH 7xsp44fkryHSiedpaMmS9JccxWgkOcjFpnojrGcrhnBJr61NiadnLIeXlml7Jks8J55tPYqbROx sVSVGOTDRKJMiQ02ANWdzPHC2e8z X-Google-Smtp-Source: AGHT+IG3POVEiRKWR2lbseEaB5W7xKgSuGcsSL6AyO4EnoMV+8OOcsbF5sD6MHkqYPSOiNjthCmUMbAIPHdn67IS77Y= X-Received: by 2002:ac8:7d02:0:b0:4ee:483:311f with SMTP id d75a77b69052e-4f1d05af549mr50117921cf.54.1765574769895; Fri, 12 Dec 2025 13:26:09 -0800 (PST) MIME-Version: 1.0 From: Adam Blomeke Date: Fri, 12 Dec 2025 16:25:59 -0500 X-Gm-Features: AQt7F2qWSy8V-q5nOLpIPXPoQPHCqLQEv6A_gtgr7GPCeSVMzTgxV2EUl3SKLb8 Message-ID: Subject: Pgpool can't detect database status properly To: pgpool-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000fbed0f0645c7e8cb" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000fbed0f0645c7e8cb Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I'm resending this as it's been sitting in the moderation queue for a while. Possibly because I didn't have a subject line? Anyways, any help would be great. Thanks! I=E2=80=99m setting up a pgpool cluster to replace a single node database i= n my environment. The single node is separate from the cluster at the moment. When it=E2=80=99s time to implement the DB I=E2=80=99m going to redo the ba= ckup/restore, throw an upgrade from pg15->18, and then bring the cluster and take over the old IP. *Environment:* - pgpool-II version: 4.6.3 (chirikoboshi) - PostgreSQL version: 18 - OS: RHEL9 - Cluster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197, 10.6.1.198) + 2 PostgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby) *Issue:* I have pgpool configured and I=E2=80=99ve set it up using the scripts and c= onfig files from a different instance, one which has been running just fine for a year and a half or so. The issue I=E2=80=99m experiencing is that when I detach/reattach a node, it sits in waiting constantly. It never transitions to up. I have to manually change the status file to up for it to get to agree that it is, and when I try to drop the node it doesn't actually drop it. It just goes into waiting again. I also don=E2=80=99t see any connectio= n attempts from the pgpool server to the postgres nodes if I look at postgres logs. I've confirmed that it can run the postgres commands from the command line. I've tried this both running pgpool as a service and running it directly from the command line. No difference in behavior. Here=E2=80=99s the log output: 2025-12-03 14:20:49.037: main pid 1085028: LOG: =3D=3D=3D Starting fail ba= ck. reconnect host 10.6.1.200(5432) =3D=3D=3D 2025-12-03 14:20:49.037: main pid 1085028: LOCATION: pgpool_main.c:4169 2025-12-03 14:20:49.037: main pid 1085028: LOG: Node 0 is not down (status: 2) 2025-12-03 14:20:49.037: main pid 1085028: LOCATION: pgpool_main.c:1524 2025-12-03 14:20:49.038: main pid 1085028: LOG: Do not restart children because we are failing back node id 1 host: 10.6.1.200 port: 5432 and we are in streaming replication mode and not all backends were down 2025-12-03 14:20:49.038: main pid 1085028: LOCATION: pgpool_main.c:4370 2025-12-03 14:20:49.038: main pid 1085028: LOG: find_primary_node_repeatedly: waiting for finding a primary node 2025-12-03 14:20:49.038: main pid 1085028: LOCATION: pgpool_main.c:2896 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node: primary node is 0 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:2815 2025-12-03 14:20:49.189: main pid 1085028: LOG: find_primary_node: standby node is 1 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:2821 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new primary node: 0 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4660 2025-12-03 14:20:49.189: main pid 1085028: LOG: failover: set new main node: 0 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4667 2025-12-03 14:20:49.189: main pid 1085028: LOG: =3D=3D=3D Failback done. reconnect host 10.6.1.200(5432) =3D=3D=3D 2025-12-03 14:20:49.189: main pid 1085028: LOCATION: pgpool_main.c:4763 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOG: worker process received restart request 2025-12-03 14:20:49.189: sr_check_worker pid 1085088: LOCATION: pool_worker_child.c:182 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOG: restart request received in pcp child process 2025-12-03 14:20:50.189: pcp_main pid 1085087: LOCATION: pcp_child.c:173 2025-12-03 14:20:50.193: main pid 1085028: LOG: PCP child 1085087 exits with status 0 in failover() 2025-12-03 14:20:50.193: main pid 1085028: LOCATION: pgpool_main.c:4850 2025-12-03 14:20:50.193: main pid 1085028: LOG: fork a new PCP child pid 1085089 in failover() 2025-12-03 14:20:50.193: main pid 1085028: LOCATION: pgpool_main.c:4854 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOG: PCP process: 1085089 started 2025-12-03 14:20:50.193: pcp_main pid 1085089: LOCATION: pcp_child.c:165 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG: process started 2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOCATION: pgpool_main.c:905 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOG: forked new pcp worker, pid=3D1085093 socket=3D7 2025-12-03 14:22:31.460: pcp_main pid 1085089: LOCATION: pcp_child.c:327 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with pid: 1085093 exit with SUCCESS. 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION: pcp_child.c:384 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOG: PCP process with pid: 1085093 exits with status 0 2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION: pcp_child.c:398 2025-12-03 14:25:39.480: child pid 1085050: LOG: failover or failback event detected 2025-12-03 14:25:39.480: child pid 1085050: DETAIL: restarting myself 2025-12-03 14:25:39.480: child pid 1085050: LOCATION: child.c:1524 2025-12-03 14:25:39.480: child pid 1085038: LOG: failover or failback event detected 2025-12-03 14:25:39.481: child pid 1085038: DETAIL: restarting myself 2025-12-03 14:25:39.481: child pid 1085038: LOCATION: child.c:1524 2025-12-03 14:25:39.481: child pid 1085035: LOG: failover or failback event detected 2025-12-03 14:25:39.481: child pid 1085035: DETAIL: restarting myself 2025-12-03 14:25:39.481: child pid 1085035: LOCATION: child.c:1524 2025-12-03 14:25:39.481: child pid 1085061: LOG: failover or failback event detected 2025-12-03 14:25:39.481: child pid 1085061: DETAIL: restarting myself 2025-12-03 14:25:39.481: child pid 1085061: LOCATION: child.c:1524 2025-12-03 14:25:39.483: child pid 1085053: LOG: failover or failback event detected 2025-12-03 14:25:39.483: child pid 1085053: DETAIL: restarting myself 2025-12-03 14:25:39.483: child pid 1085053: LOCATION: child.c:1524 2025-12-03 14:25:39.483: child pid 1085059: LOG: failover or failback event detected ......over and over and over again. pcp_node_info output: 10.6.1.199 5432 1 0.500000 waiting up primary primary 0 none none 2025-12-03 14:04:39 10.6.1.200 5432 1 0.500000 waiting up standby standby 0 streaming async 2025-12-03 14:04:39 Logs show: node status[0]: 1 node status[1]: 2 Node 0 (primary) gets status 1 (waiting), node 1 (standby) gets status 2 (up). *auto_failback behavior:* - When a node is detached (pcp_detach_node), it goes to status 3 (down) - auto_failback triggers and moves it to status 1 (waiting) - Node never transitions from waiting to up *Key configuration:* backend_clustering_mode =3D 'streaming_replication' backend_hostname0 =3D '10.6.1.199' backend_hostname1 =3D '10.6.1.200' backend_application_name0 =3D 'nasdw_users_1' backend_application_name1 =3D 'nasdw_users_2' use_watchdog =3D on # 3 watchdog nodes configured auto_failback =3D on auto_failback_interval =3D 1 sr_check_period =3D 10 sr_check_user =3D 'pgpool' sr_check_database =3D 'nasdw_users' health_check_period =3D 1 health_check_user =3D 'pgpool' health_check_database =3D 'nasdw_users' failover_when_quorum_exists =3D on (default) failover_require_consensus =3D on (default) Cheers, Adam --000000000000fbed0f0645c7e8cb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

I'm resending this as it's been = sitting in the moderation queue for a while. Possibly because I didn't = have a subject line? Anyways, any help would be great. Thanks!

I=E2= =80=99m setting up a pgpool cluster to replace a single node database in my= environment. The single node is separate from the cluster at the moment. W= hen it=E2=80=99s time to implement the DB I=E2=80=99m going to redo the bac= kup/restore, throw an upgrade from pg15->18, and then bring the cluster = and take over the old IP.

=C2=A0

= Environment:

  • pgpool-II version: 4.6.3 (chirikoboshi)
  • PostgreSQL version: 18
  • OS: RHEL9
  • C= luster topology: 3 pgpool nodes (10.6.1.196, 10.6.1.197, 10.6.1.198) + 2 Po= stgreSQL nodes (10.6.1.199 primary, 10.6.1.200 standby)

=C2=A0

Issue:

I have pgp= ool configured and I=E2=80=99ve set it up using the scripts and config file= s from a different instance, one which has been running just fine for a yea= r and a half or so. The issue I=E2=80=99m experiencing is that when I detac= h/reattach a node, it sits in waiting constantly. It never transitions to u= p. I have to manually change the status file to up for it to get to agree t= hat it is, and when I try to drop the node it doesn't actually drop it.= It just goes into waiting again. I also don=E2=80=99t see any connection a= ttempts from the pgpool server to the postgres nodes if I look at=C2=A0post= gres logs. I've confirmed that it can run the postgres commands from th= e command line. I've tried this both running pgpool as a service and ru= nning it directly from the command line. No difference in behavior.

=C2=A0

Here=E2=80=99s the log output:

2025-= 12-03 14:20:49.037: main pid 1085028: LOG:=C2=A0 =3D=3D=3D Starting fail ba= ck. reconnect host 10.6.1.200(5432) =3D=3D=3D

2025-12-03 14= :20:49.037: main pid 1085028: LOCATION:=C2=A0 pgpool_main.c:4169

2025-12-03 14:20:49.037: main pid 1085028: LOG:=C2=A0 Node 0 is not d= own (status: 2)

2025-12-03 14:20:49.037: main pid 1085028: = LOCATION:=C2=A0 pgpool_main.c:1524

2025-12-03 14:20:49.038:= main pid 1085028: LOG:=C2=A0 Do not restart children because we are failin= g back node id 1 host: 10.6.1.200 port: 5432 and we are in streaming replic= ation mode and not all backends were down

2025-12-03 14:2= 0:49.038: main pid 1085028: LOCATION:=C2=A0 pgpool_main.c:4370

2025-12-03 14:20:49.038: main pid 1085028: LOG:=C2=A0 find_primary_nod= e_repeatedly: waiting for finding a primary node

2025-12-03= 14:20:49.038: main pid 1085028: LOCATION:=C2=A0 pgpool_main.c:2896

2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 find_primary= _node: primary node is 0

2025-12-03 14:20:49.189: main pid = 1085028: LOCATION:=C2=A0 pgpool_main.c:2815

2025-12-03 14:2= 0:49.189: main pid 1085028: LOG:=C2=A0 find_primary_node: standby node is 1=

2025-12-03 14:20:49.189: main pid 1085028: LOCATION:=C2=A0= pgpool_main.c:2821

2025-12-03 14:20:49.189: main pid 10850= 28: LOG:=C2=A0 failover: set new primary node: 0

2025-12-03= 14:20:49.189: main pid 1085028: LOCATION:=C2=A0 pgpool_main.c:4660

2025-12-03 14:20:49.189: main pid 1085028: LOG:=C2=A0 failover: se= t new main node: 0

2025-12-03 14:20:49.189: main pid 108502= 8: LOCATION:=C2=A0 pgpool_main.c:4667

2025-12-03 14:20:49.1= 89: main pid 1085028: LOG:=C2=A0 =3D=3D=3D Failback done. reconnect host 10= .6.1.200(5432) =3D=3D=3D

2025-12-03 14:20:49.189: main pid = 1085028: LOCATION:=C2=A0 pgpool_main.c:4763

2025-12-03 14:2= 0:49.189: sr_check_worker pid 1085088: LOG:=C2=A0 worker process received r= estart request

2025-12-03 14:20:49.189: sr_check_worker pid= 1085088: LOCATION:=C2=A0 pool_worker_child.c:182

2025-12-0= 3 14:20:50.189: pcp_main pid 1085087: LOG:=C2=A0 restart request received i= n pcp child process

2025-12-03 14:20:50.189: pcp_main pid 1= 085087: LOCATION:=C2=A0 pcp_child.c:173

2025-12-03 14:20:50= .193: main pid 1085028: LOG:=C2=A0 PCP child 1085087 exits with status 0 in= failover()

2025-12-03 14:20:50.193: main pid 1085028: LOCA= TION:=C2=A0 pgpool_main.c:4850

2025-12-03 14:20:50.193: mai= n pid 1085028: LOG:=C2=A0 fork a new PCP child pid 1085089 in failover()

2025-12-03 14:20:50.193: main pid 1085028: LOCATION:=C2=A0 pg= pool_main.c:4854

2025-12-03 14:20:50.193: pcp_main pid 1085= 089: LOG:=C2=A0 PCP process: 1085089 started

2025-12-03 14:= 20:50.193: pcp_main pid 1085089: LOCATION:=C2=A0 pcp_child.c:165

2025-12-03 14:20:50.194: sr_check_worker pid 1085090: LOG:=C2=A0 proc= ess started

2025-12-03 14:20:50.194: sr_check_worker pid 10= 85090: LOCATION:=C2=A0 pgpool_main.c:905

2025-12-03 14:22:3= 1.460: pcp_main pid 1085089: LOG:=C2=A0 forked new pcp worker, pid=3D108509= 3 socket=3D7

2025-12-03 14:22:31.460: pcp_main pid 1085089:= LOCATION:=C2=A0 pcp_child.c:327

2025-12-03 14:22:31.721: p= cp_main pid 1085089: LOG:=C2=A0 PCP process with pid: 1085093 exit with SUC= CESS.

2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATI= ON:=C2=A0 pcp_child.c:384

2025-12-03 14:22:31.721: pcp_main= pid 1085089: LOG:=C2=A0 PCP process with pid: 1085093 exits with status 0<= /p>

2025-12-03 14:22:31.721: pcp_main pid 1085089: LOCATION:=C2= =A0 pcp_child.c:398

2025-12-03 14:25:39.480: child pid 1085= 050: LOG:=C2=A0 failover or failback event detected

2025-12= -03 14:25:39.480: child pid 1085050: DETAIL:=C2=A0 restarting myself

2025-12-03 14:25:39.480: child pid 1085050: LOCATION:=C2=A0 child= .c:1524

2025-12-03 14:25:39.480: child pid 1085038: LOG:=C2= =A0 failover or failback event detected

2025-12-03 14:25:39= .481: child pid 1085038: DETAIL:=C2=A0 restarting myself

20= 25-12-03 14:25:39.481: child pid 1085038: LOCATION:=C2=A0 child.c:1524

<= p class=3D"MsoNormal" style=3D"margin:0in;font-size:11pt;font-family:Aptos,= sans-serif">2025-12-03 14:25:39.481: child pid 1085035: LOG:=C2=A0 failover= or failback event detected

2025-12-03 14:25:39.481: child = pid 1085035: DETAIL:=C2=A0 restarting myself

2025-12-03 14:= 25:39.481: child pid 1085035: LOCATION:=C2=A0 child.c:1524

= 2025-12-03 14:25:39.481: child pid 1085061: LOG:=C2=A0 failover or failback= event detected

2025-12-03 14:25:39.481: child pid 1085061:= DETAIL:=C2=A0 restarting myself

2025-12-03 14:25:39.481: c= hild pid 1085061: LOCATION:=C2=A0 child.c:1524

2025-12-03 1= 4:25:39.483: child pid 1085053: LOG:=C2=A0 failover or failback event detec= ted

2025-12-03 14:25:39.483: child pid 1085053: DETAIL:=C2= =A0 restarting myself

2025-12-03 14:25:39.483: child pid 10= 85053: LOCATION:=C2=A0 child.c:1524

2025-12-03 14:25:39.483= : child pid 1085059: LOG:=C2=A0 failover or failback event detected

......over and over and over again.


=C2=A0

pcp_node_info output:

10.6.1.= 199 5432 1 0.500000 waiting up primary primary 0 none none 2025-12-03 14:04= :39

10.6.1.200 5432 1 0.500000 waiting up standby standby 0= streaming async 2025-12-03 14:04:39

Logs show:

node status[0]: 1

node status[1]: 2

No= de 0 (primary) gets status 1 (waiting), node 1 (standby) gets status 2 (up)= .

auto_failback behavior:

  • When a node is detached= (pcp_detach_node), it goes to status 3 (down)
  • auto_failb= ack triggers and moves it to status 1 (waiting)
  • Node neve= r transitions from waiting to up

Key configuration= :

backend_clustering_mode =3D 'streaming_replicatio= n'

backend_hostname0 =3D '10.6.1.199'

backend_hostname1 =3D '10.6.1.200'

backend_a= pplication_name0 =3D 'nasdw_users_1'

backend_applic= ation_name1 =3D 'nasdw_users_2'

=C2=A0

use_watchdog =3D on

# 3 watchdog nodes configured

<= p class=3D"MsoNormal" style=3D"margin:0in;font-size:11pt;font-family:Aptos,= sans-serif">=C2=A0

auto_failback =3D on

aut= o_failback_interval =3D 1

=C2=A0

sr_check_p= eriod =3D 10

sr_check_user =3D 'pgpool'

sr_check_database =3D 'nasdw_users'

=C2=A0

=

health_check_period =3D 1

health_check_user = =3D 'pgpool'

health_check_database =3D 'nasdw_u= sers'

=C2=A0

failover_when_quorum_exist= s =3D on (default)

failover_require_consensus =3D on (defau= lt)

Cheers,Adam
--000000000000fbed0f0645c7e8cb--