MIME-Version: 1.0
From: TV <tvfan2014@gmail.com>
Date: Wed, 16 Jul 2025 14:13:49 +0200
Message-ID: 
 <CAFjdVW8-2kJLr+bRzGXqs3Xg6cgq-hwDg9N5rg8cgnJZ6=WcOA@mail.gmail.com>
Subject: Poor load balancing performance in PGPool 4.6 on PG13,
 any config suggestions?
To: pgpool-general@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000f2618a063a0ad333"
Archived-At: 
 <https://www.postgresql.org/message-id/CAFjdVW8-2kJLr%2BbRzGXqs3Xg6cgq-hwDg9N5rg8cgnJZ6%3DWcOA%40mail.gmail.com>
Precedence: bulk

--000000000000f2618a063a0ad333
Content-Type: text/plain; charset="UTF-8"

Just to give a bit of background, we've recently migrated from old setup to
new physical servers, and are running Ubuntu24 and latest (4.6.2) version
of pgpool.  The migration went fairly well, but we are noticing that the
performance isn't any better than on the old servers, frankly it seems...
worse.  I was wondering if some of the pgpool pros could look over our
config and perhaps recommend some changes/tuning?  Hardware-wise, it's
pretty beefy, we got 1TB of RAM to play with, 80 cores (2 processors with
20 physical cores and 40 virtual), hardware definitely doesn't seem to be a
problem.   Some 'highlights' from pgpool.conf, feel free to ask for other
settings if they'll help to clear up the picture:

num_init_children = 3500
max_pool = 1
child_life_time = 0
child_max_connections = 0
connection_life_time = 500
client_idle_limit = 600
process_management_mode = dynamic
process_management_strategy = gentle
min_spare_children = 50
max_spare_children = 100
connection_cache = on
load_balance_mode = on
disable_load_balance_on_write = 'transaction'
statement_level_load_balance = on

This is a 4 node cluster running PG13 and backend_weight is set to 1 for
all 4 nodes.

Some of the errors we are seeing in pgpool logs:
2025-07-15 10:57:32: pid 2629089: CONTEXT:  while checking replication time
lag
2025-07-15 10:57:32: pid 2629089: LOCATION:  pool_worker_child.c:644
2025-07-15 10:57:33: pid 3892376: LOG:  Error message from backend: DB node
id: 2 message: "canceling statement due to conflict with recovery"
2025-07-15 10:57:33: pid 3892376: LOCATION:  pool_proto_modules.c:3226
2025-07-15 10:57:33: pid 3892376: FATAL:  unable to read data from DB node 2
2025-07-15 10:57:33: pid 3892376: DETAIL:  EOF encountered with backend
2025-07-15 10:57:33: pid 3892376: LOCATION:  pool_stream.c:274
2025-07-15 10:57:33: pid 2629004: LOG:  child process with pid: 3892376
exited with success and will not be restarted
2025-07-15 10:57:33: pid 2629004: LOCATION:  pgpool_main.c:2059

Also this:
2025-07-15 11:02:22: pid 3892505: ERROR:  unable to read data from DB node 2
2025-07-15 11:02:22: pid 3892505: DETAIL:  do not failover because
failover_on_backend_error is off
2025-07-15 11:02:22: pid 3892505: LOCATION:  pool_stream.c:407
2025-07-15 11:02:22: pid 3892505: WARNING:  write on backend 2 failed with
error :"Broken pipe"
2025-07-15 11:02:22: pid 3892505: DETAIL:  while trying to write data from
offset: 0 wlen: 17
2025-07-15 11:02:22: pid 3892505: LOCATION:  pool_stream.c:714
2025-07-15 11:02:22: pid 3892505: WARNING:  write on backend 2 failed with
error :"Broken pipe"
2025-07-15 11:02:22: pid 3892505: DETAIL:  while trying to write data from
offset: 0 wlen: 5
2025-07-15 11:02:22: pid 3892505: LOCATION:  pool_stream.c:714


saw this is as well:
2025-07-15 11:05:12: pid 2629089: CONTEXT:  while checking replication time
lag
2025-07-15 11:05:12: pid 2629089: LOCATION:  pool_worker_child.c:644
2025-07-15 11:05:19: pid 3891928: ERROR:  unable to read data from frontend
2025-07-15 11:05:19: pid 3891928: DETAIL:  socket read function returned -1
2025-07-15 11:05:19: pid 3891928: LOCATION:  pool_stream.c:414
2025-07-15 11:05:19: pid 3891928: LOG:  pool_send_and_wait: Error or notice
message from backend: DB node id: 1 backend pid: 3938180 statement: "ABORT"
message:
"terminating connection due to conflict with recovery"
2025-07-15 11:05:19: pid 3891928: LOCATION:  pool_proto_modules.c:3955
2025-07-15 11:05:19: pid 3891928: LOG:  pool_send_and_wait: Error or notice
message from backend: DB node id: 2 backend pid: 3929256 statement: "ABORT"
message:
"terminating connection due to conflict with recovery"
2025-07-15 11:05:19: pid 3891928: LOCATION:  pool_proto_modules.c:3955
2025-07-15 11:05:19: pid 3891928: LOG:  pool_send_and_wait: Error or notice
message from backend: DB node id: 3 backend pid: 3929098 statement: "ABORT"
message:
"terminating connection due to conflict with recovery"
2025-07-15 11:05:19: pid 3891928: LOCATION:  pool_proto_modules.c:3955
2025-07-15 11:05:19: pid 3891928: LOG:  pool_send_and_wait: Error or notice
message from backend: DB node id: 0 backend pid: 3060000 statement: "ABORT"
message:
"terminating connection due to idle-in-transaction timeout"
2025-07-15 11:05:19: pid 3891928: LOCATION:  pool_proto_modules.c:3955
2025-07-15 11:05:19: pid 3891928: WARNING:  write on backend 1 failed with
error :"Broken pipe"
2025-07-15 11:05:19: pid 3891928: DETAIL:  while trying to write data from
offset: 0 wlen: 5

Some of these generally seem to suggest connectivity problems?  Anything
you can suggest to look into?   It's also worth noting that if we bypass
the pgpool VIP and connect the applications directly to the DB master node,
there are no problems reported so it sure does seem like something with our
pgpool setup...

Any help will be much recommended.

--000000000000f2618a063a0ad333
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div>Just to give a bit of background, we&#39;ve recently =
migrated from old=20
setup to new physical servers, and are running Ubuntu24 and latest=20
(4.6.2) version of pgpool.=C2=A0 The migration went fairly well, but we are=
=20
noticing that the performance isn&#39;t any better than on the old servers,=
=20
frankly it seems... worse.=C2=A0 I was wondering if some of the pgpool pros=
=20
could look over our config and perhaps recommend some changes/tuning?=C2=A0=
=20
Hardware-wise, it&#39;s pretty beefy, we got 1TB of RAM to play with, 80=20
cores (2 processors with 20 physical cores and 40 virtual), hardware=20
definitely doesn&#39;t seem to be a problem.=C2=A0 =C2=A0Some &#39;highligh=
ts&#39; from=20
pgpool.conf, feel free to ask for other settings if they&#39;ll help to=20
clear up the picture:</div><div><br></div><div>num_init_children =3D 3500</=
div><div>max_pool =3D 1</div><div>child_life_time =3D 0</div><div>child_max=
_connections =3D 0</div><div>connection_life_time =3D 500</div><div>client_=
idle_limit =3D 600</div><div>process_management_mode =3D dynamic</div><div>=
process_management_strategy =3D gentle</div><div>min_spare_children =3D 50<=
br>max_spare_children =3D 100</div><div>connection_cache =3D on</div><div>l=
oad_balance_mode =3D on<br>disable_load_balance_on_write =3D &#39;transacti=
on&#39;</div><div>statement_level_load_balance =3D on</div><div><br></div><=
div>This is a 4 node cluster running PG13 and=C2=A0backend_weight is set to=
 1 for all 4 nodes.</div><div><br></div><div>Some of the errors we are seei=
ng in pgpool logs:</div><div>2025-07-15 10:57:32: pid 2629089: CONTEXT: =C2=
=A0while checking replication time lag<br>2025-07-15 10:57:32: pid 2629089:=
 LOCATION: =C2=A0pool_worker_child.c:644<br>2025-07-15
 10:57:33: pid 3892376: LOG: =C2=A0Error message from backend: DB node id: =
2=20
message: &quot;canceling statement due to conflict with recovery&quot;<br>2=
025-07-15 10:57:33: pid 3892376: LOCATION: =C2=A0pool_proto_modules.c:3226<=
br>2025-07-15 10:57:33: pid 3892376: FATAL: =C2=A0unable to read data from =
DB node 2<br>2025-07-15 10:57:33: pid 3892376: DETAIL: =C2=A0EOF encountere=
d with backend<br>2025-07-15 10:57:33: pid 3892376: LOCATION: =C2=A0pool_st=
ream.c:274<br>2025-07-15 10:57:33: pid 2629004: LOG: =C2=A0child process wi=
th pid: 3892376 exited with success and will not be restarted<br>2025-07-15=
 10:57:33: pid 2629004: LOCATION: =C2=A0pgpool_main.c:2059<br></div><div><b=
r></div><div>Also this:</div><div>2025-07-15 11:02:22: pid 3892505: ERROR: =
=C2=A0unable to read data from DB node 2<br>2025-07-15 11:02:22: pid 389250=
5: DETAIL: =C2=A0do not failover because failover_on_backend_error is off<b=
r>2025-07-15 11:02:22: pid 3892505: LOCATION: =C2=A0pool_stream.c:407<br>20=
25-07-15 11:02:22: pid 3892505: WARNING: =C2=A0write on backend 2 failed wi=
th error :&quot;Broken pipe&quot;<br>2025-07-15 11:02:22: pid 3892505: DETA=
IL: =C2=A0while trying to write data from offset: 0 wlen: 17<br>2025-07-15 =
11:02:22: pid 3892505: LOCATION: =C2=A0pool_stream.c:714<br>2025-07-15 11:0=
2:22: pid 3892505: WARNING: =C2=A0write on backend 2 failed with error :&qu=
ot;Broken pipe&quot;<br>2025-07-15 11:02:22: pid 3892505: DETAIL: =C2=A0whi=
le trying to write data from offset: 0 wlen: 5<br>2025-07-15 11:02:22: pid =
3892505: LOCATION: =C2=A0pool_stream.c:714</div><div><br></div><div><br></d=
iv><div>saw this is as well:</div><div>2025-07-15 11:05:12: pid 2629089: CO=
NTEXT: =C2=A0while checking replication time lag<br>2025-07-15 11:05:12: pi=
d 2629089: LOCATION: =C2=A0pool_worker_child.c:644<br>2025-07-15 11:05:19: =
pid 3891928: ERROR: =C2=A0unable to read data from frontend<br>2025-07-15 1=
1:05:19: pid 3891928: DETAIL: =C2=A0socket read function returned -1<br>202=
5-07-15 11:05:19: pid 3891928: LOCATION: =C2=A0pool_stream.c:414<br>2025-07=
-15
 11:05:19: pid 3891928: LOG: =C2=A0pool_send_and_wait: Error or notice=20
message from backend: DB node id: 1 backend pid: 3938180 statement:=20
&quot;ABORT&quot; message: <br>&quot;terminating connection due to conflict=
 with recovery&quot;<br>2025-07-15 11:05:19: pid 3891928: LOCATION: =C2=A0p=
ool_proto_modules.c:3955<br>2025-07-15
 11:05:19: pid 3891928: LOG: =C2=A0pool_send_and_wait: Error or notice=20
message from backend: DB node id: 2 backend pid: 3929256 statement:=20
&quot;ABORT&quot; message: <br>&quot;terminating connection due to conflict=
 with recovery&quot;<br>2025-07-15 11:05:19: pid 3891928: LOCATION: =C2=A0p=
ool_proto_modules.c:3955<br>2025-07-15
 11:05:19: pid 3891928: LOG: =C2=A0pool_send_and_wait: Error or notice=20
message from backend: DB node id: 3 backend pid: 3929098 statement:=20
&quot;ABORT&quot; message: <br>&quot;terminating connection due to conflict=
 with recovery&quot;<br>2025-07-15 11:05:19: pid 3891928: LOCATION: =C2=A0p=
ool_proto_modules.c:3955<br>2025-07-15
 11:05:19: pid 3891928: LOG: =C2=A0pool_send_and_wait: Error or notice=20
message from backend: DB node id: 0 backend pid: 3060000 statement:=20
&quot;ABORT&quot; message: <br>&quot;terminating connection due to idle-in-=
transaction timeout&quot;<br>2025-07-15 11:05:19: pid 3891928: LOCATION: =
=C2=A0pool_proto_modules.c:3955<br>2025-07-15 11:05:19: pid 3891928: WARNIN=
G: =C2=A0write on backend 1 failed with error :&quot;Broken pipe&quot;<br>2=
025-07-15 11:05:19: pid 3891928: DETAIL: =C2=A0while trying to write data f=
rom offset: 0 wlen: 5<br></div><div><br></div><div>Some
 of these generally seem to suggest connectivity problems?=C2=A0 Anything y=
ou
 can suggest to look into?=C2=A0 =C2=A0It&#39;s also worth noting that if w=
e bypass=20
the pgpool VIP and connect the applications directly to the DB master=20
node, there are no problems reported so it sure does seem like something
 with our pgpool setup...</div><div><br></div><div>Any help will be much re=
commended.</div><div class=3D"gmail-yj6qo"></div><div class=3D"gmail-adL"><=
br>=C2=A0</div><br></div>

--000000000000f2618a063a0ad333--