Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1upMwe-003nGD-4B for pgsql-admin@arkaria.postgresql.org; Fri, 22 Aug 2025 08:16:21 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1upMwc-003voX-Ta for pgsql-admin@arkaria.postgresql.org; Fri, 22 Aug 2025 08:16:19 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1upMwc-003vli-8s for pgsql-admin@lists.postgresql.org; Fri, 22 Aug 2025 08:16:19 +0000 Received: from cloud.gatewaynet.com ([185.90.37.94]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1upMwY-0019id-1i for pgsql-admin@lists.postgresql.org; Fri, 22 Aug 2025 08:16:17 +0000 Content-Type: multipart/alternative; boundary="------------BvA70L79Zx5x5Hhf4K93KlP4" Message-ID: <1332203e-fe17-4800-aadc-4de4a93fc85d@cloud.gatewaynet.com> Date: Fri, 22 Aug 2025 09:16:11 +0100 MIME-Version: 1.0 Subject: The same again with 16.9 : was Re: PostgreSQL 16.6 , query stuck with STAT Ssl, wait_event_type : IPC , wait_event : ParallelFinish To: pgsql-admin@lists.postgresql.org References: <3049794.1748751598@sss.pgh.pa.us> <6be8d715-a9a7-4be8-8ebc-8b6bdb98da2e@cloud.gatewaynet.com> Content-Language: en-US From: Achilleas Mantzios In-Reply-To: <6be8d715-a9a7-4be8-8ebc-8b6bdb98da2e@cloud.gatewaynet.com> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk This is a multi-part message in MIME format. --------------BvA70L79Zx5x5Hhf4K93KlP4 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Hi , we had the same problem today again. postgres@[local]/dynacom=3D# select * from pg_stat_activity where=20 application_name~*'dbmirr'; -[ RECORD 1=20 ]----+-------------------------------------------------------------------= -------------------------------------------------------------------------= ------------------------------ -------------------------------------------------------------------------= -------------------------------------------------------------------------= -----------=20 datid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|= 207491653 datname =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| dynacom pid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0| 1821681 leader_pid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| usesysid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 10 usename =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| postgres application_name | DBMIRROR client_addr =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 10.9.0.10 client_hostname =C2=A0| client_port =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 45051 backend_start =C2=A0=C2=A0=C2=A0| 2025-08-22 03:58:32.321683+03 xact_start =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 2025-08-22 04:06:08.8972= 52+03 query_start =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 2025-08-22 04:06:09.254048+03 state_change =C2=A0=C2=A0=C2=A0=C2=A0| 2025-08-22 04:06:09.254049+03 wait_event_type =C2=A0| IPC wait_event =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| ParallelFinish state =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|= active backend_xid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| backend_xmin =C2=A0=C2=A0=C2=A0=C2=A0| 222705697 query_id =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| -3929522546936= 394707 query =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0|= SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending pd=20 LEFT JOIN dbmirror_MirroredTransaction mt INNER JOIN dbmirror_MirrorHost=20 mh ON mt.MirrorHostId =3D =C2=A0mh.MirrorHostId AND m h.HostName=3D '192.168.145.1' =C2=A0ON pd.XID =3D mt.XID WHERE mt.XID is = null and=20 (pd.slaveid is null or pd.slaveid =3D '4826') =C2=A0GROUP BY pd.XID =C2=A0= ORDER BY=20 MAX(pd.SeqId) backend_type =C2=A0=C2=A0=C2=A0=C2=A0| client backend postgres@smadb:~$ ps -u -p 1821681 USER =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0PID %CPU %MEM =C2=A0= =C2=A0=C2=A0VSZ =C2=A0=C2=A0RSS TTY =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0STAT ST= ART =C2=A0=C2=A0TIME COMMAND postgres 1821681 =C2=A00.5 =C2=A04.8 37111844 3177260 ? =C2=A0=C2=A0=C2=A0= Ssl =C2=A003:58 =C2=A0=C2=A02:25=20 postgres: postgres dynacom 10.9.0.10(45051) SELECT postgres@smadb:~$ Again the process is stuck with this Ssl state. Again, postgres@smadb:~$ strace -p 1821681 strace: Process 1821681 attached epoll_wait(12, postgres@smadb:~$ sudo lsof -p 1821681 | egrep -e ' 12u| 3. | 4. ' postgres 1821681 postgres =C2=A0DEL =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0R= EG =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A00,1=20 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A033 /SYSV00041c60 postgres 1821681 postgres =C2=A0=C2=A0=C2=A03u =C2=A0a_inode =C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00= ,15 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00=20 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A01059 [signalfd] postgres 1821681 postgres =C2=A0=C2=A0=C2=A04r =C2=A0=C2=A0=C2=A0=C2=A0FI= FO =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A00,14 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00t0=20 1292545666 pipe postgres 1821681 postgres =C2=A0=C2=A012u =C2=A0a_inode =C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00,15 =C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00=20 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A01059 [eventpoll:3,4] Again, pg_cancel_backend, pg_terminate_backend, kill pid (-TERM) did=20 nothing. Only -HUP killed it . 10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.517 EEST=20 DBMIRROR postgres@dynacom line:26 FATAL: =C2=A0terminating connection due= to=20 administrator command 10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.517 EEST=20 DBMIRROR postgres@dynacom line:27 STATEMENT: =C2=A0SELECT pd.XID,MAX(SeqI= d)=20 FROM dbmirror_Pending pd LEFT JOIN dbmirror _MirroredTransaction mt INNER JOIN dbmirror_MirrorHost mh ON=20 mt.MirrorHostId =3D =C2=A0mh.MirrorHostId AND mh.HostName=3D '192.168.145= .1' =C2=A0ON=20 pd.XID =3D mt.XID WHERE mt.XID is null and (pd.slaveid is null or pd.slaveid =3D '4826') =C2=A0GROUP BY pd.XID =C2=A0ORDER BY MA= X(pd.SeqId) 10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.519 EEST=20 DBMIRROR postgres@dynacom line:28 LOG: =C2=A0disconnection: session time:= =20 7:11:00.197 user=3Dpostgres database=3Ddynacom ho st=3D10.9.0.10 port=3D45051 The client is written in PERL (a variant of the old DBMIRROR perl), and=20 calls nothing but pure SQL, no perl or python functions. This program is called 38592times per day and the previous such incident=20 was on 1st June, so 2,392,704times between the two incidents. On 6/1/25 06:32, Achilleas Mantzios wrote: > > On 1/6/25 07:19, Tom Lane wrote: > >> Achilleas Mantzios writes: >>>> a query is stuck with the above, it seems it waits for parallel work= er >>>> to finish, however , there are no parallel works running : >> You didn't explain the subject about "STAT Ssl", but if you mean >> that that was what ps was showing for the backend process, there's >> something very wrong there. According to "man ps", the "l" means >> >> l is multi-threaded (using CLONE_THREAD, like NPTL = pthreads >> do) > Yes. sorry, I didn't include this info, you are spot on, yes this the=20 > output of ps aux . >> which is something that a Postgres backend should never be >> (in existing releases anyway). So I'm speculating that >> the process somehow became multi-threaded and then some >> wakeup signal went to the wrong thread. >> >> We've had issues with perl or python introducing multi-threading >> because of plperl or plpython functions doing things they >> probably shouldn't. Do you have any of those in your system? > > Yes we have two perl functions only that I'd be happy to get rid off : > > postgres@[local]/dynacom=3D# select p.proname, l.lanname from=20 > pg_language l, pg_proc p where p.prolang=3Dl.oid and l.lanname ~*=20 > '.*perl.*'; > proname =C2=A0| lanname > ----------+--------- > basename | plperlu > filetype | plperlu > (2 rows) > > Nothing used in the app, just some two utility functions to help us=20 > batch insert some attachments, guess mimetype etc. However the calling=20 > client is Perl , based on libpg-perl (not DBI), basically this is a=20 > descendant of DBMirror.pl (we are still using it). > > The strange thing is that we run pgsql 16.* since November, also we=20 > run our version of DBMirror since 2005 (and PostgreSQL since 2001) and=20 > we never had this problem before (at least from what I know). > >> regards, tom lane --------------BvA70L79Zx5x5Hhf4K93KlP4 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable

Hi ,

we had the same problem today again.

postgres@[loc= al]/dynacom=3D# select * from pg_stat_activity where application_name~*'dbmirr';
-[ RECORD 1 ]----+-------------------------------------------------------------------= -------------------------------------------------------------------------= ------------------------------
-------------------------------------------------------------------------= -------------------------------------------------------------------------= -----------
datid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0| 207491653
datname =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| d= ynacom
pid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0| 1821681
leader_pid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0
usesysid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 10
usename =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| p= ostgres
application_name | DBMIRROR
client_addr =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 10.9.0.10
client_hostname =C2=A0| =C2=A0
client_port =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 45051
backend_start =C2=A0=C2=A0=C2=A0| 2025-08-22 03:58:32.321683+03
xact_start =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 2025-08-22 04:06= :08.897252+03
query_start =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| 2025-08-22 04:06:09.2= 54048+03
state_change =C2=A0=C2=A0=C2=A0=C2=A0| 2025-08-22 04:06:09.254049= +03
wait_event_type =C2=A0| IPC
wait_event =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| ParallelFinish
state =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0| active
backend_xid =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| =C2=A0
backend_xmin =C2=A0=C2=A0=C2=A0=C2=A0| 222705697
query_id =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0| -39295= 22546936394707
query =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0| SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending pd LEFT JOIN dbmirror_MirroredTransaction mt INNER JOIN dbmirror_MirrorHost mh ON mt.MirrorHostId =3D =C2=A0mh.MirrorHostId AND m
h.HostName=3D '192.168.145.1' =C2=A0ON pd.XID =3D mt.XID WHERE mt= .XID is null and (pd.slaveid is null or pd.slaveid =3D '4826') =C2=A0GROU= P BY pd.XID =C2=A0ORDER BY MAX(pd.SeqId)
backend_type =C2=A0=C2=A0=C2=A0=C2=A0| client backend

= postgr= es@smadb:~$ ps -u -p 1821681
USER =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0PID %CPU %ME= M =C2=A0=C2=A0=C2=A0VSZ =C2=A0=C2=A0RSS TTY =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= STAT START =C2=A0=C2=A0TIME COMMAND
postgres 1821681 =C2=A00.5 =C2=A04.8 37111844 3177260 ? =C2=A0=C2= =A0=C2=A0Ssl =C2=A003:58 =C2=A0=C2=A02:25 postgres: postgres dynacom 10.9.0.10(45051) SELE= CT
postgr= es@smadb:~$

Again the process is stuck with this Ssl state.

Again,
= postgr= es@smadb:~$ strace -p 1821681
strace: Process 1821681 attached
epoll_wait(12,
=
= postgr= es@smadb:~$ sudo lsof -= p 1821681 | egrep -e ' 12u| 3. | 4. ' =C2=A0
postgres 1821681 postgres =C2=A0DEL =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0REG =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A00,1 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A033 /SYSV00041c60
postgres 1821681 postgres =C2=A0=C2=A0=C2=A03u =C2=A0a_inode =C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A00,15 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00 =C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A01059 [signalfd]
postgres 1821681 postgres =C2=A0=C2=A0=C2=A04r =C2=A0=C2=A0=C2=A0= =C2=A0FIFO =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A00,14 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00t0 1292545666 pipe
postgres 1821681 postgres =C2=A0=C2=A012u =C2=A0a_inode =C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= 0,15 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A00 =C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A01059 [eventpoll:3,4]

Again, pg_cancel_backend, pg_terminate_backend, kill pid (-TERM) did nothing. Only -HUP killed it .

= 10.9.0.10(450= 51) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.517 EEST DBMIRROR postgres@dynacom line:26 FATAL: =C2=A0terminating connection due to administrator command
10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.517 EEST DBMIRROR postgres@dynacom line:27 STATEMENT: =C2=A0SELECT pd.XID,MAX(SeqId) FROM dbmirror_Pending pd LEFT JOIN dbmirror
_MirroredTransaction mt INNER JOIN dbmirror_MirrorHost mh ON mt.MirrorHostId =3D =C2=A0mh.MirrorHostId AND mh.HostName=3D '192.168.145.1' =C2=A0ON pd.XID =3D mt.XID WHERE mt.XID is null a= nd (pd.slaveid
is null or pd.slaveid =3D '4826') =C2=A0GROUP BY pd.XID =C2=A0ORD= ER BY MAX(pd.SeqId)
10.9.0.10(45051) [1821681] 68a7c0b8.1bcbf1 2025-08-22 11:09:32.519 EEST DBMIRROR postgres@dynacom line:28 LOG: =C2=A0disconnection: session time: 7:11:00.197 user=3Dpostgres database=3Ddynacom ho
st=3D10.9.0.10 port=3D45051


=
The client is written in PERL (a variant of the old DBMIRROR perl), and calls nothing but pure SQL, no perl or python functions.

This program is called 38592 times per day and the previous such incident was on 1st June, so 2,392,704 times between the two incidents.


On 6/1/25 06:32, Achilleas Mantzios wrote:

On 1/6/25 07:19, Tom Lane wrote:

Achilleas Mantzios <a.mantzios@cloud.gatewaynet.com><=
/a> writes:
a query is stuck with =
the above, it seems it waits for parallel worker=20
to finish, however , there are no parallel works running :
You didn't explain the sub=
ject about "STAT Ssl", but if you mean
that that was what ps was showing for the backend process, there's
something very wrong there.  According to "man ps", the "l" means

               l    is multi-threaded (using CLONE_THREAD, like NPTL pthr=
eads
                    do)
Yes. sorry, I didn't include this info, you are spot on, yes this the output of ps aux .
which is something that a =
Postgres backend should never be
(in existing releases anyway).  So I'm speculating that
the process somehow became multi-threaded and then some
wakeup signal went to the wrong thread.

We've had issues with perl or python introducing multi-threading
because of plperl or plpython functions doing things they
probably shouldn't.  Do you have any of those in your system?

Yes we have two perl functions only that I'd be happy to get rid off :

postgres@[l= ocal]/dynacom=3D# select p.proname, l.lanname from pg_language l, pg_proc p where p.prolang=3Dl.oid and l.lanname ~* '.*perl.*'; <= br> proname =C2=A0| lanname =C2=A0
----------+---------
basename | plperlu
filetype | plperlu
(2 rows)

Nothing used in the app, just some two utility functions to help us batch insert some attachments, guess mimetype etc. However the calling client is Perl , based on libpg-perl = (not DBI), basically this is a descendant of DBMirror.pl (we are still using it).

The strange thing is that we run pgsql 16.* since November, also we run our version of DBMirror since 2005 (and PostgreSQL since 2001) and we never had this problem before (at least from what I know).

			regards, tom lane
--------------BvA70L79Zx5x5Hhf4K93KlP4--