debug a lockup

public inbox for [email protected]  
help / color / mirror / Atom feed

debug a lockup
4+ messages / 3 participants
[nested] [flat]

* debug a lockup
@ 2026-02-10 16:55 Scott Ribe <[email protected]>
  2026-02-10 17:37 ` Re: debug a lockup Tom Lane <[email protected]>
  2026-02-11 00:00 ` Re: debug a lockup Aislan Luiz Wendling <[email protected]>
  0 siblings, 2 replies; 4+ messages in thread

From: Scott Ribe @ 2026-02-10 16:55 UTC (permalink / raw)
  To: Pgsql-admin <[email protected]>

PostgreSQL appears locked up. pgbench run that should have completed in a few seconds has been running  14 hours. psql invocation locks up. No CPU usage showing in top.

I personally suspect infra issues. (k8s pod, Pure block storage) But I'm getting pushback pointing the finger at PG. It's 18.1, and pgbench is the only client FWIW.

Any way to introspect the current non-debug build to get a clue what's going on in there?

--
Scott Ribe
[email protected]
https://www.linkedin.com/in/scottribe/

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: debug a lockup
  2026-02-10 16:55 debug a lockup Scott Ribe <[email protected]>
@ 2026-02-10 17:37 ` Tom Lane <[email protected]>
  1 sibling, 0 replies; 4+ messages in thread

From: Tom Lane @ 2026-02-10 17:37 UTC (permalink / raw)
  To: Scott Ribe <[email protected]>; +Cc: Pgsql-admin <[email protected]>

Scott Ribe <[email protected]> writes:
> Any way to introspect the current non-debug build to get a clue what's going on in there?

Backend stack traces taken with gdb should yield at least some clue
even if you don't have debug symbols.

			regards, tom lane





^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: debug a lockup
  2026-02-10 16:55 debug a lockup Scott Ribe <[email protected]>
@ 2026-02-11 00:00 ` Aislan Luiz Wendling <[email protected]>
  2026-02-11 00:12   ` Re: debug a lockup Scott Ribe <[email protected]>
  1 sibling, 1 reply; 4+ messages in thread

From: Aislan Luiz Wendling @ 2026-02-11 00:00 UTC (permalink / raw)
  To: Scott Ribe <[email protected]>; Pgsql-admin <[email protected]>

Hello,

Does it repeat on every run?

If it is possible, try to gracefully stop postgresql.

Not working? Try stop immediate and last resort stop abort.

If postgrres service does not stop, try to kill pgbench process.
First try kill -15 <pgbench PID> and if it does not work, kill -9

Nothing works, reboot the vm.

Open two terminals, start pgbench process in one. In the other ps -ef | grep pgbench

Find the parent process ID and do a strace -f -p <PID> (maybe your kernel has a different syntax, but it is to trace a process and its forks)

It can show which set of instructions is waiting. You will know because usually you are not able to read due to its speed, but when it stops, it is waiting for something.

Hope it helps.

ALW
________________________________
From: Scott Ribe <[email protected]>
Sent: Tuesday, February 10, 2026 11:55 AM
To: Pgsql-admin <[email protected]>
Subject: debug a lockup

PostgreSQL appears locked up. pgbench run that should have completed in a few seconds has been running  14 hours. psql invocation locks up. No CPU usage showing in top.

I personally suspect infra issues. (k8s pod, Pure block storage) But I'm getting pushback pointing the finger at PG. It's 18.1, and pgbench is the only client FWIW.

Any way to introspect the current non-debug build to get a clue what's going on in there?

--
Scott Ribe
[email protected]
https://www.linkedin.com/in/scottribe/

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: debug a lockup
  2026-02-10 16:55 debug a lockup Scott Ribe <[email protected]>
  2026-02-11 00:00 ` Re: debug a lockup Aislan Luiz Wendling <[email protected]>
@ 2026-02-11 00:12   ` Scott Ribe <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Scott Ribe @ 2026-02-11 00:12 UTC (permalink / raw)
  To: Aislan Luiz Wendling <[email protected]>; Tom Lane <[email protected]>; +Cc: Pgsql-admin <[email protected]>

OK, we figured it out--I think.

pgbench was stuck in restart_syscall(<...resuming interrupted read...

it was set to open 100 connections

there were ~20 pg sessions in idle, and the last one (highest pid) in auth

that one was in write to fd 2

So... This is running in kubernetes. I was doing some load testing against a storage service (thus 100 connections). PG was launched manually in a bash session connected to the pod, in k9s. There were ~20 total bash sessions open in k9s across 15 nodes.

Theory: k9s glitched and stopped reading the piped file descriptor, buffer filled, and PG blocked on the write. (I have seen prior evidence of less-than-perfect handling of output by k9s). Particularly, I had logging of connections on, so at auth it would have been writing to stderr.

This happened in one of probably over 100 runs of the same test, so not readily reproducible and I wanted to autopsy it before killing off the hung processes. Unless someone pokes a hole in my theory, at this point I think it is neither pgbench nor PG nor Pure/Portworx at fault.

--
Scott Ribe
[email protected]
https://www.linkedin.com/in/scottribe/

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2026-02-11 00:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-10 16:55 debug a lockup Scott Ribe <[email protected]>
2026-02-10 17:37 ` Tom Lane <[email protected]>
2026-02-11 00:00 ` Aislan Luiz Wendling <[email protected]>
2026-02-11 00:12   ` Scott Ribe <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox