Yet another way for pg_ctl stop to fail on Windows

public inbox for [email protected]  
help / color / mirror / Atom feed

Yet another way for pg_ctl stop to fail on Windows
2+ messages / 2 participants
[nested] [flat]

* Yet another way for pg_ctl stop to fail on Windows
@ 2024-09-07 12:00 Alexander Lakhin <[email protected]>
  2024-09-07 18:11 ` Re: Yet another way for pg_ctl stop to fail on Windows Noah Misch <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Alexander Lakhin @ 2024-09-07 12:00 UTC (permalink / raw)
  To: pgsql-hackers; Noah Misch <[email protected]>

Hello hackers,

While trying to reproduce a recent fairywren (a Windows animal) failure,
I ran amcheck/amcheck/003_cic_2pc in parallel inside a slowed-down
VM and came across another issue:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_17/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata 
-m fast stop
waiting for server to shut down..... failed
pg_ctl: server does not shut down
# pg_ctl stop failed: 256
# Postmaster PID for node "CIC_2PC_test" is 6048
[08:24:52.915](12.792s) Bail out!  pg_ctl stop failed

So "pg_ctl stop" failed due to not a timeout, but some other reason.

With extra logging added, I got:
### Stopping node "CIC_2PC_test" using mode fast
# Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_3/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata 
-m fast stop
waiting for server to shut down......!!!pgkill| GetLastError(): 231
postmaster (9596) died untimely? res: -1, errno: 22
  failed

Thus, CallNamedPipe() in pgkill() returned ERROR_PIPE_BUSY (All pipe
instances are busy) and it was handled as an unexpected error.
(The error code 231 returned 10 times out of 10 failures of this ilk for
me.)

Noah, what do you think of handling this error in line with handling of
ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?

I tried the following change:
         switch (GetLastError())
         {
                 case ERROR_BROKEN_PIPE:
                 case ERROR_BAD_PIPE:
+               case ERROR_PIPE_BUSY:
and saw no issues.

The reason I'd like to bring your attention to the issue (if you don't
mind), is that it's impossible to understand the reason of such false
failure if it happens in the buildfarm/CI.

Best regards,
Alexander






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: Yet another way for pg_ctl stop to fail on Windows
  2024-09-07 12:00 Yet another way for pg_ctl stop to fail on Windows Alexander Lakhin <[email protected]>
@ 2024-09-07 18:11 ` Noah Misch <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Noah Misch @ 2024-09-07 18:11 UTC (permalink / raw)
  To: Alexander Lakhin <[email protected]>; +Cc: pgsql-hackers

On Sat, Sep 07, 2024 at 03:00:00PM +0300, Alexander Lakhin wrote:
> With extra logging added, I got:
> ### Stopping node "CIC_2PC_test" using mode fast
> # Running: pg_ctl -D C:\src\postgresql\build/testrun/amcheck_3/003_cic_2pc\data/t_003_cic_2pc_CIC_2PC_test_data/pgdata
> -m fast stop
> waiting for server to shut down......!!!pgkill| GetLastError(): 231
> postmaster (9596) died untimely? res: -1, errno: 22
>  failed
> 
> Thus, CallNamedPipe() in pgkill() returned ERROR_PIPE_BUSY (All pipe
> instances are busy) and it was handled as an unexpected error.
> (The error code 231 returned 10 times out of 10 failures of this ilk for
> me.)

Thanks for discovering that.

> Noah, what do you think of handling this error in line with handling of
> ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?
> 
> I tried the following change:
>         switch (GetLastError())
>         {
>                 case ERROR_BROKEN_PIPE:
>                 case ERROR_BAD_PIPE:
> +               case ERROR_PIPE_BUSY:
> and saw no issues.

That would be a strict improvement over returning EINVAL like we do today.  We
do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
process exit and ENOMEM-like situations.  While that change is the best thing
if the process is exiting, it could silently drop the signal in ENOMEM-like
situations.  Consider the following alternative.  If sig==0, just return 0
like you propose, because the process isn't completely gone.  Otherwise, sleep
and retry the signal, like pgwin32_open_handle() retries after certain errors.
What do you think of that?






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2024-09-07 18:11 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-07 12:00 Yet another way for pg_ctl stop to fail on Windows Alexander Lakhin <[email protected]>
2024-09-07 18:11 ` Noah Misch <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox