public inbox for [email protected]  
help / color / mirror / Atom feed
Re: Yet another way for pg_ctl stop to fail on Windows
3+ messages / 2 participants
[nested] [flat]

* Re: Yet another way for pg_ctl stop to fail on Windows
@ 2024-09-08 15:00  Alexander Lakhin <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Alexander Lakhin @ 2024-09-08 15:00 UTC (permalink / raw)
  To: Noah Misch <[email protected]>; +Cc: pgsql-hackers

07.09.2024 21:11, Noah Misch wrote:

>
>> Noah, what do you think of handling this error in line with handling of
>> ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?
>>
>> I tried the following change:
>>          switch (GetLastError())
>>          {
>>                  case ERROR_BROKEN_PIPE:
>>                  case ERROR_BAD_PIPE:
>> +               case ERROR_PIPE_BUSY:
>> and saw no issues.
> That would be a strict improvement over returning EINVAL like we do today.  We
> do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
> process exit and ENOMEM-like situations.  While that change is the best thing
> if the process is exiting, it could silently drop the signal in ENOMEM-like
> situations.  Consider the following alternative.  If sig==0, just return 0
> like you propose, because the process isn't completely gone.  Otherwise, sleep
> and retry the signal, like pgwin32_open_handle() retries after certain errors.
> What do you think of that?

Thank you for your attention to the issue!

I agree with your approach. It looks like Microsoft recommends to loop on
ERROR_PIPE_BUSY: [1] (they say "Calling CallNamedPipe is equivalent to
calling the CreateFile ..." at [2]).

So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
it's the way to go, IMHO. I'm not sure about an infinite loop they show,
I'd vote for a loop with the same characteristics as in
pgwin32_open_handle().

I've managed to trigger ERROR_PIPE_BUSY with "pg_ctl reload", when running
20 TAP tests (see attached) in parallel (with 20 threads executing
$node->reload() in each), and with the kill() call inside do_reload()
repeated x100 (see the modification patch attached too):
...
# Running: pg_ctl -D .../099_test_pgkill\data/t_099_test_pgkill_node_data/pgdata reload
### Reloading node "node"
# Running: pg_ctl -D .../099_test_pgkill\data/t_099_test_pgkill_node_data/pgdata reload
[13:41:46.850](2.400s) # 18
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
server signaled
!!!pgkill| GetLastError(): 231
pg_ctl: could not send reload signal (PID: 3912), iteration: 81, res: -1, errno: 22: Invalid argument
server signaled
[13:41:52.594](5.744s) # 19
...

[1] https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client
[2] https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea

Best regards,
Alexander

Attachments:

  [text/x-patch] pgkill-debugging.patch (1.1K, 2-pgkill-debugging.patch)
  download | inline diff:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index e7e878c22f..cab6eb9472 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -1167,12 +1174,17 @@ do_reload(void)
 		exit(1);
 	}
 
-	if (kill(pid, sig) != 0)
+for (int i = 1; i <= 100; i++)
+{
+int res, en;
+	if ((res = kill(pid, sig)) != 0)
 	{
-		write_stderr(_("%s: could not send reload signal (PID: %d): %m\n"),
-					 progname, (int) pid);
+en = errno;
+		write_stderr(_("%s: could not send reload signal (PID: %d), iteration: %d, res: %d, errno: %d: %m\n"),
+					 progname, (int) pid, i, res, en);
 		exit(1);
 	}
+}
 
 	print_msg(_("server signaled\n"));
 }
diff --git a/src/port/kill.c b/src/port/kill.c
index 412c2f19c1..b3d2c00796 100644
--- a/src/port/kill.c
+++ b/src/port/kill.c
@@ -70,6 +70,7 @@ pgkill(int pid, int sig)
 		return 0;
 	}
 
+int le = GetLastError();
 	switch (GetLastError())
 	{
 		case ERROR_BROKEN_PIPE:
@@ -89,6 +90,7 @@ pgkill(int pid, int sig)
 			errno = EPERM;
 			return -1;
 		default:
+fprintf(stderr, "!!!pgkill| GetLastError(): %d\n", le);
 			errno = EINVAL;		/* unexpected */
 			return -1;
 	}


  [application/x-perl] 099_test_pgkill.pl (1.5K, 3-099_test_pgkill.pl)
  download

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Yet another way for pg_ctl stop to fail on Windows
@ 2024-09-08 16:53  Noah Misch <[email protected]>
  parent: Alexander Lakhin <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Noah Misch @ 2024-09-08 16:53 UTC (permalink / raw)
  To: Alexander Lakhin <[email protected]>; +Cc: pgsql-hackers

On Sun, Sep 08, 2024 at 06:00:00PM +0300, Alexander Lakhin wrote:
> 07.09.2024 21:11, Noah Misch wrote:

> > > Noah, what do you think of handling this error in line with handling of
> > > ERROR_BROKEN_PIPE and ERROR_BAD_PIPE (which was done in 0ea1f2a3a)?
> > > 
> > > I tried the following change:
> > >          switch (GetLastError())
> > >          {
> > >                  case ERROR_BROKEN_PIPE:
> > >                  case ERROR_BAD_PIPE:
> > > +               case ERROR_PIPE_BUSY:
> > > and saw no issues.
> > That would be a strict improvement over returning EINVAL like we do today.  We
> > do use PIPE_UNLIMITED_INSTANCES, so I expect the causes of ERROR_PIPE_BUSY are
> > process exit and ENOMEM-like situations.  While that change is the best thing
> > if the process is exiting, it could silently drop the signal in ENOMEM-like
> > situations.  Consider the following alternative.  If sig==0, just return 0
> > like you propose, because the process isn't completely gone.  Otherwise, sleep
> > and retry the signal, like pgwin32_open_handle() retries after certain errors.
> > What do you think of that?

> I agree with your approach. It looks like Microsoft recommends to loop on
> ERROR_PIPE_BUSY: [1] (they say "Calling CallNamedPipe is equivalent to
> calling the CreateFile ..." at [2]).

I see Microsoft suggests WaitNamedPipeA() as opposed to just polling.
WaitNamedPipeA() should be more responsive.  Given how rare this has been, it
likely doesn't matter whether we use WaitNamedPipeA() or polling.  I'd lean
toward whichever makes the code simpler, probably polling.

> So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
> it's the way to go, IMHO. I'm not sure about an infinite loop they show,
> I'd vote for a loop with the same characteristics as in
> pgwin32_open_handle().

I agree with bounding the total time of each kill(), like
pgwin32_open_handle() does for open().

> [1] https://learn.microsoft.com/en-us/windows/win32/ipc/named-pipe-client
> [2] https://learn.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-callnamedpipea






^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Yet another way for pg_ctl stop to fail on Windows
@ 2026-04-26 15:00  Alexander Lakhin <[email protected]>
  parent: Noah Misch <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Alexander Lakhin @ 2026-04-26 15:00 UTC (permalink / raw)
  To: Noah Misch <[email protected]>; +Cc: pgsql-hackers

Hello Noah,

08.09.2024 19:53, Noah Misch wrote:
> I see Microsoft suggests WaitNamedPipeA() as opposed to just polling.
> WaitNamedPipeA() should be more responsive.  Given how rare this has been, it
> likely doesn't matter whether we use WaitNamedPipeA() or polling.  I'd lean
> toward whichever makes the code simpler, probably polling.
>
>> So if we aim to not only fix "pg_ctl stop", but to make pgkill() robust,
>> it's the way to go, IMHO. I'm not sure about an infinite loop they show,
>> I'd vote for a loop with the same characteristics as in
>> pgwin32_open_handle().
> I agree with bounding the total time of each kill(), like
> pgwin32_open_handle() does for open().

While investigating a recent buildfarm failure [1] (which also happened
before: [2]):
# # --- C:/prog/bf/root/HEAD/pgsql/src/test/regress/expected/misc_functions.out 2026-04-08 01:57:31.191354200 +0000
# # +++ C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/results/misc_functions.out 2026-04-24 
02:45:42.446717500 +0000
# # @@ -316,9 +316,10 @@
# #  -- permissions are set properly.
# #  --
# #  SELECT pg_log_backend_memory_contexts(pg_backend_pid());
# # +WARNING:  could not send signal to process 3940: Invalid argument
# #   pg_log_backend_memory_contexts
# #  --------------------------------
# # - t
# # + f
# #  (1 row)
# #
# #  SELECT pg_log_backend_memory_contexts(pid) FROM pg_stat_activity
# # @@ -347,9 +348,10 @@
# #
# #  SET ROLE regress_log_memory;
# #  SELECT pg_log_backend_memory_contexts(pg_backend_pid());
# # +WARNING:  could not send signal to process 3940: Invalid argument
# #   pg_log_backend_memory_contexts
# #  --------------------------------
# # - t
# # + f
# #  (1 row)
# #
# #  RESET ROLE;
# # 1 of 248 tests failed.
# # The differences that caused some tests to fail can be viewed in the file 
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/regression.diffs".
# # A copy of the test summary that you see above is saved in the file 
"C:/prog/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/002_pg_upgrade/data/regression.out".
# ------------------------------------
# Looks like you failed 1 test of 20.

I've managed to reproduced it locally, by running five
pg_upgrade/002_pg_upgrade tests (with
"SELECT pg_log_backend_memory_contexts(pg_backend_pid());" x100 in
misc_functions.sql) concurrently on one-core VM:
iteration 3
1/5 postgresql:pg_upgrade_5 / pg_upgrade_5/002_pg_upgrade OK              184.42s   5 subtests passed
2/5 postgresql:pg_upgrade_1 / pg_upgrade_1/002_pg_upgrade ERROR           184.79s   exit status 1

pg_upgrade_1\002_pg_upgrade\log\002_pg_upgrade_old_node.log:
2026-04-26 13:13:44.238 UTC client backend[688] pg_regress/misc_functions WARNING:  could not send signal to process 
688: Invalid argument

With the logging added inside pgkill():
@@ -89,6 +93,9 @@ pgkill(int pid, int sig)
                         errno = EPERM;
                         return -1;
                 default:
+#ifndef FRONTEND
+elog(LOG, "!!!pgkill| CallNamedPipe() returned %d", GetLastError());
+#endif
                         errno = EINVAL;         /* unexpected */
                         return -1;
         }

and backtrace_functions = 'pgkill', I could see
2026-04-26 13:13:44.237 UTC client backend[688] pg_regress/misc_functions LOG:  !!!pgkill| CallNamedPipe() returned 231
2026-04-26 13:13:44.237 UTC client backend[688] pg_regress/misc_functions BACKTRACE:
     pgkill+0x15f [0x7ff6c05279ef] [C:\src\postgresql\src\port\kill.c:103]
     SendProcSignal+0xb1 [0x7ff6c03202e1] [C:\src\postgresql\src\backend\storage\ipc\procsignal.c:341]
     pg_log_backend_memory_contexts+0x7b [0x7ff6c040df1b] [C:\src\postgresql\src\backend\utils\adt\mcxtfuncs.c:301]
     ExecInterpExpr+0x684 [0x7ff6c00ec264] [C:\src\postgresql\src\backend\executor\execExprInterp.c:630]
...

That is, pg_log_backend_memory_contexts(() fails due to the same
ERROR_PIPE_BUSY.

Moreover, I can see other instances of this error in the log, which go
unnoticed:
2026-04-26 13:12:19.049 UTC client backend[2948] pg_regress/subselect LOG:  !!!pgkill| CallNamedPipe() returned 231
2026-04-26 13:12:19.049 UTC client backend[2948] pg_regress/subselect BACKTRACE:
     pgkill+0x15f [0x7ff6c05279ef] [C:\src\postgresql\src\port\kill.c:103]
     SendProcSignal+0xb1 [0x7ff6c03202e1] [C:\src\postgresql\src\backend\storage\ipc\procsignal.c:341]
     SICleanupQueue+0x1e1 [0x7ff6c03252b1] [C:\src\postgresql\src\backend\storage\ipc\sinvaladt.c:677]
     SIInsertDataEntries+0x91 [0x7ff6c0325511] [C:\src\postgresql\src\backend\storage\ipc\sinvaladt.c:411]
     AtEOXact_Inval+0x86 [0x7ff6c04b0636] [C:\src\postgresql\src\backend\utils\cache\inval.c:1225]
     CommitTransaction+0x2b4 [0x7ff6bffa5604] [C:\src\postgresql\src\backend\access\transam\xact.c:2477]
     CommitTransactionCommandInternal+0x8e [0x7ff6bffa588e] [C:\src\postgresql\src\backend\access\transam\xact.c:3253]
     CommitTransactionCommand+0x9 [0x7ff6bffa57e9] [C:\src\postgresql\src\backend\access\transam\xact.c:3213]
     RemoveTempRelationsCallback+0x6a [0x7ff6bfff51fa] [C:\src\postgresql\src\backend\catalog\namespace.c:4710]
     proc_exit_prepare+0xc6 [0x7ff6c03178d6] [C:\src\postgresql\src\backend\storage\ipc\ipc.c:199]
     proc_exit+0x56 [0x7ff6c03177c6] [C:\src\postgresql\src\backend\storage\ipc\ipc.c:155]
     PostgresMain+0xd9a [0x7ff6c034e2ea] [C:\src\postgresql\src\backend\tcop\postgres.c:5091]
     BackendMain+0xd5 [0x7ff6c034acb5] [C:\src\postgresql\src\backend\tcop\backend_startup.c:124]
...
2026-04-26 13:12:45.010 UTC client backend[8104] pg_regress/replica_identity LOG:  !!!pgkill| CallNamedPipe() returned 231
2026-04-26 13:12:45.010 UTC client backend[8104] pg_regress/replica_identity BACKTRACE:
     pgkill+0x15f [0x7ff6c05279ef] [C:\src\postgresql\src\port\kill.c:103]
     SendProcSignal+0xb1 [0x7ff6c03202e1] [C:\src\postgresql\src\backend\storage\ipc\procsignal.c:341]
     SICleanupQueue+0x1e1 [0x7ff6c03252b1] [C:\src\postgresql\src\backend\storage\ipc\sinvaladt.c:677]
     SIInsertDataEntries+0x91 [0x7ff6c0325511] [C:\src\postgresql\src\backend\storage\ipc\sinvaladt.c:411]
     AtInplace_Inval+0x67 [0x7ff6c04b0727] [C:\src\postgresql\src\backend\utils\cache\inval.c:1270]
     heap_inplace_update_and_unlock+0x266 [0x7ff6bff2a366] [C:\src\postgresql\src\backend\access\heap\heapam.c:6591]
     systable_inplace_update_finish+0x20 [0x7ff6bff4c160] [C:\src\postgresql\src\backend\access\index\genam.c:904]
     index_update_stats+0x21e [0x7ff6bffefefe] [C:\src\postgresql\src\backend\catalog\index.c:2982]
     index_build+0x2fc [0x7ff6bffed2fc] [C:\src\postgresql\src\backend\catalog\index.c:3180]
     index_create+0xc05 [0x7ff6bffef035] [C:\src\postgresql\src\backend\catalog\index.c:1291]
     DefineIndex+0x12fc [0x7ff6c006544c] [C:\src\postgresql\src\backend\commands\indexcmds.c:1265]
     ProcessUtilitySlow+0x924 [0x7ff6c0358524] [C:\src\postgresql\src\backend\tcop\utility.c:1566]
     standard_ProcessUtility+0x9f0 [0x7ff6c0359ec0] [C:\src\postgresql\src\backend\tcop\utility.c:1078]
     ProcessUtility+0x70 [0x7ff6c0357aa0] [C:\src\postgresql\src\backend\tcop\utility.c:531]
     ProcessUtilitySlow+0x3a0 [0x7ff6c0357fa0] [C:\src\postgresql\src\backend\tcop\utility.c:1262]
     standard_ProcessUtility+0x9f0 [0x7ff6c0359ec0] [C:\src\postgresql\src\backend\tcop\utility.c:1078]
     ProcessUtility+0x70 [0x7ff6c0357aa0] [C:\src\postgresql\src\backend\tcop\utility.c:531]


[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2026-04-24%2001%3A26%3A26
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2025-09-26%2018%3A15%3A06

Best regards,
Alexander

^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2026-04-26 15:00 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-09-08 15:00 Re: Yet another way for pg_ctl stop to fail on Windows Alexander Lakhin <[email protected]>
2024-09-08 16:53 ` Noah Misch <[email protected]>
2026-04-26 15:00   ` Alexander Lakhin <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox