public inbox for [email protected]help / color / mirror / Atom feed
[PATCH] postmaster: fix stale PM_STARTUP comment 8+ messages / 3 participants [nested] [flat]
* [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-15 13:57 Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Ayush Tiwari @ 2026-04-15 13:57 UTC (permalink / raw) To: pgsql-hackers Hi, The comment above the PM_STARTUP startup-process-failure case still says that there are no other processes running yet, so the postmaster can just exit. That no longer matches the current startup flow: PM_STARTUP may already have auxiliary processes running by that point. The attached patch updates that comment to describe the current behavior. Regards, Ayush Attachments: [application/octet-stream] 0001-postmaster-fix-stale-pm_startup-comment.patch (1.3K, 3-0001-postmaster-fix-stale-pm_startup-comment.patch) download | inline diff: From 972c14fb9134fdfd76ea6ebcf98a55a945bbc988 Mon Sep 17 00:00:00 2001 From: Ayush Tiwari <[email protected]> Date: Wed, 15 Apr 2026 18:54:56 +0530 Subject: [PATCH] postmaster: fix stale PM_STARTUP comment The comment above the PM_STARTUP startup-process-failure case still says there are no other processes running yet. That is no longer accurate: other postmaster children may already be running by this point. Adjust the comment to describe the current behavior more accurately. --- src/backend/postmaster/postmaster.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 6e0f41d2661e..d0b2e0bc8d2e 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2307,8 +2307,9 @@ process_pm_child_exit(void) /* * Unexpected exit of startup process (including FATAL exit) - * during PM_STARTUP is treated as catastrophic. There are no - * other processes running yet, so we can just exit. + * during PM_STARTUP is treated as catastrophic. While other + * postmaster children may already be running, this leaves the + * system in an unrecoverable state requiring immediate exit. */ if (pmState == PM_STARTUP && StartupStatus != STARTUP_SIGNALED && -- 2.34.1 ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-15 16:51 Heikki Linnakangas <[email protected]> parent: Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Heikki Linnakangas @ 2026-04-15 16:51 UTC (permalink / raw) To: Ayush Tiwari <[email protected]>; pgsql-hackers On 15/04/2026 16:57, Ayush Tiwari wrote: > Hi, > > The comment above the PM_STARTUP startup-process-failure case still says > that there are no other processes running yet, so the postmaster can just > exit. > > That no longer matches the current startup flow: PM_STARTUP may already > have auxiliary processes running by that point. The attached patch updates > that comment to describe the current behavior. Hmm, shouldn't the postmaster kill and wait for the auxiliary processes to exit first in that case? ISTM we need code changes here, not just comments. - Heikki ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-15 18:01 Ayush Tiwari <[email protected]> parent: Heikki Linnakangas <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Ayush Tiwari @ 2026-04-15 18:01 UTC (permalink / raw) To: Heikki Linnakangas <[email protected]>; +Cc: pgsql-hackers; [email protected] <[email protected]> Hi On Wed, 15 Apr 2026 at 22:21, Heikki Linnakangas <[email protected]> wrote: > On 15/04/2026 16:57, Ayush Tiwari wrote: > > Hi, > > > > The comment above the PM_STARTUP startup-process-failure case still says > > that there are no other processes running yet, so the postmaster can just > > exit. > > > > That no longer matches the current startup flow: PM_STARTUP may already > > have auxiliary processes running by that point. The attached patch > updates > > that comment to describe the current behavior. > > Hmm, shouldn't the postmaster kill and wait for the auxiliary processes > to exit first in that case? ISTM we need code changes here, not just > comments. > > - Heikki > > Yes, I agree, code change is required here. The proper thing is to route this through the existing crash-handling path so the postmaster SIGQUITs the aux children and waits for them to exit before terminating. I think the minimal change is: 1. Replace the ExitPostmaster(1) shortcut in the PM_STARTUP startup-failure case with HandleChildCrash(), which calls TerminateChildren(SIGQUIT) and transitions through the state machine. Set StartupStatus = STARTUP_CRASHED so the state machine does not try to reinitialize. 2. Let HandleFatalError() handle PM_STARTUP by transitioning to PM_WAIT_BACKENDS, instead of the current Assert(false). The state machine already handles STARTUP_CRASHED at PM_NO_CHILDREN ("shutting down due to startup process failure"), so the exit path is already correct once all children have drained. This issue was discussed in an older thread by Noah too, so, adding him in cc. I can send in a proper patch if you think this is the right way to go. Regards, Ayush ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-17 13:47 Ayush Tiwari <[email protected]> parent: Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Ayush Tiwari @ 2026-04-17 13:47 UTC (permalink / raw) To: Heikki Linnakangas <[email protected]>; +Cc: pgsql-hackers; [email protected] <[email protected]> Hi, On Wed, 15 Apr 2026 at 23:31, Ayush Tiwari <[email protected]> wrote: > Hi > > On Wed, 15 Apr 2026 at 22:21, Heikki Linnakangas <[email protected]> wrote: > >> On 15/04/2026 16:57, Ayush Tiwari wrote: >> > Hi, >> > >> > The comment above the PM_STARTUP startup-process-failure case still says >> > that there are no other processes running yet, so the postmaster can >> just >> > exit. >> > >> > That no longer matches the current startup flow: PM_STARTUP may already >> > have auxiliary processes running by that point. The attached patch >> updates >> > that comment to describe the current behavior. >> >> Hmm, shouldn't the postmaster kill and wait for the auxiliary processes >> to exit first in that case? ISTM we need code changes here, not just >> comments. >> >> - Heikki >> >> > Yes, I agree, code change is required here. > > The proper thing is to > route this through the existing crash-handling path so the postmaster > SIGQUITs the aux children and waits for them to exit before terminating. > > I think the minimal change is: > > 1. Replace the ExitPostmaster(1) shortcut in the PM_STARTUP > startup-failure case with HandleChildCrash(), which calls > TerminateChildren(SIGQUIT) and transitions through the state > machine. Set StartupStatus = STARTUP_CRASHED so the state > machine does not try to reinitialize. > > 2. Let HandleFatalError() handle PM_STARTUP by transitioning to > PM_WAIT_BACKENDS, instead of the current Assert(false). > > The minimal fix turned out to be smaller than I first described, the existing paragraph immediately below the ExitPostmaster(1) block already handles !EXIT_STATUS_0 with StartupStatus != STARTUP_SIGNALED correctly (sets STARTUP_CRASHED and HandleChildCrash). So, likely fix would be: 1. Deleting the PM_STARTUP ExitPostmaster(1) shortcut, and letting execution fall through to the next stanza. 2. Replacing the Assert(false) for PM_STARTUP in HandleFatalError() with a fall-through to UpdatePMState(PM_WAIT_BACKENDS). Verification that I did for patch: On a fresh initdb'd cluster, I zeroed out the first WAL segment to force the startup process to FATAL at StartupXLOG, then ran PG in foreground under strace. Before (master): LOG: startup process (PID N) exited with exit code 1 LOG: aborting startup due to startup process failure LOG: database system is shut down strace of the postmaster PID shows 0 kill() calls to children before exit_group(1). Checkpointer, bgwriter and io workers were running at the time of the failure and were orphaned. After (patched): LOG: startup process (PID N) exited with exit code 1 LOG: terminating any other active server processes <state transitions PM_STARTUP -> PM_WAIT_BACKENDS -> PM_WAIT_DEAD_END -> PM_NO_CHILDREN> LOG: shutting down due to startup process failure LOG: database system is shut down strace shows 8 SIGQUIT deliveries (4 children, each signaled by PID and by process-group) before the postmaster's own exit_group(1). I've attached a patch, please review and let me know your thoughts. Regards, Ayush Attachments: [application/octet-stream] 0001-postmaster-drain-aux-processes-on-startup-process-fa.patch (3.2K, 3-0001-postmaster-drain-aux-processes-on-startup-process-fa.patch) download | inline diff: From 6267a416cc773cc88fe17322908961128764c254 Mon Sep 17 00:00:00 2001 From: Ayush Tiwari <[email protected]> Date: Fri, 17 Apr 2026 18:49:34 +0530 Subject: [PATCH] postmaster: drain aux processes on startup-process failure during PM_STARTUP When the startup process exits with a non-zero status during PM_STARTUP, the postmaster called ExitPostmaster(1) immediately. But by the time PM_STARTUP is active, checkpointer, bgwriter, io workers, and BgWorkerStart_PostmasterStart background workers may already be running. Exiting immediately orphaned them. Route this path through the existing crash-handling machinery: fall through to the following stanza which sets StartupStatus = STARTUP_CRASHED and calls HandleChildCrash(), causing HandleFatalError() to SIGQUIT the aux children and transition to PM_WAIT_BACKENDS. The state machine then drains through PM_WAIT_DEAD_END to PM_NO_CHILDREN, where the existing STARTUP_CRASHED check logs 'shutting down due to startup process failure' and calls ExitPostmaster(1). Also replace the Assert(false) for PM_STARTUP in HandleFatalError() with a transition to PM_WAIT_BACKENDS. The assert was a latent bug: any aux process crash during PM_STARTUP (not just startup-process failure) would reach it via HandleChildCrash -> HandleFatalError. --- src/backend/postmaster/postmaster.c | 29 +++++------------------------ 1 file changed, 5 insertions(+), 24 deletions(-) diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index b6fd332f196..01df0f634e3 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -2305,26 +2305,10 @@ process_pm_child_exit(void) } /* - * Unexpected exit of startup process (including FATAL exit) - * during PM_STARTUP is treated as catastrophic. There are no - * other processes running yet, so we can just exit. - */ - if (pmState == PM_STARTUP && - StartupStatus != STARTUP_SIGNALED && - !EXIT_STATUS_0(exitstatus)) - { - LogChildExit(LOG, _("startup process"), - pid, exitstatus); - ereport(LOG, - (errmsg("aborting startup due to startup process failure"))); - ExitPostmaster(1); - } - - /* - * After PM_STARTUP, any unexpected exit (including FATAL exit) of - * the startup process is catastrophic, so kill other children, - * and set StartupStatus so we don't try to reinitialize after - * they're gone. Exception: if StartupStatus is STARTUP_SIGNALED, + * Any unexpected exit (including FATAL exit) of the startup + * process is catastrophic, so kill other children, and set + * StartupStatus so we don't try to reinitialize after they're + * gone. Exception: if StartupStatus is STARTUP_SIGNALED, * then we previously sent the startup process a SIGQUIT; so * that's probably the reason it died, and we do want to try to * restart in that case. @@ -2780,12 +2764,9 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt) /* shouldn't have any children */ Assert(false); break; - case PM_STARTUP: - /* should have been handled in process_pm_child_exit */ - Assert(false); - break; /* wait for children to die */ + case PM_STARTUP: case PM_RECOVERY: case PM_HOT_STANDBY: case PM_RUN: -- 2.34.1 ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-17 22:54 Michael Paquier <[email protected]> parent: Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Michael Paquier @ 2026-04-17 22:54 UTC (permalink / raw) To: Ayush Tiwari <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; pgsql-hackers; [email protected] <[email protected]> On Fri, Apr 17, 2026 at 07:17:18PM +0530, Ayush Tiwari wrote: > I've attached a patch, please review and let me know your thoughts. /* - * Unexpected exit of startup process (including FATAL exit) - * during PM_STARTUP is treated as catastrophic. There are no - * other processes running yet, so we can just exit. - */ - if (pmState == PM_STARTUP && - StartupStatus != STARTUP_SIGNALED && - !EXIT_STATUS_0(exitstatus)) The assumption that only the startup process is running while we are in a PM_STARTUP state is wrong since 7ff23c6d277d, and this exit code path has been missed the fact that now the checkpointer and the bgwriter are started during recovery. So this means a backpatch. Removing the assertion is the right move, yes, there are other children, so again that's an issue with 7ff23c6d277d. I am planning to double-check the shutdown sequence while switching to PM_WAIT_BACKENDS under a PM_STARTUP, just note that bgworkers that use BgWorkerStart_PostmasterStart cannot request a database connection, meaning that PM_WAIT_BACKENDS should be pointeless, but perhaps it makes the whole shutdown flow easier to reason about, as you are suggesting, as making this path more complicated would lead to the addition of more postmaster states. Making this code simpler, not more complicated, is always useful. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-20 06:41 Ayush Tiwari <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Ayush Tiwari @ 2026-04-20 06:41 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; pgsql-hackers; [email protected] <[email protected]> Hi, Thanks a lot for the review!. On Sat, 18 Apr 2026 at 04:24, Michael Paquier <[email protected]> wrote: > On Fri, Apr 17, 2026 at 07:17:18PM +0530, Ayush Tiwari wrote: > > I've attached a patch, please review and let me know your thoughts. > > /* > - * Unexpected exit of startup process (including FATAL exit) > - * during PM_STARTUP is treated as catastrophic. There are no > - * other processes running yet, so we can just exit. > - */ > - if (pmState == PM_STARTUP && > - StartupStatus != STARTUP_SIGNALED && > - !EXIT_STATUS_0(exitstatus)) > > The assumption that only the startup process is running while we are > in a PM_STARTUP state is wrong since 7ff23c6d277d, and this exit code > path has been missed the fact that now the checkpointer and the > bgwriter are started during recovery. So this means a backpatch. > Agreed, this is latent since 7ff23c6d277d (v15). I can prepare a back-branch versions patch for v15..master once the master patch shape is settled > > Removing the assertion is the right move, yes, there are other > children, so again that's an issue with 7ff23c6d277d. I am planning > to double-check the shutdown sequence while switching to > PM_WAIT_BACKENDS under a PM_STARTUP, just note that bgworkers that use > BgWorkerStart_PostmasterStart cannot request a database connection, > meaning that PM_WAIT_BACKENDS should be pointeless, but perhaps it > makes the whole shutdown flow easier to reason about, as you are > suggesting, as making this path more complicated would lead to the > addition of more postmaster states. Making this code simpler, not > more complicated, is always useful. > -- > Michael > Right, I believe at PM_STARTUP the only children we can possibly have are the auxiliary processes (checkpointer, bgwriter, walwriter (if applicable), IO workers) and BgWorkerStart_PostmasterStart bgworkers, none of which should hold backend slots. So, PM_WAIT_BACKENDS is functionally a no-op in this case and we transition straight through to PM_WAIT_DEAD_END. I considered short-circuiting directly to PM_WAIT_DEAD_END from PM_STARTUP, but routing through the same state path the rest of the crash-handling code uses keeps the state machine uniform and avoids a special case in HandleFatalError() / PostmasterStateMachine(). Making the code simpler, as you mentioned. Regards, Ayush ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-21 01:00 Michael Paquier <[email protected]> parent: Ayush Tiwari <[email protected]> 0 siblings, 1 reply; 8+ messages in thread From: Michael Paquier @ 2026-04-21 01:00 UTC (permalink / raw) To: Ayush Tiwari <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; pgsql-hackers; [email protected] <[email protected]> On Mon, Apr 20, 2026 at 12:11:45PM +0530, Ayush Tiwari wrote: > Agreed, this is latent since 7ff23c6d277d (v15). I can prepare a > back-branch versions patch for v15..master once the master patch > shape is settled. No need to. I have poked at this problem a bit more, stracing my way as you did, and after more testing across v15~HEAD, I have applied it. For v15, a difference becomes necessary at HandleChildCrash(), or we would begin to fail the shutdown sequence should the startup process have the idea to PANIC. This maps with the changes in v18 and HEAD where this has been replaced by a switch/case. Another thing that I have spent a long time looking at is process_pm_child_exit() and the interference that this could generate for the startup process case, but here as well I did not spot any issue, so I think that we are in the clear. There was also a comment at the top of postmaster.c that incorrectly claimed that the checkpointer and the background writer were only started after switch to PM_RECOVERY, which was wrong. I have tweaked that while on it. That was a good catch overall. -- Michael Attachments: [application/pgp-signature] signature.asc (833B, 2-signature.asc) download ^ permalink raw reply [nested|flat] 8+ messages in thread
* Re: [PATCH] postmaster: fix stale PM_STARTUP comment @ 2026-04-21 05:03 Ayush Tiwari <[email protected]> parent: Michael Paquier <[email protected]> 0 siblings, 0 replies; 8+ messages in thread From: Ayush Tiwari @ 2026-04-21 05:03 UTC (permalink / raw) To: Michael Paquier <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; pgsql-hackers; [email protected] <[email protected]> On Tue, 21 Apr 2026 at 06:30, Michael Paquier <[email protected]> wrote: > On Mon, Apr 20, 2026 at 12:11:45PM +0530, Ayush Tiwari wrote: > > No need to. I have poked at this problem a bit more, stracing my way > as you did, and after more testing across v15~HEAD, I have applied it. > For v15, a difference becomes necessary at HandleChildCrash(), or we > would begin to fail the shutdown sequence should the startup process > have the idea to PANIC. This maps with the changes in v18 and HEAD > where this has been replaced by a switch/case. > > Another thing that I have spent a long time looking at is > process_pm_child_exit() and the interference that this could generate > for the startup process case, but here as well I did not spot any > issue, so I think that we are in the clear. > > There was also a comment at the top of postmaster.c that incorrectly > claimed that the checkpointer and the background writer were only > started after switch to PM_RECOVERY, which was wrong. I have tweaked > that while on it. > > That was a good catch overall. > > Thanks a lot Michael for looking into it and pushing it! Regards, Ayush ^ permalink raw reply [nested|flat] 8+ messages in thread
end of thread, other threads:[~2026-04-21 05:03 UTC | newest] Thread overview: 8+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-04-15 13:57 [PATCH] postmaster: fix stale PM_STARTUP comment Ayush Tiwari <[email protected]> 2026-04-15 16:51 ` Heikki Linnakangas <[email protected]> 2026-04-15 18:01 ` Ayush Tiwari <[email protected]> 2026-04-17 13:47 ` Ayush Tiwari <[email protected]> 2026-04-17 22:54 ` Michael Paquier <[email protected]> 2026-04-20 06:41 ` Ayush Tiwari <[email protected]> 2026-04-21 01:00 ` Michael Paquier <[email protected]> 2026-04-21 05:03 ` Ayush Tiwari <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox