Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
4+ messages / 2 participants
[nested] [flat]

* Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
@ 2026-05-19 18:45  Alvaro Herrera <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Alvaro Herrera @ 2026-05-19 18:45 UTC (permalink / raw)
  To: Baji Shaik <[email protected]>; +Cc: [email protected]

On 2026-May-17, Baji Shaik wrote:

> v3 uses PG_ENSURE_ERROR_CLEANUP which:
> - Handles both ERROR and FATAL exits
> - Automatically cancels the callback on normal completion
>   (no slot leak)
> - Runs before memory contexts are destroyed (no use-after-free)

Yeah, looks good.  I have pushed it, with some comment wordsmithing and
other cosmetic changes.

While looking at it, I realized that I didn't like the way
stop_repack_decoding_worker() works, mainly because if there's no
handle, we leak everything else -- and the way we initialize things
means we leak the shared memory segment.  This is maybe a rare case and
just a small memory leak, but it seems better to do it nicely.  So
here's a followup patch that reworks that code.  This also forced me to
understand more clearly what is going on, so I rewrote the comments.

> - 20 REPACK (CONCURRENTLY) in same session completes without
>   slot exhaustion

FWIW I tested this by doing "repack (concurrently) foo \watch 0.1" and
letting it run for some time.  I happened to notice that if I have two
psqls running, one with the above and the second with the equivalent for
table bar, when they run together, each runs more quickly than when only
one of them is running.  I don't know what causes this; I suspect/assume
it's because the WAL messages for initial historic snapshot creation
from one gets the other running.

> I have not added a dedicated regression test for the
> pg_terminate_backend scenario yet, but I can write one using
> injection points if needed.

I don't feel a need for that.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
@ 2026-05-20 07:54  Antonin Houska <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Antonin Houska @ 2026-05-20 07:54 UTC (permalink / raw)
  To: Alvaro Herrera <[email protected]>; +Cc: Baji Shaik <[email protected]>; [email protected]

Alvaro Herrera <[email protected]> wrote:

> On 2026-May-17, Baji Shaik wrote:
> 
> > v3 uses PG_ENSURE_ERROR_CLEANUP which:
> > - Handles both ERROR and FATAL exits
> > - Automatically cancels the callback on normal completion
> >   (no slot leak)
> > - Runs before memory contexts are destroyed (no use-after-free)
> 
> Yeah, looks good.  I have pushed it, with some comment wordsmithing and
> other cosmetic changes.
> 
> While looking at it, I realized that I didn't like the way
> stop_repack_decoding_worker() works, mainly because if there's no
> handle, we leak everything else -- and the way we initialize things
> means we leak the shared memory segment.  This is maybe a rare case and
> just a small memory leak, but it seems better to do it nicely.  So
> here's a followup patch that reworks that code.  This also forced me to
> understand more clearly what is going on, so I rewrote the comments.

The call of shm_mq_detach() got lost, or do you rely on dsm_detach() to call
shm_mq_detach_callback() ? The latter does not free ->mqh_buffer. Since each
REPACK runs in a separate transaction, I wouldn't consider that a leak, but I
still think that explicit call of shm_mq_detach() makes the code a bit easier
to read (i.e. no need for the developer to check if the detaching happens
automatically).


 	/*
-	 * If we could not cancel the current sleep due to ERROR, do that before
-	 * we detach from the shared memory the condition variable is located in.
-	 * If we did not, the bgworker ERROR handling code would try and fail
-	 * badly.
+	 * Now detach from our shared memory segment.  In error cases there might
+	 * still be messages from the worker in the queue, which ProcessInterrupts
+	 * would try to read; this is pointless (and causes an assertion failure),
+	 * so set the global pointer to NULL to have ProcessRepackMessages ignore
+	 * them.
 	 */
-	ConditionVariableCancelSleep();
-
-	dsm_detach(decoding_worker->seg);
+	dsmseg = decoding_worker->seg;
 	pfree(decoding_worker);
 	decoding_worker = NULL;
+
+	/* We must also cancel the current sleep, if one is still set up */
+	ConditionVariableCancelSleep();
+
+	if (dsmseg != NULL)
+		dsm_detach(dsmseg);


I suppose the reason for the assertion failure was reading from the queue
after the backend had detached from it? Thanks for fixing that.

-- 
Antonin Houska
Web: https://www.cybertec-postgresql.com






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
@ 2026-05-20 20:13  Alvaro Herrera <[email protected]>
  parent: Antonin Houska <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Alvaro Herrera @ 2026-05-20 20:13 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Baji Shaik <[email protected]>; [email protected]

On 2026-May-20, Antonin Houska wrote:

> Alvaro Herrera <[email protected]> wrote:

> The call of shm_mq_detach() got lost, or do you rely on dsm_detach() to call
> shm_mq_detach_callback() ? The latter does not free ->mqh_buffer.

Hmm, this I think this is a code documentation bug then, because the
comment for shm_mq_attach() says

 * If seg != NULL, the queue will be automatically detached when that dynamic
 * shared memory segment is detached.

I think it's strange, or buggy even, to say that "the queue is
automatically detached", but that you still have to call dsm_mq_detach()
afterwards.

I can put back the shm_mq_detach() call, of course.

> +	 * Now detach from our shared memory segment.  In error cases there might
> +	 * still be messages from the worker in the queue, which ProcessInterrupts
> +	 * would try to read; this is pointless (and causes an assertion failure),
> +	 * so set the global pointer to NULL to have ProcessRepackMessages ignore
> +	 * them.

> I suppose the reason for the assertion failure was reading from the queue
> after the backend had detached from it? Thanks for fixing that.

Yeah, that's exactly what happened.

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"La vida es para el que se aventura"






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit
@ 2026-05-26 15:26  Alvaro Herrera <[email protected]>
  parent: Alvaro Herrera <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Alvaro Herrera @ 2026-05-26 15:26 UTC (permalink / raw)
  To: Antonin Houska <[email protected]>; +Cc: Baji Shaik <[email protected]>; [email protected]

On 2026-May-20, Alvaro Herrera wrote:

> On 2026-May-20, Antonin Houska wrote:
> 
> > The call of shm_mq_detach() got lost, or do you rely on dsm_detach() to call
> > shm_mq_detach_callback() ? The latter does not free ->mqh_buffer.

> I can put back the shm_mq_detach() call, of course.

I did that and pushed.

Many thanks!

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2026-05-26 15:26 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-19 18:45 Re: [PATCH] Fix REPACK decoding worker not cleaned up on FATAL exit Alvaro Herrera <[email protected]>
2026-05-20 07:54 ` Antonin Houska <[email protected]>
2026-05-20 20:13   ` Alvaro Herrera <[email protected]>
2026-05-26 15:26     ` Alvaro Herrera <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox