public inbox for [email protected]  
help / color / mirror / Atom feed
[BUG] Race in online checksums launcher_exit()
3+ messages / 2 participants
[nested] [flat]

* [BUG] Race in online checksums launcher_exit()
@ 2026-04-19 20:09  Ayush Tiwari <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Ayush Tiwari @ 2026-04-19 20:09 UTC (permalink / raw)
  To: pgsql-hackers

Hi hackers,

While using the pg_enable_data_checksums() feature, I found a likely bug, a
race condition in  datachecksum_state.c's launcher_exit().

When pg_enable_data_checksums() is called twice before the first launcher
starts, two bg workers are registered (the code expects this).  The
redundant launcher exits early, but it's launcher_exit() callback
unconditionally clears the shared launcher_running flag and may call
SetDataChecksumsOff() -- even though it never owned the flag.

This allows a third pg_enable_data_checksums() call to launch another
launcher concurrently with the first (duplicate work, doubled I/O, spurious
warnings).  Worse, if the redundant launcher initialized after the winner
transitioned to inprogress-on, its exit handler calls
SetDataChecksumsOff(), silently aborting the enable operation.  (I have
not triggered the SetDataChecksumsOff part though calling out ad it can be
a likely scenario based on timing of workers)

Reproduced by firing three calls in quick succession:

  psql -c "SELECT pg_enable_data_checksums();" &
  psql -c "SELECT pg_enable_data_checksums();" &
  sleep 0.5
  psql -c "SELECT pg_enable_data_checksums();" &

Log shows two launchers processing databases concurrently:

  [2093292] LOG:  enabling data checksums requested
  [2093293] LOG:  already running, exiting
  [2093299] LOG:  enabling data checksums requested     -- third launcher
admitted
  [2093292] LOG:  processing database "postgres"
  [2093299] LOG:  processing database "postgres"        -- same DB,
concurrently
  [2093299] WARNING:  cannot set data checksums to "on", current state is
not "inprogress-on"

I think the process-local launcher_running flag exists for this purpose and
is already used for the worker-kill block, but the flag-clear and
state-revert blocks do not use it.

The attached patch returns early from launcher_exit() when the local flag
is false. Thoughts?

Regards,
Ayush


Attachments:

  [application/octet-stream] 0001-Fix-race-in-online-checksums-launcher_exit.patch (2.5K, 3-0001-Fix-race-in-online-checksums-launcher_exit.patch)
  download | inline diff:
From 336d5b671157f974cceef15385e394ca27dd58f2 Mon Sep 17 00:00:00 2001
From: Ayush Tiwari <[email protected]>
Date: Mon, 20 Apr 2026 00:52:00 +0530
Subject: [BUG] Fix race in online checksums launcher_exit()

When pg_enable_data_checksums() is called twice before the first
launcher starts, two launcher processes are registered.  The second
(redundant) launcher exits early after seeing launcher_running is
already set, but its launcher_exit() callback unconditionally clears
the shared DataChecksumState->launcher_running flag and may call
SetDataChecksumsOff().  This allows a third launcher to start
concurrently with the first, and can silently revert the cluster
checksum state to off while the first launcher is still working.

Fix by returning early from launcher_exit() when the process-local
launcher_running flag is false, indicating this process never claimed
the launcher role.
---
 src/backend/postmaster/datachecksum_state.c | 25 +++++++++++++--------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/src/backend/postmaster/datachecksum_state.c b/src/backend/postmaster/datachecksum_state.c
index 18797a8ee3d..76f5aa00f2b 100644
--- a/src/backend/postmaster/datachecksum_state.c
+++ b/src/backend/postmaster/datachecksum_state.c
@@ -887,17 +887,24 @@ launcher_exit(int code, Datum arg)
 {
 	abort_requested = false;
 
-	if (launcher_running)
+	/*
+	 * Only perform cleanup if we actually claimed the launcher role by
+	 * setting the shared launcher_running flag.  A redundant launcher that
+	 * found another launcher already running will have exited early without
+	 * setting the local launcher_running flag, and must not touch the shared
+	 * state owned by the active launcher.
+	 */
+	if (!launcher_running)
+		return;
+
+	LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
+	if (DataChecksumState->worker_pid != InvalidPid)
 	{
-		LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
-		if (DataChecksumState->worker_pid != InvalidPid)
-		{
-			ereport(LOG,
-					errmsg("data checksums launcher exiting while worker is still running, signalling worker"));
-			kill(DataChecksumState->worker_pid, SIGTERM);
-		}
-		LWLockRelease(DataChecksumsWorkerLock);
+		ereport(LOG,
+				errmsg("data checksums launcher exiting while worker is still running, signalling worker"));
+		kill(DataChecksumState->worker_pid, SIGTERM);
 	}
+	LWLockRelease(DataChecksumsWorkerLock);
 
 	/*
 	 * If the launcher is exiting before data checksums are enabled then set
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: [BUG] Race in online checksums launcher_exit()
@ 2026-04-19 20:17  Daniel Gustafsson <[email protected]>
  parent: Ayush Tiwari <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Daniel Gustafsson @ 2026-04-19 20:17 UTC (permalink / raw)
  To: Ayush Tiwari <[email protected]>; +Cc: pgsql-hackers

> On 19 Apr 2026, at 22:09, Ayush Tiwari <[email protected]> wrote:
> 
> Hi hackers,
> 
> While using the pg_enable_data_checksums() feature, I found a likely bug, a race condition in  datachecksum_state.c's launcher_exit().

Thanks for your report.  Tomas and I have worked over the past couple of days
on a fixup series due to a rare race condition which was found after extensive
longrunning testing.  While hacking on that we identified what I believe is the
same bug you found and we have a fix for that, the patchset will be shared very
shortly (we am literally putting the final touches on it as I write this).
I'll compare notes and will if applicable incorporate your patch into it.

--
Daniel Gustafsson






^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: [BUG] Race in online checksums launcher_exit()
@ 2026-04-20 06:08  Ayush Tiwari <[email protected]>
  parent: Daniel Gustafsson <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Ayush Tiwari @ 2026-04-20 06:08 UTC (permalink / raw)
  To: Daniel Gustafsson <[email protected]>; +Cc: pgsql-hackers

Hi,

On Mon, 20 Apr 2026 at 01:47, Daniel Gustafsson <[email protected]> wrote:

> > On 19 Apr 2026, at 22:09, Ayush Tiwari <[email protected]>
> wrote:
> >
> > Hi hackers,
> >
> > While using the pg_enable_data_checksums() feature, I found a likely
> bug, a race condition in  datachecksum_state.c's launcher_exit().
>
> Thanks for your report.  Tomas and I have worked over the past couple of
> days
> on a fixup series due to a rare race condition which was found after
> extensive
> longrunning testing.  While hacking on that we identified what I believe
> is the
> same bug you found and we have a fix for that, the patchset will be shared
> very
> shortly (we am literally putting the final touches on it as I write this).
> I'll compare notes and will if applicable incorporate your patch into it.
>
> --
> Daniel Gustafsson
>
>
Thanks Daniel! Glad to hear it's being addressed. I would be happy to
test the patchset when it's posted.

Regards,
Ayush


^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2026-04-20 06:08 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-19 20:09 [BUG] Race in online checksums launcher_exit() Ayush Tiwari <[email protected]>
2026-04-19 20:17 ` Daniel Gustafsson <[email protected]>
2026-04-20 06:08   ` Ayush Tiwari <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox