public inbox for [email protected]
help / color / mirror / Atom feedFrom: Ayush Tiwari <[email protected]>
To: [email protected]
Subject: [BUG] Race in online checksums launcher_exit()
Date: Mon, 20 Apr 2026 01:39:51 +0530
Message-ID: <CAJTYsWWg6tFrdMhWs5PkwESTNeeUUsMuY17O4UmPPh771c3stA@mail.gmail.com> (raw)
Hi hackers,
While using the pg_enable_data_checksums() feature, I found a likely bug, a
race condition in datachecksum_state.c's launcher_exit().
When pg_enable_data_checksums() is called twice before the first launcher
starts, two bg workers are registered (the code expects this). The
redundant launcher exits early, but it's launcher_exit() callback
unconditionally clears the shared launcher_running flag and may call
SetDataChecksumsOff() -- even though it never owned the flag.
This allows a third pg_enable_data_checksums() call to launch another
launcher concurrently with the first (duplicate work, doubled I/O, spurious
warnings). Worse, if the redundant launcher initialized after the winner
transitioned to inprogress-on, its exit handler calls
SetDataChecksumsOff(), silently aborting the enable operation. (I have
not triggered the SetDataChecksumsOff part though calling out ad it can be
a likely scenario based on timing of workers)
Reproduced by firing three calls in quick succession:
psql -c "SELECT pg_enable_data_checksums();" &
psql -c "SELECT pg_enable_data_checksums();" &
sleep 0.5
psql -c "SELECT pg_enable_data_checksums();" &
Log shows two launchers processing databases concurrently:
[2093292] LOG: enabling data checksums requested
[2093293] LOG: already running, exiting
[2093299] LOG: enabling data checksums requested -- third launcher
admitted
[2093292] LOG: processing database "postgres"
[2093299] LOG: processing database "postgres" -- same DB,
concurrently
[2093299] WARNING: cannot set data checksums to "on", current state is
not "inprogress-on"
I think the process-local launcher_running flag exists for this purpose and
is already used for the worker-kill block, but the flag-clear and
state-revert blocks do not use it.
The attached patch returns early from launcher_exit() when the local flag
is false. Thoughts?
Regards,
Ayush
Attachments:
[application/octet-stream] 0001-Fix-race-in-online-checksums-launcher_exit.patch (2.5K, 3-0001-Fix-race-in-online-checksums-launcher_exit.patch)
download | inline diff:
From 336d5b671157f974cceef15385e394ca27dd58f2 Mon Sep 17 00:00:00 2001
From: Ayush Tiwari <[email protected]>
Date: Mon, 20 Apr 2026 00:52:00 +0530
Subject: [BUG] Fix race in online checksums launcher_exit()
When pg_enable_data_checksums() is called twice before the first
launcher starts, two launcher processes are registered. The second
(redundant) launcher exits early after seeing launcher_running is
already set, but its launcher_exit() callback unconditionally clears
the shared DataChecksumState->launcher_running flag and may call
SetDataChecksumsOff(). This allows a third launcher to start
concurrently with the first, and can silently revert the cluster
checksum state to off while the first launcher is still working.
Fix by returning early from launcher_exit() when the process-local
launcher_running flag is false, indicating this process never claimed
the launcher role.
---
src/backend/postmaster/datachecksum_state.c | 25 +++++++++++++--------
1 file changed, 16 insertions(+), 9 deletions(-)
diff --git a/src/backend/postmaster/datachecksum_state.c b/src/backend/postmaster/datachecksum_state.c
index 18797a8ee3d..76f5aa00f2b 100644
--- a/src/backend/postmaster/datachecksum_state.c
+++ b/src/backend/postmaster/datachecksum_state.c
@@ -887,17 +887,24 @@ launcher_exit(int code, Datum arg)
{
abort_requested = false;
- if (launcher_running)
+ /*
+ * Only perform cleanup if we actually claimed the launcher role by
+ * setting the shared launcher_running flag. A redundant launcher that
+ * found another launcher already running will have exited early without
+ * setting the local launcher_running flag, and must not touch the shared
+ * state owned by the active launcher.
+ */
+ if (!launcher_running)
+ return;
+
+ LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
+ if (DataChecksumState->worker_pid != InvalidPid)
{
- LWLockAcquire(DataChecksumsWorkerLock, LW_EXCLUSIVE);
- if (DataChecksumState->worker_pid != InvalidPid)
- {
- ereport(LOG,
- errmsg("data checksums launcher exiting while worker is still running, signalling worker"));
- kill(DataChecksumState->worker_pid, SIGTERM);
- }
- LWLockRelease(DataChecksumsWorkerLock);
+ ereport(LOG,
+ errmsg("data checksums launcher exiting while worker is still running, signalling worker"));
+ kill(DataChecksumState->worker_pid, SIGTERM);
}
+ LWLockRelease(DataChecksumsWorkerLock);
/*
* If the launcher is exiting before data checksums are enabled then set
--
2.34.1
view thread (3+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected]
Subject: Re: [BUG] Race in online checksums launcher_exit()
In-Reply-To: <CAJTYsWWg6tFrdMhWs5PkwESTNeeUUsMuY17O4UmPPh771c3stA@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox