Change checkpoint‑record‑missing PANIC to FATAL

public inbox for [email protected]  
help / color / mirror / Atom feed

 Change checkpoint‑record‑missing PANIC to FATAL
7+ messages / 2 participants
[nested] [flat]

*  Change checkpoint‑record‑missing PANIC to FATAL
@ 2025-12-16 10:55 Nitin Jadhav <[email protected]>
  2025-12-17 03:49 ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Nitin Jadhav @ 2025-12-16 10:55 UTC (permalink / raw)
  To: pgsql-hackers; +Cc: Michael Paquier <[email protected]>

Hi,

While working on [1], we discussed whether the redo-record-missing error
should be a PANIC or a FATAL. We concluded that FATAL is more appropriate,
as it is more appropriate for the current situation and achieves the
intended behavior and also it is consistent with the backup_label path,
which already reports FATAL in the same scenario.

However, when the checkpoint record is missing, the behavior remains
inconsistent: Without a backup_label, we currently raise a PANIC. With a
backup_label, the same code path reports a FATAL.Since we have already made
the redo‑record‑missing case to FATAL in 15f68ce
<https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=15f68cebdcec;,
it seems reasonable to align the checkpoint‑record‑missing case as well.
The existing PANIC dates back to an era before online backups and archive
recovery existed, when external manipulation of WAL was not expected and
such conditions were treated as internal faults. With all such features, it
is much more realistic for WAL segments to go missing due to operational
issues, and such cases are often recoverable. So switching this to FATAL
appears appropriate.

Please share your thoughts.

I am happy to share a patch including a TAP test to cover this behavior
once we agree to proceed.

[1]:
https://www.postgresql.org/message-id/flat/CAMm1aWaaJi2w49c0RiaDBfhdCL6ztbr9m%3DdaGqiOuVdizYWYaA%40m...

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Change checkpoint‑record‑missing PANIC to FATAL
  2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
@ 2025-12-17 03:49 ` Michael Paquier <[email protected]>
  2025-12-29 15:09   `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Michael Paquier @ 2025-12-17 03:49 UTC (permalink / raw)
  To: Nitin Jadhav <[email protected]>; +Cc: pgsql-hackers

On Tue, Dec 16, 2025 at 04:25:37PM +0530, Nitin Jadhav wrote:
> it seems reasonable to align the checkpoint‑record‑missing case as well.
> The existing PANIC dates back to an era before online backups and archive
> recovery existed, when external manipulation of WAL was not expected and
> such conditions were treated as internal faults. With all such features, it
> is much more realistic for WAL segments to go missing due to operational
> issues, and such cases are often recoverable. So switching this to FATAL
> appears appropriate.
> 
> Please share your thoughts.

FWIW, I think that we should lift the PANIC pattern in this case, at
least to be able to provide more tests around the manipulation of WAL
segments when triggering recovery, with or without a backup_label as
much as with or without a recovery/standby.signal defined in the tree. 
The PANIC pattern to blow up the backend when missing a checkpoint
record at the beginning of recovery is a historical artifact of
4d14fe0048cf.  The backend has evolved a lot since, particularly with
WAL archives that came much later than that.  Lowering that to a FATAL
does not imply a loss of information, just the lack of a backtrace
that can be triggered depending on how one has set of a cluster to
start (say a recovery.signal was forgotten and pg_wal/ has no
contents, etc.).  And IMO I doubt that a trace is really useful anyway
in this specific code path.

I'd love to hear the opinion of others on the matter, so if anybody
has comments, feel free.

I'd be curious to look at the amount of tests related to recovery
startup you have in mind anyway, Nitin.
--
Michael

Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

*  Re: Change checkpoint‑record‑missing PANIC to FATAL
  2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2025-12-17 03:49 ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
@ 2025-12-29 15:09   ` Nitin Jadhav <[email protected]>
  2026-03-06 05:32     `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Nitin Jadhav @ 2025-12-29 15:09 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: pgsql-hackers

> I'd be curious to look at the amount of tests related to recovery
> startup you have in mind anyway, Nitin.

Apologies for the delay.
At a high level, the recovery startup cases we want to test fall into
two main buckets:
(1) with a backup_label file and (2) without a backup_label file.



^ permalink  raw  reply  [nested|flat] 7+ messages in thread

*  Re: Change checkpoint‑record‑missing PANIC to FATAL
  2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2025-12-17 03:49 ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  2025-12-29 15:09   `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
@ 2026-03-06 05:32     ` Nitin Jadhav <[email protected]>
  2026-03-10 03:58       ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Nitin Jadhav @ 2026-03-06 05:32 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: pgsql-hackers

> Thanks, Nitin.  Perhaps it would be a better approach to split the
> patch into multiple pieces, with the most relevant PANIC->FATAL
> switches and the most critical tests on top of the rest.  It would be
> nice to get most of that by the end of the release cycle, or a rather
> "good" chunk of it.

Thanks for the suggestion, Michael, and apologies for the delay. I had
intended to send this sooner, but it took a bit more time.

I’m sharing the two patches below.

Patch 0001 adjusts the error severity during crash recovery when the
checkpoint record referenced by pg_control cannot be located and no
backup_label file is present. The error is lowered from PANIC to
FATAL. This patch also adds a new TAP test that verifies startup fails
with a clear FATAL error. The test is straightforward: it removes the
WAL segment containing the checkpoint record and confirms that the
server reports the expected error.

Patch 0002 adds two TAP tests that exercise similar missing‑WAL
scenarios during backup recovery, i.e., when a backup_label file is
present:

Missing checkpoint WAL segment referenced by backup_label:
This test uses an online backup to create a backup_label file,
extracts the checkpoint record information from it, removes the
corresponding WAL segment, and verifies that the server reports the
expected error.

Missing redo WAL segment referenced by the checkpoint:
In this test, redo and checkpoint records are forced into different
WAL segments using injection points. A cold backup is then taken, with
an explicit backup_label created in the restored cluster. The WAL
segment containing the redo record is removed, and startup is expected
to fail with the appropriate error message.

Please review and share your feedback. I’m happy to adjust the patches
if there are better ways to handle any of these cases, especially the
054_missing_redo_with_backup_label test.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Thu, Feb 19, 2026 at 9:48 AM Michael Paquier <[email protected]> wrote:
>
> On Thu, Feb 19, 2026 at 08:24:02AM +0530, Nitin Jadhav wrote:
> > I had a quick look at the existing recovery TAP tests and didn’t
> > immediately find a case where simply adding log checks would cover
> > these error paths, but I’ll double‑check once more before sending the
> > patch. I’ll work on this and share the patch soon.
>
> Thanks, Nitin.  Perhaps it would be a better approach to split the
> patch into multiple pieces, with the most relevant PANIC->FATAL
> switches and the most critical tests on top of the rest.  It would be
> nice to get most of that by the end of the release cycle, or a rather
> "good" chunk of it.
> --
> Michael


Attachments:

  [application/octet-stream] 0002-Add-TAP-tests-for-missing-redo-checkpoint-during-bac.patch (10.0K, 2-0002-Add-TAP-tests-for-missing-redo-checkpoint-during-bac.patch)
  download | inline diff:
From 80c9db24c4937e1782703c123ab67d74d8b20690 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <[email protected]>
Date: Fri, 6 Mar 2026 04:52:17 +0000
Subject: [PATCH 2/2] Add TAP tests for missing redo/checkpoint during backup
 recovery

Add two recovery TAP tests to validate PostgreSQL behavior when WAL
records required for startup are missing in the presence of a
backup_label file.

The first test covers the case where the checkpoint record referenced
by backup_label is missing, and verifies that recovery fails with a
clear FATAL error.

The second test covers the case where the redo record referenced by the
checkpoint is missing while a backup_label file is present, with redo
and checkpoint records forced into different WAL segments using
injection points.
---
 ...53_missing_checkpoint_with_backup_label.pl |  86 ++++++++++
 .../t/054_missing_redo_with_backup_label.pl   | 152 ++++++++++++++++++
 2 files changed, 238 insertions(+)
 create mode 100644 src/test/recovery/t/053_missing_checkpoint_with_backup_label.pl
 create mode 100644 src/test/recovery/t/054_missing_redo_with_backup_label.pl

diff --git a/src/test/recovery/t/053_missing_checkpoint_with_backup_label.pl b/src/test/recovery/t/053_missing_checkpoint_with_backup_label.pl
new file mode 100644
index 00000000000..7be070abfea
--- /dev/null
+++ b/src/test/recovery/t/053_missing_checkpoint_with_backup_label.pl
@@ -0,0 +1,86 @@
+# Copyright (c) 2025-2026, PostgreSQL Global Development Group
+#
+# Verify crash recovery behavior when the WAL segment containing the
+# checkpoint record referenced by backup_label is missing.
+#
+# Expected behavior: startup fails with FATAL and logs a message about
+# not being able to locate a valid checkpoint record.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init(allows_streaming => 1);
+$node->append_conf('postgresql.conf', 'wal_level = replica');
+$node->start;
+
+# Generate WAL and force a checkpoint
+$node->safe_psql('postgres',
+    q{CREATE TABLE t(a int); INSERT INTO t VALUES (1),(2),(3);});
+$node->safe_psql('postgres', 'CHECKPOINT');
+
+# Take a physical base backup (creates backup_label)
+my $backupname = 'fs_bkp';
+$node->backup($backupname);
+
+my $reco = PostgreSQL::Test::Cluster->new('recovery_from_backup_ckpt');
+$reco->init_from_backup(
+    $node,
+    $backupname,
+    has_restored => 1,
+);
+
+# Assert backup_label exists
+my $backup_label = $reco->data_dir . '/backup_label';
+ok(
+    -e $backup_label,
+    'backup_label exists'
+);
+
+# Determine WAL file containing the checkpoint record
+my $backup_label_path = $reco->data_dir . '/backup_label';
+my $backup_label_contents = slurp_file($backup_label_path);
+
+my ($checkpoint_walfile) =
+    $backup_label_contents =~
+        /\(file\s+([0-9A-F]{24})\)/;
+ok(
+    defined $checkpoint_walfile,
+    "extracted checkpoint WAL file from backup_label: $checkpoint_walfile"
+);
+
+# Remove the WAL segment containing the checkpoint record
+my $pgwal = $reco->data_dir . '/pg_wal';
+ok(-d $pgwal, 'pg_wal directory exists');
+
+my $target = "$pgwal/$checkpoint_walfile";
+ok(
+    -e $target,
+    "checkpoint WAL segment exists before removal: $target"
+) or die "Expected WAL segment $target not found";
+
+unlink($target)
+  or die "unlink $target failed: $!";
+
+# Start the server and confirm that recovery has failed, as expected.
+command_fails(
+    [
+        'pg_ctl',
+        '--pgdata' => $reco->data_dir,
+        '--log'    => $reco->logfile,
+        'start',
+    ],
+    'startup fails when checkpoint WAL is missing with backup_label present'
+);
+
+my $log = slurp_file($reco->logfile);
+like(
+    $log,
+    qr/(?:FATAL|PANIC): .*could not locate required checkpoin record/i,
+    'server log reports missing checkpoint record'
+);
+
+done_testing();
\ No newline at end of file
diff --git a/src/test/recovery/t/054_missing_redo_with_backup_label.pl b/src/test/recovery/t/054_missing_redo_with_backup_label.pl
new file mode 100644
index 00000000000..f2d3352c46f
--- /dev/null
+++ b/src/test/recovery/t/054_missing_redo_with_backup_label.pl
@@ -0,0 +1,152 @@
+# Copyright (c) 2025-2026, PostgreSQL Global Development Group
+#
+# Verify recovery behavior when a WAL segment containing the redo record is
+# missing, with a checkpoint record located in a different segment, in the
+# presence of a backup_label file.
+#
+# Expected behavior: startup fails with FATAL and logs a message about not
+# being able to find the redo location referenced by the checkpoint record.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+if ($ENV{enable_injection_points} ne 'yes')
+{
+    plan skip_all => 'Injection points not supported by this build';
+}
+
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init;
+$node->append_conf('postgresql.conf', 'log_checkpoints = on');
+$node->start;
+
+# Check if the extension injection_points is available.
+if (!$node->check_extension('injection_points'))
+{
+    plan skip_all => 'Extension injection_points not installed';
+}
+$node->safe_psql('postgres', q(CREATE EXTENSION injection_points));
+
+# Note that this uses two injection points based on waits, not one.  This
+# may look strange, but this works as a workaround to enforce all memory
+# allocations to happen outside the critical section of the checkpoint
+# required for this test.
+# First, "create-checkpoint-initial" is run outside the critical section
+# section, and is used as a way to initialize the shared memory required
+# for the wait machinery with its DSM registry.
+# Then, "create-checkpoint-run" is loaded outside the critical section of
+# a checkpoint to allocate any memory required by the library load, and
+# its callback is run inside the critical section.
+$node->safe_psql('postgres',
+    q{SELECT injection_points_attach('create-checkpoint-initial', 'wait')});
+$node->safe_psql('postgres',
+    q{SELECT injection_points_attach('create-checkpoint-run', 'wait')});
+
+# Start a psql session to run the checkpoint in the background and make
+# the test wait on the injection point so the checkpoint stops just after
+# it starts.
+my $checkpoint = $node->background_psql('postgres');
+$checkpoint->query_until(
+    qr/starting_checkpoint/,
+    q(\echo starting_checkpoint
+checkpoint;
+));
+
+# Wait for the initial point to finish, the checkpointer is still
+# outside its critical section.  Then release to reach the second
+# point.
+$node->wait_for_event('checkpointer', 'create-checkpoint-initial');
+$node->safe_psql('postgres',
+    q{SELECT injection_points_wakeup('create-checkpoint-initial')});
+
+# Wait until the checkpoint has reached the second injection point.
+# We are now in the middle of a checkpoint running, after the redo
+# record has been logged.
+$node->wait_for_event('checkpointer', 'create-checkpoint-run');
+
+# Switch WAL segment to ensure redo and checkpoint records are in different
+# segments.
+$node->safe_psql('postgres', 'SELECT pg_switch_wal()');
+
+# Continue checkpoint and wait for completion.
+my $log_offset = -s $node->logfile;
+$node->safe_psql('postgres',
+    q{SELECT injection_points_wakeup('create-checkpoint-run')});
+$node->wait_for_log(qr/checkpoint complete/, $log_offset);
+
+$checkpoint->quit;
+
+# Retrieve the WAL file names for the redo record and checkpoint record.
+my $redo_lsn = $node->safe_psql('postgres',
+    q{SELECT redo_lsn FROM pg_control_checkpoint()});
+my $checkpoint_lsn = $node->safe_psql('postgres',
+    q{SELECT checkpoint_lsn FROM pg_control_checkpoint()});
+my $redo_walfile_name =
+  $node->safe_psql('postgres', "SELECT pg_walfile_name('$redo_lsn')");
+my $checkpoint_walfile_name =
+  $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')");
+
+# Redo record and checkpoint record should be on different segments.
+isnt($redo_walfile_name, $checkpoint_walfile_name,
+    'redo and checkpoint records on different segments');
+
+# Stop and take a cold filesystem backup of the stopped server.
+$node->stop('immediate');
+my $backupname = 'cold_bkp';
+$node->backup_fs_cold($backupname);
+
+# Restore cold backup into a new node.
+my $reco = PostgreSQL::Test::Cluster->new('reco_with_backup_label');
+$reco->init_from_backup($node, $backupname, has_restored => 1);
+
+# Manually create backup_label in restored cluster to force backup recovery.
+my $backup_label_path = $reco->data_dir . '/backup_label';
+
+# Extract timeline from WAL filename (first 8 hex digits).
+my $tli_hex = substr($checkpoint_walfile_name, 0, 8);
+my $tli = hex($tli_hex);
+
+open(my $bl, '>', $backup_label_path)
+  or die "could not create backup_label: $!";
+print $bl "START WAL LOCATION: $redo_lsn (file $redo_walfile_name)\n";
+print $bl "CHECKPOINT LOCATION: $checkpoint_lsn\n";
+print $bl "BACKUP METHOD: test\n";
+print $bl "BACKUP FROM: primary\n";
+print $bl "START TIMELINE: $tli\n";
+print $bl "CHECKPOINT TIMELINE: $tli\n";
+print $bl "LABEL: redo missing with backup_label\n";
+close($bl);
+
+ok(-e $backup_label_path, 'backup_label exists before startup');
+
+# Remove the WAL segment containing the redo record.
+my $redo_path = $reco->data_dir . "/pg_wal/$redo_walfile_name";
+my $ckpt_path = $reco->data_dir . "/pg_wal/$checkpoint_walfile_name";
+
+ok(-e $ckpt_path, "checkpoint WAL segment exists: $ckpt_path");
+ok(-e $redo_path, "redo WAL segment exists before removal: $redo_path")
+  or die "Expected WAL segment $redo_path not found";
+
+unlink($redo_path)
+  or die "could not remove redo WAL file: $!";
+
+# Use run_log instead of node->start because this test expects that
+# the server ends with an error during recovery.
+run_log(
+	[
+		'pg_ctl',
+		'--pgdata' => $reco->data_dir,
+		'--log'    => $reco->logfile,
+		'start',
+	]);
+
+# Confirm that recovery has failed, as expected.
+my $logfile = slurp_file($reco->logfile());
+ok( $logfile =~
+	  qr/FATAL: .* could not find redo location .* referenced by checkpoint record at .*/,
+	"ends with FATAL because it could not find redo location");
+
+done_testing();
\ No newline at end of file
-- 
2.43.0



  [application/octet-stream] 0001-Lower-PANIC-to-FATAL-for-missing-checkpoint-error-in.patch (4.1K, 3-0001-Lower-PANIC-to-FATAL-for-missing-checkpoint-error-in.patch)
  download | inline diff:
From c6007d8909f2dee64a8fbd664aa97dd89f944884 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <[email protected]>
Date: Thu, 19 Feb 2026 08:28:34 +0000
Subject: [PATCH 1/2] Lower PANIC to FATAL for missing checkpoint error in
 no-backup_label path

Crash recovery without a backup_label currently PANICs if the checkpoint
record cannot be found.

Report this as FATAL instead, as missing WAL is a plausible operational
failure and the backup_label recovery path already treats it as a FATAL
error.
---
 src/backend/access/transam/xlogrecovery.c     |  2 +-
 ...pl => 050_missing_redo_no_backup_label.pl} |  0
 .../052_missing_checkpoint_no_backup_label.pl | 60 +++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)
 rename src/test/recovery/t/{050_redo_segment_missing.pl => 050_missing_redo_no_backup_label.pl} (100%)
 create mode 100644 src/test/recovery/t/052_missing_checkpoint_no_backup_label.pl

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c0c2744d45b..857dbb33f19 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -798,7 +798,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			 * can't read the last checkpoint because this allows us to
 			 * simplify processing around checkpoints.
 			 */
-			ereport(PANIC,
+			ereport(FATAL,
 					errmsg("could not locate a valid checkpoint record at %X/%08X",
 						   LSN_FORMAT_ARGS(CheckPointLoc)));
 		}
diff --git a/src/test/recovery/t/050_redo_segment_missing.pl b/src/test/recovery/t/050_missing_redo_no_backup_label.pl
similarity index 100%
rename from src/test/recovery/t/050_redo_segment_missing.pl
rename to src/test/recovery/t/050_missing_redo_no_backup_label.pl
diff --git a/src/test/recovery/t/052_missing_checkpoint_no_backup_label.pl b/src/test/recovery/t/052_missing_checkpoint_no_backup_label.pl
new file mode 100644
index 00000000000..1f2182f9f31
--- /dev/null
+++ b/src/test/recovery/t/052_missing_checkpoint_no_backup_label.pl
@@ -0,0 +1,60 @@
+# Copyright (c) 2025-2026, PostgreSQL Global Development Group
+#
+# Verify crash recovery behavior when the WAL segment containing the
+# checkpoint record referenced by pg_controldata is missing, in the code path
+# where there is no backup_label file.
+#
+# Expected behavior: startup fails with FATAL and logs a message about
+# not being able to locate a checkpoint record.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('testnode');
+$node->init;
+$node->append_conf('postgresql.conf', 'log_checkpoints = on');
+$node->start;
+
+# Force a checkpoint so pg_controldata points to a checkpoint record we can target.
+$node->safe_psql('postgres', 'CHECKPOINT;');
+
+# Retrieve the checkpoint LSN and derive the WAL segment name.
+my $checkpoint_lsn = $node->safe_psql('postgres',
+    "SELECT checkpoint_lsn FROM pg_control_checkpoint()");
+my $checkpoint_walfile =
+  $node->safe_psql('postgres', "SELECT pg_walfile_name('$checkpoint_lsn')");
+
+ok($checkpoint_walfile ne '', "derived checkpoint WAL file name: $checkpoint_walfile");
+
+# Stop the node.
+$node->stop('immediate');
+
+# Remove the WAL segment containing the checkpoint record.
+my $walpath = $node->data_dir . "/pg_wal/$checkpoint_walfile";
+ok(-f $walpath, "checkpoint WAL file exists before deletion: $walpath");
+
+unlink $walpath
+  or die "could not remove WAL file $walpath: $!";
+
+ok(!-e $walpath, "checkpoint WAL file removed: $walpath");
+
+# Use run_log instead of node->start because this test expects that
+# the server ends with an error during recovery.
+run_log(
+	[
+		'pg_ctl',
+		'--pgdata' => $node->data_dir,
+		'--log' => $node->logfile,
+		'start',
+	]);
+
+# Confirm that recovery has failed, as expected.
+my $logfile = slurp_file($node->logfile());
+ok( $logfile =~
+	  qr/FATAL: .* could not locate a valid checkpoint record at .*/,
+	"FATAL logged for missing checkpoint record (no backup_label path)");
+
+done_testing();
\ No newline at end of file
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Change checkpoint‑record‑missing PANIC to FATAL
  2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2025-12-17 03:49 ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  2025-12-29 15:09   `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2026-03-06 05:32     `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
@ 2026-03-10 03:58       ` Michael Paquier <[email protected]>
  2026-03-20 14:26         `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Michael Paquier @ 2026-03-10 03:58 UTC (permalink / raw)
  To: Nitin Jadhav <[email protected]>; +Cc: pgsql-hackers

On Fri, Mar 06, 2026 at 11:02:41AM +0530, Nitin Jadhav wrote:
> Patch 0001 adjusts the error severity during crash recovery when the
> checkpoint record referenced by pg_control cannot be located and no
> backup_label file is present. The error is lowered from PANIC to
> FATAL. This patch also adds a new TAP test that verifies startup fails
> with a clear FATAL error. The test is straightforward: it removes the
> WAL segment containing the checkpoint record and confirms that the
> server reports the expected error.

There could be one possibility for 0001 to be unstable, actually.  One
could imagine that by getting the segment from a live server things
could fail: if the run is very slow then we may finish with an
incorrect record.  It would be possibe to use pg_controldata once the
server has been shut down, but we should be on the same segment anyway
because nothing happens on the server, so I have let the test are you
have suggested, and applied this one.  You have missed an update in
meson.build, and I am not sure that there was a need for renaming 050,
either.

> Missing checkpoint WAL segment referenced by backup_label:
> This test uses an online backup to create a backup_label file,
> extracts the checkpoint record information from it, removes the
> corresponding WAL segment, and verifies that the server reports the
> expected error.
> 
> Missing redo WAL segment referenced by the checkpoint:
> In this test, redo and checkpoint records are forced into different
> WAL segments using injection points. A cold backup is then taken, with
> an explicit backup_label created in the restored cluster. The WAL
> segment containing the redo record is removed, and startup is expected
> to fail with the appropriate error message.

These two are very close to the existing tests that we have on HEAD
now, could it be better to group them together instead?  Particularly,
054 reuses the injection point trick in what is clearly a copy-paste
taken from 050.  Avoiding this duplication may be nice.
--
Michael


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

*  Re: Change checkpoint‑record‑missing PANIC to FATAL
  2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2025-12-17 03:49 ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  2025-12-29 15:09   `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2026-03-06 05:32     `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2026-03-10 03:58       ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
@ 2026-03-20 14:26         ` Nitin Jadhav <[email protected]>
  2026-03-23 00:55           ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  0 siblings, 1 reply; 7+ messages in thread

From: Nitin Jadhav @ 2026-03-20 14:26 UTC (permalink / raw)
  To: Michael Paquier <[email protected]>; +Cc: pgsql-hackers

> There could be one possibility for 0001 to be unstable, actually.  One
> could imagine that by getting the segment from a live server things
> could fail: if the run is very slow then we may finish with an
> incorrect record.  It would be possibe to use pg_controldata once the
> server has been shut down, but we should be on the same segment anyway
> because nothing happens on the server, so I have let the test are you
> have suggested, and applied this one.  You have missed an update in
> meson.build, and I am not sure that there was a need for renaming 050,
> either.

Good catch—thanks. This does seem like it could apply to the earlier
redo‑record‑missing case as well. That said, since there’s no activity
on the server during the test, it should be safe as you mentioned. My
apologies for missing the meson.build update again—thanks for taking
care of it. On the renaming, I was aiming to separate things into four
tests to make the differences explicit, but I’m happy to stick with
two test files.

> These two are very close to the existing tests that we have on HEAD
> now, could it be better to group them together instead?  Particularly,
> 054 reuses the injection point trick in what is clearly a copy-paste
> taken from 050.  Avoiding this duplication may be nice.

That's a good point. I agree these new cases are very close to what we
already have on HEAD. I can rework this to avoid duplication by
grouping them with the existing tests.

Best Regards,
Nitin Jadhav
Azure Database for PostgreSQL
Microsoft

On Tue, Mar 10, 2026 at 9:28 AM Michael Paquier <[email protected]> wrote:
>
> On Fri, Mar 06, 2026 at 11:02:41AM +0530, Nitin Jadhav wrote:
> > Patch 0001 adjusts the error severity during crash recovery when the
> > checkpoint record referenced by pg_control cannot be located and no
> > backup_label file is present. The error is lowered from PANIC to
> > FATAL. This patch also adds a new TAP test that verifies startup fails
> > with a clear FATAL error. The test is straightforward: it removes the
> > WAL segment containing the checkpoint record and confirms that the
> > server reports the expected error.
>
> There could be one possibility for 0001 to be unstable, actually.  One
> could imagine that by getting the segment from a live server things
> could fail: if the run is very slow then we may finish with an
> incorrect record.  It would be possibe to use pg_controldata once the
> server has been shut down, but we should be on the same segment anyway
> because nothing happens on the server, so I have let the test are you
> have suggested, and applied this one.  You have missed an update in
> meson.build, and I am not sure that there was a need for renaming 050,
> either.
>
> > Missing checkpoint WAL segment referenced by backup_label:
> > This test uses an online backup to create a backup_label file,
> > extracts the checkpoint record information from it, removes the
> > corresponding WAL segment, and verifies that the server reports the
> > expected error.
> >
> > Missing redo WAL segment referenced by the checkpoint:
> > In this test, redo and checkpoint records are forced into different
> > WAL segments using injection points. A cold backup is then taken, with
> > an explicit backup_label created in the restored cluster. The WAL
> > segment containing the redo record is removed, and startup is expected
> > to fail with the appropriate error message.
>
> These two are very close to the existing tests that we have on HEAD
> now, could it be better to group them together instead?  Particularly,
> 054 reuses the injection point trick in what is clearly a copy-paste
> taken from 050.  Avoiding this duplication may be nice.
> --
> Michael





^ permalink  raw  reply  [nested|flat] 7+ messages in thread

* Re: Change checkpoint‑record‑missing PANIC to FATAL
  2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2025-12-17 03:49 ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  2025-12-29 15:09   `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2026-03-06 05:32     `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
  2026-03-10 03:58       ` Re: Change checkpoint‑record‑missing PANIC to FATAL Michael Paquier <[email protected]>
  2026-03-20 14:26         `  Re: Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
@ 2026-03-23 00:55           ` Michael Paquier <[email protected]>
  0 siblings, 0 replies; 7+ messages in thread

From: Michael Paquier @ 2026-03-23 00:55 UTC (permalink / raw)
  To: Nitin Jadhav <[email protected]>; +Cc: pgsql-hackers

On Fri, Mar 20, 2026 at 07:56:46PM +0530, Nitin Jadhav wrote:
> That's a good point. I agree these new cases are very close to what we
> already have on HEAD. I can rework this to avoid duplication by
> grouping them with the existing tests.

Thanks.  I am not sure if I will be able to follow up the work of this
thread for this commit fest, so I'll probably consider more
improvements in this area once v20 opens for business around the end
of June.
--
Michael


Attachments:

  [application/pgp-signature] signature.asc (833B, 2-signature.asc)
  download

^ permalink  raw  reply  [nested|flat] 7+ messages in thread

end of thread, other threads:[~2026-03-23 00:55 UTC | newest]

Thread overview: 7+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-12-16 10:55  Change checkpoint‑record‑missing PANIC to FATAL Nitin Jadhav <[email protected]>
2025-12-17 03:49 ` Michael Paquier <[email protected]>
2025-12-29 15:09   ` Nitin Jadhav <[email protected]>
2026-03-06 05:32     ` Nitin Jadhav <[email protected]>
2026-03-10 03:58       ` Michael Paquier <[email protected]>
2026-03-20 14:26         ` Nitin Jadhav <[email protected]>
2026-03-23 00:55           ` Michael Paquier <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox