public inbox for [email protected]  
help / color / mirror / Atom feed
Propagate XLogFindNextRecord error to callers
6+ messages / 2 participants
[nested] [flat]

* Propagate XLogFindNextRecord error to callers
@ 2025-12-12 10:39  Anthonin Bonnefoy <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Anthonin Bonnefoy @ 2025-12-12 10:39 UTC (permalink / raw)
  To: pgsql-hackers

Hi,

Currently, XLogFindNextRecord errormsg is ignored and callers will only
output a generic 'could not find a valid record' message without details.
Additionally, invalid page header won't go through XLogReadRecord, leaving
the error in state->errormsg_buf.

This patch propagates XLogFindNextRecord's error message to the callers and
displays it. In case of an invalid page header, the errormsg is filled with
errormsg_buf content.

With this patch, pg_waldump will now have the following output when reading
a file with an invalid header:
pg_waldump: error: could not find a valid record after D80/5C000000:
invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN
D80/5C000000, offset 0

Regards,
Anthonin Bonnefoy


Attachments:

  [application/octet-stream] v1-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch (6.9K, 3-v1-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch)
  download | inline diff:
From f807ad4e9aaf5e68601f65482cfcd33eeaea5e0c Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <[email protected]>
Date: Fri, 12 Dec 2025 09:56:00 +0100
Subject: Propage errormsg to XLogFindNextRecord caller

Currently, XLogFindNextRecord errormsg is ignored and callers will only
output a generic 'could not find a valid record' message without
details.
Additionally, invalid page header won't go through XLogReadRecord,
leaving the error in state->errormsg_buf.

This patch propagates XLogFindNextRecord's error message to the callers
and displays it. In case of an invalid page header, the errormsg is
filled with errormsg_buf content.
---
 contrib/pg_walinspect/pg_walinspect.c   | 16 ++++++++++++----
 src/backend/access/transam/xlogreader.c | 16 +++++++++++++---
 src/backend/postmaster/walsummarizer.c  | 17 ++++++++++++-----
 src/bin/pg_waldump/pg_waldump.c         | 12 +++++++++---
 src/include/access/xlogreader.h         |  2 +-
 5 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index 6945bac1306..da997e4fd86 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -97,6 +97,7 @@ InitXLogReaderState(XLogRecPtr lsn)
 	XLogReaderState *xlogreader;
 	ReadLocalXLogPageNoWaitPrivate *private_data;
 	XLogRecPtr	first_valid_record;
+	char	   *errormsg;
 
 	/*
 	 * Reading WAL below the first page of the first segments isn't allowed.
@@ -124,12 +125,19 @@ InitXLogReaderState(XLogRecPtr lsn)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	/* first find a valid recptr to start from */
-	first_valid_record = XLogFindNextRecord(xlogreader, lsn);
+	first_valid_record = XLogFindNextRecord(xlogreader, lsn, &errormsg);
 
 	if (!XLogRecPtrIsValid(first_valid_record))
-		ereport(ERROR,
-				errmsg("could not find a valid record after %X/%08X",
-					   LSN_FORMAT_ARGS(lsn)));
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not find a valid record after %X/%08X: %s",
+						   LSN_FORMAT_ARGS(lsn), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not find a valid record after %X/%08X",
+						   LSN_FORMAT_ARGS(lsn)));
+	}
 
 	return xlogreader;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5e5001b2101..269c0ff7f47 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1391,12 +1391,11 @@ XLogReaderResetError(XLogReaderState *state)
  * XLogReadRecord() will read the next valid record.
  */
 XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
 
 	Assert(XLogRecPtrIsValid(RecPtr));
 
@@ -1481,7 +1480,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, errormsg) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1496,6 +1495,17 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 err:
 	XLogReaderInvalReadState(state);
 
+	/*
+	 * we may have reported errors due to invalid WAL header, propagate the
+	 * error message to the caller.
+	 */
+	if (state->errormsg_deferred)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		state->errormsg_deferred = false;
+	}
+
 	return InvalidXLogRecPtr;
 }
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e7e4d652f97..3e7114b6942 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -918,6 +918,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	WalSummaryIO io;
 	BlockRefTable *brtab = CreateEmptyBlockRefTable();
 	bool		fast_forward = true;
+	char	   *errormsg;
 
 	/* Initialize private data for xlogreader. */
 	private_data = palloc0_object(SummarizerReadLocalXLogPrivate);
@@ -969,7 +970,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	}
 	else
 	{
-		summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+		summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn, &errormsg);
 		if (!XLogRecPtrIsValid(summary_start_lsn))
 		{
 			/*
@@ -998,9 +999,16 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 				switch_lsn = xlogreader->EndRecPtr;
 			}
 			else
-				ereport(ERROR,
-						errmsg("could not find a valid record after %X/%08X",
-							   LSN_FORMAT_ARGS(start_lsn)));
+			{
+				if (errormsg)
+					ereport(ERROR,
+							errmsg("could not find a valid record after %X/%08X: %s",
+								   LSN_FORMAT_ARGS(start_lsn), errormsg));
+				else
+					ereport(ERROR,
+							errmsg("could not find a valid record after %X/%08X",
+								   LSN_FORMAT_ARGS(start_lsn)));
+			}
 		}
 
 		/* We shouldn't go backward. */
@@ -1013,7 +1021,6 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	while (1)
 	{
 		int			block_id;
-		char	   *errormsg;
 		XLogRecord *record;
 		uint8		rmid;
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 1c1ccf59f65..d73f44e6394 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1211,11 +1211,17 @@ main(int argc, char **argv)
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	/* first find a valid recptr to start from */
-	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+	first_record = XLogFindNextRecord(xlogreader_state, private.startptr, &errormsg);
 
 	if (!XLogRecPtrIsValid(first_record))
-		pg_fatal("could not find a valid record after %X/%08X",
-				 LSN_FORMAT_ARGS(private.startptr));
+	{
+		if (errormsg)
+			pg_fatal("could not find a valid record after %X/%X: %s",
+					 LSN_FORMAT_ARGS(private.startptr), errormsg);
+		else
+			pg_fatal("could not find a valid record after %X/%X",
+					 LSN_FORMAT_ARGS(private.startptr));
+	}
 
 	/*
 	 * Display a message that we're skipping data if `from` wasn't a pointer
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index dfabbbd57d4..4f68fe0e4ea 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -342,7 +342,7 @@ extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
 
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg);
 
 /* Return values from XLogPageReadCB. */
 typedef enum XLogPageReadResult
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Propagate XLogFindNextRecord error to callers
@ 2026-03-17 09:48  Fujii Masao <[email protected]>
  parent: Anthonin Bonnefoy <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Fujii Masao @ 2026-03-17 09:48 UTC (permalink / raw)
  To: Anthonin Bonnefoy <[email protected]>; +Cc: Chao Li <[email protected]>; Mircea Cadariu <[email protected]>; pgsql-hackers; Japin Li <[email protected]>

On Thu, Feb 26, 2026 at 5:19 PM Anthonin Bonnefoy
<[email protected]> wrote:
>
> Thanks for the comments!
>
> On Tue, Feb 24, 2026 at 5:00 AM Chao Li <[email protected]> wrote:
> > From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new parameter ultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of state, and we also need extra handling for cases like errormsg == NULL.
> >
> > Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally pstrdup()s state->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string explicitly, and there’s no hidden dependency on the reader state’s lifetime.
> >
> > This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.
>
> One issue I see is that it introduces another way to get the error,
> with XLogReadRecord and XLogNextRecord using an errormsg parameter,
> and XLogFindNextRecord using the helper function. Maybe the solution
> would be to change both XLogReadRecord and XLogNextRecord to use this
> new function to stay consistent, but that means changing their
> signatures.
>
> Also, I see the errormsg parameter as a way to signal the caller that
> "this function can fail, the detailed error will be available here".
> With the XLogReaderGetLastError, it becomes the caller's
> responsibility to know which function may fill the error message and
> check it accordingly.
>
> The error message is likely printed shortly after the function's call,
> so I suspect the risk of using the errormsg after its intended
> lifetime is low.
>
> You bring up a good point about the errormsg's lifetime, which is
> definitely something to mention in the function's comments. I've
> updated the patch with the additional comments.

Since this patch is marked Ready for Committer, I've started reviewing it.

+ * When set, *errormsg points to an internal buffer that's valid until the next
+ * call to XLogReadRecord.

Could that buffer also be invalidated by other functions that modify
"XLogReaderState *state", such as XLogBeginRead()?


+# Wrong WAL version. We copy an existing wal file and set the
+# page's magic value to 0000.

Would it be better to describe the purpose of this test at the top?
For example:

    # Test that pg_waldump reports a detailed error message when dumping
    # a WAL file with an invalid magic number (0000).
    {
            # The broken WAL file is created by copying a valid WAL file and
            # overwriting its magic number with 0000.
            my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();


+ open($fh, '+<', $broken_wal);
+ close($fh);

Should we add error handling like:

    open(my $fh, '+<', $broken_wal)
        or BAIL_OUT("open($broken_wal) failed: $!");
    close($fh)
        or BAIL_OUT("close failed: $!");


Also, other similar tests seem to call binmode. Is it really unnecessary
in this case?

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Propagate XLogFindNextRecord error to callers
@ 2026-03-23 08:15  Anthonin Bonnefoy <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Anthonin Bonnefoy @ 2026-03-23 08:15 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Chao Li <[email protected]>; Mircea Cadariu <[email protected]>; pgsql-hackers; Japin Li <[email protected]>

On Tue, Mar 17, 2026 at 10:49 AM Fujii Masao <[email protected]> wrote:
> Since this patch is marked Ready for Committer, I've started reviewing it.

Thanks for the review!

On Tue, Mar 17, 2026 at 10:49 AM Fujii Masao <[email protected]> wrote:
> + * When set, *errormsg points to an internal buffer that's valid until the next
> + * call to XLogReadRecord.
>
> Could that buffer also be invalidated by other functions that modify
> "XLogReaderState *state", such as XLogBeginRead()?

Yes, XLogBeginRead() will clear the buffer, but I don't think there's
a pattern where XLogBeginRead() is called after XLogFindNextRecord()
on the same reader?
The comment was more focused on how the function would be used, with
XLogReadRecord() being called after the reader state is initialized,
and talking about XLogBeginRead() here could be confusing.

> +# Wrong WAL version. We copy an existing wal file and set the
> +# page's magic value to 0000.
>
> Would it be better to describe the purpose of this test at the top?
> For example:
>
>     # Test that pg_waldump reports a detailed error message when dumping
>     # a WAL file with an invalid magic number (0000).
>     {
>             # The broken WAL file is created by copying a valid WAL file and
>             # overwriting its magic number with 0000.
>             my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();

Sounds good, I've updated the patch.

> + open($fh, '+<', $broken_wal);
> + close($fh);
>
> Should we add error handling like:
>
>     open(my $fh, '+<', $broken_wal)
>         or BAIL_OUT("open($broken_wal) failed: $!");
>     close($fh)
>         or BAIL_OUT("close failed: $!");

Added.

> Also, other similar tests seem to call binmode. Is it really unnecessary
> in this case?

Ha right, I've missed those calls. I'm not sure if it's strictly
necessary for the functions used there, but it's probably a lot saner
to read and write WAL using binary mode. I've added a binmode call.

Regards,
Anthonin Bonnefoy


Attachments:

  [application/octet-stream] v7-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch (8.7K, 2-v7-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch)
  download | inline diff:
From a09e289393ce4b54efcb22aab7b6bed78bc41450 Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <[email protected]>
Date: Mon, 23 Mar 2026 08:41:44 +0100
Subject: Propagate errormsg to XLogFindNextRecord caller

Currently, XLogFindNextRecord errormsg is ignored and callers will only
output a generic 'could not find a valid record' message without
details.
Additionally, invalid page header won't go through XLogReadRecord,
leaving the error in state->errormsg_buf.

This patch propagates XLogFindNextRecord's error message to the callers
and displays it. In case of an invalid page header, the errormsg is
filled with errormsg_buf content.
---
 contrib/pg_walinspect/pg_walinspect.c   | 16 ++++++++++----
 src/backend/access/transam/xlogreader.c | 24 +++++++++++++++++---
 src/backend/postmaster/walsummarizer.c  | 17 ++++++++++-----
 src/bin/pg_waldump/pg_waldump.c         | 12 +++++++---
 src/bin/pg_waldump/t/001_basic.pl       | 29 +++++++++++++++++++++++++
 src/include/access/xlogreader.h         |  3 ++-
 6 files changed, 85 insertions(+), 16 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index f6f5f0792d2..4cf6e41e2f5 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -99,6 +99,7 @@ InitXLogReaderState(XLogRecPtr lsn)
 	XLogReaderState *xlogreader;
 	ReadLocalXLogPageNoWaitPrivate *private_data;
 	XLogRecPtr	first_valid_record;
+	char	   *errormsg;
 
 	/*
 	 * Reading WAL below the first page of the first segments isn't allowed.
@@ -126,12 +127,19 @@ InitXLogReaderState(XLogRecPtr lsn)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	/* first find a valid recptr to start from */
-	first_valid_record = XLogFindNextRecord(xlogreader, lsn);
+	first_valid_record = XLogFindNextRecord(xlogreader, lsn, &errormsg);
 
 	if (!XLogRecPtrIsValid(first_valid_record))
-		ereport(ERROR,
-				errmsg("could not find a valid record after %X/%08X",
-					   LSN_FORMAT_ARGS(lsn)));
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not find a valid record after %X/%08X: %s",
+						   LSN_FORMAT_ARGS(lsn), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not find a valid record after %X/%08X",
+						   LSN_FORMAT_ARGS(lsn)));
+	}
 
 	return xlogreader;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 8cb2110cb99..8849610db00 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1390,14 +1390,21 @@ XLogReaderResetError(XLogReaderState *state)
  *
  * This positions the reader, like XLogBeginRead(), so that the next call to
  * XLogReadRecord() will read the next valid record.
+ *
+ * On failure, InvalidXLogRecPtr is returned, and *errormsg is set to a string
+ * with details of the failure.
+ *
+ * When set, *errormsg points to an internal buffer that's valid until the next
+ * call to XLogReadRecord.
  */
 XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
+
+	*errormsg = NULL;
 
 	Assert(XLogRecPtrIsValid(RecPtr));
 
@@ -1482,7 +1489,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, errormsg) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1497,6 +1504,17 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 err:
 	XLogReaderInvalReadState(state);
 
+	/*
+	 * We may have reported errors due to invalid WAL header, propagate the
+	 * error message to the caller.
+	 */
+	if (state->errormsg_deferred)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		state->errormsg_deferred = false;
+	}
+
 	return InvalidXLogRecPtr;
 }
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1aa102f41d..0c0670f7da9 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -915,6 +915,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	WalSummaryIO io;
 	BlockRefTable *brtab = CreateEmptyBlockRefTable();
 	bool		fast_forward = true;
+	char	   *errormsg;
 
 	/* Initialize private data for xlogreader. */
 	private_data = palloc0_object(SummarizerReadLocalXLogPrivate);
@@ -966,7 +967,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	}
 	else
 	{
-		summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+		summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn, &errormsg);
 		if (!XLogRecPtrIsValid(summary_start_lsn))
 		{
 			/*
@@ -995,9 +996,16 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 				switch_lsn = xlogreader->EndRecPtr;
 			}
 			else
-				ereport(ERROR,
-						errmsg("could not find a valid record after %X/%08X",
-							   LSN_FORMAT_ARGS(start_lsn)));
+			{
+				if (errormsg)
+					ereport(ERROR,
+							errmsg("could not find a valid record after %X/%08X: %s",
+								   LSN_FORMAT_ARGS(start_lsn), errormsg));
+				else
+					ereport(ERROR,
+							errmsg("could not find a valid record after %X/%08X",
+								   LSN_FORMAT_ARGS(start_lsn)));
+			}
 		}
 
 		/* We shouldn't go backward. */
@@ -1010,7 +1018,6 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	while (1)
 	{
 		int			block_id;
-		char	   *errormsg;
 		XLogRecord *record;
 		uint8		rmid;
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f82507ef696..630d9859882 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1384,11 +1384,17 @@ main(int argc, char **argv)
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	/* first find a valid recptr to start from */
-	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+	first_record = XLogFindNextRecord(xlogreader_state, private.startptr, &errormsg);
 
 	if (!XLogRecPtrIsValid(first_record))
-		pg_fatal("could not find a valid record after %X/%08X",
-				 LSN_FORMAT_ARGS(private.startptr));
+	{
+		if (errormsg)
+			pg_fatal("could not find a valid record after %X/%08X: %s",
+					 LSN_FORMAT_ARGS(private.startptr), errormsg);
+		else
+			pg_fatal("could not find a valid record after %X/%08X",
+					 LSN_FORMAT_ARGS(private.startptr));
+	}
 
 	/*
 	 * Display a message that we're skipping data if `from` wasn't a pointer
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 11df7e092bf..7ff9e918ff2 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -10,6 +10,7 @@ use Test::More;
 use List::Util qw(shuffle);
 
 my $tar = $ENV{TAR};
+use File::Copy;
 
 program_help_ok('pg_waldump');
 program_version_ok('pg_waldump');
@@ -246,6 +247,34 @@ command_like(
 	qr/^$/,
 	'no output with --quiet option');
 
+# Wrong WAL version. We copy an existing wal file and set the
+# page's magic value to 0000.
+{
+	# The broken WAL file is created by copying a valid WAL file and
+	# overwriting its magic number with 0000.
+	my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();
+	my $broken_wal = "$broken_wal_dir/$start_walfile";
+	copy($node->data_dir . '/pg_wal/' . $start_walfile, $broken_wal)
+		|| die "copying $start_walfile $!";
+
+	my $fh;
+	open($fh, '+<', $broken_wal)
+	  or BAIL_OUT("open failed: $!");
+	binmode $fh;
+	sysseek($fh, 0, 0)
+		or BAIL_OUT("sysseek failed: $!");
+	syswrite($fh, pack("S", 0))
+		or BAIL_OUT("syswrite failed: $!");
+	close($fh)
+	  or BAIL_OUT("close failed: $!");
+
+	command_fails_like(
+		[ 'pg_waldump', $broken_wal ],
+			qr/invalid magic number 0000/i,
+			'detailed error message shown for invalid WAL page magic'
+	);
+}
+
 # Test for: Display a message that we're skipping data if `from`
 # wasn't a pointer to the start of a record.
 sub test_pg_waldump_skip_bytes
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 80f1a548e08..c2a48bfe809 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -342,7 +342,8 @@ extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
 
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
+                                     char **errormsg);
 
 /* Return values from XLogPageReadCB. */
 typedef enum XLogPageReadResult
-- 
2.53.0



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Propagate XLogFindNextRecord error to callers
@ 2026-03-23 12:00  Fujii Masao <[email protected]>
  parent: Anthonin Bonnefoy <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Fujii Masao @ 2026-03-23 12:00 UTC (permalink / raw)
  To: Anthonin Bonnefoy <[email protected]>; +Cc: Chao Li <[email protected]>; Mircea Cadariu <[email protected]>; pgsql-hackers; Japin Li <[email protected]>

On Mon, Mar 23, 2026 at 5:15 PM Anthonin Bonnefoy
<[email protected]> wrote:
> > +# Wrong WAL version. We copy an existing wal file and set the
> > +# page's magic value to 0000.
> >
> > Would it be better to describe the purpose of this test at the top?
> > For example:
> >
> >     # Test that pg_waldump reports a detailed error message when dumping
> >     # a WAL file with an invalid magic number (0000).
> >     {
> >             # The broken WAL file is created by copying a valid WAL file and
> >             # overwriting its magic number with 0000.
> >             my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();
>
> Sounds good, I've updated the patch.

Thanks for updating the patch!

It looks like the first part of the comment above wasn't included,
so I've added it. I also made a few cosmetic changes and updated
the commit message. Attached is the updated version of the patch.

Unless there are objections, I'll commit this version.

Regards,

-- 
Fujii Masao


Attachments:

  [application/octet-stream] v8-0001-Report-detailed-errors-from-XLogFindNextRecord-fa.patch (9.3K, 2-v8-0001-Report-detailed-errors-from-XLogFindNextRecord-fa.patch)
  download | inline diff:
From 7624005ad6d4413e27c133ef250381114a527b78 Mon Sep 17 00:00:00 2001
From: Fujii Masao <[email protected]>
Date: Mon, 23 Mar 2026 18:37:56 +0900
Subject: [PATCH v8] Report detailed errors from XLogFindNextRecord() failures.

Previously, XLogFindNextRecord() did not return detailed error information
when it failed to find a valid WAL record. As a result, callers such as
the WAL summarizer, pg_waldump, and pg_walinspect could only report generic
errors (e.g., "could not find a valid record after ..."), making
troubleshooting difficult.

This commit fix the issue by extending XLogFindNextRecord() to return
detailed error information on failure, and updating its callers to include
those details in their error messages.

For example, when pg_waldump is run on a WAL file with an invalid magic number,
it now reports not only the generic error but also the specific cause
(e.g., "invalid magic number").

Author: Anthonin Bonnefoy <[email protected]>
Reviewed-by: Mircea Cadariu <[email protected]>
Reviewed-by: Japin Li <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Reviewed-by: Fujii Masao <[email protected]>
Discussion: https://postgr.es/m/CAO6_XqoxJXddcT4wkd9Xd+cD6Sz-fyspRGuV4Bq-wbXG4pVNzA@mail.gmail.com
---
 contrib/pg_walinspect/pg_walinspect.c   | 16 ++++++++++----
 src/backend/access/transam/xlogreader.c | 24 ++++++++++++++++++---
 src/backend/postmaster/walsummarizer.c  | 17 ++++++++++-----
 src/bin/pg_waldump/pg_waldump.c         | 12 ++++++++---
 src/bin/pg_waldump/t/001_basic.pl       | 28 +++++++++++++++++++++++++
 src/include/access/xlogreader.h         |  3 ++-
 6 files changed, 84 insertions(+), 16 deletions(-)

diff --git a/contrib/pg_walinspect/pg_walinspect.c b/contrib/pg_walinspect/pg_walinspect.c
index f6f5f0792d2..4cf6e41e2f5 100644
--- a/contrib/pg_walinspect/pg_walinspect.c
+++ b/contrib/pg_walinspect/pg_walinspect.c
@@ -99,6 +99,7 @@ InitXLogReaderState(XLogRecPtr lsn)
 	XLogReaderState *xlogreader;
 	ReadLocalXLogPageNoWaitPrivate *private_data;
 	XLogRecPtr	first_valid_record;
+	char	   *errormsg;
 
 	/*
 	 * Reading WAL below the first page of the first segments isn't allowed.
@@ -126,12 +127,19 @@ InitXLogReaderState(XLogRecPtr lsn)
 				 errdetail("Failed while allocating a WAL reading processor.")));
 
 	/* first find a valid recptr to start from */
-	first_valid_record = XLogFindNextRecord(xlogreader, lsn);
+	first_valid_record = XLogFindNextRecord(xlogreader, lsn, &errormsg);
 
 	if (!XLogRecPtrIsValid(first_valid_record))
-		ereport(ERROR,
-				errmsg("could not find a valid record after %X/%08X",
-					   LSN_FORMAT_ARGS(lsn)));
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not find a valid record after %X/%08X: %s",
+						   LSN_FORMAT_ARGS(lsn), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not find a valid record after %X/%08X",
+						   LSN_FORMAT_ARGS(lsn)));
+	}
 
 	return xlogreader;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 8cb2110cb99..8849610db00 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1390,14 +1390,21 @@ XLogReaderResetError(XLogReaderState *state)
  *
  * This positions the reader, like XLogBeginRead(), so that the next call to
  * XLogReadRecord() will read the next valid record.
+ *
+ * On failure, InvalidXLogRecPtr is returned, and *errormsg is set to a string
+ * with details of the failure.
+ *
+ * When set, *errormsg points to an internal buffer that's valid until the next
+ * call to XLogReadRecord.
  */
 XLogRecPtr
-XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
+XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr, char **errormsg)
 {
 	XLogRecPtr	tmpRecPtr;
 	XLogRecPtr	found = InvalidXLogRecPtr;
 	XLogPageHeader header;
-	char	   *errormsg;
+
+	*errormsg = NULL;
 
 	Assert(XLogRecPtrIsValid(RecPtr));
 
@@ -1482,7 +1489,7 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 	 * or we just jumped over the remaining data of a continuation.
 	 */
 	XLogBeginRead(state, tmpRecPtr);
-	while (XLogReadRecord(state, &errormsg) != NULL)
+	while (XLogReadRecord(state, errormsg) != NULL)
 	{
 		/* past the record we've found, break out */
 		if (RecPtr <= state->ReadRecPtr)
@@ -1497,6 +1504,17 @@ XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr)
 err:
 	XLogReaderInvalReadState(state);
 
+	/*
+	 * We may have reported errors due to invalid WAL header, propagate the
+	 * error message to the caller.
+	 */
+	if (state->errormsg_deferred)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		state->errormsg_deferred = false;
+	}
+
 	return InvalidXLogRecPtr;
 }
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index e1aa102f41d..0c0670f7da9 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -915,6 +915,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	WalSummaryIO io;
 	BlockRefTable *brtab = CreateEmptyBlockRefTable();
 	bool		fast_forward = true;
+	char	   *errormsg;
 
 	/* Initialize private data for xlogreader. */
 	private_data = palloc0_object(SummarizerReadLocalXLogPrivate);
@@ -966,7 +967,7 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	}
 	else
 	{
-		summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn);
+		summary_start_lsn = XLogFindNextRecord(xlogreader, start_lsn, &errormsg);
 		if (!XLogRecPtrIsValid(summary_start_lsn))
 		{
 			/*
@@ -995,9 +996,16 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 				switch_lsn = xlogreader->EndRecPtr;
 			}
 			else
-				ereport(ERROR,
-						errmsg("could not find a valid record after %X/%08X",
-							   LSN_FORMAT_ARGS(start_lsn)));
+			{
+				if (errormsg)
+					ereport(ERROR,
+							errmsg("could not find a valid record after %X/%08X: %s",
+								   LSN_FORMAT_ARGS(start_lsn), errormsg));
+				else
+					ereport(ERROR,
+							errmsg("could not find a valid record after %X/%08X",
+								   LSN_FORMAT_ARGS(start_lsn)));
+			}
 		}
 
 		/* We shouldn't go backward. */
@@ -1010,7 +1018,6 @@ SummarizeWAL(TimeLineID tli, XLogRecPtr start_lsn, bool exact,
 	while (1)
 	{
 		int			block_id;
-		char	   *errormsg;
 		XLogRecord *record;
 		uint8		rmid;
 
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index f82507ef696..630d9859882 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -1384,11 +1384,17 @@ main(int argc, char **argv)
 		pg_fatal("out of memory while allocating a WAL reading processor");
 
 	/* first find a valid recptr to start from */
-	first_record = XLogFindNextRecord(xlogreader_state, private.startptr);
+	first_record = XLogFindNextRecord(xlogreader_state, private.startptr, &errormsg);
 
 	if (!XLogRecPtrIsValid(first_record))
-		pg_fatal("could not find a valid record after %X/%08X",
-				 LSN_FORMAT_ARGS(private.startptr));
+	{
+		if (errormsg)
+			pg_fatal("could not find a valid record after %X/%08X: %s",
+					 LSN_FORMAT_ARGS(private.startptr), errormsg);
+		else
+			pg_fatal("could not find a valid record after %X/%08X",
+					 LSN_FORMAT_ARGS(private.startptr));
+	}
 
 	/*
 	 * Display a message that we're skipping data if `from` wasn't a pointer
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 11df7e092bf..8bb8fa225f6 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -4,6 +4,7 @@
 use strict;
 use warnings FATAL => 'all';
 use Cwd;
+use File::Copy;
 use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
@@ -246,6 +247,33 @@ command_like(
 	qr/^$/,
 	'no output with --quiet option');
 
+# Test that pg_waldump reports a detailed error message when dumping
+# a WAL file with an invalid magic number (0000).
+#
+# The broken WAL file is created by copying a valid WAL file and
+# overwriting its magic number with 0000.
+my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();
+my $broken_wal = "$broken_wal_dir/$start_walfile";
+copy($node->data_dir . '/pg_wal/' . $start_walfile, $broken_wal)
+  || die "copying $start_walfile $!";
+
+my $fh;
+open($fh, '+<', $broken_wal)
+  or BAIL_OUT("open failed: $!");
+binmode $fh;
+
+sysseek($fh, 0, 0)
+  or BAIL_OUT("sysseek failed: $!");
+syswrite($fh, pack("S", 0))
+  or BAIL_OUT("syswrite failed: $!");
+close($fh)
+  or BAIL_OUT("close failed: $!");
+
+command_fails_like(
+	[ 'pg_waldump', $broken_wal ],
+	qr/invalid magic number 0000/i,
+	'detailed error message shown for invalid WAL page magic');
+
 # Test for: Display a message that we're skipping data if `from`
 # wasn't a pointer to the start of a record.
 sub test_pg_waldump_skip_bytes
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 80f1a548e08..97eae2c1dab 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -342,7 +342,8 @@ extern void XLogReaderSetDecodeBuffer(XLogReaderState *state,
 
 /* Position the XLogReader to given record */
 extern void XLogBeginRead(XLogReaderState *state, XLogRecPtr RecPtr);
-extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr);
+extern XLogRecPtr XLogFindNextRecord(XLogReaderState *state, XLogRecPtr RecPtr,
+									 char **errormsg);
 
 /* Return values from XLogPageReadCB. */
 typedef enum XLogPageReadResult
-- 
2.51.2



^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Propagate XLogFindNextRecord error to callers
@ 2026-03-23 12:07  Anthonin Bonnefoy <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 1 reply; 6+ messages in thread

From: Anthonin Bonnefoy @ 2026-03-23 12:07 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: Chao Li <[email protected]>; Mircea Cadariu <[email protected]>; pgsql-hackers; Japin Li <[email protected]>

On Mon, Mar 23, 2026 at 1:00 PM Fujii Masao <[email protected]> wrote:
> It looks like the first part of the comment above wasn't included,
> so I've added it. I also made a few cosmetic changes and updated
> the commit message. Attached is the updated version of the patch.

Right, I've missed the first part... Thanks for the fix!





^ permalink  raw  reply  [nested|flat] 6+ messages in thread

* Re: Propagate XLogFindNextRecord error to callers
@ 2026-03-24 13:35  Fujii Masao <[email protected]>
  parent: Anthonin Bonnefoy <[email protected]>
  0 siblings, 0 replies; 6+ messages in thread

From: Fujii Masao @ 2026-03-24 13:35 UTC (permalink / raw)
  To: Anthonin Bonnefoy <[email protected]>; +Cc: Chao Li <[email protected]>; Mircea Cadariu <[email protected]>; pgsql-hackers; Japin Li <[email protected]>

On Mon, Mar 23, 2026 at 9:07 PM Anthonin Bonnefoy
<[email protected]> wrote:
>
> On Mon, Mar 23, 2026 at 1:00 PM Fujii Masao <[email protected]> wrote:
> > It looks like the first part of the comment above wasn't included,
> > so I've added it. I also made a few cosmetic changes and updated
> > the commit message. Attached is the updated version of the patch.
>
> Right, I've missed the first part... Thanks for the fix!

I've pushed the patch. Thanks!

Regards,

-- 
Fujii Masao





^ permalink  raw  reply  [nested|flat] 6+ messages in thread


end of thread, other threads:[~2026-03-24 13:35 UTC | newest]

Thread overview: 6+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-12-12 10:39 Propagate XLogFindNextRecord error to callers Anthonin Bonnefoy <[email protected]>
2026-03-17 09:48 ` Fujii Masao <[email protected]>
2026-03-23 08:15   ` Anthonin Bonnefoy <[email protected]>
2026-03-23 12:00     ` Fujii Masao <[email protected]>
2026-03-23 12:07       ` Anthonin Bonnefoy <[email protected]>
2026-03-24 13:35         ` Fujii Masao <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox