public inbox for [email protected]  
help / color / mirror / Atom feed
From: Anthonin Bonnefoy <[email protected]>
To: PostgreSQL Hackers <[email protected]>
Subject: Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
Date: Tue, 24 Feb 2026 10:46:51 +0100
Message-ID: <CAO6_XqrZEREa5d+dyjahX6bteBhoN=8Jid-3a4f6Q35sWrv9eg@mail.gmail.com> (raw)
In-Reply-To: <CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com>
References: <CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com>

Some additional informations:

XLogSendLogical is stuck in the following infinite loop:
- It attempt to read the next record with XLogReadAhead + XLogDecodeNextRecord
- The page with the record header is read
- It has the record header, it goes back to XLogDecodeNextRecord
- tot_len > len, the record needs to be reassembled
- The next page containing the rest of the record is read with
ReadPageInternal. It fails since this page was never written.
- It jumps to the err label, XLogReaderInvalReadState(state) is called
and reset the reader state
- It goes back to the start of WalSndLoop's loop

There are some attempts done by the walsender to flush the WAL using
XLogBackgroundFlush:
  /*
   * If we're shutting down, trigger pending WAL to be written out,
   * otherwise we'd possibly end up waiting for WAL that never gets
   * written, because walwriter has shut down already.
   */
  if (got_STOPPING)
    XLogBackgroundFlush();

However, XLogBackgroundFlush only writes completed blocks or the
latest async xact known. With the issue triggered, I have the
following state:

XLogCtl->LogwrtRqst: (Write = 39128056, Flush = 39124992)
LogwrtResult: (Write = 39124992, Flush = 39124992)
XLogCtl->asyncXactLSN: 39119776

There are 3064 bytes (39128056 - 39124992) that contain the next page
with the rest of the cont record that still needs to be written.
However, XLogBackgroundFlush backs off to the previous page boundary:
  /* back off to last completed page boundary */
  WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
Meaning WriteRqst.Write will be 39124992, which is already written and
flushed and asyncXactLSN is behind both write and flush.

So, it looks like the root issue is more that the async LSN isn't
updated when a transaction without xid is rollbacked.
When going through CommitTransaction, such a transaction would still
go through XLogSetAsyncXactLSN.

I've updated the patch with this new approach: XLogSetAsyncXactLSN is
now called in RecordTransactionAbort even when a xid wasn't assigned.
With this, the logical walsender is able to force the flush of the
last partial page using XLogBackgroundFlush.


Attachments:

  [application/octet-stream] v2-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch (2.7K, 2-v2-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch)
  download | inline diff:
From 28f552b9ccb076627d2577b7daeb23a93a1e50ef Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <[email protected]>
Date: Tue, 24 Feb 2026 09:24:48 +0100
Subject: Fix stuck shutdown due to unflushed records

Shutdown sequence may be stuck indefinitely under the following
circumstances:
- Data checksums is enabled
- A logical replication walsender is running
- A select in an explicit transaction tries to prune a full heap page,
  wrote a FPI_FOR_HINT record which crosses the page boundary
- The select is rollbacked (or killed)
- 'pg_ctl stop' is sent

The FPI_FOR_HINT record is likely going to be a contrecord and starts a
new page. However, as the select is rollbacked, XLogSetAsyncXactLSN
isn't called to advance the LSN to include this record.

When the checkpointer starts ShutdownXLOG(), all walsenders will be
notified to stop. However, the logical replication walsender will be
stuck in the following infinite loop:
- Tries to read the last FPI_FOR_HINT record
- The page with the record header is read
- tot_len > len, the record needs to be reassembled
- Tries to read the next page containing the rest of the record. It fails since this page was never written.
- xlog reader state is reset with XLogReaderInvalReadState
- It goes back to the start of WalSndLoop's loop

There are some attempts done by the walsender to flush the WAL using
XLogBackgroundFlush. However, XLogBackgroundFlush only writes completed
blocks, or up to the latest known async lsn.

Since the select was rollbacked, XLogBackgroundFlush doesn't flush the
next partial page.

This patch fixes the issue by advancing the async LSN, even when the
transaction doesn't have an assigned xid. This allows
XLogBackgroundFlush to write the necessary partial page when called by
the walsender.
---
 src/backend/access/transam/xact.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index eba4f063168..0a49d5b603c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1787,7 +1787,18 @@ RecordTransactionAbort(bool isSubXact)
 	{
 		/* Reset XactLastRecEnd until the next transaction writes something */
 		if (!isSubXact)
+		{
+			/*
+			 * Even if no xid was assigned, some records may have been written
+			 * in the WAL. Report the latest async LSN, so that the WAL writer
+			 * knows to flush those records. This is important when shutting
+			 * down, walsender may use XLogBackgroundFlush to trigger pending
+			 * WAL to be written out. If they're not tracked by async xact
+			 * lsn, they won't be written by XLogBackgroundFlush.
+			 */
+			XLogSetAsyncXactLSN(XactLastRecEnd);
 			XactLastRecEnd = 0;
+		}
 		return InvalidTransactionId;
 	}
 
-- 
2.52.0



view thread (3+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected]
  Subject: Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
  In-Reply-To: <CAO6_XqrZEREa5d+dyjahX6bteBhoN=8Jid-3a4f6Q35sWrv9eg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox