public inbox for [email protected]  
help / color / mirror / Atom feed
Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
3+ messages / 1 participants
[nested] [flat]

* Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
@ 2026-02-20 17:55 Anthonin Bonnefoy <[email protected]>
  2026-02-24 09:46 ` Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record Anthonin Bonnefoy <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Anthonin Bonnefoy @ 2026-02-20 17:55 UTC (permalink / raw)
  To: pgsql-hackers

Hi,

Shutdown may be indefinitely stuck under the following circumstances:
- Data checksum is enabled (needed to generate FPI_FOR_HINT record)
- A logical replication walsender is running
- A select in an explicit ongoing transaction pruned a heap page and
logged a FPI_FOR_HINT record. This record is likely going to be a
contrecord and start a new page.

Starting the shutdown will kill this ongoing transaction. Since the
transaction doesn't have an allocated xid, the FPI_FOR_HINT record
will be left unflushed.

When the checkpointer calls ShutdownXLOG(), all walsenders will be
notified to stop. However, the logical replication walsender will be
stuck in an infinite loop, trying to read this unflushed record and
never reaching the stop state, blocking the whole shutdown sequence.

This can be reproduced with the following script (this assumes
`pgbench -i` was run to create pgbench_accounts and a running logical
replication walsender):

TRUNCATE pgbench_accounts;
-- Completely fill the first heap page
INSERT INTO pgbench_accounts SELECT *, *, *, '' FROM generate_series(0, 62);
-- This should tag the page's metadata as full
BEGIN;
UPDATE pgbench_accounts SET bid=4 where aid=1;
ROLLBACK;
-- Force checkpoint so next change will be a FPW
CHECKPOINT;
-- Open an explicit transaction
BEGIN;
-- Select will do an opportunistic pruning, find nothing to prune but
will still unset the page full flag, writing a FPI_FOR_HINT
SELECT ctid, * FROM pgbench_accounts WHERE aid=2;

Then shutdown the database with 'pg_ctl stop' with the transaction
left opened. The shutdown will be stuck and the logical replication
walsender will be stuck at 100% CPU.

I've managed to reproduce this issue on 14 and the current HEAD.

The attached (tentative) patch fixes the issue by flushing all records
before signaling walsenders to stop. At that point, all backends
should have been killed, so flushing leftover records felt like a
correct approach.

Regards,
Anthonin Bonnefoy


Attachments:

  [application/octet-stream] v1-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch (2.1K, 2-v1-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch)
  download | inline diff:
From f62b6b45594b4d58ffe739ca42ecd3aca6605c4c Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <[email protected]>
Date: Fri, 20 Feb 2026 18:15:12 +0100
Subject: Fix stuck shutdown due to unflushed records

Shutdown sequence may be stuck indefinitely under the following
circumstances:
- Data checksums is enabled
- A logical replication walsender is running
- A select in an explicit ongoing transaction pruned a heap page and
  logged a FPI_FOR_HINT record. This record is likely going to be a
  contrecord and start a new page.

Starting the shutdown will kill this ongoing transaction. Since the
transaction doesn't have an allocated xid, the FPI_FOR_HINT record will
be left unflushed.

When the checkpointer starts ShutdownXLOG(), all walsenders will be
notified to stop. However, the logical replication walsender will be
stuck in an infinite loop, trying to read this unflushed record, never
reaching the stop state and blocking the whole shutdown sequence.

This patch fixes the issue by flushing all records before signaling
walsenders to stop.
---
 src/backend/access/transam/xlog.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13ec6225b85..aa490176aaf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6727,6 +6727,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	XLogRecPtr	WriteRqstPtr;
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -6740,6 +6742,15 @@ ShutdownXLOG(int code, Datum arg)
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
 
+	/*
+	 * We may have unflushed records, make sure everything is flushed before
+	 * stopping the walsenders.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
+	SpinLockRelease(&XLogCtl->info_lck);
+	XLogFlush(WriteRqstPtr);
+
 	/*
 	 * Signal walsenders to move to stopping state.
 	 */
-- 
2.52.0



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
  2026-02-20 17:55 Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record Anthonin Bonnefoy <[email protected]>
@ 2026-02-24 09:46 ` Anthonin Bonnefoy <[email protected]>
  2026-02-26 10:35   ` Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record Anthonin Bonnefoy <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Anthonin Bonnefoy @ 2026-02-24 09:46 UTC (permalink / raw)
  To: pgsql-hackers

Some additional informations:

XLogSendLogical is stuck in the following infinite loop:
- It attempt to read the next record with XLogReadAhead + XLogDecodeNextRecord
- The page with the record header is read
- It has the record header, it goes back to XLogDecodeNextRecord
- tot_len > len, the record needs to be reassembled
- The next page containing the rest of the record is read with
ReadPageInternal. It fails since this page was never written.
- It jumps to the err label, XLogReaderInvalReadState(state) is called
and reset the reader state
- It goes back to the start of WalSndLoop's loop

There are some attempts done by the walsender to flush the WAL using
XLogBackgroundFlush:
  /*
   * If we're shutting down, trigger pending WAL to be written out,
   * otherwise we'd possibly end up waiting for WAL that never gets
   * written, because walwriter has shut down already.
   */
  if (got_STOPPING)
    XLogBackgroundFlush();

However, XLogBackgroundFlush only writes completed blocks or the
latest async xact known. With the issue triggered, I have the
following state:

XLogCtl->LogwrtRqst: (Write = 39128056, Flush = 39124992)
LogwrtResult: (Write = 39124992, Flush = 39124992)
XLogCtl->asyncXactLSN: 39119776

There are 3064 bytes (39128056 - 39124992) that contain the next page
with the rest of the cont record that still needs to be written.
However, XLogBackgroundFlush backs off to the previous page boundary:
  /* back off to last completed page boundary */
  WriteRqst.Write -= WriteRqst.Write % XLOG_BLCKSZ;
Meaning WriteRqst.Write will be 39124992, which is already written and
flushed and asyncXactLSN is behind both write and flush.

So, it looks like the root issue is more that the async LSN isn't
updated when a transaction without xid is rollbacked.
When going through CommitTransaction, such a transaction would still
go through XLogSetAsyncXactLSN.

I've updated the patch with this new approach: XLogSetAsyncXactLSN is
now called in RecordTransactionAbort even when a xid wasn't assigned.
With this, the logical walsender is able to force the flush of the
last partial page using XLogBackgroundFlush.


Attachments:

  [application/octet-stream] v2-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch (2.7K, 2-v2-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch)
  download | inline diff:
From 28f552b9ccb076627d2577b7daeb23a93a1e50ef Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <[email protected]>
Date: Tue, 24 Feb 2026 09:24:48 +0100
Subject: Fix stuck shutdown due to unflushed records

Shutdown sequence may be stuck indefinitely under the following
circumstances:
- Data checksums is enabled
- A logical replication walsender is running
- A select in an explicit transaction tries to prune a full heap page,
  wrote a FPI_FOR_HINT record which crosses the page boundary
- The select is rollbacked (or killed)
- 'pg_ctl stop' is sent

The FPI_FOR_HINT record is likely going to be a contrecord and starts a
new page. However, as the select is rollbacked, XLogSetAsyncXactLSN
isn't called to advance the LSN to include this record.

When the checkpointer starts ShutdownXLOG(), all walsenders will be
notified to stop. However, the logical replication walsender will be
stuck in the following infinite loop:
- Tries to read the last FPI_FOR_HINT record
- The page with the record header is read
- tot_len > len, the record needs to be reassembled
- Tries to read the next page containing the rest of the record. It fails since this page was never written.
- xlog reader state is reset with XLogReaderInvalReadState
- It goes back to the start of WalSndLoop's loop

There are some attempts done by the walsender to flush the WAL using
XLogBackgroundFlush. However, XLogBackgroundFlush only writes completed
blocks, or up to the latest known async lsn.

Since the select was rollbacked, XLogBackgroundFlush doesn't flush the
next partial page.

This patch fixes the issue by advancing the async LSN, even when the
transaction doesn't have an assigned xid. This allows
XLogBackgroundFlush to write the necessary partial page when called by
the walsender.
---
 src/backend/access/transam/xact.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index eba4f063168..0a49d5b603c 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -1787,7 +1787,18 @@ RecordTransactionAbort(bool isSubXact)
 	{
 		/* Reset XactLastRecEnd until the next transaction writes something */
 		if (!isSubXact)
+		{
+			/*
+			 * Even if no xid was assigned, some records may have been written
+			 * in the WAL. Report the latest async LSN, so that the WAL writer
+			 * knows to flush those records. This is important when shutting
+			 * down, walsender may use XLogBackgroundFlush to trigger pending
+			 * WAL to be written out. If they're not tracked by async xact
+			 * lsn, they won't be written by XLogBackgroundFlush.
+			 */
+			XLogSetAsyncXactLSN(XactLastRecEnd);
 			XactLastRecEnd = 0;
+		}
 		return InvalidTransactionId;
 	}
 
-- 
2.52.0



^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
  2026-02-20 17:55 Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record Anthonin Bonnefoy <[email protected]>
  2026-02-24 09:46 ` Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record Anthonin Bonnefoy <[email protected]>
@ 2026-02-26 10:35   ` Anthonin Bonnefoy <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Anthonin Bonnefoy @ 2026-02-26 10:35 UTC (permalink / raw)
  To: pgsql-hackers

And here's a script reproducing the issue. It creates the clusters,
sets up the logical replication and runs the necessary query to leave
FPI_FOR_HINT as the last written record.

If successful, the script should have pg_stop stuck with 'waiting for
server to shut down.......', with the walsender stuck at 100% CPU.


Attachments:

  [text/x-sh] reproduce_stuck_shutdown.sh (1.3K, 2-reproduce_stuck_shutdown.sh)
  download | inline:
#!/bin/bash
set -eu

export PGDATABASE=postgres

# Setup primary
initdb -k -D primary
echo "port = 5432
wal_level = logical
# Just make it easier to gdb into the walsender without getting it killed
wal_receiver_status_interval = 0
wal_sender_timeout = 0" > primary/postgresql.conf

# Start it
pg_ctl -D primary -l primary.log -U postgres start

# Setup replica
initdb -k -D replica
echo "port = 5433
wal_receiver_timeout = 0" > replica/postgresql.conf

# Start it
pg_ctl -D replica -l replica.log -U postgres start

# Create empty pgbench tables
pgbench -i -Idtp
pgbench -i -Idtp -p 5433

# Start logical replication
psql -c 'CREATE PUBLICATION pgbench_accounts_replication for table pgbench_accounts;'
psql -p 5433 -c "CREATE SUBSCRIPTION my_subscription CONNECTION 'host=127.0.0.1 port=5432' PUBLICATION pgbench_accounts_replication;"

# Fill the first heap page
psql -c "INSERT INTO pgbench_accounts SELECT *, *, *, '' FROM generate_series(0, 62);"

# Set page full hint bit
psql -c "BEGIN; UPDATE pgbench_accounts SET bid=4 where aid=1; ROLLBACK;"

# Force next change to be a FPI
psql -c "CHECKPOINT;"

# Trigger the FPI_FOR_HINT as the last written record in the WAL
psql -c "BEGIN; SELECT ctid, * FROM pgbench_accounts WHERE aid=2; ROLLBACK;"

# Stop the primary, it should be blocked with the walsender stuck at 100% CPU
pg_ctl stop -D primary

^ permalink  raw  reply  [nested|flat] 3+ messages in thread


end of thread, other threads:[~2026-02-26 10:35 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-20 17:55 Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record Anthonin Bonnefoy <[email protected]>
2026-02-24 09:46 ` Anthonin Bonnefoy <[email protected]>
2026-02-26 10:35   ` Anthonin Bonnefoy <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox