public inbox for [email protected]  
help / color / mirror / Atom feed
From: Anthonin Bonnefoy <[email protected]>
To: PostgreSQL Hackers <[email protected]>
Subject: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
Date: Fri, 20 Feb 2026 18:55:33 +0100
Message-ID: <CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com> (raw)

Hi,

Shutdown may be indefinitely stuck under the following circumstances:
- Data checksum is enabled (needed to generate FPI_FOR_HINT record)
- A logical replication walsender is running
- A select in an explicit ongoing transaction pruned a heap page and
logged a FPI_FOR_HINT record. This record is likely going to be a
contrecord and start a new page.

Starting the shutdown will kill this ongoing transaction. Since the
transaction doesn't have an allocated xid, the FPI_FOR_HINT record
will be left unflushed.

When the checkpointer calls ShutdownXLOG(), all walsenders will be
notified to stop. However, the logical replication walsender will be
stuck in an infinite loop, trying to read this unflushed record and
never reaching the stop state, blocking the whole shutdown sequence.

This can be reproduced with the following script (this assumes
`pgbench -i` was run to create pgbench_accounts and a running logical
replication walsender):

TRUNCATE pgbench_accounts;
-- Completely fill the first heap page
INSERT INTO pgbench_accounts SELECT *, *, *, '' FROM generate_series(0, 62);
-- This should tag the page's metadata as full
BEGIN;
UPDATE pgbench_accounts SET bid=4 where aid=1;
ROLLBACK;
-- Force checkpoint so next change will be a FPW
CHECKPOINT;
-- Open an explicit transaction
BEGIN;
-- Select will do an opportunistic pruning, find nothing to prune but
will still unset the page full flag, writing a FPI_FOR_HINT
SELECT ctid, * FROM pgbench_accounts WHERE aid=2;

Then shutdown the database with 'pg_ctl stop' with the transaction
left opened. The shutdown will be stuck and the logical replication
walsender will be stuck at 100% CPU.

I've managed to reproduce this issue on 14 and the current HEAD.

The attached (tentative) patch fixes the issue by flushing all records
before signaling walsenders to stop. At that point, all backends
should have been killed, so flushing leftover records felt like a
correct approach.

Regards,
Anthonin Bonnefoy


Attachments:

  [application/octet-stream] v1-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch (2.1K, 2-v1-0001-Fix-stuck-shutdown-due-to-unflushed-records.patch)
  download | inline diff:
From f62b6b45594b4d58ffe739ca42ecd3aca6605c4c Mon Sep 17 00:00:00 2001
From: Anthonin Bonnefoy <[email protected]>
Date: Fri, 20 Feb 2026 18:15:12 +0100
Subject: Fix stuck shutdown due to unflushed records

Shutdown sequence may be stuck indefinitely under the following
circumstances:
- Data checksums is enabled
- A logical replication walsender is running
- A select in an explicit ongoing transaction pruned a heap page and
  logged a FPI_FOR_HINT record. This record is likely going to be a
  contrecord and start a new page.

Starting the shutdown will kill this ongoing transaction. Since the
transaction doesn't have an allocated xid, the FPI_FOR_HINT record will
be left unflushed.

When the checkpointer starts ShutdownXLOG(), all walsenders will be
notified to stop. However, the logical replication walsender will be
stuck in an infinite loop, trying to read this unflushed record, never
reaching the stop state and blocking the whole shutdown sequence.

This patch fixes the issue by flushing all records before signaling
walsenders to stop.
---
 src/backend/access/transam/xlog.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13ec6225b85..aa490176aaf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -6727,6 +6727,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	XLogRecPtr	WriteRqstPtr;
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -6740,6 +6742,15 @@ ShutdownXLOG(int code, Datum arg)
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
 			(errmsg("shutting down")));
 
+	/*
+	 * We may have unflushed records, make sure everything is flushed before
+	 * stopping the walsenders.
+	 */
+	SpinLockAcquire(&XLogCtl->info_lck);
+	WriteRqstPtr = XLogCtl->LogwrtRqst.Write;
+	SpinLockRelease(&XLogCtl->info_lck);
+	XLogFlush(WriteRqstPtr);
+
 	/*
 	 * Signal walsenders to move to stopping state.
 	 */
-- 
2.52.0



view thread (3+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected]
  Subject: Re: Shutdown indefinitely stuck due to unflushed FPI_FOR_HINT record
  In-Reply-To: <CAO6_Xqo3co3BuUVEVzkaBVw9LidBgeeQ_2hfxeLMQcXwovB3GQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox