Deadlock detector fails to activate on a hot standby replica

public inbox for [email protected]  
help / color / mirror / Atom feed

Deadlock detector fails to activate on a hot standby replica
4+ messages / 2 participants
[nested] [flat]

* Deadlock detector fails to activate on a hot standby replica
@ 2026-01-19 12:43  Vitaly Davydov <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Vitaly Davydov @ 2026-01-19 12:43 UTC (permalink / raw)
  To: [email protected]

Dear Hackers,

The deadlock detection mechanism fails to activate when a deadlock occurs
between startup and backend processes on a hot standby replica, resulting in
unforeseen delays in the recovery. The deadlock may happen, when processing
XLOG_HEAP2_PRUNE_* messages. Automatic resolution of deadlocks remains
possible when reaching the specified max_standby_streaming_delay value, if it is
set. Sometimes this value is set to -1 which disables this timeout. This
issue appears consistently in versions 15 and later, when
log_startup_progress_interval was introduced.

The startup process notify the conflicting backend process to check for deadlocks
when deadlock_timeout is reached. It works in general, but doesn't work in some
scenarios. If to set deadlock_timeout to be greater than
log_startup_progress_interval, the deadlock detector will never be triggered,
but the startup process will wait for the the deadlock resolution until
max_standby_streaming_delay timeout is reached (if it is set).

It is reproducible with the attached tap test 900_startup_backend_deadlock.pl.
To reproduce, just copy this test into src/test/recovery/t and run it.

The problem seems to appear in timeout.c functionality, or in
ResolveRecoveryConflictWithBufferPin depending on how to understand the
semantics of the timeout api. The root cause - handle_sig_alarm (SIGALRM handler)
may be called when no active timeouts are reached. It sets the process latch
unconditionally, this, waking up the process.

The problem may be in an optimization when setitimer may not be called,
when the closest final time of active timeouts is greater than already set time.
The SIGARLM handler may be called when no active timeouts are reached.

Below is the scenatio when deadlock timeout is not activated:

(1) The startup process sets startup_progress_interval to 1000ms and continues
with the recovery of the received WAL.

(2) When processing XLOG_HEAP2_PRUNE_*, the startup process tries to lock the
buffer using LockBufferForCleanup that calls ResolveRecoveryConflictWithBufferPin.
The deadlock of startup and backend processes is possible (see
src/test/recovery/t/031_recovery_conflict.pl test). Image, we come to the deadlock.

(3) ResolveRecoveryConflictWithBufferPin sets deadlock timeout to 3000 ms and
waits for buffer pin to be unlocked or for the timeout using ProcWaitForSignal.

(4) When the startup process in ProcWaitForSignal, handle_sig_alarm is called
because startup_progress_interval is reached (the timeout was disabled, but
the real timer was not reset). It sets the process latch unconditionally and
reschedules timers - the current active timer will be rescheduled in ~2000 ms in
our case, if XLOG_HEAP2_PRUNE_ was received right after step (1).
It means, the next call of handle_sig_alarm will be in 2000 ms.

(5) ResolveRecoveryConflictWithBufferPin continues after ProcWaitForSignal,
disables all active timeouts and returns. LockBufferForCleanup sees that the
buffer is still locked and calls ResolveRecoveryConflictWithBufferPin again.

(6) ResolveRecoveryConflictWithBufferPin sets deadlock timeout to 3000 ms, but
the real timer is not changed - it will be triggered in 2000 ms. And, then,
wits for timeout in ProcWaitForSignal.

(7) The SIGALRM handler (handle_sig_alarm) is called in 2000 ms, it sets the
process latch, but the deadlock timeout is not yet reached. Once, it is not
reached, the startup process will not signal to the conflicting backend to check
for deadlocks. ResolveRecoveryConflictWithBufferPin resets all timeouts again
and transfer control to LockBufferForCleanup. The buffer is still locked, it
calls ResolveRecoveryConflictWithBufferPin again.

(8) And so on... The startup process will run forever. It will loop in
LockBufferForCleanup without any progress in recovery.

The problem is here - if an unforeseen SIGALRM is received before deadlock
timeout, it can lead to infinite loop in LockBufferForCleanup.

I see a couple of possible solutions:
1. Call seitimer every time when needed (see the demo patch [1]).
2. Redesign LockBufferForCleanup logic to support the cases when SIGALRM may
come unexpectedly.
3. Call SetLatch in handle_sig_alarm only if some timeout is reached.

The solution 1 is a simpler one, but it can not guarantee that some other
functionaly will set a timeout and will affect LockBufferForCleanup. The
solution 2 seems to be more robust, but it is harder to implement. Furthermore,
I can not exclude some other places, where the timeout functionality is used in
a wrong way. Solution 3 seems to be the simplest but there is an opinion, that
any SIGALRM should wake up the process (set the latch).

Any ideas?

[1] 900_startup_backend_deadlock.pl
[2] 0001-Fix-deadlock-detector-activation-in-startup-process.patch

Attachments:

  [application/x-perl] 900_startup_backend_deadlock.pl (4.4K, 2-900_startup_backend_deadlock.pl)
  download

  [text/x-patch] 0001-Fix-deadlock-detector-activation-in-startup-process.patch (877B, 3-0001-Fix-deadlock-detector-activation-in-startup-process.patch)
  download | inline diff:
From f50bdfc0beea8da43265eb279d3bc1e2a8c86b8b Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <[email protected]>
Date: Mon, 19 Jan 2026 14:30:36 +0300
Subject: [PATCH] Fix deadlock detector activation in startup process

---
 src/backend/utils/misc/timeout.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/src/backend/utils/misc/timeout.c b/src/backend/utils/misc/timeout.c
index ddba5dc607c..3ef819949de 100644
--- a/src/backend/utils/misc/timeout.c
+++ b/src/backend/utils/misc/timeout.c
@@ -313,8 +313,10 @@ schedule_alarm(TimestampTz now)
 		 * to trigger the interrupt is likely to be a bit later than
 		 * signal_due_at.  That's fine, for the same reasons described above.
 		 */
+		/*
 		if (signal_pending && nearest_timeout >= signal_due_at)
 			return;
+		*/

 		/*
 		 * As with calling enable_alarm(), we must set signal_pending *before*
-- 
2.43.0

^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Deadlock detector fails to activate on a hot standby replica
@ 2026-01-23 11:51  Vitaly Davydov <[email protected]>
  parent: Vitaly Davydov <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Vitaly Davydov @ 2026-01-23 11:51 UTC (permalink / raw)
  To: [email protected]

Dear Hackers,

I would like to propose a patch that fixes the problem, which has the roots in
the possibility of spontaneous SIGALRM signals when waiting for some timeouts.
The idea of the patch - ignore spontaneous SIGALRM signals and continue waiting
for expected timeouts or buffer unpinning by the conflicting backend. This
patch is not a final version. I plan to add a tap-test for this case.

I'm in doubt to put the calls of some page buffer specific functions into
ResolveRecoveryConflictWithBufferPin (standby.c), but otherwise we have to
do more changes in LockBufferForCleanup and ResolveRecoveryConflictWithBufferPin.

I also think, we have to add some description of the found problem in timeout.c,
because the implemented optimization of setitimer calls leads to some not
evident consequences. The optimization seems to be implemented in the commit:
09cf1d52267644cdbdb734294012cf1228745aaa

With best regards,
Vitaly

Attachments:

  [text/x-patch] v1-0001-Fix-deadlock-detector-activation-in-a-recovery-confl.patch (6.5K, 2-v1-0001-Fix-deadlock-detector-activation-in-a-recovery-confl.patch)
  download | inline diff:
From f31ae975dfba0271d8dd61b44008fc2de14df68b Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <[email protected]>
Date: Fri, 23 Jan 2026 14:20:05 +0300
Subject: [PATCH] Fix deadlock detector activation in a recovery conflict

When the startup process in a deadlock with a backend, it sends the
signal to the backend to trigger the deadlock detector when
the deadlock timeout is elapsed (deadlock_timeout guc). Due to some
optimization in timeout.c, when spontaneous SIGALRM signals are
possible, which doesn't relate to any enabled timeout, the function
ResolveRecoveryConflictWithBufferPin can never send the signal to the
conflicting backend, becase the deadlock timeout will never be
triggered.

The patch fixes ResolveRecoveryConflictWithBufferPin by ignoring
spontaneous SIGALRM signals, that are possible in the current
implementation of timeout.c functionality.
---
 src/backend/storage/buffer/bufmgr.c |  2 +-
 src/backend/storage/ipc/standby.c   | 80 +++++++++++++++++++----------
 src/include/storage/standby.h       |  2 +-
 3 files changed, 54 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 6f935648ae9..3b9fb78842a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -6627,7 +6627,7 @@ LockBufferForCleanup(Buffer buffer)
 			/* Publish the bufid that Startup process waits on */
 			SetStartupBufferPinWaitBufId(buffer - 1);
 			/* Set alarm and then wait to be signaled by UnpinBuffer() */
-			ResolveRecoveryConflictWithBufferPin();
+			ResolveRecoveryConflictWithBufferPin(buffer);
 			/* Reset the published bufid */
 			SetStartupBufferPinWaitBufId(-1);
 		}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index afffab77106..fde5f45781f 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -30,6 +30,7 @@
 #include "storage/procarray.h"
 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "storage/buf_internals.h"
 #include "utils/hsearch.h"
 #include "utils/injection_point.h"
 #include "utils/ps_status.h"
@@ -790,11 +791,13 @@ cleanup:
  * at least deadlock_timeout.
  */
 void
-ResolveRecoveryConflictWithBufferPin(void)
+ResolveRecoveryConflictWithBufferPin(Buffer buffer)
 {
 	TimestampTz ltime;
 
 	Assert(InHotStandby);
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
 
 	ltime = GetStandbyLimitTime();
 
@@ -831,35 +834,56 @@ ResolveRecoveryConflictWithBufferPin(void)
 		enable_timeouts(timeouts, cnt);
 	}
 
-	/*
-	 * Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
-	 * by one of the timeouts established above.
-	 *
-	 * We assume that only UnpinBuffer() and the timeout requests established
-	 * above can wake us up here. WakeupRecovery() called by walreceiver or
-	 * SIGHUP signal handler, etc cannot do that because it uses the different
-	 * latch from that ProcWaitForSignal() waits on.
-	 */
-	ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
-
-	if (got_standby_delay_timeout)
-		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
-	else if (got_standby_deadlock_timeout)
+	for (;;)
 	{
 		/*
-		 * Send out a request for hot-standby backends to check themselves for
-		 * deadlocks.
-		 *
-		 * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
-		 * to be signaled by UnpinBuffer() again and send a request for
-		 * deadlocks check if deadlock_timeout happens. This causes the
-		 * request to continue to be sent every deadlock_timeout until the
-		 * buffer is unpinned or ltime is reached. This would increase the
-		 * workload in the startup process and backends. In practice it may
-		 * not be so harmful because the period that the buffer is kept pinned
-		 * is basically no so long. But we should fix this?
-		 */
-		SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		* Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
+		* by one of the timeouts established above.
+		*
+		* We assume that only UnpinBuffer() and the timeout requests established
+		* above can wake us up here. WakeupRecovery() called by walreceiver or
+		* SIGHUP signal handler, etc cannot do that because it uses the different
+		* latch from that ProcWaitForSignal() waits on.
+		*/
+		ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
+
+		if (got_standby_delay_timeout)
+			SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
+		else if (got_standby_deadlock_timeout)
+		{
+			/*
+			* Send out a request for hot-standby backends to check themselves for
+			* deadlocks.
+			*
+			* XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+			* to be signaled by UnpinBuffer() again and send a request for
+			* deadlocks check if deadlock_timeout happens. This causes the
+			* request to continue to be sent every deadlock_timeout until the
+			* buffer is unpinned or ltime is reached. This would increase the
+			* workload in the startup process and backends. In practice it may
+			* not be so harmful because the period that the buffer is kept pinned
+			* is basically no so long. But we should fix this?
+			*/
+			SendRecoveryConflictWithBufferPin(PROCSIG_RECOVERY_CONFLICT_STARTUP_DEADLOCK);
+		}
+		else
+		{
+			BufferDesc *bufHdr = GetBufferDescriptor(buffer - 1);
+			uint64		buf_state = LockBufHdr(bufHdr);
+			uint32		buf_refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+
+			UnlockBufHdr(bufHdr);
+
+			// When the buffer’s reference count exceeds 1, the exclusive lock
+			// remains unacquired. A SIGALRM signal appears to have been received
+			// unexpectedly, and it is not associated with any active timeout.
+			// The system should wait until either the buffer becomes unlocked
+			// or the anticipated timeout period elapses.
+			if (buf_refcount > 1)
+				continue;
+		}
+
+		break;
 	}
 
 	/*
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 7b10932635a..e15dcd22e8f 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -38,7 +38,7 @@ extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag, bool logging_conflict);
-extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithBufferPin(Buffer buffer);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Deadlock detector fails to activate on a hot standby replica
@ 2026-05-20 16:26  Fujii Masao <[email protected]>
  parent: Vitaly Davydov <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Fujii Masao @ 2026-05-20 16:26 UTC (permalink / raw)
  To: Vitaly Davydov <[email protected]>; +Cc: [email protected]

On Fri, Jan 23, 2026 at 8:52 PM Vitaly Davydov <[email protected]> wrote:
>
> Dear Hackers,
>
> I would like to propose a patch that fixes the problem, which has the roots in
> the possibility of spontaneous SIGALRM signals when waiting for some timeouts.
> The idea of the patch - ignore spontaneous SIGALRM signals and continue waiting
> for expected timeouts or buffer unpinning by the conflicting backend. This
> patch is not a final version. I plan to add a tap-test for this case.

Thanks for the patch!


 #include "storage/sinvaladt.h"
 #include "storage/standby.h"
+#include "storage/buf_internals.h"

This include should be placed in alphabetical order.


+ * We assume that only UnpinBuffer() and the timeout requests established
+ * above can wake us up here. WakeupRecovery() called by walreceiver or
+ * SIGHUP signal handler, etc cannot do that because it uses the different
+ * latch from that ProcWaitForSignal() waits on.

As your investigation showed, that assumption does not seem to hold. If so,
I think something like the following would be more accurate:

    ---------------------------------
    ProcWaitForSignal() can also wake up for unrelated reasons, so recheck
    whether we're still the waiter after each wakeup. If we are and no timeout
    fired, continue waiting without resetting the active timeouts.
    ---------------------------------


+ uint32 buf_refcount = BUF_STATE_GET_REFCOUNT(buf_state);
<snip>
+ if (buf_refcount > 1)
+ continue;

Wouldn't it be better to check explicitly whether we're still the waiter,
instead of using BUF_STATE_GET_REFCOUNT()?

    (buf_state & BM_PIN_COUNT_WAITER) != 0 &&
    bufHdr->wait_backend_pgprocno == MyProcNumber


The current control flow in the loop feels a bit hard to follow.
Would something like the following be simpler?

    for (;;)
    {
        ....
        ProcWaitForSignal(...);

        if (!StillWaitingForBufferPin(...))
            break;

        if (got_standby_delay_timeout)
        {
            SendRecoveryConflictWithBufferPin(...);
            break;
        }
        else if (got_standby_deadlock_timeout)
        {
            SendRecoveryConflictWithBufferPin(...);
            break;
        }
    }

Regards,

-- 
Fujii Masao






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: Deadlock detector fails to activate on a hot standby replica
@ 2026-05-25 14:20  Vitaly Davydov <[email protected]>
  parent: Fujii Masao <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Vitaly Davydov @ 2026-05-25 14:20 UTC (permalink / raw)
  To: Fujii Masao <[email protected]>; +Cc: [email protected]

Dear Fujii Masao,

Thank you for the review of the patch. Based on your comments I propose
a new version of the patch.

 > +#include "storage/buf_internals.h"
 > This include should be placed in alphabetical order.

I removed the include of the header file. I'm not sure, that it is a good
way to expose buffer internals here. Instead, I implemented a new function
BufferGetRefCount to be used in standby.c.

 > + uint32 buf_refcount = BUF_STATE_GET_REFCOUNT(buf_state);
 > <snip>
 > + if (buf_refcount > 1)
 > + continue;
 >
 > Wouldn't it be better to check explicitly whether we're still the waiter,
 > instead of using BUF_STATE_GET_REFCOUNT()?
 >
 >     (buf_state & BM_PIN_COUNT_WAITER) != 0 &&
 >     bufHdr->wait_backend_pgprocno == MyProcNumber

This is a precondition for the ResolveRecoveryConflictWithBufferPin function
that the current process should be a waiter process. See LockBufferForCleanup
where this state is set before the call of ResolveRecoveryConflictWithBufferPin.
I believe, there is no sense to check the waiter state because it doesn't
change in this function. Instead, I think, it is enough to just check for
refcount. One of the suggestions is to add an assert to check that the current
process is a waiter process, but I'm not sure about it. Let me know please if
I missed anything.

 > The current control flow in the loop feels a bit hard to follow.
 > Would something like the following be simpler?

I reorganized the control flow as well.

I also added a new tap-test as a part of the patch. I did some changes in the
tap test to make it stable. Let me know, please, if it should be in a separate
commit.

With best regards,
Vitaly

Attachments:

  [text/x-patch] v2-0001-Fix-deadlock-detector-activation-in-a-recovery-confl.patch (11.8K, 2-v2-0001-Fix-deadlock-detector-activation-in-a-recovery-confl.patch)
  download | inline diff:
From fc0b3b6df48d7c5d69edf891cd84f8c4848a0e65 Mon Sep 17 00:00:00 2001
From: Vitaly Davydov <[email protected]>
Date: Wed, 20 May 2026 11:48:34 +0300
Subject: [PATCH] Fix deadlock detector activation in a recovery conflict

When the startup process in a deadlock with a backend, it sends the
signal to the backend to trigger the deadlock detector when
the deadlock timeout is elapsed (deadlock_timeout guc). Due to some
optimization in timeout.c, when spontaneous SIGALRM signals are
possible, which doesn't relate to any enabled timeout, the function
ResolveRecoveryConflictWithBufferPin can never send the signal to the
conflicting backend, becase the deadlock timeout will never be
triggered.

The patch fixes ResolveRecoveryConflictWithBufferPin by ignoring
spontaneous SIGALRM signals, that are possible in the current
implementation of timeout.c functionality.
---
 src/backend/storage/buffer/bufmgr.c           |  25 +++-
 src/backend/storage/ipc/standby.c             |  67 +++++----
 src/include/storage/bufmgr.h                  |   1 +
 src/include/storage/standby.h                 |   2 +-
 .../t/053_startup_backend_deadlock.pl         | 127 ++++++++++++++++++
 5 files changed, 194 insertions(+), 28 deletions(-)
 create mode 100644 src/test/recovery/t/053_startup_backend_deadlock.pl

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index cc398db124d..ce5871fbd9b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -4749,6 +4749,29 @@ BufferGetLSNAtomic(Buffer buffer)
 #endif
 }
 
+/*
+ * BufferGetRefCount
+ *		Return the current reference counter.
+ */
+uint32
+BufferGetRefCount(Buffer buffer)
+{
+	BufferDesc *bufHdr;
+	uint64		buf_state;
+	uint32		buf_refcount;
+
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
+
+	bufHdr = GetBufferDescriptor(buffer - 1);
+
+	buf_state = LockBufHdr(bufHdr);
+	buf_refcount = BUF_STATE_GET_REFCOUNT(buf_state);
+	UnlockBufHdr(bufHdr);
+
+	return buf_refcount;
+}
+
 /* ---------------------------------------------------------------------
  *		DropRelationBuffers
  *
@@ -6789,7 +6812,7 @@ LockBufferForCleanup(Buffer buffer)
 			/* Publish the bufid that Startup process waits on */
 			SetStartupBufferPinWaitBufId(buffer - 1);
 			/* Set alarm and then wait to be signaled by UnpinBuffer() */
-			ResolveRecoveryConflictWithBufferPin();
+			ResolveRecoveryConflictWithBufferPin(buffer);
 			/* Reset the published bufid */
 			SetStartupBufferPinWaitBufId(-1);
 		}
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index de9092fdf5b..00399eafee3 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -790,13 +790,17 @@ cleanup:
  * Deadlocks are extremely rare, and relatively expensive to check for,
  * so we don't do a deadlock check right away ... only if we have had to wait
  * at least deadlock_timeout.
+ *
+ * The precondition: the current process should be the waiter process.
  */
 void
-ResolveRecoveryConflictWithBufferPin(void)
+ResolveRecoveryConflictWithBufferPin(Buffer buffer)
 {
 	TimestampTz ltime;
 
 	Assert(InHotStandby);
+	Assert(BufferIsValid(buffer));
+	Assert(!BufferIsLocal(buffer));
 
 	ltime = GetStandbyLimitTime();
 
@@ -833,35 +837,46 @@ ResolveRecoveryConflictWithBufferPin(void)
 		enable_timeouts(timeouts, cnt);
 	}
 
-	/*
-	 * Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
-	 * by one of the timeouts established above.
-	 *
-	 * We assume that only UnpinBuffer() and the timeout requests established
-	 * above can wake us up here. WakeupRecovery() called by walreceiver or
-	 * SIGHUP signal handler, etc cannot do that because it uses the different
-	 * latch from that ProcWaitForSignal() waits on.
-	 */
-	ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
-
-	if (got_standby_delay_timeout)
-		SendRecoveryConflictWithBufferPin(RECOVERY_CONFLICT_BUFFERPIN);
-	else if (got_standby_deadlock_timeout)
+	for (;;)
 	{
 		/*
-		 * Send out a request for hot-standby backends to check themselves for
-		 * deadlocks.
+		 * Wait to be signaled by UnpinBuffer() or for the wait to be interrupted
+		 * by one of the timeouts established above.
 		 *
-		 * XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
-		 * to be signaled by UnpinBuffer() again and send a request for
-		 * deadlocks check if deadlock_timeout happens. This causes the
-		 * request to continue to be sent every deadlock_timeout until the
-		 * buffer is unpinned or ltime is reached. This would increase the
-		 * workload in the startup process and backends. In practice it may
-		 * not be so harmful because the period that the buffer is kept pinned
-		 * is basically no so long. But we should fix this?
+		 * ProcWaitForSignal() can also wake up for unrelated reasons, so recheck
+		 * that the buffer is pinned by the current waiter process only (reference
+		 * counter should be 1). Continue waiting, if no registered timeout is
+		 * fired or the buffer is still pinned by other processes as well.
 		 */
-		SendRecoveryConflictWithBufferPin(RECOVERY_CONFLICT_BUFFERPIN_DEADLOCK);
+		ProcWaitForSignal(WAIT_EVENT_BUFFER_CLEANUP);
+
+		/*
+		 * Once the reference count is less or equal to 1 and we are the waiter
+		 * process, no one uses the buffer at the moment. There is a chance to
+		 * lock the buffer exclusively.
+		 */
+		if (BufferGetRefCount(buffer) <= 1)
+			break;
+
+		if (got_standby_delay_timeout)
+			SendRecoveryConflictWithBufferPin(RECOVERY_CONFLICT_BUFFERPIN);
+		else if (got_standby_deadlock_timeout)
+		{
+			/*
+			* Send out a request for hot-standby backends to check themselves for
+			* deadlocks.
+			*
+			* XXX The subsequent ResolveRecoveryConflictWithBufferPin() will wait
+			* to be signaled by UnpinBuffer() again and send a request for
+			* deadlocks check if deadlock_timeout happens. This causes the
+			* request to continue to be sent every deadlock_timeout until the
+			* buffer is unpinned or ltime is reached. This would increase the
+			* workload in the startup process and backends. In practice it may
+			* not be so harmful because the period that the buffer is kept pinned
+			* is basically no so long. But we should fix this?
+			*/
+			SendRecoveryConflictWithBufferPin(RECOVERY_CONFLICT_BUFFERPIN_DEADLOCK);
+		}
 	}
 
 	/*
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 6837b35fc6d..b7dcff85a8d 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -313,6 +313,7 @@ extern bool BufferIsPermanent(Buffer buffer);
 extern XLogRecPtr BufferGetLSNAtomic(Buffer buffer);
 extern void BufferGetTag(Buffer buffer, RelFileLocator *rlocator,
 						 ForkNumber *forknum, BlockNumber *blknum);
+uint32		BufferGetRefCount(Buffer buffer);
 
 extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index 6a314c693cd..c75d87b7ddc 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -79,7 +79,7 @@ extern void ResolveRecoveryConflictWithTablespace(Oid tsid);
 extern void ResolveRecoveryConflictWithDatabase(Oid dbid);
 
 extern void ResolveRecoveryConflictWithLock(LOCKTAG locktag, bool logging_conflict);
-extern void ResolveRecoveryConflictWithBufferPin(void);
+extern void ResolveRecoveryConflictWithBufferPin(Buffer buffer);
 extern void CheckRecoveryConflictDeadlock(void);
 extern void StandbyDeadLockHandler(void);
 extern void StandbyTimeoutHandler(void);
diff --git a/src/test/recovery/t/053_startup_backend_deadlock.pl b/src/test/recovery/t/053_startup_backend_deadlock.pl
new file mode 100644
index 00000000000..5bd24f315e6
--- /dev/null
+++ b/src/test/recovery/t/053_startup_backend_deadlock.pl
@@ -0,0 +1,127 @@
+#!/usr/bin/perl
+#
+# Test a deadlock between a backend and the startup processes when processing
+# XLOG_PRUNE_PAGE wal record. The test is based on 031_recovery_conflict.pl
+# vanilla test.
+#
+
+use strict;
+use warnings;
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Test settings
+my $deadlock_timeout = 3000; # ms
+my $log_startup_progress_interval = 2000; # ms
+my $testdb = "testdb";
+my $backup_name = 'my_backup';
+my $table1 = "table1";
+my $table2 = "table2";
+
+# Set up nodes
+my $node_primary = PostgreSQL::Test::Cluster->new('primary');
+$node_primary->init(allows_streaming => 1);
+
+$node_primary->append_conf('postgresql.conf', qq[
+
+	# for deadlock test
+	max_prepared_transactions = 10
+
+	# wait some to test the wait paths as well, but not long for obvious reasons
+	max_standby_streaming_delay = -1
+
+	# Some of the recovery conflict logging code only gets exercised after
+	# deadlock_timeout. The test doesn't rely on that additional output, but it's
+	# nice to get some minimal coverage of that code.
+	log_recovery_conflict_waits = on
+
+	log_startup_progress_interval = ${log_startup_progress_interval}
+	deadlock_timeout = ${deadlock_timeout}
+	autovacuum = off
+]);
+
+$node_primary->start;
+$node_primary->backup($backup_name);
+
+my $node_standby = PostgreSQL::Test::Cluster->new('standby');
+$node_standby->init_from_backup($node_primary, $backup_name, has_streaming => 1);
+$node_standby->start();
+
+my $log_location = -s $node_standby->logfile;
+
+# use a new database, to trigger database recovery conflict
+$node_primary->safe_psql('postgres', "CREATE DATABASE $testdb");
+#$node_primary->safe_psql('postgres', "CREATE TABLE $table2(a int, b int)");
+
+$node_primary->safe_psql($testdb, qq[
+	CREATE TABLE $table1(a int, b int);
+	CREATE TABLE $table2(a int, b int);
+	INSERT INTO $table1 VALUES (1);
+]);
+
+# Generate a few dead rows, to later be cleaned up by vacuum. Then acquire a
+# lock on another relation in a prepared xact, so it's held continuously by
+# the startup process. The standby psql will block acquiring that lock while
+# holding a pin that vacuum needs, triggering the deadlock.
+$node_primary->safe_psql($testdb, qq[
+	BEGIN;
+	INSERT INTO $table1(a) SELECT generate_series(1, 100) i;
+	ROLLBACK;
+]);
+
+$node_primary->safe_psql($testdb, qq[
+	BEGIN;
+	LOCK TABLE $table2;
+	PREPARE TRANSACTION 'lock';
+	INSERT INTO $table1(a) VALUES (170);
+	SELECT txid_current();
+]);
+
+$node_primary->wait_for_catchup($node_standby, 'replay', $node_primary->lsn('write'));
+
+my $psql_standby = $node_standby->background_psql($testdb, on_error_stop => 0);
+
+$psql_standby->query_until(qr/^1$/m, qq[
+	BEGIN;
+	-- hold pin
+	DECLARE test_recovery_conflict_cursor CURSOR FOR SELECT a FROM $table1;
+	FETCH FORWARD FROM test_recovery_conflict_cursor;
+	-- wait for lock held by prepared transaction
+	SELECT * FROM $table2;
+]);
+
+ok(1, "cursor holding conflicting pin, also waiting for lock, established");
+
+# VACUUM will prune away rows, causing a buffer pin conflict, while standby
+# psql is waiting on lock
+$node_primary->safe_psql($testdb, qq[VACUUM $table1;]);
+
+# Wait and check that the deadlock detector was triggered and found a deadlock
+# in the backend process (not in startup process).
+note("Waiting for deadlock detector to launch...");
+check_conflict_log("User transaction caused buffer deadlock with recovery.");
+
+# clean up for next tests
+$node_primary->safe_psql($testdb, qq[ROLLBACK PREPARED 'lock';]);
+
+# Check that expected number of conflicts show in pg_stat_database. Needs to
+# be tested before database is dropped, for obvious reasons.
+my $nconflicts = $node_standby->safe_psql($testdb, "SELECT conflicts FROM pg_stat_database WHERE datname = '$testdb'");
+note("number of recovery conflicts: $nconflicts");
+
+$psql_standby->quit();
+
+sub check_conflict_log
+{
+	my $message = shift;
+	my $old_log_location = $log_location;
+
+	$log_location = $node_standby->wait_for_log(qr/$message/, $log_location);
+
+	cmp_ok($log_location, '>', $old_log_location,
+		"logfile contains terminated connection due to recovery conflict"
+	);
+}
+
+done_testing();
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 4+ messages in thread

end of thread, other threads:[~2026-05-25 14:20 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-19 12:43 Deadlock detector fails to activate on a hot standby replica Vitaly Davydov <[email protected]>
2026-01-23 11:51 ` Vitaly Davydov <[email protected]>
2026-05-20 16:26   ` Fujii Masao <[email protected]>
2026-05-25 14:20     ` Vitaly Davydov <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox