Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Laurenz Albe <[email protected]>
To: Klaus Darilion <[email protected]>
To: [email protected] <[email protected]>
Subject: Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU
Date: Mon, 21 Jul 2025 15:35:11 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <DBAPR03MB6358854AD71C8ABA5CA10A8DF15DA@DBAPR03MB6358.eurprd03.prod.outlook.com>
References: <DBAPR03MB6358854AD71C8ABA5CA10A8DF15DA@DBAPR03MB6358.eurprd03.prod.outlook.com>

On Mon, 2025-07-21 at 10:47 +0000, Klaus Darilion wrote:
> (Note: I have also attached the whole email for better readability of the logs)

Your mail looks good enough the way it is:
https://postgr.es/m/DBAPR03MB6358854AD71C8ABA5CA10A8DF15DA%40DBAPR03MB6358.eurprd03.prod.outlook.com

> Our setup: 5 Node Patroni Cluster with PostgreSQL 16.9.
> db1: current leader
> db2: sync-replica
> db3/4/5: replica
>  
> The replicas connect to the leader using the host IP of the leader. So there are
> 4 walsender for patroni, 1 sync and 3 async.
>  
> The patroni cluster utilizes a service IP-address (VIP). The VIP is used by all
> clients connecting to the current leader. These clients are:
> - some web-apps doing normal DB queries (read/write)
> - 2 barman backup clients using streaming replication
> - 58 logical replication clients
>  
> Additionally we use https://github.com/EnterpriseDB/pg_failover_slots to sync and
> advance the logical replication slots on the replicas. The failover_slots plugin
> periodically connects to leader using the VIP.
>  
> We had a planned maintenance and wanted to switch the leader from db1 to db2:
> 12:32:18: patronictl switchover --leader db1 --candidate db2
>  
> So postmaster received the fast shutdown request from Patroni and started
> shutting down the client connection processes:
>  
> Usually the switchover only takes a few seconds. After waiting a few minutes
> we became anxious and started debugging.
>  
> Using "ps -Alf|grep postgres" we saw that there were no more normal client
> connections, but still 58 logical replicaton walsender processes and
> 6 streaming replication walsenders.
> "top" revealed that the walsenders were eating CPU.

We have had a somewhat similar report:
https://www.postgresql.org/message-id/flat/18985-64431d78bcabae95%40postgresql.org

What is the logical decoding plugin you are using?

If it is "pgoutput", what are the walsenders doing? You can try "strace" and
use "gdb" to break into the walsenders and take a stack trace.

Yours,
Laurenz Albe

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Postgresql 16.9 fast shutdown hangs with walsenders eating 100% CPU
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox