Date: Tue, 19 Aug 2025 19:21:47 +0200
From: hubert depesz lubaczewski <depesz@depesz.com>
Sender: depesz@depesz.com
To: PostgreSQL General <pgsql-general@lists.postgresql.org>
Subject: Streaming replica hangs periodically for ~ 1 second - how to
 diagnose/debug
Message-ID: <aKSyq74zIuLS-2B5@depesz.com>
Reply-To: depesz@depesz.com
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
Archived-At: <https://www.postgresql.org/message-id/aKSyq74zIuLS-2B5%40depesz.com>
Precedence: bulk

Hi,
we have following situation: Pg 14.17 (yes, I know, but it can't be
upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using
arm64 architecture.

All works, is fine.

Every now and then (usually every 3-5 minutes, but not through the whole
day), we see situations where every query suddently takes ~ 1 second.

I'm talkign about "prepare" for preparing statements. binds. also
"discard all".

We have logging enabled to csv log, which offers milisecond precicision.
For ~ 1 second there are no logs going to log (we usually have at 5-20
messages logged per second), no connection, nothing. And then we get
bunch (30+) messages with the same milisecond time.

And they all have insane durations - 800-1300ms for virtually anything.

After such second, everything works fine, without any problems.

Up to next case.

Unfortunately due to short duration of such things, and the fact that
literally *everything* is paused for this 1 second, it's hard to
debug/diagnose.

Servers have memory that is almost 2x total db size (200gb vs. 384gb of
ram), so disk shouldn't be an issue.

Aside from this, we don't see any other problems.

Any idea how to look at it, what to look for, to be able to diagnose the
issue?

We do use some savepoints, but it's hard to tell when exactly they
happen (we usually log only queries that take more than 250ms, and
queries that use savepoints are usually much faster).

The DB server in question has ~ 150 connections, and handles, at this
time, we had ~ 40-50 ktps.

Logging is set using:

               name                │            setting
═══════════════════════════════════╪════════════════════════════════
 log_autovacuum_min_duration       │ 0
 log_checkpoints                   │ on
 log_connections                   │ on
 log_destination                   │ csvlog
 log_directory                     │ /cache/postgres_logs
 log_disconnections                │ off
 log_duration                      │ off
 log_error_verbosity               │ default
 log_executor_stats                │ off
 log_file_mode                     │ 0600
 log_filename                      │ postgresql-%Y-%m-%d_%H%M%S.log
 log_hostname                      │ off
 log_line_prefix                   │ db=%d,user=%u
 log_lock_waits                    │ on
 log_min_duration_sample           │ 0
 log_min_duration_statement        │ 0
 log_min_error_statement           │ error
 log_min_messages                  │ warning
 log_parameter_max_length          │ -1
 log_parameter_max_length_on_error │ 0
 log_parser_stats                  │ off
 log_planner_stats                 │ off
 log_recovery_conflict_waits       │ off
 log_replication_commands          │ off
 log_rotation_age                  │ 60
 log_rotation_size                 │ 51200
 log_statement                     │ none
 log_statement_sample_rate         │ 0.0001
 log_statement_stats               │ off
 log_temp_files                    │ 0
 log_timezone                      │ UTC
 log_transaction_sample_rate       │ 0
 log_truncate_on_rotation          │ off
 logging_collector                 │ on

and it doesn't seem to show anything interesting. Checkpoints happen,
but they don't seem correlated in any way.

Any ideas?

Best regards,

depesz