Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uoQ1u-003cKr-RY for pgsql-general@arkaria.postgresql.org; Tue, 19 Aug 2025 17:21:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uoQ1u-00FAs5-9Z for pgsql-general@arkaria.postgresql.org; Tue, 19 Aug 2025 17:21:50 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uoQ1t-00FArw-TR for pgsql-general@lists.postgresql.org; Tue, 19 Aug 2025 17:21:50 +0000 Received: from lana.depesz.com ([88.198.49.178] helo=depesz.com) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1uoQ1s-000khY-1M for pgsql-general@lists.postgresql.org; Tue, 19 Aug 2025 17:21:50 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=depesz.com; s=20170201; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Reply-To: Message-ID:Subject:To:Sender:From:Date:Cc:Content-ID:Content-Description: In-Reply-To:References; bh=y8acjYWIB3iDgIVBxOZa8djPdbQ0rYBXVedM7Odncf8=; b=bt dmG7nypFdrybgil2U36v8DgFbs5XxpF6EH5zHG4pL4O2T8qBSxTJJCXbsG9/nsfPm/l/RaDkoV9ul FgNt+OTvnLtIh8cD3gbjR1sP3h+6YE2FMKKgag10QzmwgCt8d+B8NDwH4iHqwXoJrEyqAt9gYV+0o p8tsuQ6ZYZVpsJ4=; Received: from depesz by depesz.com with local (Exim 4.96) (envelope-from ) id 1uoQ1r-006tm7-08 for pgsql-general@lists.postgresql.org; Tue, 19 Aug 2025 19:21:47 +0200 Date: Tue, 19 Aug 2025 19:21:47 +0200 From: hubert depesz lubaczewski Sender: depesz@depesz.com To: PostgreSQL General Subject: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug Message-ID: Reply-To: depesz@depesz.com MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, we have following situation: Pg 14.17 (yes, I know, but it can't be upgraded now/soon), on Ubuntu focal, in AWS cloud on EC2 server using arm64 architecture. All works, is fine. Every now and then (usually every 3-5 minutes, but not through the whole day), we see situations where every query suddently takes ~ 1 second. I'm talkign about "prepare" for preparing statements. binds. also "discard all". We have logging enabled to csv log, which offers milisecond precicision. For ~ 1 second there are no logs going to log (we usually have at 5-20 messages logged per second), no connection, nothing. And then we get bunch (30+) messages with the same milisecond time. And they all have insane durations - 800-1300ms for virtually anything. After such second, everything works fine, without any problems. Up to next case. Unfortunately due to short duration of such things, and the fact that literally *everything* is paused for this 1 second, it's hard to debug/diagnose. Servers have memory that is almost 2x total db size (200gb vs. 384gb of ram), so disk shouldn't be an issue. Aside from this, we don't see any other problems. Any idea how to look at it, what to look for, to be able to diagnose the issue? We do use some savepoints, but it's hard to tell when exactly they happen (we usually log only queries that take more than 250ms, and queries that use savepoints are usually much faster). The DB server in question has ~ 150 connections, and handles, at this time, we had ~ 40-50 ktps. Logging is set using: name │ setting ═══════════════════════════════════╪════════════════════════════════ log_autovacuum_min_duration │ 0 log_checkpoints │ on log_connections │ on log_destination │ csvlog log_directory │ /cache/postgres_logs log_disconnections │ off log_duration │ off log_error_verbosity │ default log_executor_stats │ off log_file_mode │ 0600 log_filename │ postgresql-%Y-%m-%d_%H%M%S.log log_hostname │ off log_line_prefix │ db=%d,user=%u log_lock_waits │ on log_min_duration_sample │ 0 log_min_duration_statement │ 0 log_min_error_statement │ error log_min_messages │ warning log_parameter_max_length │ -1 log_parameter_max_length_on_error │ 0 log_parser_stats │ off log_planner_stats │ off log_recovery_conflict_waits │ off log_replication_commands │ off log_rotation_age │ 60 log_rotation_size │ 51200 log_statement │ none log_statement_sample_rate │ 0.0001 log_statement_stats │ off log_temp_files │ 0 log_timezone │ UTC log_transaction_sample_rate │ 0 log_truncate_on_rotation │ off logging_collector │ on and it doesn't seem to show anything interesting. Checkpoints happen, but they don't seem correlated in any way. Any ideas? Best regards, depesz