Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1up40a-00Fiya-GX for pgsql-general@arkaria.postgresql.org; Thu, 21 Aug 2025 12:03:10 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1up40Z-00G8aR-U1 for pgsql-general@arkaria.postgresql.org; Thu, 21 Aug 2025 12:03:08 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1up40Z-00G8a9-K0 for pgsql-general@lists.postgresql.org; Thu, 21 Aug 2025 12:03:08 +0000 Received: from lana.depesz.com ([88.198.49.178] helo=depesz.com) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1up40Y-0015Yh-0d for pgsql-general@lists.postgresql.org; Thu, 21 Aug 2025 12:03:07 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=depesz.com; s=20170201; h=In-Reply-To:Content-Transfer-Encoding:Content-Type:MIME-Version :References:Reply-To:Message-ID:Subject:Cc:To:Sender:From:Date:Content-ID: Content-Description; bh=Zbid78+koNpukS+Tdo4/1SOb6lloNDlZaUuP6dxe7O8=; b=g4oCu NZxyLkRzLreei1FGXgJ0Ybwd/11yw5dVpYYgG08eoJDBO4RtzqYPRav+CQ9XATryYGMYDcKewW0qP jY76DjvKOeAuTZslGoknLgCL4R/2r6FFeaN7Nvlb59fUJazxAgF1TaRAyEMY19A2zKPYbZ6YKzJSn mrRHUEMR2IYk=; Received: from depesz by depesz.com with local (Exim 4.96) (envelope-from ) id 1up40U-0032rh-0I; Thu, 21 Aug 2025 14:03:02 +0200 Date: Thu, 21 Aug 2025 14:03:02 +0200 From: hubert depesz lubaczewski Sender: depesz@depesz.com To: Thom Brown Cc: Adrian Klaver , PostgreSQL General Subject: Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug Message-ID: Reply-To: depesz@depesz.com References: <05969854-0d19-4726-ae1b-586659dd443b@aklaver.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Aug 21, 2025 at 12:41:44PM +0100, Thom Brown wrote: > Ah, yeah I meant transparent hugepage: > cat /sys/kernel/mm/transparent_hugepage/enabled > This should show it being set as "never". Ah. Sorry, couldn't decipher. Yes, it's "never". > > # grep -oP '^2025-08-19 22:09:2\d\.\d+ UTC' postgresql-2025-08-19_220000.csv | uniq -c | grep -C3 -P '^\s*\d\d' > > 2 2025-08-19 22:09:29.084 UTC > > 1 2025-08-19 22:09:29.094 UTC > > 2 2025-08-19 22:09:29.097 UTC > > 70 2025-08-19 22:09:29.109 UTC > > 90 2025-08-19 22:09:29.110 UTC > > 6 2025-08-19 22:09:29.111 UTC > > 1 2025-08-19 22:09:29.153 UTC > > 1 2025-08-19 22:09:29.555 UTC … > > 22:10:54 all 2.41 0.00 0.28 0.22 0.00 0.10 0.00 0.00 0.00 96.99 > > 22:10:59 all 2.83 0.00 0.29 0.19 0.00 0.12 0.00 0.00 0.00 96.57 > > This output looks fine, so it doesn't show anything concerning, so > suggests the issue is somehow on the Postgres side. > > Did you happen to poll pg_stat_activity at the time to see whether you > had lots of IPC waits? I'm wondering whether the storage layer is > freezing up for a moment. So, we get select * from pg_stat_activity, for client backends that are not idle, every 29 seconds. So, 1 second "freeze" is impossible to cathc. Plus - I suspect that if I ran select * from pg_stat_activity while "in freeze", it would also get frozen. Anyway, I have data from 22:09:22 and 22:09:51. In both cases only 4 non-idle backend. 6 of them had NULL in wait_event* one was Client/ClientRead and one was IPC/BgWorkerShutdown. State_change for the IPC/BgWorkerShutdown backend was 2025-08-19 22:09:51.79504+00 so it was well past the moment when the problem struck. Best regards, depesz