Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1upTil-005wlS-S1 for pgsql-general@arkaria.postgresql.org; Fri, 22 Aug 2025 15:30:29 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1upTil-006zIc-4t for pgsql-general@arkaria.postgresql.org; Fri, 22 Aug 2025 15:30:27 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1upTik-006zIT-Qy for pgsql-general@lists.postgresql.org; Fri, 22 Aug 2025 15:30:27 +0000 Received: from lana.depesz.com ([88.198.49.178] helo=depesz.com) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1upTij-001IQp-1J for pgsql-general@lists.postgresql.org; Fri, 22 Aug 2025 15:30:27 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=depesz.com; s=20170201; h=In-Reply-To:Content-Type:MIME-Version:References:Reply-To: Message-ID:Subject:Cc:To:Sender:From:Date:Content-Transfer-Encoding: Content-ID:Content-Description; bh=qvE/2R/9HWLglXzfAVVdrT0aJpnYXcofx6vtD5bpDIg=; b=nBG9LMrbMSxMuHYC7lJjVSVY0x JBlCNku3hC3EAnCzqM539hwYUAgOkGBgSgl8xMjBtFjRQzrLRPUZIioHENTuP7nqH2dkDOWRoUG15 nN0oC8p8PzUhmToNIDLHA3wp2yIfU506NOv+Sv4w9JC0vaWH3rjBIDfwwylCkd0vHQl8=; Received: from depesz by depesz.com with local (Exim 4.96) (envelope-from ) id 1upTig-00DDOa-1E; Fri, 22 Aug 2025 17:30:22 +0200 Date: Fri, 22 Aug 2025 17:30:22 +0200 From: hubert depesz lubaczewski Sender: depesz@depesz.com To: Tom Lane Cc: Adrian Klaver , PostgreSQL General , Chris Wilson Subject: Re: Streaming replica hangs periodically for ~ 1 second - how to diagnose/debug Message-ID: Reply-To: depesz@depesz.com References: <25334887-f1c3-40a1-94b0-753c7d67ae2b@aklaver.com> <2a3e4a8d-e8c2-46d6-ad7d-9e631ce6725e@aklaver.com> <1882312.1755876082@sss.pgh.pa.us> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <1882312.1755876082@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Aug 22, 2025 at 11:21:22AM -0400, Tom Lane wrote: > hubert depesz lubaczewski writes: > > I got repeatable case today. Is is breaking on its own everyy > > ~ 5 minutes. > > Interesting. That futex call is presumably caused by interaction > with some other process within the standby server, and the only > plausible candidate really is the startup process (which is replaying > WAL received from the primary). There are cases where WAL replay > will take locks that can block queries on the standby. Can you > correlate the delays on the standby server with any DDL events > occurring on the primary? Nope. Plus there is certain repetition of these cases, so even if I'd miss *some* create table/alter, it just isn't going to be happening every 4-5 minutes. For example, looking at logs for the last ~2h, and just checking situation when there are more than 20 messages in the same milisecond, I can see: 108 14:02:03.149 25 14:04:01.619 110 14:05:36.924 77 14:05:36.925 108 14:09:28.155 38 14:13:52.481 63 14:13:52.482 73 14:13:52.484 146 14:18:19.338 39 14:18:19.339 24 14:20:01.694 82 14:23:07.352 55 14:23:07.353 37 14:23:07.353 45 14:27:44.125 132 14:27:44.126 109 14:31:41.593 70 14:31:41.594 24 14:32:01.205 21 14:34:01.477 79 14:35:36.761 104 14:35:36.762 22 14:39:49.541 151 14:39:49.542 22 14:39:49.543 112 14:44:15.607 73 14:44:15.608 28 14:48:01.256 50 14:48:25.588 131 14:48:25.589 139 14:52:44.391 74 14:57:02.369 117 14:57:02.370 20 15:00:02.008 137 15:00:43.982 34 15:00:43.983 20 15:01:01.110 22 15:04:21.037 153 15:04:21.038 20 15:08:01.136 31 15:08:55.798 126 15:08:55.799 76 15:13:46.654 83 15:13:46.655 20 15:17:01.700 107 15:18:42.112 72 15:18:42.113 124 15:23:48.689 32 15:23:48.690 25 15:23:48.690 28 15:24:01.397 So, while there are outliers, I'd say that most of the problems happens every 3-5 minutes. depesz