Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1svFYv-00DWrX-RY for pgsql-admin@arkaria.postgresql.org; Mon, 30 Sep 2024 12:31:38 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1svFYu-00HRDc-Kb for pgsql-admin@arkaria.postgresql.org; Mon, 30 Sep 2024 12:31:36 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1stx2X-00FPmU-CJ for pgsql-admin@lists.postgresql.org; Thu, 26 Sep 2024 22:32:49 +0000 Received: from mail-yw1-x1133.google.com ([2607:f8b0:4864:20::1133]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1stx2U-001Dt9-EG for pgsql-admin@lists.postgresql.org; Thu, 26 Sep 2024 22:32:47 +0000 Received: by mail-yw1-x1133.google.com with SMTP id 00721157ae682-68518bc1407so17536407b3.2 for ; Thu, 26 Sep 2024 15:32:46 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=datadoghq.com; s=google; t=1727389964; x=1727994764; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=e8cU/sbmWX1BFRDhAEpN439EWh7VwjiaydKNyeReeRw=; b=THdZ9MVhDn3IDl6Kd/MCLaqzqN08rANx3PZr1dJLffoWJVRWw+B6WHNUXgCiw+gWC8 BzbTWhlOMC3barX8u2nSIwGsw5tpo5DZmGFqA7AAN22lhYXJ9zZyrxRmB41QHqCvf/kY af4iP4K7EtBPaORy4b48c2sQtsg3LNRkBOuVA= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727389964; x=1727994764; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=e8cU/sbmWX1BFRDhAEpN439EWh7VwjiaydKNyeReeRw=; b=OPw8HwX3sOIqtdYqFprzDXZrXs2EMIRNMIPQ8UaBFW/gxq07IVqbepDlVj/09LIfm8 Wdm1W7JRW1RXImWN6gMYu2vzexHNJF/i1cuNfuFEgLAduVyN3xWJdQ4pc3gtwsjkeQ9P CswDOJKmz/KiENzKvoOCsTp9uXSweNo6bI2jwsV0A89Lcqg0Mj64XngJCKOLfrjFK5El MXSjZKdx3qQ1ZS9mcvBXEKCR1TvuzXCoMOCdp0lQ2i5F2TGWKbmVWAmdFbIWpgQ2z8N1 ibVD5SzdrKo43XLVXskDy3+dedJWrXmt5JqCukERuya3tfsD6pVF7P57Vx5akSwRFSyD JXLg== X-Gm-Message-State: AOJu0YxvRkl5I/sv4QqMZ9cfqwPpzKzeHMF0VJhL/SIzIGU5qc0lGBLD 3O2N9KZnjHNEUeRmfUS2YaTMj3IoXi8lqQr39aKIa9wM46htif5vvaYd/cxy5BvhjPPmWkgdTBc N4xgWdZHPS6R5IUBwUfA7xaeNmEumYCcXHmqhPb7a8tBD+eBSKmM= X-Google-Smtp-Source: AGHT+IG1CLCYYbmS8bnDWE7fSks3l6eHXrSxlyV29sClyWObHCrvGRsPVBC/us8ttHN2ALcHlRiMi2pUmr2ruQLb+TI= X-Received: by 2002:a05:690c:6612:b0:6b3:a6ff:768e with SMTP id 00721157ae682-6e2473a90c5mr14006327b3.0.1727389964066; Thu, 26 Sep 2024 15:32:44 -0700 (PDT) MIME-Version: 1.0 From: Alex Adriaanse Date: Thu, 26 Sep 2024 17:32:33 -0500 Message-ID: Subject: Long running vacuum and logical replication delay To: pgsql-admin@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi pgsql-admins, We observed an instance of growing replication delay (measured using pg_current_wal_lsn() - confirmed_flush_lsn) for a logical replication slot used by Debezium that strongly correlated with autovacuum activity. Logical replication delay started growing after a burst in deletes, performed in batches/transactions of 1,000 rows over a span of 40 minutes. This delete resulted in 68 million dead rows, representing 20% of total rows in the given table, which triggered an autovacuum. After the deletes finished, logical replication delay continued to grow from normal write activity as long as autovacuum was running. During this time WALs grew to 68 GiB. When autovacuum completed almost 3 hours later, logical replication delay immediately started going down, and there was an increase in network traffic to Debezium as replication started catching up. We did not see CPU or I/O saturation during the time that logical replication struggled to keep up. Would this be an indicator to increase logical_decoding_work_mem? If not, what symptoms would indicate a need to increase this parameter? Also, what could cause this correlated behavior? Brainstorming ideas on potential causes: - Does the WAL activity generated by vacuums slow down walsender or logical decoding enough to cause this issue? - Do autovacuum and logical replication share resources? - Does a running vacuum block logical replication altogether for a given table? - Does the presence of dead rows slow down logical replication? - Could the large number of accumulated WAL files slow down logical replication (e.g. due to syscalls taking longer to return)? - Could causality be reversed: instead of autovacuum causing replication delay, did replication delay from the large delete cause autovacuum to not finish until the delete had fully replicated? Thanks, Alex