Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sknLM-002vkb-9G for pgsql-general@arkaria.postgresql.org; Sun, 01 Sep 2024 16:22:25 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sknLK-00Gf4T-Rk for pgsql-general@arkaria.postgresql.org; Sun, 01 Sep 2024 16:22:23 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sknLK-00Gf4E-CN for pgsql-general@lists.postgresql.org; Sun, 01 Sep 2024 16:22:22 +0000 Received: from mail-ua1-x92f.google.com ([2607:f8b0:4864:20::92f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sknLC-0009X2-BS for pgsql-general@postgresql.org; Sun, 01 Sep 2024 16:22:19 +0000 Received: by mail-ua1-x92f.google.com with SMTP id a1e0cc1a2514c-846bdc20098so408521241.3 for ; Sun, 01 Sep 2024 09:22:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725207733; x=1725812533; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=N6SazDCsVZONEvymBEG8+SmHtxIi8LA86N5M/tpjor0=; b=njGlGtSY/L6KWbjj7LEv+buoFa0uKHueMEx/lUcIuCcl7wuLkikW1jrMfbMpSlhGQr etAyvjo7gMejv96W/VoZJ6D2+I70PxbMMA/iMpycnA7X6s6uNXG2cu1E77XG3+tHl/dI WuFTqOQEeRrWceSMMqkq44Jbj5b/H/g7Io2e4X6+ddQkA9/imsjXxC//S4WI1JWOP+an lFClsV6//Gua4uyjsdGJw5V6KDsYKXDHxceGAaQtsxFguNjjpmLEFaMc1oFPISCaTd0O YH/qr9WRNmjeS7xbpxMAeWXXMq4amxSoeb4mW0qC4VxqKbJjE9sqbh1PISp/Sen2czNZ p1Aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725207733; x=1725812533; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=N6SazDCsVZONEvymBEG8+SmHtxIi8LA86N5M/tpjor0=; b=bxdPPxvBWsYbr8aoT2sRi9Yx6HqOLNMfbCnLanXGfKAy7iAMl/gyM3G7zCMJsjeQOF iOAwvBHTxSlWfQdGyP900QvhJhrSq80FZL9zRdi12gj0HYfHn1O+8LAOXgSXi/JKYyPK rWFPKaEbRJt+ZylvYGat08WO4ogGKY24Sjika/4qD6CYjz6JLuHlf2qKfW+lLx3pR5rj 8LSYtOdTPV/ofeUHfJR+tS7PZJEVyaJwvZlZyg40Yv3S1cjU0qs6IZPx7wfF2AsNubF7 DRuQKDyztcnthzAa/XCfNSgVKymdOOrsE0dytknXOGDs35VJlp4VbgG4Y4tnRU7BKbb+ Dcbw== X-Gm-Message-State: AOJu0YxXjFd6jitmon7YwhE/bpQ8RmpoNq54Vrrfa8L4w7IqYn9xnZJd qN/OlKF30MCLDHS7mGPGeY0pcuAmBZuPVqowxO6RJU8LJX3NHjqf/4lfFI9IVL8G4BY64iNf6Da kwPNF8Aw223Mkr0ez9wXP5BDTil1QKQ== X-Google-Smtp-Source: AGHT+IEKfcDZjqQwdvcEBELfkbimDtrnNiJsOe6bekpRNM32dVHekS5YdlsogjqYtzsxmfSBSOeeb9x+9Un1sjFiBAk= X-Received: by 2002:a05:6122:1787:b0:4f5:2b7f:f1a7 with SMTP id 71dfb90a1353d-5009ad5c534mr6187395e0c.13.1725207732576; Sun, 01 Sep 2024 09:22:12 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Shaheed Haque Date: Sun, 1 Sep 2024 17:22:01 +0100 Message-ID: Subject: Re: Postgres Logical Replication - how to see what subscriber is doing with received data? To: Michael Jaskiewicz Cc: "pgsql-general@postgresql.org" Content-Type: multipart/alternative; boundary="000000000000106d370621113ac9" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000106d370621113ac9 Content-Type: text/plain; charset="UTF-8" Since nobody more knowledgeable has replied... I'm very interested in this area and still surprised that there is no official/convenient/standard way to approach this (see https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com ). Based partly on that thread, I ended up with a script that connects to both ends of the replication, and basically loops while comparing the counts in each table. On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, wrote: > I've got two Postgres 13 databases on AWS RDS. > > - One is a master, the other a slave using logical replication. > - Replication has fallen behind by about 350Gb. > - The slave was maxed out in terms of CPU for the past four days > because of some jobs that were ongoing so I'm not sure what logical > replication was able to replicate during that time. > - I killed those jobs and now CPU on the master and slave are both low. > - I look at the subscriber via `select * from pg_stat_subscription;` > and see that latest_end_lsn is advancing albeit very slowly. > - The publisher says write/flush/replay lags are all 13 minutes behind > but it's been like that for most of the day. > - I see no errors in the logs on either the publisher or subscriber > outside of some simple SQL errors that users have been making. > - CloudWatch reports low CPU utilization, low I/O, and low network. > > > > Is there anything I can do here? Previously I set wal_receiver_timeout > timeout to 0 because I had replication issues, and that helped things. I > wish I had *some* visibility here to get any kind of confidence that it's > going to pull through, but other than these lsn values and database logs, > I'm not sure what to check. > > > > Sincerely, > > mj > --000000000000106d370621113ac9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised tha= t there is no official/convenient/standard way to approach this (see https://www.postgresql.org/message-= id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com).

Based partly on that thread, I ended up with a script that c= onnects to both ends of the replication, and basically loops while comparin= g the counts in each table.


I've got two Postgres 13 databases on AWS RDS.

  • One is a master, the other a slave using logical re= plication.
  • Replication has fallen= behind by about 350Gb.
  • The slave= was maxed out in terms of CPU for the past four days because of some jobs = that were ongoing so I'm not sure what logical replication was able to = replicate during that time.
  • I kil= led those jobs and now CPU on the master and slave are both low.<= /u>
  • I look at the subscriber via `select * from= pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver= y slowly.
  • The publisher says writ= e/flush/replay lags are all 13 minutes behind but it's been like that f= or most of the day.
  • I see no erro= rs in the logs on either the publisher or subscriber outside of some simple= SQL errors that users have been making.
  • CloudWatch reports low CPU utilization, low I/O, and low network.

=C2=A0

Is there anything I can do here? Previously I set wa= l_receiver_timeout timeout to 0 because I had replication issues, and that = helped things. I wish I had some visibility here to get any kind of confidence that it's goi= ng to pull through, but other than these lsn values and database logs, I= 9;m not sure what to check.

=C2=A0

Sincerely,

mj

--000000000000106d370621113ac9--