Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sl0F1-004Pn1-6V for pgsql-general@arkaria.postgresql.org; Mon, 02 Sep 2024 06:08:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sl0Ez-008d3L-Pp for pgsql-general@arkaria.postgresql.org; Mon, 02 Sep 2024 06:08:42 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sl0Ez-008d3C-Cx for pgsql-general@lists.postgresql.org; Mon, 02 Sep 2024 06:08:41 +0000 Received: from mail-lj1-x22d.google.com ([2a00:1450:4864:20::22d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sl0Ew-000FOZ-SH for pgsql-general@postgresql.org; Mon, 02 Sep 2024 06:08:41 +0000 Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-2f4f24b4d7fso8259951fa.2 for ; Sun, 01 Sep 2024 23:08:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725257317; x=1725862117; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=Osoq4Uc5Aq0GR2ejlEkeW+A2aDiC8FdGMEcMdxz9CpM=; b=GTSIgWBYb1seur44d3bbmo9q3ccgiJYIC03jktZcG4tC5033pfmiZSHeXyiOxWiDxA 1bGjDqat1eiZ/hJTesMcMwSQQu5cRNY9PPnkvp4lIWkTj5gzYdXQRYh3xrKBDmIKGC90 lNT1rkkJc/DIC4GivrMtS+oGcK2RzheUlPzd8bNXzAlWcT1ISvyupBrYe8kooRNHY9qM ND+bMa2fyHtxSG11/6BQ7Ipl9GI4Yl+fBQoJOLsv2pE6lx1s8iioV6FFbFbPShY5XAOX XoA9dbwgLn0nOVRKdt8g6L4KRZgOpWTAYOcE5ZCSjUITJCBfDI3qm61gxagZuyu6mLa1 a2rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725257317; x=1725862117; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Osoq4Uc5Aq0GR2ejlEkeW+A2aDiC8FdGMEcMdxz9CpM=; b=O6rX1JHCIPLzWTwcohggshMgVB/XyQxqwkRXyo5TcVjpBXxmsfGjdV0SeKMvYlfTcs f+XX34poh+BHMANLI94FWlJlp5OcFqlKilsxHhxELWuz/M2LYFf9jySddKJaUfWv+Zbd KJ9mxbfWr54Ebd4YK4Z0Vg8XuGc4lhQttFeX1Rp8l6dHoVd9AU9/HrrLVEO7Vcm9Iu5v ZQqkXFeCXVH3OujaBPiQH4xPjttZ1hA2U12yfyhLIPP+PIZYf/XTyz+lo8tBFfTV51ao w4FnMGPhTVE9Jnt6W7L+Y+iLs2sUmpZcMWU76OUxfDRP9llRMp6xjUQ2hzYDgMPo2q7B uxlg== X-Forwarded-Encrypted: i=1; AJvYcCV38BN+1x3oO2Mugd29o1HZ1FrQ6riRIkmNgdmsjBF+04mW68wwkMeHOgkupk/l/H/8b05dXPecAle0sUU2@postgresql.org X-Gm-Message-State: AOJu0Yywnb19nzQW2DG4Gjw8Dn7Qj5ERb5c/xGmLD8+1/wk1jj2yjrXk ucR+cTTBdLxednGi4uaM4VMZ1799hPi753YfLEBK0W3MhXoSrjAoGoysCofJuYOgpP9TUaBhDhM yiJhrB9g7JiqeYTckPFKLS4K47cU/oTgH X-Google-Smtp-Source: AGHT+IHSYVR38OBrr3/034b8SrvdulsIwT59ELe3YtrAvFbylMiPHdQ3m/7BQE267eS1qHy4DW1Gv/5QK3u0X5n7kHQ= X-Received: by 2002:a05:6512:3095:b0:52e:fb8b:32d0 with SMTP id 2adb3069b0e04-53546afa27cmr3337462e87.1.1725257316455; Sun, 01 Sep 2024 23:08:36 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Muhammad Ikram Date: Mon, 2 Sep 2024 11:08:20 +0500 Message-ID: Subject: Re: Postgres Logical Replication - how to see what subscriber is doing with received data? To: Shaheed Haque Cc: Michael Jaskiewicz , "pgsql-general@postgresql.org" Content-Type: multipart/alternative; boundary="0000000000007e5a3706211cc5a1" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000007e5a3706211cc5a1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Shaheed, Maybe these considerations could help you or give any hint to the problem ? Check if wal_receiver_timeout being set to 0 could potentially cause issues, like not detecting network issues quickly enough. Consider re-evaluating this setting if you see connection issues. If you notice that some data is missing on subscriber then could you increase max_slot_wal_keep_size on publisher so that WALs are not deleted until they are applied on subscriber. Do you have flexibility to increase max_worker_processes and max_logical_replication_workers, work_mem and maintenance_work_mem on subscriber (In case bottleneck exists on subscriber) If there's significant lag, consider whether it might be more efficient to drop the subscription and re-initialize it from scratch using a new base backup, depending on the data volume and how long it might take for the existing replication to catch up. Regards, Muhammad Ikram On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque wrote: > Since nobody more knowledgeable has replied... > > I'm very interested in this area and still surprised that there is no > official/convenient/standard way to approach this (see > https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXC= N9KjZOvuTNJaAAC_hg%40mail.gmail.com > ). > > Based partly on that thread, I ended up with a script that connects to > both ends of the replication, and basically loops while comparing the > counts in each table. > > On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, > wrote: > >> I've got two Postgres 13 databases on AWS RDS. >> >> - One is a master, the other a slave using logical replication. >> - Replication has fallen behind by about 350Gb. >> - The slave was maxed out in terms of CPU for the past four days >> because of some jobs that were ongoing so I'm not sure what logical >> replication was able to replicate during that time. >> - I killed those jobs and now CPU on the master and slave are both >> low. >> - I look at the subscriber via `select * from pg_stat_subscription;` >> and see that latest_end_lsn is advancing albeit very slowly. >> - The publisher says write/flush/replay lags are all 13 minutes >> behind but it's been like that for most of the day. >> - I see no errors in the logs on either the publisher or subscriber >> outside of some simple SQL errors that users have been making. >> - CloudWatch reports low CPU utilization, low I/O, and low network. >> >> >> >> Is there anything I can do here? Previously I set wal_receiver_timeout >> timeout to 0 because I had replication issues, and that helped things. I >> wish I had *some* visibility here to get any kind of confidence that >> it's going to pull through, but other than these lsn values and database >> logs, I'm not sure what to check. >> >> >> >> Sincerely, >> >> mj >> > --=20 Muhammad Ikram --0000000000007e5a3706211cc5a1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi Shaheed,

Maybe these considerations = could help you or give any hint to the problem ?


Check if wal_re= ceiver_timeout being set to 0 could potentially cause issues, like not dete= cting network issues quickly enough. Consider re-evaluating this setting if= you see connection issues.

If you notice that some data is missing = on subscriber then could you increase max_slot_wal_keep_size on publisher s= o that WALs are not deleted until they are applied on subscriber.

Do= you have flexibility to increase max_worker_processes and max_logical_repl= ication_workers, work_mem and maintenance_work_mem on subscriber (In case b= ottleneck exists on subscriber)

If there's significant lag, cons= ider whether it might be more efficient to drop the subscription and re-ini= tialize it from scratch using a new base backup, depending on the data volu= me and how long it might take for the existing replication to catch up.
=

=C2=A0Regards,
Muhammad Ikram


On = Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque <shaheedhaque@gmail.com> wrote:

Since nobody more= knowledgeable has replied...

I'm very interested in this area and still surprised tha= t there is no official/convenient/standard way to approach this (see https://www.postg= resql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%= 40mail.gmail.com).

Based partly on that thread, I ended up with a script that c= onnects to both ends of the replication, and basically loops while comparin= g the counts in each table.


On Fri= , 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com> wrote:

I've got two Postgres 13 databases on AWS RDS.

  • One is a master, the other a slave using logical re= plication.
  • Replication has fallen= behind by about 350Gb.
  • The slave= was maxed out in terms of CPU for the past four days because of some jobs = that were ongoing so I'm not sure what logical replication was able to = replicate during that time.
  • I kil= led those jobs and now CPU on the master and slave are both low.<= /u>
  • I look at the subscriber via `select * from= pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver= y slowly.
  • The publisher says writ= e/flush/replay lags are all 13 minutes behind but it's been like that f= or most of the day.
  • I see no erro= rs in the logs on either the publisher or subscriber outside of some simple= SQL errors that users have been making.
  • CloudWatch reports low CPU utilization, low I/O, and low network.

=C2=A0

Is there anything I can do here? Previously I set wa= l_receiver_timeout timeout to 0 because I had replication issues, and that = helped things. I wish I had some visibility here to get any kind of confidence that it's goi= ng to pull through, but other than these lsn values and database logs, I= 9;m not sure what to check.

=C2=A0

Sincerely,

mj



--
Muhammad Ikram

--0000000000007e5a3706211cc5a1--