Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sl2gv-004hmv-Aw for pgsql-general@arkaria.postgresql.org; Mon, 02 Sep 2024 08:45:42 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sl2gt-00AWGU-FF for pgsql-general@arkaria.postgresql.org; Mon, 02 Sep 2024 08:45:39 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sl2gs-00AWFx-QS for pgsql-general@lists.postgresql.org; Mon, 02 Sep 2024 08:45:39 +0000 Received: from mail-lj1-x22d.google.com ([2a00:1450:4864:20::22d]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1sl2gp-000GL7-Vr for pgsql-general@postgresql.org; Mon, 02 Sep 2024 08:45:37 +0000 Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-2f3f922ad1eso2574221fa.3 for ; Mon, 02 Sep 2024 01:45:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1725266734; x=1725871534; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=zwhuUStzHef5hABVxMnLRziA/NC0xizLdruKYMCYd3A=; b=XjR6/WiraRrfyaYYDk2k0f9r9ctAreQ9VQIPWyT/1+6lrBQQasx35DNXMXpNdu3NQp SOJ9a/fjUoEdcKK1miH5Lz6Ayw1vQczpgxKa23SoIN+Uq8c7kCRYJvEB9QcgKtF3N3D4 WExpT5lpESUpx5G3ITcbHyCEar4G5fZE/hFUuwawkqZ9g6e6FKLO9AihsIA5jQ0Xf6yy dcNm5nagnBLiCjv8emsrkv73USYgF/OVLbhrJIU9RfRMpQi8RilluxX0wdcBCOr5WexJ Gf4Fdwbzl+uQS5jUK61ENzxfroHPEPu5CUrInbV73pMyzKBWBRw1LCgWQkCU9HV4mp+8 aozg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725266734; x=1725871534; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=zwhuUStzHef5hABVxMnLRziA/NC0xizLdruKYMCYd3A=; b=wdWht+tWp34eyRM1/jIKXSHpbRpiyL7q2Zly+9h8wiP4f8rdfVrdwSH4qLPb/EdgU7 2+WeyKzv27JddsiKkwJUTHMR/wepDce2MNrEczhc9UJd2nCLRd+X4qq+cTBc9ntDB3bv 6FLwCapqCZMpNb27YDTP77c9Bjc5HJEgMbpd8MgCxhWxvkNCNRJbrfNKbpxuKa03H0cy 4+9rIzjEURO1eAUz7nO/9FfIubNtIT+jh00qcfcgu6/h+wV4fy8P5vc15chxOyv+pCs2 hrl2fUTu7l0rFrUn4PzQoA8Emdv16JuXGHQjzAhoSkFs9nOoXYVcmI7TXmqbA1xkJZye 5L3g== X-Forwarded-Encrypted: i=1; AJvYcCUWUdEiReioDBkutHJkUoCnUyUmCblruUfr7CEPrfum2olwXHqzp3A7du2QtktkT9RnHe0YXt48o0JEqfOn@postgresql.org X-Gm-Message-State: AOJu0YzMP2lSeQVLcPpfp/OGO2L6PL10FQ91cbixOYCsuTdnj6j7RBRk Y+d7uYWN42Fg6TAJYzUncYCZA1+Y31J9uxMPqAbibl+AUGRFGAB+uwm6v3dBPrT3GuXcTUsAJez jjuQm8tuZ9bF2Dfmj3aIXsDWLw5PzsKeq X-Google-Smtp-Source: AGHT+IFP8Rd8HBqOyXPvDzlaJ0Q6SOzwi77gyCqzQzg0WuKacaYozspMZiSjNqL/2GQPanrBHVt4P2JjlEw9bf47Dz4= X-Received: by 2002:a2e:a99c:0:b0:2f4:f3e7:2a36 with SMTP id 38308e7fff4ca-2f61e032809mr38712311fa.3.1725266733383; Mon, 02 Sep 2024 01:45:33 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Muhammad Ikram Date: Mon, 2 Sep 2024 13:45:16 +0500 Message-ID: Subject: Re: Postgres Logical Replication - how to see what subscriber is doing with received data? To: Shaheed Haque Cc: Michael Jaskiewicz , pgsql-general@postgresql.org Content-Type: multipart/alternative; boundary="000000000000c94a7f06211ef6aa" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000c94a7f06211ef6aa Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi Shaheed, I think you must have already analyzed the outcome of queries on pg_replication_slots, pg_current_wal_lsn(), pg_stat_subscription etc. I could find a query SELECT pg_size_pretty(pg_wal_lsn_diff('', '')); As a side note if you want to see what has been applied to subscribers vs what exists on publisher then here is something from my previous experience. We used to have a Data Validation tool for checking tables/rows across publisher/subscriber. We also used pg_dump for another tool that was meant for making copies of schemas. Regards, Muhammad Ikram On Mon, Sep 2, 2024 at 12:42=E2=80=AFPM Shaheed Haque wrote: > Hi Muhammad, > > On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, wrote: > >> Hi Shaheed, >> >> Maybe these considerations could help you or give any hint to the proble= m >> ? >> >> >> Check if wal_receiver_timeout being set to 0 could potentially cause >> issues, like not detecting network issues quickly enough. Consider >> re-evaluating this setting if you see connection issues. >> >> If you notice that some data is missing on subscriber then could you >> increase max_slot_wal_keep_size on publisher so that WALs are not delete= d >> until they are applied on subscriber. >> >> Do you have flexibility to increase max_worker_processes and >> max_logical_replication_workers, work_mem and maintenance_work_mem on >> subscriber (In case bottleneck exists on subscriber) >> >> If there's significant lag, consider whether it might be more efficient >> to drop the subscription and re-initialize it from scratch using a new b= ase >> backup, depending on the data volume and how long it might take for the >> existing replication to catch up. >> > > Thanks for the kind hints, I'll certainly look into those. > > My main interest however was with the "visibility" question, i.e. to get > an understanding of the gap between the two ends of a replication slot, > ideally in human terms (e.g. tables x records). > > I understand the difficulties of trying to produce a meaningful metric > that spans two (or more) systems but let's be honest, trying to diagnose > which knobs to tweak (whether in application, PG, the OS or the network) = is > basically black magic when all we really have is a pair of opaque LSNs. > > > > >> >> Regards, >> Muhammad Ikram >> >> >> On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque >> wrote: >> >>> Since nobody more knowledgeable has replied... >>> >>> I'm very interested in this area and still surprised that there is no >>> official/convenient/standard way to approach this (see >>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5W= XCN9KjZOvuTNJaAAC_hg%40mail.gmail.com >>> ). >>> >>> Based partly on that thread, I ended up with a script that connects to >>> both ends of the replication, and basically loops while comparing the >>> counts in each table. >>> >>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, >>> wrote: >>> >>>> I've got two Postgres 13 databases on AWS RDS. >>>> >>>> - One is a master, the other a slave using logical replication. >>>> - Replication has fallen behind by about 350Gb. >>>> - The slave was maxed out in terms of CPU for the past four days >>>> because of some jobs that were ongoing so I'm not sure what logical >>>> replication was able to replicate during that time. >>>> - I killed those jobs and now CPU on the master and slave are both >>>> low. >>>> - I look at the subscriber via `select * from >>>> pg_stat_subscription;` and see that latest_end_lsn is advancing alb= eit very >>>> slowly. >>>> - The publisher says write/flush/replay lags are all 13 minutes >>>> behind but it's been like that for most of the day. >>>> - I see no errors in the logs on either the publisher or subscriber >>>> outside of some simple SQL errors that users have been making. >>>> - CloudWatch reports low CPU utilization, low I/O, and low network. >>>> >>>> >>>> >>>> Is there anything I can do here? Previously I set wal_receiver_timeout >>>> timeout to 0 because I had replication issues, and that helped things.= I >>>> wish I had *some* visibility here to get any kind of confidence that >>>> it's going to pull through, but other than these lsn values and databa= se >>>> logs, I'm not sure what to check. >>>> >>>> >>>> >>>> Sincerely, >>>> >>>> mj >>>> >>> >> >> -- >> Muhammad Ikram >> >> --=20 Muhammad Ikram --000000000000c94a7f06211ef6aa Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi=C2=A0Shaheed,
I think you must have already analyze= d the outcome of queries on=C2=A0pg_replication_slots,=C2=A0 pg_current_wal= _lsn(),=C2=A0pg_stat_subscription etc. I could find a query SELECT pg_size_= pretty(pg_wal_lsn_diff('<publisher_restart_lsn>', '<su= bscriber_replayed_lsn>'));

As a side note i= f you want to see what has been applied to subscribers vs what exists on pu= blisher then here is something from my previous experience. We used to have= a Data Validation tool=C2=A0for checking tables/rows across publisher/subs= criber. We also used pg_dump for another tool that was meant for making cop= ies of schemas.

Regards,
Muhammad Ikram<= /div>



On Mon, Sep 2, 2024 at 12:42=E2=80=AFPM = Shaheed Haque <shaheedhaque@gm= ail.com> wrote:
Hi Muhammad,=C2=A0

On Mon, 2 Sep 2024, 07:08 Muh= ammad Ikram, <mmi= kram@gmail.com> wrote:
Hi Shaheed,

Maybe these c= onsiderations could help you or give any hint to the problem ?


C= heck if wal_receiver_timeout being set to 0 could potentially cause issues,= like not detecting network issues quickly enough. Consider re-evaluating t= his setting if you see connection issues.

If you notice that some da= ta is missing on subscriber then could you increase max_slot_wal_keep_size = on publisher so that WALs are not deleted until they are applied on subscri= ber.

Do you have flexibility to increase max_worker_processes and ma= x_logical_replication_workers, work_mem and maintenance_work_mem on subscri= ber (In case bottleneck exists on subscriber)

If there's signifi= cant lag, consider whether it might be more efficient to drop the subscript= ion and re-initialize it from scratch using a new base backup, depending on= the data volume and how long it might take for the existing replication to= catch up.

Thanks for the kind hints, I'll certainly look in= to those.=C2=A0

My main = interest however was with the "visibility" question, i.e. to get = an understanding of the gap between the two ends of a replication slot, ide= ally in human terms (e.g. tables x records).

I understand the difficulties of trying to produce a m= eaningful metric that spans two (or more) systems but let's be honest, = trying to diagnose which knobs to tweak (whether in application, PG, the OS= or the network) is basically black magic when all we really have is a pair= of opaque LSNs.=C2=A0

<= br>

<= div>

=C2=A0Regards,
Muhammad Ikram


On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque <shaheedhaque@= gmail.com> wrote:

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised tha= t there is no official/convenient/standard way to approach this (see https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXC= N9KjZOvuTNJaAAC_hg%40mail.gmail.com).

Based partly on that thread, I ended up with a script that c= onnects to both ends of the replication, and basically loops while comparin= g the counts in each table.


On Fri= , 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com> w= rote:

I've got two Postgres 13 databases on AWS RDS.

  • One is a master, the other a slave using logical re= plication.
  • Replication has fallen= behind by about 350Gb.
  • The slave= was maxed out in terms of CPU for the past four days because of some jobs = that were ongoing so I'm not sure what logical replication was able to = replicate during that time.
  • I kil= led those jobs and now CPU on the master and slave are both low.<= /u>
  • I look at the subscriber via `select * from= pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver= y slowly.
  • The publisher says writ= e/flush/replay lags are all 13 minutes behind but it's been like that f= or most of the day.
  • I see no erro= rs in the logs on either the publisher or subscriber outside of some simple= SQL errors that users have been making.
  • CloudWatch reports low CPU utilization, low I/O, and low network.

=C2=A0

Is there anything I can do here? Previously I set wa= l_receiver_timeout timeout to 0 because I had replication issues, and that = helped things. I wish I had some visibility here to get any kind of confidence that it's goi= ng to pull through, but other than these lsn values and database logs, I= 9;m not sure what to check.

=C2=A0

Sincerely,

mj



--
Muhammad Ikram



--
Muhammad Ikram

--000000000000c94a7f06211ef6aa--