MIME-Version: 1.0
References: <BN9PR03MB59965B5087688309C0D79DCEB7962@BN9PR03MB5996.namprd03.prod.outlook.com>
In-Reply-To: <BN9PR03MB59965B5087688309C0D79DCEB7962@BN9PR03MB5996.namprd03.prod.outlook.com>
From: Shaheed Haque <shaheedhaque@gmail.com>
Date: Sun, 1 Sep 2024 17:22:01 +0100
Message-ID: <CAHAc2je=cCww1xNmZ1TpZG6vwOxut8htJ7NdWotGjrHGuj8ELg@mail.gmail.com>
Subject: Re: Postgres Logical Replication - how to see what subscriber is
 doing with received data?
To: Michael Jaskiewicz <mjaskiewicz@ghx.com>
Cc: "pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000106d370621113ac9"
Archived-At: <https://www.postgresql.org/message-id/CAHAc2je%3DcCww1xNmZ1TpZG6vwOxut8htJ7NdWotGjrHGuj8ELg%40mail.gmail.com>
Precedence: bulk

--000000000000106d370621113ac9
Content-Type: text/plain; charset="UTF-8"

Since nobody more knowledgeable has replied...

I'm very interested in this area and still surprised that there is no
official/convenient/standard way to approach this (see
https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
).

Based partly on that thread, I ended up with a script that connects to both
ends of the replication, and basically loops while comparing the counts in
each table.

On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com> wrote:

> I've got two Postgres 13 databases on AWS RDS.
>
>    - One is a master, the other a slave using logical replication.
>    - Replication has fallen behind by about 350Gb.
>    - The slave was maxed out in terms of CPU for the past four days
>    because of some jobs that were ongoing so I'm not sure what logical
>    replication was able to replicate during that time.
>    - I killed those jobs and now CPU on the master and slave are both low.
>    - I look at the subscriber via `select * from pg_stat_subscription;`
>    and see that latest_end_lsn is advancing albeit very slowly.
>    - The publisher says write/flush/replay lags are all 13 minutes behind
>    but it's been like that for most of the day.
>    - I see no errors in the logs on either the publisher or subscriber
>    outside of some simple SQL errors that users have been making.
>    - CloudWatch reports low CPU utilization, low I/O, and low network.
>
>
>
> Is there anything I can do here? Previously I set wal_receiver_timeout
> timeout to 0 because I had replication issues, and that helped things. I
> wish I had *some* visibility here to get any kind of confidence that it's
> going to pull through, but other than these lsn values and database logs,
> I'm not sure what to check.
>
>
>
> Sincerely,
>
> mj
>

--000000000000106d370621113ac9
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<p dir=3D"ltr">Since nobody more knowledgeable has replied...</p>
<p dir=3D"ltr">I&#39;m very interested in this area and still surprised tha=
t there is no official/convenient/standard way to approach this (see <a hre=
f=3D"https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5=
WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com">https://www.postgresql.org/message-=
id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com</=
a>).</p>
<p dir=3D"ltr">Based partly on that thread, I ended up with a script that c=
onnects to both ends of the replication, and basically loops while comparin=
g the counts in each table. </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri=
, 30 Aug 2024, 12:38 Michael Jaskiewicz, &lt;<a href=3D"mailto:mjaskiewicz@=
ghx.com">mjaskiewicz@ghx.com</a>&gt; wrote:<br></div><blockquote class=3D"g=
mail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-l=
eft:1ex">


<div lang=3D"EN-US" link=3D"#467886" vlink=3D"#96607D" style=3D"word-wrap:b=
reak-word">
<div class=3D"m_720491875081138638WordSection1">
<p class=3D"MsoNormal">I&#39;ve got two Postgres 13 databases on AWS RDS.<u=
></u><u></u></p>
<ul style=3D"margin-top:0in" type=3D"disc">
<li class=3D"MsoNormal">One is a master, the other a slave using logical re=
plication.<u></u><u></u></li><li class=3D"MsoNormal">Replication has fallen=
 behind by about 350Gb.<u></u><u></u></li><li class=3D"MsoNormal">The slave=
 was maxed out in terms of CPU for the past four days because of some jobs =
that were ongoing so I&#39;m not sure what logical replication was able to =
replicate during that time.<u></u><u></u></li><li class=3D"MsoNormal">I kil=
led those jobs and now CPU on the master and slave are both low.<u></u><u><=
/u></li><li class=3D"MsoNormal">I look at the subscriber via `select * from=
 pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver=
y slowly.<u></u><u></u></li><li class=3D"MsoNormal">The publisher says writ=
e/flush/replay lags are all 13 minutes behind but it&#39;s been like that f=
or most of the day.<u></u><u></u></li><li class=3D"MsoNormal">I see no erro=
rs in the logs on either the publisher or subscriber outside of some simple=
 SQL errors that users have been making.<u></u><u></u></li><li class=3D"Mso=
Normal">CloudWatch reports low CPU utilization, low I/O, and low network.
<u></u><u></u></li></ul>
<p class=3D"MsoNormal" style=3D"margin-left:.5in"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Is there anything I can do here? Previously I set wa=
l_receiver_timeout timeout to 0 because I had replication issues, and that =
helped things. I wish I had
<i>some</i> visibility here to get any kind of confidence that it&#39;s goi=
ng to pull through, but other than these lsn values and database logs, I=
9;m not sure what to check.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Sincerely,<u></u><u></u></p>
<p class=3D"MsoNormal">mj<u></u><u></u></p>
</div>
</div>

</blockquote></div>

--000000000000106d370621113ac9--