MIME-Version: 1.0
References: <BN9PR03MB59965B5087688309C0D79DCEB7962@BN9PR03MB5996.namprd03.prod.outlook.com>
 <CAHAc2je=cCww1xNmZ1TpZG6vwOxut8htJ7NdWotGjrHGuj8ELg@mail.gmail.com>
 <CAGeimVrMDJnxZaQj9ia2SJ3pOpsvreeb+s3j9qXukoGuncH9Zw@mail.gmail.com> <CAHAc2jcNjz6rA=n9RkeRj36KEcV-TR3BD7nhNKqndoTKYMXTtg@mail.gmail.com>
In-Reply-To: <CAHAc2jcNjz6rA=n9RkeRj36KEcV-TR3BD7nhNKqndoTKYMXTtg@mail.gmail.com>
From: Muhammad Ikram <mmikram@gmail.com>
Date: Mon, 2 Sep 2024 13:45:16 +0500
Message-ID: <CAGeimVrgXKejvzGB+Dks9ieZtnYYOF6RHLC5mDBxPCb-KLZLVw@mail.gmail.com>
Subject: Re: Postgres Logical Replication - how to see what subscriber is
 doing with received data?
To: Shaheed Haque <shaheedhaque@gmail.com>
Cc: Michael Jaskiewicz <mjaskiewicz@ghx.com>, pgsql-general@postgresql.org
Content-Type: multipart/alternative; boundary="000000000000c94a7f06211ef6aa"
Archived-At: <https://www.postgresql.org/message-id/CAGeimVrgXKejvzGB%2BDks9ieZtnYYOF6RHLC5mDBxPCb-KLZLVw%40mail.gmail.com>
Precedence: bulk

--000000000000c94a7f06211ef6aa
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Shaheed,
I think you must have already analyzed the outcome of queries
on pg_replication_slots,  pg_current_wal_lsn(), pg_stat_subscription etc. I
could find a query SELECT
pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>',
'<subscriber_replayed_lsn>'));

As a side note if you want to see what has been applied to subscribers vs
what exists on publisher then here is something from my previous
experience. We used to have a Data Validation tool for checking tables/rows
across publisher/subscriber. We also used pg_dump for another tool that was
meant for making copies of schemas.

Regards,
Muhammad Ikram


On Mon, Sep 2, 2024 at 12:42=E2=80=AFPM Shaheed Haque <shaheedhaque@gmail.c=
om>
wrote:

> Hi Muhammad,
>
> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmikram@gmail.com> wrote:
>
>> Hi Shaheed,
>>
>> Maybe these considerations could help you or give any hint to the proble=
m
>> ?
>>
>>
>> Check if wal_receiver_timeout being set to 0 could potentially cause
>> issues, like not detecting network issues quickly enough. Consider
>> re-evaluating this setting if you see connection issues.
>>
>> If you notice that some data is missing on subscriber then could you
>> increase max_slot_wal_keep_size on publisher so that WALs are not delete=
d
>> until they are applied on subscriber.
>>
>> Do you have flexibility to increase max_worker_processes and
>> max_logical_replication_workers, work_mem and maintenance_work_mem on
>> subscriber (In case bottleneck exists on subscriber)
>>
>> If there's significant lag, consider whether it might be more efficient
>> to drop the subscription and re-initialize it from scratch using a new b=
ase
>> backup, depending on the data volume and how long it might take for the
>> existing replication to catch up.
>>
>
> Thanks for the kind hints, I'll certainly look into those.
>
> My main interest however was with the "visibility" question, i.e. to get
> an understanding of the gap between the two ends of a replication slot,
> ideally in human terms (e.g. tables x records).
>
> I understand the difficulties of trying to produce a meaningful metric
> that spans two (or more) systems but let's be honest, trying to diagnose
> which knobs to tweak (whether in application, PG, the OS or the network) =
is
> basically black magic when all we really have is a pair of opaque LSNs.
>
>
>
>
>>
>>  Regards,
>> Muhammad Ikram
>>
>>
>> On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque <shaheedhaque@gmail=
.com>
>> wrote:
>>
>>> Since nobody more knowledgeable has replied...
>>>
>>> I'm very interested in this area and still surprised that there is no
>>> official/convenient/standard way to approach this (see
>>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5W=
XCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
>>> ).
>>>
>>> Based partly on that thread, I ended up with a script that connects to
>>> both ends of the replication, and basically loops while comparing the
>>> counts in each table.
>>>
>>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
>>> wrote:
>>>
>>>> I've got two Postgres 13 databases on AWS RDS.
>>>>
>>>>    - One is a master, the other a slave using logical replication.
>>>>    - Replication has fallen behind by about 350Gb.
>>>>    - The slave was maxed out in terms of CPU for the past four days
>>>>    because of some jobs that were ongoing so I'm not sure what logical
>>>>    replication was able to replicate during that time.
>>>>    - I killed those jobs and now CPU on the master and slave are both
>>>>    low.
>>>>    - I look at the subscriber via `select * from
>>>>    pg_stat_subscription;` and see that latest_end_lsn is advancing alb=
eit very
>>>>    slowly.
>>>>    - The publisher says write/flush/replay lags are all 13 minutes
>>>>    behind but it's been like that for most of the day.
>>>>    - I see no errors in the logs on either the publisher or subscriber
>>>>    outside of some simple SQL errors that users have been making.
>>>>    - CloudWatch reports low CPU utilization, low I/O, and low network.
>>>>
>>>>
>>>>
>>>> Is there anything I can do here? Previously I set wal_receiver_timeout
>>>> timeout to 0 because I had replication issues, and that helped things.=
 I
>>>> wish I had *some* visibility here to get any kind of confidence that
>>>> it's going to pull through, but other than these lsn values and databa=
se
>>>> logs, I'm not sure what to check.
>>>>
>>>>
>>>>
>>>> Sincerely,
>>>>
>>>> mj
>>>>
>>>
>>
>> --
>> Muhammad Ikram
>>
>>

--=20
Muhammad Ikram

--000000000000c94a7f06211ef6aa
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi=C2=A0Shaheed,<div>I think you must have already analyze=
d the outcome of queries on=C2=A0pg_replication_slots,=C2=A0 pg_current_wal=
_lsn(),=C2=A0pg_stat_subscription etc. I could find a query SELECT pg_size_=
pretty(pg_wal_lsn_diff(&#39;&lt;publisher_restart_lsn&gt;&#39;, &#39;&lt;su=
bscriber_replayed_lsn&gt;&#39;));</div><div><br></div><div>As a side note i=
f you want to see what has been applied to subscribers vs what exists on pu=
blisher then here is something from my previous experience. We used to have=
 a Data Validation tool=C2=A0for checking tables/rows across publisher/subs=
criber. We also used pg_dump for another tool that was meant for making cop=
ies of schemas.</div><div><br></div><div>Regards,</div><div>Muhammad Ikram<=
/div><div><br></div><div><br></div></div><br><div class=3D"gmail_quote"><di=
v dir=3D"ltr" class=3D"gmail_attr">On Mon, Sep 2, 2024 at 12:42=E2=80=AFPM =
Shaheed Haque &lt;<a href=3D"mailto:shaheedhaque@gmail.com">shaheedhaque@gm=
ail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"auto"><div>Hi Muhammad,=C2=A0<br><br><div class=3D"gmail=
_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, 2 Sep 2024, 07:08 Muh=
ammad Ikram, &lt;<a href=3D"mailto:mmikram@gmail.com" target=3D"_blank">mmi=
kram@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd=
ing-left:1ex"><div dir=3D"ltr">Hi Shaheed,<div><br></div><div>Maybe these c=
onsiderations could help you or give any hint to the problem ?<br><br><br>C=
heck if wal_receiver_timeout being set to 0 could potentially cause issues,=
 like not detecting network issues quickly enough. Consider re-evaluating t=
his setting if you see connection issues.<br><br>If you notice that some da=
ta is missing on subscriber then could you increase max_slot_wal_keep_size =
on publisher so that WALs are not deleted until they are applied on subscri=
ber.<br><br>Do you have flexibility to increase max_worker_processes and ma=
x_logical_replication_workers, work_mem and maintenance_work_mem on subscri=
ber (In case bottleneck exists on subscriber)<br><br>If there&#39;s signifi=
cant lag, consider whether it might be more efficient to drop the subscript=
ion and re-initialize it from scratch using a new base backup, depending on=
 the data volume and how long it might take for the existing replication to=
 catch up.<br></div></div></blockquote></div></div><div dir=3D"auto"><br></=
div><div dir=3D"auto">Thanks for the kind hints, I&#39;ll certainly look in=
to those.=C2=A0</div><div dir=3D"auto"><br></div><div dir=3D"auto">My main =
interest however was with the &quot;visibility&quot; question, i.e. to get =
an understanding of the gap between the two ends of a replication slot, ide=
ally in human terms (e.g. tables x records).</div><div dir=3D"auto"><br></d=
iv><div dir=3D"auto">I understand the difficulties of trying to produce a m=
eaningful metric that spans two (or more) systems but let&#39;s be honest, =
trying to diagnose which knobs to tweak (whether in application, PG, the OS=
 or the network) is basically black magic when all we really have is a pair=
 of opaque LSNs.=C2=A0</div><div dir=3D"auto"><br></div><div dir=3D"auto"><=
br></div><div dir=3D"auto"><br></div><div dir=3D"auto"><div class=3D"gmail_=
quote"><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;=
border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><=
div><br><br>=C2=A0Regards,<br></div><div>Muhammad Ikram</div><div><br></div=
></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr"=
>On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque &lt;<a href=3D"mailto=
:shaheedhaque@gmail.com" rel=3D"noreferrer" target=3D"_blank">shaheedhaque@=
gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><p dir=3D"ltr">Since nobody more knowledgeable has replied...</p=
>
<p dir=3D"ltr">I&#39;m very interested in this area and still surprised tha=
t there is no official/convenient/standard way to approach this (see <a hre=
f=3D"https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5=
WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com" rel=3D"noreferrer" target=3D"_blank=
">https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXC=
N9KjZOvuTNJaAAC_hg%40mail.gmail.com</a>).</p>
<p dir=3D"ltr">Based partly on that thread, I ended up with a script that c=
onnects to both ends of the replication, and basically loops while comparin=
g the counts in each table. </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri=
, 30 Aug 2024, 12:38 Michael Jaskiewicz, &lt;<a href=3D"mailto:mjaskiewicz@=
ghx.com" rel=3D"noreferrer" target=3D"_blank">mjaskiewicz@ghx.com</a>&gt; w=
rote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0p=
x 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div lang=3D"EN-US">
<div>
<p class=3D"MsoNormal">I&#39;ve got two Postgres 13 databases on AWS RDS.<u=
></u><u></u></p>
<ul style=3D"margin-top:0in" type=3D"disc">
<li class=3D"MsoNormal">One is a master, the other a slave using logical re=
plication.<u></u><u></u></li><li class=3D"MsoNormal">Replication has fallen=
 behind by about 350Gb.<u></u><u></u></li><li class=3D"MsoNormal">The slave=
 was maxed out in terms of CPU for the past four days because of some jobs =
that were ongoing so I&#39;m not sure what logical replication was able to =
replicate during that time.<u></u><u></u></li><li class=3D"MsoNormal">I kil=
led those jobs and now CPU on the master and slave are both low.<u></u><u><=
/u></li><li class=3D"MsoNormal">I look at the subscriber via `select * from=
 pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver=
y slowly.<u></u><u></u></li><li class=3D"MsoNormal">The publisher says writ=
e/flush/replay lags are all 13 minutes behind but it&#39;s been like that f=
or most of the day.<u></u><u></u></li><li class=3D"MsoNormal">I see no erro=
rs in the logs on either the publisher or subscriber outside of some simple=
 SQL errors that users have been making.<u></u><u></u></li><li class=3D"Mso=
Normal">CloudWatch reports low CPU utilization, low I/O, and low network.
<u></u><u></u></li></ul>
<p class=3D"MsoNormal" style=3D"margin-left:0.5in"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Is there anything I can do here? Previously I set wa=
l_receiver_timeout timeout to 0 because I had replication issues, and that =
helped things. I wish I had
<i>some</i> visibility here to get any kind of confidence that it&#39;s goi=
ng to pull through, but other than these lsn values and database logs, I=
9;m not sure what to check.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Sincerely,<u></u><u></u></p>
<p class=3D"MsoNormal">mj<u></u><u></u></p>
</div>
</div>

</blockquote></div>
</blockquote></div><br clear=3D"all"><div><br></div><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr"><div>Muhammad Ikram<br><br></div></div></div>
</blockquote></div></div></div>
</blockquote></div><br clear=3D"all"><div><br></div><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr"><div>Muhammad Ikram<br><br></div></div></div>

--000000000000c94a7f06211ef6aa--