MIME-Version: 1.0
References: <BN9PR03MB59965B5087688309C0D79DCEB7962@BN9PR03MB5996.namprd03.prod.outlook.com>
 <CAHAc2je=cCww1xNmZ1TpZG6vwOxut8htJ7NdWotGjrHGuj8ELg@mail.gmail.com>
 <CAGeimVrMDJnxZaQj9ia2SJ3pOpsvreeb+s3j9qXukoGuncH9Zw@mail.gmail.com>
 <CAHAc2jcNjz6rA=n9RkeRj36KEcV-TR3BD7nhNKqndoTKYMXTtg@mail.gmail.com> <CAGeimVrgXKejvzGB+Dks9ieZtnYYOF6RHLC5mDBxPCb-KLZLVw@mail.gmail.com>
In-Reply-To: <CAGeimVrgXKejvzGB+Dks9ieZtnYYOF6RHLC5mDBxPCb-KLZLVw@mail.gmail.com>
From: Shaheed Haque <shaheedhaque@gmail.com>
Date: Mon, 2 Sep 2024 14:27:24 +0100
Message-ID: <CAHAc2jcTzdRmvE6oCZ2EPBenadHKbFaBOXM=WN7o1x5e383BNA@mail.gmail.com>
Subject: Re: Postgres Logical Replication - how to see what subscriber is
 doing with received data?
To: Muhammad Ikram <mmikram@gmail.com>
Cc: Michael Jaskiewicz <mjaskiewicz@ghx.com>, pgsql-general@postgresql.org
Content-Type: multipart/alternative; boundary="0000000000007e8f3b062122e776"
Archived-At: <https://www.postgresql.org/message-id/CAHAc2jcTzdRmvE6oCZ2EPBenadHKbFaBOXM%3DWN7o1x5e383BNA%40mail.gmail.com>
Precedence: bulk

--0000000000007e8f3b062122e776
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Muhammad,

On Mon, 2 Sep 2024, 09:45 Muhammad Ikram, <mmikram@gmail.com> wrote:

> Hi Shaheed,
> I think you must have already analyzed the outcome of queries
> on pg_replication_slots,  pg_current_wal_lsn(), pg_stat_subscription etc.=
 I
> could find a query SELECT
> pg_size_pretty(pg_wal_lsn_diff('<publisher_restart_lsn>',
> '<subscriber_replayed_lsn>'));
>

Yes. My point is that it is hard to go from byte numbers to table entries.

Aps a side note if you want to see what has been applied to subscribers vs
> what exists on publisher then here is something from my previous
> experience. We used to have a Data Validation tool for checking tables/ro=
ws
> across publisher/subscriber.
>

Ack. That's pretty much what I had to build.

We also used pg_dump for another tool that was meant for making copies of
> schemas.
>

I'm somewhat fortunate to have a simple use case where all I am doing is a
copy of the "old" deployment to a "new" deployment such that when the two
ends are in close sync, I can freeze traffic to the old deployment, pause
for any final catchup, and then run a Django migration on the new, before
switching on the new (thereby minimising the down time for the app).

What I found by just looking at LSN numbers was that the database LSN were
close but NOT the same. Once I built the tool, I was able to see which
tables were still in play, and saw that some previously overlooked
background timers were expiring, causing the activity.

Net result: the LSNs can tell you if you are not in sync, but not the
reason why. (Again, I understand that row counts worked for me, but might
not work for others).

Thanks for your kind help and pointers!


Regards,
> Muhammad Ikram
>
>
>
> On Mon, Sep 2, 2024 at 12:42=E2=80=AFPM Shaheed Haque <shaheedhaque@gmail=
.com>
> wrote:
>
>> Hi Muhammad,
>>
>> On Mon, 2 Sep 2024, 07:08 Muhammad Ikram, <mmikram@gmail.com> wrote:
>>
>>> Hi Shaheed,
>>>
>>> Maybe these considerations could help you or give any hint to the
>>> problem ?
>>>
>>>
>>> Check if wal_receiver_timeout being set to 0 could potentially cause
>>> issues, like not detecting network issues quickly enough. Consider
>>> re-evaluating this setting if you see connection issues.
>>>
>>> If you notice that some data is missing on subscriber then could you
>>> increase max_slot_wal_keep_size on publisher so that WALs are not delet=
ed
>>> until they are applied on subscriber.
>>>
>>> Do you have flexibility to increase max_worker_processes and
>>> max_logical_replication_workers, work_mem and maintenance_work_mem on
>>> subscriber (In case bottleneck exists on subscriber)
>>>
>>> If there's significant lag, consider whether it might be more efficient
>>> to drop the subscription and re-initialize it from scratch using a new =
base
>>> backup, depending on the data volume and how long it might take for the
>>> existing replication to catch up.
>>>
>>
>> Thanks for the kind hints, I'll certainly look into those.
>>
>> My main interest however was with the "visibility" question, i.e. to get
>> an understanding of the gap between the two ends of a replication slot,
>> ideally in human terms (e.g. tables x records).
>>
>> I understand the difficulties of trying to produce a meaningful metric
>> that spans two (or more) systems but let's be honest, trying to diagnose
>> which knobs to tweak (whether in application, PG, the OS or the network)=
 is
>> basically black magic when all we really have is a pair of opaque LSNs.
>>
>>
>>
>>
>>>
>>>  Regards,
>>> Muhammad Ikram
>>>
>>>
>>> On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque <shaheedhaque@gmai=
l.com>
>>> wrote:
>>>
>>>> Since nobody more knowledgeable has replied...
>>>>
>>>> I'm very interested in this area and still surprised that there is no
>>>> official/convenient/standard way to approach this (see
>>>> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5=
WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com
>>>> ).
>>>>
>>>> Based partly on that thread, I ended up with a script that connects to
>>>> both ends of the replication, and basically loops while comparing the
>>>> counts in each table.
>>>>
>>>> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
>>>> wrote:
>>>>
>>>>> I've got two Postgres 13 databases on AWS RDS.
>>>>>
>>>>>    - One is a master, the other a slave using logical replication.
>>>>>    - Replication has fallen behind by about 350Gb.
>>>>>    - The slave was maxed out in terms of CPU for the past four days
>>>>>    because of some jobs that were ongoing so I'm not sure what logica=
l
>>>>>    replication was able to replicate during that time.
>>>>>    - I killed those jobs and now CPU on the master and slave are both
>>>>>    low.
>>>>>    - I look at the subscriber via `select * from
>>>>>    pg_stat_subscription;` and see that latest_end_lsn is advancing al=
beit very
>>>>>    slowly.
>>>>>    - The publisher says write/flush/replay lags are all 13 minutes
>>>>>    behind but it's been like that for most of the day.
>>>>>    - I see no errors in the logs on either the publisher or
>>>>>    subscriber outside of some simple SQL errors that users have been =
making.
>>>>>    - CloudWatch reports low CPU utilization, low I/O, and low
>>>>>    network.
>>>>>
>>>>>
>>>>>
>>>>> Is there anything I can do here? Previously I set wal_receiver_timeou=
t
>>>>> timeout to 0 because I had replication issues, and that helped things=
. I
>>>>> wish I had *some* visibility here to get any kind of confidence that
>>>>> it's going to pull through, but other than these lsn values and datab=
ase
>>>>> logs, I'm not sure what to check.
>>>>>
>>>>>
>>>>>
>>>>> Sincerely,
>>>>>
>>>>> mj
>>>>>
>>>>
>>>
>>> --
>>> Muhammad Ikram
>>>
>>>
>
> --
> Muhammad Ikram
>
>

--0000000000007e8f3b062122e776
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"auto"><div>Hi Muhammad,=C2=A0<br><br><div class=3D"gmail_quote"=
><div dir=3D"ltr" class=3D"gmail_attr">On Mon, 2 Sep 2024, 09:45 Muhammad I=
kram, &lt;<a href=3D"mailto:mmikram@gmail.com" target=3D"_blank" rel=3D"nor=
eferrer">mmikram@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gma=
il_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-lef=
t:1ex"><div dir=3D"ltr">Hi=C2=A0Shaheed,<div>I think you must have already =
analyzed the outcome of queries on=C2=A0pg_replication_slots,=C2=A0 pg_curr=
ent_wal_lsn(),=C2=A0pg_stat_subscription etc. I could find a query SELECT p=
g_size_pretty(pg_wal_lsn_diff(&#39;&lt;publisher_restart_lsn&gt;&#39;, &#39=
;&lt;subscriber_replayed_lsn&gt;&#39;));</div></div></blockquote></div></di=
v><div dir=3D"auto"><br></div><div dir=3D"auto">Yes. My point is that it is=
 hard to go from byte numbers to table entries.=C2=A0</div><div dir=3D"auto=
"><br></div><div dir=3D"auto"><div class=3D"gmail_quote"><blockquote class=
=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padd=
ing-left:1ex"><div dir=3D"ltr"><div>Aps a side note if you want to see what=
 has been applied to subscribers vs what exists on publisher then here is s=
omething from my previous experience. We used to have a Data Validation too=
l=C2=A0for checking tables/rows across publisher/subscriber. </div></div></=
blockquote></div></div><div dir=3D"auto"><br></div><div dir=3D"auto">Ack. T=
hat&#39;s pretty much what I had to build.=C2=A0</div><div dir=3D"auto"><br=
></div><div dir=3D"auto"><div class=3D"gmail_quote"><blockquote class=3D"gm=
ail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-le=
ft:1ex"><div dir=3D"ltr"><div>We also used pg_dump for another tool that wa=
s meant for making copies of schemas.<br></div></div></blockquote></div></d=
iv><div dir=3D"auto"><br></div><div dir=3D"auto">I&#39;m somewhat fortunate=
 to have a simple use case where all I am doing is a copy of the &quot;old&=
quot; deployment to a &quot;new&quot; deployment such that when the two end=
s are in close sync, I can freeze traffic to the old deployment, pause for =
any final catchup, and then run a Django migration on the new, before switc=
hing on the new (thereby minimising the down time for the app).=C2=A0</div>=
<div dir=3D"auto"><br></div><div dir=3D"auto">What I found by just looking =
at LSN numbers was that the database LSN were close but NOT the same. Once =
I built the tool, I was able to see which tables were still in play, and sa=
w that some previously overlooked background timers were expiring, causing =
the activity.</div><div dir=3D"auto"><br></div><div dir=3D"auto">Net result=
: the LSNs can tell you if you are not in sync, but not the reason why. (Ag=
ain, I understand that row counts worked for me, but might not work for oth=
ers).=C2=A0</div><div dir=3D"auto"><br></div><div dir=3D"auto">Thanks for y=
our kind help and pointers!=C2=A0</div><div dir=3D"auto"><br></div><div dir=
=3D"auto"><br></div><div dir=3D"auto"><div class=3D"gmail_quote"><blockquot=
e class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-left:1px #ccc sol=
id;padding-left:1ex"><div dir=3D"ltr"><div></div><div>Regards,<br></div><di=
v>Muhammad Ikram</div><div><br></div><div><br></div></div><br><div class=3D=
"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Mon, Sep 2, 2024 at =
12:42=E2=80=AFPM Shaheed Haque &lt;<a href=3D"mailto:shaheedhaque@gmail.com=
" rel=3D"noreferrer noreferrer" target=3D"_blank">shaheedhaque@gmail.com</a=
>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px=
 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><di=
v dir=3D"auto"><div>Hi Muhammad,=C2=A0<br><br><div class=3D"gmail_quote"><d=
iv dir=3D"ltr" class=3D"gmail_attr">On Mon, 2 Sep 2024, 07:08 Muhammad Ikra=
m, &lt;<a href=3D"mailto:mmikram@gmail.com" rel=3D"noreferrer noreferrer" t=
arget=3D"_blank">mmikram@gmail.com</a>&gt; wrote:<br></div><blockquote clas=
s=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid r=
gb(204,204,204);padding-left:1ex"><div dir=3D"ltr">Hi Shaheed,<div><br></di=
v><div>Maybe these considerations could help you or give any hint to the pr=
oblem ?<br><br><br>Check if wal_receiver_timeout being set to 0 could poten=
tially cause issues, like not detecting network issues quickly enough. Cons=
ider re-evaluating this setting if you see connection issues.<br><br>If you=
 notice that some data is missing on subscriber then could you increase max=
_slot_wal_keep_size on publisher so that WALs are not deleted until they ar=
e applied on subscriber.<br><br>Do you have flexibility to increase max_wor=
ker_processes and max_logical_replication_workers, work_mem and maintenance=
_work_mem on subscriber (In case bottleneck exists on subscriber)<br><br>If=
 there&#39;s significant lag, consider whether it might be more efficient t=
o drop the subscription and re-initialize it from scratch using a new base =
backup, depending on the data volume and how long it might take for the exi=
sting replication to catch up.<br></div></div></blockquote></div></div><div=
 dir=3D"auto"><br></div><div dir=3D"auto">Thanks for the kind hints, I&#39;=
ll certainly look into those.=C2=A0</div><div dir=3D"auto"><br></div><div d=
ir=3D"auto">My main interest however was with the &quot;visibility&quot; qu=
estion, i.e. to get an understanding of the gap between the two ends of a r=
eplication slot, ideally in human terms (e.g. tables x records).</div><div =
dir=3D"auto"><br></div><div dir=3D"auto">I understand the difficulties of t=
rying to produce a meaningful metric that spans two (or more) systems but l=
et&#39;s be honest, trying to diagnose which knobs to tweak (whether in app=
lication, PG, the OS or the network) is basically black magic when all we r=
eally have is a pair of opaque LSNs.=C2=A0</div><div dir=3D"auto"><br></div=
><div dir=3D"auto"><br></div><div dir=3D"auto"><br></div><div dir=3D"auto">=
<div class=3D"gmail_quote"><blockquote class=3D"gmail_quote" style=3D"margi=
n:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex=
"><div dir=3D"ltr"><div><br><br>=C2=A0Regards,<br></div><div>Muhammad Ikram=
</div><div><br></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" =
class=3D"gmail_attr">On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque &=
lt;<a href=3D"mailto:shaheedhaque@gmail.com" rel=3D"noreferrer noreferrer n=
oreferrer" target=3D"_blank">shaheedhaque@gmail.com</a>&gt; wrote:<br></div=
><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border=
-left:1px solid rgb(204,204,204);padding-left:1ex"><p dir=3D"ltr">Since nob=
ody more knowledgeable has replied...</p>
<p dir=3D"ltr">I&#39;m very interested in this area and still surprised tha=
t there is no official/convenient/standard way to approach this (see <a hre=
f=3D"https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5=
WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com" rel=3D"noreferrer noreferrer norefe=
rrer" target=3D"_blank">https://www.postgresql.org/message-id/CAHAc2jdAHvp7=
tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com</a>).</p>
<p dir=3D"ltr">Based partly on that thread, I ended up with a script that c=
onnects to both ends of the replication, and basically loops while comparin=
g the counts in each table. </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri=
, 30 Aug 2024, 12:38 Michael Jaskiewicz, &lt;<a href=3D"mailto:mjaskiewicz@=
ghx.com" rel=3D"noreferrer noreferrer noreferrer" target=3D"_blank">mjaskie=
wicz@ghx.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" styl=
e=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddin=
g-left:1ex">


<div lang=3D"EN-US">
<div>
<p class=3D"MsoNormal">I&#39;ve got two Postgres 13 databases on AWS RDS.<u=
></u><u></u></p>
<ul style=3D"margin-top:0in" type=3D"disc">
<li class=3D"MsoNormal">One is a master, the other a slave using logical re=
plication.<u></u><u></u></li><li class=3D"MsoNormal">Replication has fallen=
 behind by about 350Gb.<u></u><u></u></li><li class=3D"MsoNormal">The slave=
 was maxed out in terms of CPU for the past four days because of some jobs =
that were ongoing so I&#39;m not sure what logical replication was able to =
replicate during that time.<u></u><u></u></li><li class=3D"MsoNormal">I kil=
led those jobs and now CPU on the master and slave are both low.<u></u><u><=
/u></li><li class=3D"MsoNormal">I look at the subscriber via `select * from=
 pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver=
y slowly.<u></u><u></u></li><li class=3D"MsoNormal">The publisher says writ=
e/flush/replay lags are all 13 minutes behind but it&#39;s been like that f=
or most of the day.<u></u><u></u></li><li class=3D"MsoNormal">I see no erro=
rs in the logs on either the publisher or subscriber outside of some simple=
 SQL errors that users have been making.<u></u><u></u></li><li class=3D"Mso=
Normal">CloudWatch reports low CPU utilization, low I/O, and low network.
<u></u><u></u></li></ul>
<p class=3D"MsoNormal" style=3D"margin-left:0.5in"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Is there anything I can do here? Previously I set wa=
l_receiver_timeout timeout to 0 because I had replication issues, and that =
helped things. I wish I had
<i>some</i> visibility here to get any kind of confidence that it&#39;s goi=
ng to pull through, but other than these lsn values and database logs, I=
9;m not sure what to check.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Sincerely,<u></u><u></u></p>
<p class=3D"MsoNormal">mj<u></u><u></u></p>
</div>
</div>

</blockquote></div>
</blockquote></div><br clear=3D"all"><div><br></div><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr"><div>Muhammad Ikram<br><br></div></div></div>
</blockquote></div></div></div>
</blockquote></div><br clear=3D"all"><div><br></div><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr"><div>Muhammad Ikram<br><br></div></div></div>
</blockquote></div></div></div>

--0000000000007e8f3b062122e776--