MIME-Version: 1.0
References: <BN9PR03MB59965B5087688309C0D79DCEB7962@BN9PR03MB5996.namprd03.prod.outlook.com>
 <CAHAc2je=cCww1xNmZ1TpZG6vwOxut8htJ7NdWotGjrHGuj8ELg@mail.gmail.com>
In-Reply-To: <CAHAc2je=cCww1xNmZ1TpZG6vwOxut8htJ7NdWotGjrHGuj8ELg@mail.gmail.com>
From: Muhammad Ikram <mmikram@gmail.com>
Date: Mon, 2 Sep 2024 11:08:20 +0500
Message-ID: <CAGeimVrMDJnxZaQj9ia2SJ3pOpsvreeb+s3j9qXukoGuncH9Zw@mail.gmail.com>
Subject: Re: Postgres Logical Replication - how to see what subscriber is
 doing with received data?
To: Shaheed Haque <shaheedhaque@gmail.com>
Cc: Michael Jaskiewicz <mjaskiewicz@ghx.com>, 
	"pgsql-general@postgresql.org" <pgsql-general@postgresql.org>
Content-Type: multipart/alternative; boundary="0000000000007e5a3706211cc5a1"
Archived-At: <https://www.postgresql.org/message-id/CAGeimVrMDJnxZaQj9ia2SJ3pOpsvreeb%2Bs3j9qXukoGuncH9Zw%40mail.gmail.com>
Precedence: bulk

--0000000000007e5a3706211cc5a1
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hi Shaheed,

Maybe these considerations could help you or give any hint to the problem ?


Check if wal_receiver_timeout being set to 0 could potentially cause
issues, like not detecting network issues quickly enough. Consider
re-evaluating this setting if you see connection issues.

If you notice that some data is missing on subscriber then could you
increase max_slot_wal_keep_size on publisher so that WALs are not deleted
until they are applied on subscriber.

Do you have flexibility to increase max_worker_processes and
max_logical_replication_workers, work_mem and maintenance_work_mem on
subscriber (In case bottleneck exists on subscriber)

If there's significant lag, consider whether it might be more efficient to
drop the subscription and re-initialize it from scratch using a new base
backup, depending on the data volume and how long it might take for the
existing replication to catch up.


 Regards,
Muhammad Ikram


On Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque <shaheedhaque@gmail.co=
m> wrote:

> Since nobody more knowledgeable has replied...
>
> I'm very interested in this area and still surprised that there is no
> official/convenient/standard way to approach this (see
> https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXC=
N9KjZOvuTNJaAAC_hg%40mail.gmail.com
> ).
>
> Based partly on that thread, I ended up with a script that connects to
> both ends of the replication, and basically loops while comparing the
> counts in each table.
>
> On Fri, 30 Aug 2024, 12:38 Michael Jaskiewicz, <mjaskiewicz@ghx.com>
> wrote:
>
>> I've got two Postgres 13 databases on AWS RDS.
>>
>>    - One is a master, the other a slave using logical replication.
>>    - Replication has fallen behind by about 350Gb.
>>    - The slave was maxed out in terms of CPU for the past four days
>>    because of some jobs that were ongoing so I'm not sure what logical
>>    replication was able to replicate during that time.
>>    - I killed those jobs and now CPU on the master and slave are both
>>    low.
>>    - I look at the subscriber via `select * from pg_stat_subscription;`
>>    and see that latest_end_lsn is advancing albeit very slowly.
>>    - The publisher says write/flush/replay lags are all 13 minutes
>>    behind but it's been like that for most of the day.
>>    - I see no errors in the logs on either the publisher or subscriber
>>    outside of some simple SQL errors that users have been making.
>>    - CloudWatch reports low CPU utilization, low I/O, and low network.
>>
>>
>>
>> Is there anything I can do here? Previously I set wal_receiver_timeout
>> timeout to 0 because I had replication issues, and that helped things. I
>> wish I had *some* visibility here to get any kind of confidence that
>> it's going to pull through, but other than these lsn values and database
>> logs, I'm not sure what to check.
>>
>>
>>
>> Sincerely,
>>
>> mj
>>
>

--=20
Muhammad Ikram

--0000000000007e5a3706211cc5a1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">Hi Shaheed,<div><br></div><div>Maybe these considerations =
could help you or give any hint to the problem ?<br><br><br>Check if wal_re=
ceiver_timeout being set to 0 could potentially cause issues, like not dete=
cting network issues quickly enough. Consider re-evaluating this setting if=
 you see connection issues.<br><br>If you notice that some data is missing =
on subscriber then could you increase max_slot_wal_keep_size on publisher s=
o that WALs are not deleted until they are applied on subscriber.<br><br>Do=
 you have flexibility to increase max_worker_processes and max_logical_repl=
ication_workers, work_mem and maintenance_work_mem on subscriber (In case b=
ottleneck exists on subscriber)<br><br>If there&#39;s significant lag, cons=
ider whether it might be more efficient to drop the subscription and re-ini=
tialize it from scratch using a new base backup, depending on the data volu=
me and how long it might take for the existing replication to catch up.<br>=
<br><br>=C2=A0Regards,<br></div><div>Muhammad Ikram</div><div><br></div></d=
iv><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On =
Sun, Sep 1, 2024 at 9:22=E2=80=AFPM Shaheed Haque &lt;<a href=3D"mailto:sha=
heedhaque@gmail.com">shaheedhaque@gmail.com</a>&gt; wrote:<br></div><blockq=
uote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1p=
x solid rgb(204,204,204);padding-left:1ex"><p dir=3D"ltr">Since nobody more=
 knowledgeable has replied...</p>
<p dir=3D"ltr">I&#39;m very interested in this area and still surprised tha=
t there is no official/convenient/standard way to approach this (see <a hre=
f=3D"https://www.postgresql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5=
WXCN9KjZOvuTNJaAAC_hg%40mail.gmail.com" target=3D"_blank">https://www.postg=
resql.org/message-id/CAHAc2jdAHvp7tFZBP37awcth%3DT3h5WXCN9KjZOvuTNJaAAC_hg%=
40mail.gmail.com</a>).</p>
<p dir=3D"ltr">Based partly on that thread, I ended up with a script that c=
onnects to both ends of the replication, and basically loops while comparin=
g the counts in each table. </p>
<br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Fri=
, 30 Aug 2024, 12:38 Michael Jaskiewicz, &lt;<a href=3D"mailto:mjaskiewicz@=
ghx.com" target=3D"_blank">mjaskiewicz@ghx.com</a>&gt; wrote:<br></div><blo=
ckquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left=
:1px solid rgb(204,204,204);padding-left:1ex">


<div lang=3D"EN-US" style=3D"overflow-wrap: break-word;">
<div>
<p class=3D"MsoNormal">I&#39;ve got two Postgres 13 databases on AWS RDS.<u=
></u><u></u></p>
<ul style=3D"margin-top:0in" type=3D"disc">
<li class=3D"MsoNormal">One is a master, the other a slave using logical re=
plication.<u></u><u></u></li><li class=3D"MsoNormal">Replication has fallen=
 behind by about 350Gb.<u></u><u></u></li><li class=3D"MsoNormal">The slave=
 was maxed out in terms of CPU for the past four days because of some jobs =
that were ongoing so I&#39;m not sure what logical replication was able to =
replicate during that time.<u></u><u></u></li><li class=3D"MsoNormal">I kil=
led those jobs and now CPU on the master and slave are both low.<u></u><u><=
/u></li><li class=3D"MsoNormal">I look at the subscriber via `select * from=
 pg_stat_subscription;` and see that latest_end_lsn is advancing albeit ver=
y slowly.<u></u><u></u></li><li class=3D"MsoNormal">The publisher says writ=
e/flush/replay lags are all 13 minutes behind but it&#39;s been like that f=
or most of the day.<u></u><u></u></li><li class=3D"MsoNormal">I see no erro=
rs in the logs on either the publisher or subscriber outside of some simple=
 SQL errors that users have been making.<u></u><u></u></li><li class=3D"Mso=
Normal">CloudWatch reports low CPU utilization, low I/O, and low network.
<u></u><u></u></li></ul>
<p class=3D"MsoNormal" style=3D"margin-left:0.5in"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Is there anything I can do here? Previously I set wa=
l_receiver_timeout timeout to 0 because I had replication issues, and that =
helped things. I wish I had
<i>some</i> visibility here to get any kind of confidence that it&#39;s goi=
ng to pull through, but other than these lsn values and database logs, I=
9;m not sure what to check.<u></u><u></u></p>
<p class=3D"MsoNormal"><u></u>=C2=A0<u></u></p>
<p class=3D"MsoNormal">Sincerely,<u></u><u></u></p>
<p class=3D"MsoNormal">mj<u></u><u></u></p>
</div>
</div>

</blockquote></div>
</blockquote></div><br clear=3D"all"><div><br></div><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr"><div>Muhammad Ikram<br><br></div></div></div>

--0000000000007e5a3706211cc5a1--