MIME-Version: 1.0
From: Marko Sutic <marko.sutic@gmail.com>
Date: Mon, 8 Apr 2024 14:48:33 +0200
Message-ID: <CAMD6WPd77S67F41YXuyJbVr-6SiD_2fm2scNS3mZhG7577E10g@mail.gmail.com>
Subject: LwLocks contention (MultiXactOffsetControlLock/multixact_offset) when
 running logical replication initial snapshot
To: pgsql-general@postgresql.org
Content-Type: multipart/alternative; boundary="000000000000558c0b0615953b5b"
Archived-At: <https://www.postgresql.org/message-id/CAMD6WPd77S67F41YXuyJbVr-6SiD_2fm2scNS3mZhG7577E10g%40mail.gmail.com>
Precedence: bulk

--000000000000558c0b0615953b5b
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Hello,
We are currently using a shared PostgreSQL cluster (version 11.18) that
supports over ten databases. To alleviate the load on this cluster, we've
decided to migrate certain databases to dedicated clusters using the native
logical replication feature. This approach has been successfully applied to
between 50 and 100 databases without issues. However, we have recently
encountered an issue related to LWLocks contention.

The problem happened during the taking of an initial snapshot of a slightly
bigger database, approximately 500GB, with a single table accounting for
300GB. Although the database remained operational, its performance degraded
significantly for some services. Threads experienced delays of 20-30
seconds per simple execution when waiting for the
=E2=80=9CLWLock:MultiXactOffsetControlLock=E2=80=9D and =E2=80=9Cmultixact_=
offset=E2=80=9D locks, which
also blocked other processes. This issue did not happen immediately but
after a few hours running initial snapshot creation required for logical
replication.

Interestingly, not all databases or queries were impacted. The performance
degradation primarily affected specific queries, which I've listed below
with anonymized table names for confidentiality:

Database "migrated_db":
Insert Query: INSERT INTO library_books (author_id, genre_id, book_id,
publisher, library_id, section_key, content) VALUES ($1, $2, $3, $4, $5,
$6, $7);
Select Query: SELECT $2 FROM ONLY "academic_records"."lecture_series" x
WHERE "professor_id" =3D $1 FOR KEY SHARE OF x;

Database "other_db":
Update Query: UPDATE "vehicle_registry" SET "mileage_count" =3D mileage_cou=
nt
+ $1 WHERE "vehicle_id" =3D $2 RETURNING "mileage_count";

These queries experienced significant increases in execution time and
shared buffer reads per call. The "library_books" table swelled from 500KB
to nearly 800MB, showing increased bloat and the oldest row age. Noticeable
drop in transaction rate was visible for affected services.
Upon discontinuing the replication, the locks were released, and the
"library_books" table returned to its original size of 500KB, with
performance levels improving correspondingly.

Could you please provide insights on how the initial snapshot for logical
replication could be causing these LWLocks contention issues? Furthermore,
why are only certain queries affected, including some from non-migrated
databases?
Would initiating the snapshot with pg_dump, reducing or temporarily
removing the workload on the affected queries, or making certain parameter
adjustments help resolve this issue?

Thank you for your assistance and insights.

Best regards,
Marko

--000000000000558c0b0615953b5b
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_default" style=3D"font-family:arial,sa=
ns-serif;font-size:small"><div class=3D"gmail_default">Hello,<br>We are cur=
rently using a shared PostgreSQL cluster (version 11.18) that supports over=
 ten databases. To alleviate the load on this cluster, we&#39;ve decided to=
 migrate certain databases to dedicated clusters using the native logical r=
eplication feature. This approach has been successfully applied to between =
50 and 100 databases without issues. However, we have recently encountered =
an issue related to LWLocks contention.</div><div class=3D"gmail_default"><=
br>The problem happened during the taking of an initial snapshot of a sligh=
tly bigger database, approximately 500GB, with a single table accounting fo=
r 300GB. Although the database remained operational, its performance degrad=
ed significantly for some services. Threads experienced delays of 20-30 sec=
onds per simple execution when waiting for the =E2=80=9CLWLock:MultiXactOff=
setControlLock=E2=80=9D and =E2=80=9Cmultixact_offset=E2=80=9D locks, which=
 also blocked other processes. This issue did not happen immediately but af=
ter a few hours running initial snapshot creation required for logical repl=
ication.<br><br></div><div class=3D"gmail_default">Interestingly, not all d=
atabases or queries were impacted. The performance degradation primarily af=
fected specific queries, which I&#39;ve listed below with anonymized table =
names for confidentiality:<br><br></div><div class=3D"gmail_default">Databa=
se &quot;migrated_db&quot;:<br>Insert Query: INSERT INTO library_books (aut=
hor_id, genre_id, book_id, publisher, library_id, section_key, content) VAL=
UES ($1, $2, $3, $4, $5, $6, $7);</div><div class=3D"gmail_default">Select =
Query: SELECT $2 FROM ONLY &quot;academic_records&quot;.&quot;lecture_serie=
s&quot; x WHERE &quot;professor_id&quot; =3D $1 FOR KEY SHARE OF x;<br><br>=
</div><div class=3D"gmail_default">Database &quot;other_db&quot;:<br>Update=
 Query: UPDATE &quot;vehicle_registry&quot; SET &quot;mileage_count&quot; =
=3D mileage_count + $1 WHERE &quot;vehicle_id&quot; =3D $2 RETURNING &quot;=
mileage_count&quot;;<br><br>These queries experienced significant increases=
 in execution time and shared buffer reads per call. The &quot;library_book=
s&quot; table swelled from 500KB to nearly 800MB, showing increased bloat a=
nd the oldest row age. Noticeable drop in transaction rate was visible for =
affected services.<br>Upon discontinuing the replication, the locks were re=
leased, and the &quot;library_books&quot; table returned to its original si=
ze of 500KB, with performance levels improving correspondingly.<br><br></di=
v><div class=3D"gmail_default">Could you please provide insights on how the=
 initial snapshot for logical replication could be causing these LWLocks co=
ntention issues? Furthermore, why are only certain queries affected, includ=
ing some from non-migrated databases?<br>Would initiating the snapshot with=
 pg_dump, reducing or temporarily removing the workload on the affected que=
ries, or making certain parameter adjustments help resolve this issue?<br><=
br></div><div class=3D"gmail_default">Thank you for your assistance and ins=
ights.</div><div class=3D"gmail_default"><br>Best regards,</div><div class=
=3D"gmail_default">Marko<div class=3D"gmail-yj6qo"></div><div class=3D"gmai=
l-adL"><br style=3D"color:rgb(0,0,0)"></div></div></div></div>

--000000000000558c0b0615953b5b--