MIME-Version: 1.0
From: Sam Persoon <sam@qargo.com>
Date: Sat, 28 Jun 2025 16:42:59 +0200
Message-ID: <CAHiJKbnX=8gFOFPbafbzQMzOfojnNnsBnOf9sO_buzdSkbL3Uw@mail.gmail.com>
Subject: LWLocks lock_manager occurences and timeouts
To: pgsql-general@postgresql.org
Content-Type: multipart/alternative; boundary="00000000000050f6db0638a2d0a1"
Archived-At: <https://www.postgresql.org/message-id/CAHiJKbnX%3D8gFOFPbafbzQMzOfojnNnsBnOf9sO_buzdSkbL3Uw%40mail.gmail.com>
Precedence: bulk

--00000000000050f6db0638a2d0a1
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Postgres version: PostgreSQL 17.5 on x86_64-pc-linux-gnu, compiled by
Debian clang version 12.0.1, 64-bit

Postgres hosting: Google Cloud Platform, Cloud SQL

Postgres CPU: 40 vCPU

Postgres Memory: 200 GB

Postgres setup: 1 primary with 1 read-replica (hot_standby_feedback flag is
on)

Issue:

We experience a lot of LWLock:lock_manager events on our read-replica, i.e.
150+. They seem to come mostly in bursts and the occurrences seem to have
been steadily increasing while our load did only very slightly. The amount
of these lock_manager locks also seem concentrated to mostly one or two
specific queries while we have 100+ different queries, some with a higher
frequency and as many joins than the =E2=80=9Cproblematic=E2=80=9D one. Not=
e that the total
amount of locks during that time is more than 10k.

These lock_manager locks end up taking dozens amount of seconds and because
of that we run into our statement timeouts of 60s.

Some statistics on our database:

   - Average amount of queries per minute is 12-15k
   - Maximum amount of concurrent requests is ~100, but when the lock rises
   goes up to ~150
   - The relevant tables have between 10 and 20 indexes, mostly foreign key=
s
   - The relevant tables are ~20-40 GB
   - Total database size is 981 GB
   - The average replication lag on the read replica is <100ms
   - Generating a query plan of the problematic query it seems to read
   between 100MB and 1GB of data reported in the Buffers
   - We don=E2=80=99t have any partitions

The amount of information on when you run into this lock is somewhat
limited, but we found
https://www.postgresql.org/message-id/E1ss4gX-000IvX-63%40gemulon.postgresq=
l.org
that mentions that the number of fast-path locks should have been
configurable since that release using the max_locks_per_transaction
parameter. Although if we check the pg_locks table we see that the amount
of fast-path locks per pid and lock mode is still a maximum of 16, even
though max_locks_per_transaction is set to 64.

If we increase the memory of the read-replica to 300GB with the same amount
of CPU=E2=80=99s we see it occur way less, which makes us think that increa=
sing the
fast-path slots wouldn=E2=80=99t really solve the issue, but something else=
 is
going on. Maybe we didn=E2=80=99t give enough memory for the amount of vCPU=
=E2=80=99s.

When adding a second replica with the same vCPU=E2=80=99s and memory, we do=
n=E2=80=99t see
any LWLock:lock_manager waits (or negligible at least). The traffic to both
replica=E2=80=99s is random, so they should hit the same data, so we would =
expect
to see the issue somewhat in this case as well, but the behaviour seems to
differ than having one larger read-replica.

Additionally we notice somewhat of a periodicity in these locking issues
about every 10 minutes, which we can=E2=80=99t correlate with any load incr=
ease.

So the questions we have, are:

   - When would the LWLock:lock_manager locks occur?
   - Why do they not occur consistently but in waves?
   - Why do they seem to correlate with the amount of memory given?
   - How we can solve this?


*Sam Persoon*
Team Lead Frontend
qargo.com <https://www.qargo.com>

--00000000000050f6db0638a2d0a1
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div><p>Postgres version: PostgreSQL 17.5 on x86_64-pc-lin=
ux-gnu, compiled by Debian clang version 12.0.1, 64-bit</p>
<p>Postgres hosting: Google Cloud Platform, Cloud SQL</p>
<p>Postgres CPU: 40 vCPU</p>
<p>Postgres Memory: 200 GB</p>
<p>Postgres setup: 1 primary with 1 read-replica (hot_standby_feedback flag=
 is on)</p>
<p>Issue:</p>
<p>We experience a lot of LWLock:lock_manager events on our read-replica, i=
.e. 150+. They seem to come mostly in bursts and the occurrences seem to ha=
ve been steadily increasing while our load did only very slightly. The amou=
nt of these lock_manager locks also seem concentrated to mostly one or two =
specific queries while we have 100+ different queries, some with a higher f=
requency and as many joins than the =E2=80=9Cproblematic=E2=80=9D one. Note=
 that the total amount of locks during that time is more than 10k.</p>
<p>These lock_manager locks end up taking dozens amount of seconds and beca=
use of that we run into our statement timeouts of 60s.</p>
<p>Some statistics on our database:</p>
<ul><li>Average amount of queries per minute is 12-15k</li><li>Maximum amou=
nt of concurrent requests is ~100, but when the lock rises goes up to ~150<=
/li><li>The relevant tables have between 10 and 20 indexes, mostly foreign =
keys</li><li>The relevant tables are ~20-40 GB</li><li>Total database size =
is 981 GB</li><li>The average replication lag on the read replica is &lt;10=
0ms</li><li>Generating a query plan of the problematic query it seems to re=
ad between 100MB and 1GB of data reported in the Buffers</li><li>We don=E2=
=80=99t have any partitions</li></ul>
<p>The amount of information on when you run into this lock is somewhat lim=
ited, but we found <a href=3D"https://www.postgresql.org/message-id/E1ss4gX=
-000IvX-63%40gemulon.postgresql.org">https://www.postgresql.org/message-id/=
E1ss4gX-000IvX-63%40gemulon.postgresql.org</a> that mentions that the numbe=
r of fast-path locks should have been configurable since that release using=
 the max_locks_per_transaction parameter. Although if we check the <code>pg=
_locks</code> table we see that the amount of fast-path locks per pid and l=
ock mode is still a maximum of 16, even though max_locks_per_transaction is=
 set to 64.</p>
<p>If we increase the memory of the read-replica to 300GB with the same amo=
unt of CPU=E2=80=99s we see it occur way less, which makes us think that in=
creasing the fast-path slots wouldn=E2=80=99t really solve the issue, but s=
omething else is going on. Maybe we didn=E2=80=99t give enough memory for t=
he amount of vCPU=E2=80=99s.</p>
<p>When adding a second replica with the same vCPU=E2=80=99s and memory, we=
 don=E2=80=99t see any LWLock:lock_manager waits (or negligible at least). =
The traffic to both replica=E2=80=99s is random, so they should hit the sam=
e data, so we would expect to see the issue somewhat in this case as well, =
but the behaviour seems to differ than having one larger read-replica.</p>
<p>Additionally we notice somewhat of a periodicity in these locking issues=
 about every 10 minutes, which we can=E2=80=99t correlate with any load inc=
rease.</p>
<p>So the questions we have, are:</p>
<ul><li>When would the <code>LWLock:lock_manager</code> locks occur?</li><l=
i>Why do they not occur consistently but in waves?</li><li>Why do they seem=
 to correlate with the amount of memory given?</li><li>How we can solve thi=
s?</li></ul></div><div><div dir=3D"ltr" class=3D"gmail_signature" data-smar=
tmail=3D"gmail_signature"><div dir=3D"ltr">
	<br>
	<table width=3D"320" cellspacing=3D"0" cellpadding=3D"0" border=3D"0">
		<tbody><tr>
			<td style=3D"vertical-align:top;border-top:1px solid #808080">
				<table width=3D"320" cellspacing=3D"0" cellpadding=3D"0">
					<tbody><tr>
						<td style=3D"vertical-align:top;line-height:20px;font-family:&#39;Rob=
oto&#39;,Tahoma,Arial,&#39;Helvetica Neue&#39;,Helvetica,sans-serif;font-si=
ze:14px;color:#808080" width=3D"190">
							<br><span style=3D"font-weight:bold;color:#000"><font color=3D"#000"=
><b>Sam Persoon</b></font>
							</span><br>
							<span style=3D"font-weight:bold;color:#00e26c">Team Lead Fronten</sp=
an><font color=3D"#00e26c">d</font><br><a href=3D"https://www.qargo.com" st=
yle=3D"text-decoration:none!important;font-weight:normal;font-family:&#39;R=
oboto&#39;,Tahoma,Arial,&#39;Helvetica Neue&#39;,Helvetica,sans-serif;font-=
size:14px;color:#000" target=3D"_blank">qargo.com							</a>
						</td>
						<td style=3D"vertical-align:top" width=3D"130" align=3D"right">
							<br>
							<img src=3D"https://static.wixstatic.com/media/a2d9c6_de0b134c4fab41=
f5a0347c5842a55b29~mv2.png" width=3D"130" height=3D"49">
						</td>
					</tr>
				</tbody></table>
			</td>
		</tr>
	</tbody></table>


</div></div></div></div>

--00000000000050f6db0638a2d0a1--