Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uWAvZ-009Noy-IC for pgsql-general@arkaria.postgresql.org; Mon, 30 Jun 2025 09:35:53 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uWAvW-00FjUk-ML for pgsql-general@arkaria.postgresql.org; Mon, 30 Jun 2025 09:35:51 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uVWlw-008GVx-MJ for pgsql-general@lists.postgresql.org; Sat, 28 Jun 2025 14:43:17 +0000 Received: from mail-ed1-x535.google.com ([2a00:1450:4864:20::535]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1uVWlu-004a9w-0Q for pgsql-general@postgresql.org; Sat, 28 Jun 2025 14:43:16 +0000 Received: by mail-ed1-x535.google.com with SMTP id 4fb4d7f45d1cf-60867565fb5so5268472a12.3 for ; Sat, 28 Jun 2025 07:43:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=qargo-com.20230601.gappssmtp.com; s=20230601; t=1751121791; x=1751726591; darn=postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=XByGFk+XafpxazXZivV9V/WhH0e0Dxiama8LTKIIvzI=; b=zhgcvIS1MGWY4/2htsrooof/4c7yZkoa43KqZ4mFnGngtdDOQ4fKDM4mnXRqp24QQ4 WCk224PGSXgtg6BJKJrRf2gqRM/jl7zCnd1WUkToY4Va8RUuXUn5YlQHeR7hF6Hf2lH7 8+BFNolygLGW2loRwWgGXuSsBc/sdPQGrBAlMVg3y/6PKLktlhullxAYZlliYelF0RD4 4np8nNGlX3KReG7+NR6VK3TnBJ0OedSBW86NZQiF5UHLDAu1/vveiauUaDgeke3CFHpx Fy8gj7f/dkG0o02BR/2i/yQt7cEveFttemg8mMvUe5d9cpdGQkxCcJ7gdOcJMboW5HYo yvfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1751121791; x=1751726591; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=XByGFk+XafpxazXZivV9V/WhH0e0Dxiama8LTKIIvzI=; b=jhMP/4PlANZ/mtqQYsr2xTkOV9DmW61X92AqUBTKfeHVc+XEX6pLWfjdxcIvxE5z9I oRvlY6ZSLTEY2PIYYKaVzrrZTCSrLjHGysr8FGNKKTHIE2oHMsPhm86iIUs0x+f4Da// tk7ivTK0Kd71FJ7div/TLRLLrvNJZwQiq+mpooqw63+Qfrqd1IDuir9aElC/UaColdsp eIscC27f0zElwfqfvjaWW4wTM/EDiVLzO+dwtHimtoerGz8d3NQIVz+ztsbRmx8m0Rjc mqeVgD1J+HPGCnSNt2fjrk8mIugfGsNygwo75cBI1xr2m9nzv44klj5jjM6fkkHHtYFX 98mA== X-Gm-Message-State: AOJu0Ywa4bLTyVTZWUdrdLmwY0KClIHS3UvSKRcU5UGbrFGc362q8FPA eOcCTzH1Gqpw6qNd0zhkq1I2zNaVpFH8J8G/F1SLYrmgpOZoFR9NtZEES7BZxOwXCNf3BvcVZbR 0hR/GtKfRUJ3GRLjnGzoQ5dBPPkHESJ+bxNIS1j2Z5OCNKN2Hw9K+BfaQzQ== X-Gm-Gg: ASbGncucjSIwbSQw7ONGdqYmUBPQu8njrMTQFuLnz7iiMcqPh4NyiWyxuVkiqrPlo1r 7ITPvveyId1nGpux0K1VQRe0mQN+slQlmNS75Joz5AZpcZZpQJZsRceexBsb1/k1EGAEUw6ECgg jzaIUDV/+ESpFMt977zm0D1C8ugJikHBoMva1+eIlRIqEY X-Google-Smtp-Source: AGHT+IHJIMaNy6hf6O77YvLia54XSqRTw1aqlf4fdrE2FHTzn6r0G00RqAVwbfPYtAVD/urOD2aGMW3evsgUGGNx4hU= X-Received: by 2002:a05:6402:5190:b0:607:f55d:7c51 with SMTP id 4fb4d7f45d1cf-60c88e1f5f8mr6184706a12.17.1751121791082; Sat, 28 Jun 2025 07:43:11 -0700 (PDT) MIME-Version: 1.0 From: Sam Persoon Date: Sat, 28 Jun 2025 16:42:59 +0200 X-Gm-Features: Ac12FXwegh3cZCs3D5UNb8pQeHX-QjlmwRdNwLZiL0tBCWGIJUbfIQaqf_f34eo Message-ID: Subject: LWLocks lock_manager occurences and timeouts To: pgsql-general@postgresql.org Content-Type: multipart/alternative; boundary="00000000000050f6db0638a2d0a1" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000050f6db0638a2d0a1 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Postgres version: PostgreSQL 17.5 on x86_64-pc-linux-gnu, compiled by Debian clang version 12.0.1, 64-bit Postgres hosting: Google Cloud Platform, Cloud SQL Postgres CPU: 40 vCPU Postgres Memory: 200 GB Postgres setup: 1 primary with 1 read-replica (hot_standby_feedback flag is on) Issue: We experience a lot of LWLock:lock_manager events on our read-replica, i.e. 150+. They seem to come mostly in bursts and the occurrences seem to have been steadily increasing while our load did only very slightly. The amount of these lock_manager locks also seem concentrated to mostly one or two specific queries while we have 100+ different queries, some with a higher frequency and as many joins than the =E2=80=9Cproblematic=E2=80=9D one. Not= e that the total amount of locks during that time is more than 10k. These lock_manager locks end up taking dozens amount of seconds and because of that we run into our statement timeouts of 60s. Some statistics on our database: - Average amount of queries per minute is 12-15k - Maximum amount of concurrent requests is ~100, but when the lock rises goes up to ~150 - The relevant tables have between 10 and 20 indexes, mostly foreign key= s - The relevant tables are ~20-40 GB - Total database size is 981 GB - The average replication lag on the read replica is <100ms - Generating a query plan of the problematic query it seems to read between 100MB and 1GB of data reported in the Buffers - We don=E2=80=99t have any partitions The amount of information on when you run into this lock is somewhat limited, but we found https://www.postgresql.org/message-id/E1ss4gX-000IvX-63%40gemulon.postgresq= l.org that mentions that the number of fast-path locks should have been configurable since that release using the max_locks_per_transaction parameter. Although if we check the pg_locks table we see that the amount of fast-path locks per pid and lock mode is still a maximum of 16, even though max_locks_per_transaction is set to 64. If we increase the memory of the read-replica to 300GB with the same amount of CPU=E2=80=99s we see it occur way less, which makes us think that increa= sing the fast-path slots wouldn=E2=80=99t really solve the issue, but something else= is going on. Maybe we didn=E2=80=99t give enough memory for the amount of vCPU= =E2=80=99s. When adding a second replica with the same vCPU=E2=80=99s and memory, we do= n=E2=80=99t see any LWLock:lock_manager waits (or negligible at least). The traffic to both replica=E2=80=99s is random, so they should hit the same data, so we would = expect to see the issue somewhat in this case as well, but the behaviour seems to differ than having one larger read-replica. Additionally we notice somewhat of a periodicity in these locking issues about every 10 minutes, which we can=E2=80=99t correlate with any load incr= ease. So the questions we have, are: - When would the LWLock:lock_manager locks occur? - Why do they not occur consistently but in waves? - Why do they seem to correlate with the amount of memory given? - How we can solve this? *Sam Persoon* Team Lead Frontend qargo.com --00000000000050f6db0638a2d0a1 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Postgres version: PostgreSQL 17.5 on x86_64-pc-lin= ux-gnu, compiled by Debian clang version 12.0.1, 64-bit

Postgres hosting: Google Cloud Platform, Cloud SQL

Postgres CPU: 40 vCPU

Postgres Memory: 200 GB

Postgres setup: 1 primary with 1 read-replica (hot_standby_feedback flag= is on)

Issue:

We experience a lot of LWLock:lock_manager events on our read-replica, i= .e. 150+. They seem to come mostly in bursts and the occurrences seem to ha= ve been steadily increasing while our load did only very slightly. The amou= nt of these lock_manager locks also seem concentrated to mostly one or two = specific queries while we have 100+ different queries, some with a higher f= requency and as many joins than the =E2=80=9Cproblematic=E2=80=9D one. Note= that the total amount of locks during that time is more than 10k.

These lock_manager locks end up taking dozens amount of seconds and beca= use of that we run into our statement timeouts of 60s.

Some statistics on our database:

  • Average amount of queries per minute is 12-15k
  • Maximum amou= nt of concurrent requests is ~100, but when the lock rises goes up to ~150<= /li>
  • The relevant tables have between 10 and 20 indexes, mostly foreign = keys
  • The relevant tables are ~20-40 GB
  • Total database size = is 981 GB
  • The average replication lag on the read replica is <10= 0ms
  • Generating a query plan of the problematic query it seems to re= ad between 100MB and 1GB of data reported in the Buffers
  • We don=E2= =80=99t have any partitions

The amount of information on when you run into this lock is somewhat lim= ited, but we found https://www.postgresql.org/message-id/= E1ss4gX-000IvX-63%40gemulon.postgresql.org that mentions that the numbe= r of fast-path locks should have been configurable since that release using= the max_locks_per_transaction parameter. Although if we check the pg= _locks table we see that the amount of fast-path locks per pid and l= ock mode is still a maximum of 16, even though max_locks_per_transaction is= set to 64.

If we increase the memory of the read-replica to 300GB with the same amo= unt of CPU=E2=80=99s we see it occur way less, which makes us think that in= creasing the fast-path slots wouldn=E2=80=99t really solve the issue, but s= omething else is going on. Maybe we didn=E2=80=99t give enough memory for t= he amount of vCPU=E2=80=99s.

When adding a second replica with the same vCPU=E2=80=99s and memory, we= don=E2=80=99t see any LWLock:lock_manager waits (or negligible at least). = The traffic to both replica=E2=80=99s is random, so they should hit the sam= e data, so we would expect to see the issue somewhat in this case as well, = but the behaviour seems to differ than having one larger read-replica.

Additionally we notice somewhat of a periodicity in these locking issues= about every 10 minutes, which we can=E2=80=99t correlate with any load inc= rease.

So the questions we have, are:

  • When would the LWLock:lock_manager locks occur?
  • Why do they not occur consistently but in waves?
  • Why do they seem= to correlate with the amount of memory given?
  • How we can solve thi= s?


Sam Persoon
Team Lead Frontend
qargo.com

--00000000000050f6db0638a2d0a1--