Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rtrbN-000QAC-V9 for pgsql-general@arkaria.postgresql.org; Mon, 08 Apr 2024 16:12:10 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1rtrbN-009R7t-74 for pgsql-general@arkaria.postgresql.org; Mon, 08 Apr 2024 16:12:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1rtoR1-006Cu0-Tm for pgsql-general@lists.postgresql.org; Mon, 08 Apr 2024 12:49:16 +0000 Received: from mail-lf1-x134.google.com ([2a00:1450:4864:20::134]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1rtoQy-001pE4-CF for pgsql-general@postgresql.org; Mon, 08 Apr 2024 12:49:14 +0000 Received: by mail-lf1-x134.google.com with SMTP id 2adb3069b0e04-516cdb21b34so5545170e87.1 for ; Mon, 08 Apr 2024 05:49:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1712580550; x=1713185350; darn=postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=qQK2XTw8rQEy8FZH4iafhqfvIZ/6a+Ea92wTCCPWqL8=; b=EgAhBEYh62xuqHa6ZL1zBfLt2LhgPuACNk5m+7zTKpAd5HfkYmEadZmafp041waNXt 2kmz/FCNwkHRpMd2ABmaw1rrnahrJ9DO/bTPlSdeM3HGPxT7j+6dzgMOJeiXz4Fk+Y0+ B3HbxrTjDg4X/3L++RlVTAzO1yJcpHNTkXNUU6yQ8RVhVflts+oNKOxYyPIqOGpocqwf uMNVcEop0cf6OiVmvtru4XAFnS97QL1ev+uOEThQtnitiHtan0JWKTXgeiXa9n5TrmXl EvPSdXRHzkAIVYl68s4YUFBInMBvOtoKxUP1JGGtb00UrvGkIZMjlVhKu5KcIAg99+js sr/w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712580550; x=1713185350; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=qQK2XTw8rQEy8FZH4iafhqfvIZ/6a+Ea92wTCCPWqL8=; b=W57O23T5I5bUsrRRqBXk2FAlDfJ3Qdo1f1nAyLuCxWQDBY2EMu4CuxWRToKhoi4RYv R5jD5FsLZnhNx2dm3EvYTOvvpkZjC9tW46PS+bu+GoUZy0XF912h2cEClJQBn9f8tAPV jDWz2h5oWemcWjs76LhrW6KXXKD3l31NCfEM1Si/RAPCnwvDqQ/IbTf+2z0L3RfGJPLd h0njWa6DgZzCTE+0BF9AaIQgmel/KxYvF2hYvjavRjMMehT8hJjpFZx4dS83FCvs54lI A0hLgqTTAn4J642F6QKZLNkhR6ooMxOl5b1QMhdoDFaPNVBNoVAJ3/8W/Mml180cUWr7 Y1Gw== X-Gm-Message-State: AOJu0YxGRltXwPc69U6g7zvv1auiwXe/FNqFMlbQTgjga1CeLcU87F+z sRATOl5N+KfWWbPFhgmnvIlZAll60JnjHbyoHINCI4TnogHjTJxqiZHggcpqFO6FxhlDpgm76PH oMmXyW95pftuoJkH97g4EkyFCc3e20p/L X-Google-Smtp-Source: AGHT+IHJoHfyR52VtNECwFsCFqSqWQUyM3FpJeZ5+rMJIxqqV7IpusSFOhS0R/oRGHrtkyfQEr3RxfeP/30Buz5+c3Y= X-Received: by 2002:a19:690a:0:b0:515:d5c9:e685 with SMTP id e10-20020a19690a000000b00515d5c9e685mr5785677lfc.33.1712580550033; Mon, 08 Apr 2024 05:49:10 -0700 (PDT) MIME-Version: 1.0 From: Marko Sutic Date: Mon, 8 Apr 2024 14:48:33 +0200 Message-ID: Subject: LwLocks contention (MultiXactOffsetControlLock/multixact_offset) when running logical replication initial snapshot To: pgsql-general@postgresql.org Content-Type: multipart/alternative; boundary="000000000000558c0b0615953b5b" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000558c0b0615953b5b Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hello, We are currently using a shared PostgreSQL cluster (version 11.18) that supports over ten databases. To alleviate the load on this cluster, we've decided to migrate certain databases to dedicated clusters using the native logical replication feature. This approach has been successfully applied to between 50 and 100 databases without issues. However, we have recently encountered an issue related to LWLocks contention. The problem happened during the taking of an initial snapshot of a slightly bigger database, approximately 500GB, with a single table accounting for 300GB. Although the database remained operational, its performance degraded significantly for some services. Threads experienced delays of 20-30 seconds per simple execution when waiting for the =E2=80=9CLWLock:MultiXactOffsetControlLock=E2=80=9D and =E2=80=9Cmultixact_= offset=E2=80=9D locks, which also blocked other processes. This issue did not happen immediately but after a few hours running initial snapshot creation required for logical replication. Interestingly, not all databases or queries were impacted. The performance degradation primarily affected specific queries, which I've listed below with anonymized table names for confidentiality: Database "migrated_db": Insert Query: INSERT INTO library_books (author_id, genre_id, book_id, publisher, library_id, section_key, content) VALUES ($1, $2, $3, $4, $5, $6, $7); Select Query: SELECT $2 FROM ONLY "academic_records"."lecture_series" x WHERE "professor_id" =3D $1 FOR KEY SHARE OF x; Database "other_db": Update Query: UPDATE "vehicle_registry" SET "mileage_count" =3D mileage_cou= nt + $1 WHERE "vehicle_id" =3D $2 RETURNING "mileage_count"; These queries experienced significant increases in execution time and shared buffer reads per call. The "library_books" table swelled from 500KB to nearly 800MB, showing increased bloat and the oldest row age. Noticeable drop in transaction rate was visible for affected services. Upon discontinuing the replication, the locks were released, and the "library_books" table returned to its original size of 500KB, with performance levels improving correspondingly. Could you please provide insights on how the initial snapshot for logical replication could be causing these LWLocks contention issues? Furthermore, why are only certain queries affected, including some from non-migrated databases? Would initiating the snapshot with pg_dump, reducing or temporarily removing the workload on the affected queries, or making certain parameter adjustments help resolve this issue? Thank you for your assistance and insights. Best regards, Marko --000000000000558c0b0615953b5b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello,
We are cur= rently using a shared PostgreSQL cluster (version 11.18) that supports over= ten databases. To alleviate the load on this cluster, we've decided to= migrate certain databases to dedicated clusters using the native logical r= eplication feature. This approach has been successfully applied to between = 50 and 100 databases without issues. However, we have recently encountered = an issue related to LWLocks contention.
<= br>The problem happened during the taking of an initial snapshot of a sligh= tly bigger database, approximately 500GB, with a single table accounting fo= r 300GB. Although the database remained operational, its performance degrad= ed significantly for some services. Threads experienced delays of 20-30 sec= onds per simple execution when waiting for the =E2=80=9CLWLock:MultiXactOff= setControlLock=E2=80=9D and =E2=80=9Cmultixact_offset=E2=80=9D locks, which= also blocked other processes. This issue did not happen immediately but af= ter a few hours running initial snapshot creation required for logical repl= ication.

Interestingly, not all d= atabases or queries were impacted. The performance degradation primarily af= fected specific queries, which I've listed below with anonymized table = names for confidentiality:

Databa= se "migrated_db":
Insert Query: INSERT INTO library_books (aut= hor_id, genre_id, book_id, publisher, library_id, section_key, content) VAL= UES ($1, $2, $3, $4, $5, $6, $7);
Select = Query: SELECT $2 FROM ONLY "academic_records"."lecture_serie= s" x WHERE "professor_id" =3D $1 FOR KEY SHARE OF x;

=
Database "other_db":
Update= Query: UPDATE "vehicle_registry" SET "mileage_count" = =3D mileage_count + $1 WHERE "vehicle_id" =3D $2 RETURNING "= mileage_count";

These queries experienced significant increases= in execution time and shared buffer reads per call. The "library_book= s" table swelled from 500KB to nearly 800MB, showing increased bloat a= nd the oldest row age. Noticeable drop in transaction rate was visible for = affected services.
Upon discontinuing the replication, the locks were re= leased, and the "library_books" table returned to its original si= ze of 500KB, with performance levels improving correspondingly.

Could you please provide insights on how the= initial snapshot for logical replication could be causing these LWLocks co= ntention issues? Furthermore, why are only certain queries affected, includ= ing some from non-migrated databases?
Would initiating the snapshot with= pg_dump, reducing or temporarily removing the workload on the affected que= ries, or making certain parameter adjustments help resolve this issue?
<= br>
Thank you for your assistance and ins= ights.

Best regards,
Marko

--000000000000558c0b0615953b5b--