Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1scMpv-006cZa-Ui for pgsql-general@arkaria.postgresql.org; Fri, 09 Aug 2024 10:27:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1scMpu-002bdW-F1 for pgsql-general@arkaria.postgresql.org; Fri, 09 Aug 2024 10:27:06 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1scMpt-002bdO-MA for pgsql-general@lists.postgresql.org; Fri, 09 Aug 2024 10:27:05 +0000 Received: from mail-oo1-xc2c.google.com ([2607:f8b0:4864:20::c2c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1scMpq-003s0U-8W for pgsql-general@lists.postgresql.org; Fri, 09 Aug 2024 10:27:04 +0000 Received: by mail-oo1-xc2c.google.com with SMTP id 006d021491bc7-5d5d077c60aso999851eaf.1 for ; Fri, 09 Aug 2024 03:27:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ilm.com; s=google; t=1723199220; x=1723804020; darn=lists.postgresql.org; h=to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=2GCUZGuNqzdISdIuybnAOqqmhY9XEvwMVcND2WCBJL8=; b=E0/DoS3cS36gMwisHopg2VDd+IezlA5ac2z+G2IJg+uPE59v1r83apDIWdtewmu+I4 AWna6yE462Z9x4M28Gon/IboR2ecL4WefkgtZLIusdsxBzD6XqnWDGTozhlPAm1UcA+Z roiy3Dy24f4Uj+fXuCnVaQzmTzLpr0AAlMW4A= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723199220; x=1723804020; h=to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=2GCUZGuNqzdISdIuybnAOqqmhY9XEvwMVcND2WCBJL8=; b=aeshHeAfiyek9UgBtoXdIr7FhizKvZZKgE0+CI7BxZlyWBw+mK9vVR1RxBm6gQCNGK aNsj0OW3kZO06RpPlPz9ogFwMTl6d8aGkzJXvjt8CVHmjJldU8B2xZSy+NBQPjxUr85v 1IuyZyHuK9n0LZgJur3i2kYJFDhqJzaLrJAuGepxjywPnhdbLHlgm8+TKMEGOVqR7PXV A+WYUwbyswd2O2Xnaw31nEWkOWGc0RZehyCBT0izkM9BuUx/TQPSYoIq1Pa452gzxqK9 xE5vE4/w5+WFTwlkLSU23ZT89sIKy9Laf2lECmcdDW+FBhMmfyT1I3ubHgkejMA3kOvH pxOA== X-Gm-Message-State: AOJu0YwwCxDChPAve/31+AxWdj/F3tT5FJ7X3911BR2q5OKd3PYh9KP2 uIhCDxkiq0DcwDLqfwzc+nfb+oj38LRGTrKKTee0OmU9i9XrF2lJ+8LMMS3sbpRkx0TO+lI6CJy m95w6/GytmO6v1oBY2l+pLCngF0O9rDAeMC5owLGmhEnFV1Q8PE4= X-Google-Smtp-Source: AGHT+IHrWa18IfqIGUoA4S1MreDJS1IluBQREL8hAQ+/cvYF8SeJ3upYNLAl0KIqTxAYN3enndiDUxUPuvwKkhfSl/k= X-Received: by 2002:a05:6358:430c:b0:1ab:86f1:25bb with SMTP id e5c5f4694b2df-1b177198514mr109694255d.27.1723199220379; Fri, 09 Aug 2024 03:27:00 -0700 (PDT) MIME-Version: 1.0 From: Jim Vanns Date: Fri, 9 Aug 2024 11:26:49 +0100 Message-ID: Subject: Trouble understanding how to avoid/manage ERROR: multixact "members" limit exceeded To: pgsql-general@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000006898ac061f3d9599" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000006898ac061f3d9599 Content-Type: text/plain; charset="UTF-8" Hi pggen community! I am struggling with this error almost daily now and despite various efforts, not succeeding in avoiding or dealing with it; ERROR: multixact "members" limit exceeded DETAIL: This command would create a multixact with 2 members, but the remaining space is only enough for 0 members. HINT: Execute a database-wide VACUUM in database with OID 16467 with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. Runtime details follow below, but before that; I am now seeing the above error almost daily after approximately 12 hours of normal or expected behaviour and throughput. Then it hits and all writes are blocked etc. and the service is largely unusable/unable to recover. Restarting PG does allow autovacuum processes to kick in with aggressive vacuuming to handle the multixact freezing, but that isn't a suitable solution! Although having read sources that now explain why multixact XIDs exist and when they're used, I am not able to properly figure out how to plan for it or configure postgresql appropriately to handle it given our workload. My questions are; 1) How should I be managing this? Although not new to PG, I am new to this particular problem. 2) How can I confirm what processes/tables are contributing to this multixact "members" limit? 3) What are the units for vacuum_multixact_freeze_min_age/vacuum_multixact_freeze_table_age and how should I be setting them appropriately for my rates etc.? I can't really find anything that explains this clearly. 4) How can I check that autovacuum workers are specifically able to freeze multixact XIDs and thus avoid this? 5) Can I check if autovacuum is actually succeeding in its work? Schema (brief): 10 tables 1 table (the 'hottest) partitioned by 2h ranges; so an extra 12x partitions/relations a day, though only the most recent one might be considered 'active' System (brief): PG: 15.5 w/ TimescaleDB 2.14 extension Write heavy workload; Mean Txn/s (low): 8k Mean Txn/s (high): 10k Mean rows/s: 100k Concurrency: 32 threads (local socket sessions) for 'upserts' via primary service plus auxiliary processes (background workers/autovacuum workers etc.) Pattern (brief): COPY (binary) to session-local temporary tables (ON COMMIT DELETE) INSERT FROM tt TO main_table(s) UPDATE FROM tt TO main_table(s) VACUUM tt (every 100k txns) Config (excerpt): # - Transactions - (based on a period mean of ~8k txn/s) # See/ref; # www.postgresql.org/docs/15/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND # www.percona.com/blog/tuning-autovacuum-in-postgresql-and-autovacuum-internals # blog.sentry.io/transaction-id-wraparound-in-postgres # https://blog.danslimmon.com/2023/12/11/concurrent-locks-and-multixacts-in-postgres # https://pganalyze.com/blog/5mins-postgres-multiXact-ids-foreign-keys-performance # *What unit is 'age' here? No. of. transactions?* vacuum_freeze_min_age = 28800000 # 1h @ 8k/s vacuum_multixact_freeze_min_age = 28800000 # 1h @ 8k/s autovacuum_freeze_max_age = 157600000 # 2h @ 8k/s autovacuum_multixact_freeze_max_age = 57600000 # 2h @ 8k/s vacuum_multixact_freeze_table_age = 115200000 # 4h @ 8k/s # We don't UPDATE or DELETE often; we never DELETE and only perform # UPDATE operations every 6h due to internal cache(s). So we set this to # zero so the formula doesn't consider it a major factor autovacuum_vacuum_threshold = 0 # In tuples/rows UPDATEd or DELETEd autovacuum_vacuum_insert_threshold = 5000 # In tuples/rows INSERTed autovacuum_vacuum_scale_factor = 0.1 # 10% autovacuum_analyze_scale_factor = 0.1 # 10% autovacuum_vacuum_insert_scale_factor = 0.1 # 10% autovacuum_naptime = 60 autovacuum_max_workers = 8 # Give autovacuum more credits to ensure a better chance at scanning autovacuum_vacuum_cost_limit = 2000 # 10x the default Thanks for your help and any guidance/knowledge you can share! Jim -- Jim Vanns Principal Production Engineer Industrial Light & Magic, London --0000000000006898ac061f3d9599 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hi pggen community!

I am struggling wit= h this error almost daily now and despite various efforts, not succeeding i= n avoiding or dealing with it;

ERROR: =C2=A0multix= act "members" limit exceeded
DETAIL: =C2=A0This command= would create a multixact with 2 members, but the remaining space is only e= nough for 0 members.
HINT: =C2=A0Execute a database-wide VACUUM i= n database with OID 16467 with reduced vacuum_multixact_freeze_min_age and = vacuum_multixact_freeze_table_age settings.

Runtime details follow below, but before that; I am now seeing the ab= ove error almost daily after approximately 12 hours of normal or expected b= ehaviour and throughput. Then it hits and all writes are blocked etc. and t= he service is largely unusable/unable to recover. Restarting PG does allow = autovacuum processes to kick in with aggressive vacuuming to handle the mul= tixact freezing, but that isn't a suitable=C2=A0solution! Although havi= ng read sources that now explain why multixact XIDs exist and when they'= ;re used, I am not able to properly figure out how to plan for it or config= ure postgresql appropriately to handle it given our workload.
My questions are;

1) How should I be m= anaging this? Although not new to PG, I am new to this particular problem.<= /div>
2) How can I confirm what processes/tables are contributing to th= is multixact "members" limit?
3) What are the units for= vacuum_multixact_freeze_min_age/vacuum_multixact_freeze_table_age and how = should I be=C2=A0
=C2=A0 =C2=A0 setting them appropriately for my= rates etc.? I can't really find anything that explains this clearly.
4) How can I check that autovacuum workers are specifically able t= o freeze multixact XIDs and thus avoid this?
5) Can I check if au= tovacuum is actually succeeding in its work?

Schem= a (brief):
10 tables
1 table (the 'hottest) partiti= oned=C2=A0by 2h ranges; so an extra 12x partitions/relations a day, though = only the most recent one might be considered 'active'
System (brief):
PG: 15.5 w/ TimescaleDB 2.14 extensio= n
Write heavy workload;
Mean Txn/s (low):=C2=A0= =C2=A08k
Mean Txn/s (high): 10k
Mean rows/s: 100k
Concurrency: 32 threads (local socket sessions) for 'upserts'= via primary service plus auxiliary=C2=A0processes (background workers/auto= vacuum workers etc.)

Pattern (brief):
CO= PY (binary) to session-local temporary tables (ON COMMIT DELETE)
= INSERT FROM tt TO main_table(s)
UPDATE FROM tt TO main_table(s)
VACUUM tt (every 100k txns)

Config (excer= pt):
# Wha= t unit is 'age' here? No. of. transactions?
vacuum_freeze_mi= n_age =3D 28800000 # 1h @ 8k/s
vacuum_multixact_freeze_min_age =3D 28800= 000 # 1h @ 8k/s
autovacuum_freeze_max_age =3D 157600000 # 2h @ 8k/s
a= utovacuum_multixact_freeze_max_age =3D 57600000 # 2h @ 8k/s
vacuum_multi= xact_freeze_table_age =3D 115200000 # 4h @ 8k/s

# We don't UPDAT= E or DELETE often; we never DELETE and only perform
# UPDATE operations = every 6h due to internal cache(s). So we set this=C2=A0to
# zero = so the formula doesn't consider it a major factor
autovacuum_vacuum_= threshold =3D 0 # In tuples/rows UPDATEd or DELETEd
autovacuum_vacuum_in= sert_threshold =3D 5000 # In tuples/rows INSERTed

autovacuum_vacuum_= scale_factor =3D 0.1 # 10%
autovacuum_analyze_scale_factor =3D 0.1 # 10%=
autovacuum_vacuum_insert_scale_factor =3D 0.1 # 10%

a= utovacuum_naptime =3D 60
autovacuum_max_workers =3D 8

# Give autovacuum more credits to ensure a better chance at scanning=
autovacuum_vacuum_cost_limit =3D 2000 # 10x the default

Thanks for your help and any guidance/knowledge you ca= n share!

Jim

--
Jim Vanns
= Principal Production Engineer
Industrial Light & Magic, London
=
--0000000000006898ac061f3d9599--