BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg

public inbox for [email protected]  
help / color / mirror / Atom feed

BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid
2+ messages / 2 participants
[nested] [flat]

* BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid
@ 2026-03-12 09:24 PG Bug reporting form <[email protected]>
  2026-03-12 19:47 ` Re: BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid Laurenz Albe <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: PG Bug reporting form @ 2026-03-12 09:24 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

The following bug has been logged on the website:

Bug reference:      19430
Logged by:          Alexey Ermakov
Email address:      [email protected]
PostgreSQL version: 14.15
Operating system:   Ubuntu 22.04
Description:        

On a PostgreSQL 14.15 server with about 100 databases running on Ubuntu
22.04 I observed following problem:

1) According to the DB logs (log_autovacuum_min_duration = 1s) and
monitoring graphs autovacuums almost completely stopped working across all
DBs at once
2) This is also confirmed by checking pg_stat_user_tables.last_autovacuum in
all databases, there were no autovacuums anywhere for several days
3) Counters in pg_stat_user_tables updated normally, no anomalies there
4) pg_stat_all_tables.last_autovacuum shows one anomaly with pg_authid
catalog: only that catalog was autovacuumed recently (there are not many
dead rows there, looks normal). There might be other anomalies in other
databases as I only checked one but I doubt it.
5) I found nothing interesting in the logs around the time when problem
started (I checked for ERROR/WARNING/FATAL messages and possible DDL queries
but sadly log_statement = none there, log_min_duration_statement = 1s)
6) Autovacuum launcher process appears to be running normally, periodically
wakes up and consume 1-3% CPU of single core, pg_stat_activity shows the
wait event as AutoVacuumMain
7) Just in case, I checked dmesg entries and the time offset graph in
monitoring (to verify that time wasn't moving backwards for some reason) -
no anomalies there
8) Running vacuum on one of the bloated tables (which cause performance
problems) worked without issues
9) There is no long running transactions or lagging standbys with
hot_standby_feedback on
10) pg_authid catalog has `age(relfrozenxid)` around 290M but
autovacuum_freeze_max_age is 200M and anti-wraparound autovacuum should've
been triggered
11) After running `vacuum verbose pg_authid;` nothing changed (it wasn't an
aggressive vacuum) but running `vacuum (freeze, verbose) pg_authid;` fixed
problem and autovacuums started working properly. Output of these commands
are at the end of the report.

I assume that problem might have started when age(relfrozenxid) for
pg_authid approached autovacuum_freeze_max_age. A rough estimate based on
the current rate of age(relfrozenxid) confirms that we might indeed did 90M
transactions during the autovacuum malfunction.

An interesting observation: the query `ALTER ROLE xxx WITH ;` from some
permission-granting business logic triggered several autovacuums on ~half of
databases at same second for some time but didn't fix the problem
completely. Perhaps some cache invalidation did that?

Unfortunately, 14.15 is more than 1 year old. An update to 14.22 was
discussed recently but haven't been made yet. There were several fixes
related to vacuum/autovacuum but they didn't seem directly related to the
issue but I might be wrong.

Output of vacuum commands:

vacuum verbose pg_authid;
INFO:  vacuuming "pg_catalog.pg_authid"
INFO:  table "pg_authid": found 0 removable, 7 nonremovable row versions in
1 out of 57 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828262030
Skipped 0 pages due to buffer pins, 55 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  vacuuming "pg_toast.pg_toast_1260"
INFO:  table "pg_toast_1260": found 0 removable, 0 nonremovable row versions
in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828262030
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

vacuum (freeze, verbose) pg_authid;
INFO:  aggressively vacuuming "pg_catalog.pg_authid"
INFO:  table "pg_authid": found 0 removable, 196 nonremovable row versions
in 4 out of 57 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828278683
Skipped 0 pages due to buffer pins, 53 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO:  aggressively vacuuming "pg_toast.pg_toast_1260"
INFO:  table "pg_toast_1260": found 0 removable, 0 nonremovable row versions
in 0 out of 0 pages
DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828278683
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
VACUUM

Thanks,
Alexey Ermakov

^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid
  2026-03-12 09:24 BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid PG Bug reporting form <[email protected]>
@ 2026-03-12 19:47 ` Laurenz Albe <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Laurenz Albe @ 2026-03-12 19:47 UTC (permalink / raw)
  To: [email protected]; [email protected]

On Thu, 2026-03-12 at 09:24 +0000, PG Bug reporting form wrote:
> PostgreSQL version: 14.15
> Operating system:   Ubuntu 22.04
> 
> On a PostgreSQL 14.15 server with about 100 databases running on Ubuntu
> 22.04 I observed following problem:
> 
> 1) According to the DB logs (log_autovacuum_min_duration = 1s) and
> monitoring graphs autovacuums almost completely stopped working across all
> DBs at once
> 2) This is also confirmed by checking pg_stat_user_tables.last_autovacuum in
> all databases, there were no autovacuums anywhere for several days
> 3) Counters in pg_stat_user_tables updated normally, no anomalies there
> 4) pg_stat_all_tables.last_autovacuum shows one anomaly with pg_authid
> catalog: only that catalog was autovacuumed recently (there are not many
> dead rows there, looks normal). There might be other anomalies in other
> databases as I only checked one but I doubt it.
> 5) I found nothing interesting in the logs around the time when problem
> started (I checked for ERROR/WARNING/FATAL messages and possible DDL queries
> but sadly log_statement = none there, log_min_duration_statement = 1s)
> 6) Autovacuum launcher process appears to be running normally, periodically
> wakes up and consume 1-3% CPU of single core, pg_stat_activity shows the
> wait event as AutoVacuumMain
> 7) Just in case, I checked dmesg entries and the time offset graph in
> monitoring (to verify that time wasn't moving backwards for some reason) -
> no anomalies there
> 8) Running vacuum on one of the bloated tables (which cause performance
> problems) worked without issues
> 9) There is no long running transactions or lagging standbys with
> hot_standby_feedback on
> 10) pg_authid catalog has `age(relfrozenxid)` around 290M but
> autovacuum_freeze_max_age is 200M and anti-wraparound autovacuum should've
> been triggered
> 11) After running `vacuum verbose pg_authid;` nothing changed (it wasn't an
> aggressive vacuum) but running `vacuum (freeze, verbose) pg_authid;` fixed
> problem and autovacuums started working properly. Output of these commands
> are at the end of the report.
> 
> I assume that problem might have started when age(relfrozenxid) for
> pg_authid approached autovacuum_freeze_max_age. A rough estimate based on
> the current rate of age(relfrozenxid) confirms that we might indeed did 90M
> transactions during the autovacuum malfunction.
> 
> An interesting observation: the query `ALTER ROLE xxx WITH ;` from some
> permission-granting business logic triggered several autovacuums on ~half of
> databases at same second for some time but didn't fix the problem
> completely. Perhaps some cache invalidation did that?
> 
> Unfortunately, 14.15 is more than 1 year old. An update to 14.22 was
> discussed recently but haven't been made yet. There were several fixes
> related to vacuum/autovacuum but they didn't seem directly related to the
> issue but I might be wrong.
> 
> Output of vacuum commands:
> 
> vacuum verbose pg_authid;
> INFO:  vacuuming "pg_catalog.pg_authid"
> INFO:  table "pg_authid": found 0 removable, 7 nonremovable row versions in
> 1 out of 57 pages
> DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828262030
> Skipped 0 pages due to buffer pins, 55 frozen pages.
> CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> INFO:  vacuuming "pg_toast.pg_toast_1260"
> INFO:  table "pg_toast_1260": found 0 removable, 0 nonremovable row versions
> in 0 out of 0 pages
> DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828262030
> Skipped 0 pages due to buffer pins, 0 frozen pages.
> CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> VACUUM
> 
> vacuum (freeze, verbose) pg_authid;
> INFO:  aggressively vacuuming "pg_catalog.pg_authid"
> INFO:  table "pg_authid": found 0 removable, 196 nonremovable row versions
> in 4 out of 57 pages
> DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828278683
> Skipped 0 pages due to buffer pins, 53 frozen pages.
> CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> INFO:  aggressively vacuuming "pg_toast.pg_toast_1260"
> INFO:  table "pg_toast_1260": found 0 removable, 0 nonremovable row versions
> in 0 out of 0 pages
> DETAIL:  0 dead row versions cannot be removed yet, oldest xmin: 3828278683
> Skipped 0 pages due to buffer pins, 0 frozen pages.
> CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
> VACUUM

This change in 14.21 looks relevant:

  - Fix issues around in-place catalog updates (Noah Misch)

  Send a nontransactional invalidation message for an in-place update, since
  such an update will survive transaction rollback. Also ensure that the
  update is WAL-logged before other sessions can see it. These fixes
  primarily prevent scenarios in which relations' frozen-XID attributes
  become inconsistent, possibly allowing premature CLOG truncation and
  subsequent “could not access status of transaction” errors.

Yours,
Laurenz Albe






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2026-03-12 19:47 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-03-12 09:24 BUG #19430: Autovacuums stopped working possible due to problem with vacuuming shared catalog pg_authid PG Bug reporting form <[email protected]>
2026-03-12 19:47 ` Laurenz Albe <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox