public inbox for [email protected]  
help / color / mirror / Atom feed
another autovacuum scheduling thread
143+ messages / 15 participants
[nested] [flat]

* another autovacuum scheduling thread
@ 2025-10-08 15:18  Nathan Bossart <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-10-08 15:18 UTC (permalink / raw)
  To: pgsql-hackers

/me dons flame-proof suit

My goal with this thread is to produce some incremental autovacuum
scheduling improvements for v19, but realistically speaking, I know that
it's a bit of a long-shot.  There have been many discussions over the
years, and I've read through a few of them [0] [1] [2] [3] [4], but there
are certainly others I haven't found.  Since this seems to be a contentious
topic, I figured I'd start small to see if we can get _something_
committed.

While I am by no means wedded to a specific idea, my current concrete
proposal (proof-of-concept patch attached) is to start by ordering the
tables a worker will process by (M)XID age.  Here are the reasons:

* We do some amount of prioritization of databases at risk of wraparound at
database level, per the following comment from autovacuum.c:

	 * Choose a database to connect to.  We pick the database that was least
	 * recently auto-vacuumed, or one that needs vacuuming to prevent Xid
	 * wraparound-related data loss.  If any db at risk of Xid wraparound is
	 * found, we pick the one with oldest datfrozenxid, independently of
	 * autovacuum times; similarly we pick the one with the oldest datminmxid
	 * if any is in MultiXactId wraparound.  Note that those in Xid wraparound
	 * danger are given more priority than those in multi wraparound danger.

However, we do no such prioritization of the tables within a database.  In
fact, the ordering of the tables is effectively random.  IMHO this gives us
quite a bit of wiggle room to experiment; since we are processing tables in
no specific order today, changing the order to something vacuuming-related
seems more likely to help than it is to harm.

* Prioritizing tables based on their (M)XID age might help avoid more
aggressive vacuums, not to mention wraparound.  Of course, there are
scenarios where this doesn't work.  For example, the age of a table may
have changed greatly between the time we recorded it and the time we
process it.  Or maybe there is another table in a different database that
is more important from a wraparound perspective.  We could complicate the
patch to try to handle some of these things, but I maintain that even some
basic, incremental scheduling improvements would be better than the status
quo.  And we can always change it further in the future to handle these
problems and to consider other things like bloat.

The attached patch works by storing the maximum of the XID age and the MXID
age in the list with the OIDs and sorting it prior to processing.

Thoughts?

[0] https://postgr.es/m/CA%2BTgmoafJPjB3WVqB3FrGWUU4NLRc3VHx8GXzLL-JM%2B%2BJPwK%2BQ%40mail.gmail.com
[1] https://postgr.es/m/CAEG8a3%2B3fwQbgzak%2Bh3Q7Bp%3DvK_aWhw1X7w7g5RCgEW9ufdvtA%40mail.gmail.com
[2] https://postgr.es/m/CAD21AoBUaSRBypA6pd9ZD%3DU-2TJCHtbyZRmrS91Nq0eVQ0B3BA%40mail.gmail.com
[3] https://postgr.es/m/CA%2BTgmobT3m%3D%2BdU5HF3VGVqiZ2O%2Bv6P5wN1Gj%2BPrq%2Bhj7dAm9AQ%40mail.gmail.com
[4] https://postgr.es/m/20130124215715.GE4528%40alvh.no-ip.org

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-08 17:06  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 3 replies; 143+ messages in thread

From: Sami Imseih @ 2025-10-08 17:06 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: pgsql-hackers

Thanks for raising this topic! I agree that autovacuum scheduling
could be improved.

> * Prioritizing tables based on their (M)XID age might help avoid more
> aggressive vacuums, not to mention wraparound.  Of course, there are
> scenarios where this doesn't work.  For example, the age of a table may
> have changed greatly between the time we recorded it and the time we
> process it.  Or maybe there is another table in a different database that
> is more important from a wraparound perspective.  We could complicate the
> patch to try to handle some of these things, but I maintain that even some
> basic, incremental scheduling improvements would be better than the status
> quo.  And we can always change it further in the future to handle these
> problems and to consider other things like bloat.

One risk I see with this approach is that we will end up autovacuuming
tables that also take the longest time to complete, which could cause
smaller, quick-to-process tables to be neglected.

It’s not always the case that the oldest tables in terms of (M)XID age
are also the most expensive to vacuum, but that is often more true
than not.

Not saying that the current approach, which is as you mention is
random, is any better, however this approach will likely increase
the behavior of large tables saturating workers.

But I also do see the merit of this approach when we know we are
in failsafe territory, because I would want my oldest aged tables to be
a/v'd first.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-08 17:20  Álvaro Herrera <[email protected]>
  parent: Sami Imseih <[email protected]>
  2 siblings, 0 replies; 143+ messages in thread

From: Álvaro Herrera @ 2025-10-08 17:20 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; pgsql-hackers

On 2025-Oct-08, Sami Imseih wrote:

> One risk I see with this approach is that we will end up autovacuuming
> tables that also take the longest time to complete, which could cause
> smaller, quick-to-process tables to be neglected.

Perhaps we can have autovacuum workers decide on a mode to use at
startup (or launcher decides for them), and use different prioritization
heuristics depending on the mode.  For instance if we're past max freeze
age for any tables then we know we have to first vacuum tables with
higher MXID ages regardless of size considerations, but if there's at
least one worker in that mode then we use the mode where smaller
high-churn tables go first.

-- 
Álvaro Herrera        Breisgau, Deutschland  —  https://www.EnterpriseDB.com/
"No nos atrevemos a muchas cosas porque son difíciles,
pero son difíciles porque no nos atrevemos a hacerlas" (Séneca)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-08 17:37  Andres Freund <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Andres Freund @ 2025-10-08 17:37 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: pgsql-hackers

Hi,

On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:
> However, we do no such prioritization of the tables within a database.  In
> fact, the ordering of the tables is effectively random.

We don't prioritize tables, but I don't think the order really is random?
Isn't it basically in the order in which the data is in pg_class? That
typically won't change from one autovacuum pass to the next...


> * Prioritizing tables based on their (M)XID age might help avoid more
> aggressive vacuums, not to mention wraparound.  Of course, there are
> scenarios where this doesn't work.  For example, the age of a table may
> have changed greatly between the time we recorded it and the time we
> process it.

> Or maybe there is another table in a different database that
> is more important from a wraparound perspective.

That seems like something no ordering within a single AV worker can address. I
think it's fine to just define that to be out of scope.


> We could complicate the patch to try to handle some of these things, but I
> maintain that even some basic, incremental scheduling improvements would be
> better than the status quo.  And we can always change it further in the
> future to handle these problems and to consider other things like bloat.

Agreed!  It doesn't take much to be better at scheduling than "order in
pg_class".


> The attached patch works by storing the maximum of the XID age and the MXID
> age in the list with the OIDs and sorting it prior to processing.

I think it may be worth trying to avoid reliably using the same order -
otherwise e.g. a corrupt index on the first scheduled table can cause
autovacuum to reliably fail on the same relation, never allowing it to
progress past that point.

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-08 17:47  Sami Imseih <[email protected]>
  parent: Sami Imseih <[email protected]>
  2 siblings, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2025-10-08 17:47 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: pgsql-hackers

> Not saying that the current approach, which is as you mention is
> random, is any better, however this approach will likely increase
> the behavior of large tables saturating workers.

Maybe it will be good to allocate some workers to the oldest tables
and workers based on some random list? This could balance things
out between the oldest (large) tables and everything else to avoid
this problem.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-08 23:40  Jeremy Schneider <[email protected]>
  parent: Sami Imseih <[email protected]>
  2 siblings, 1 reply; 143+ messages in thread

From: Jeremy Schneider @ 2025-10-08 23:40 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; pgsql-hackers

On Wed, 8 Oct 2025 12:06:29 -0500
Sami Imseih <[email protected]> wrote:
> 
> One risk I see with this approach is that we will end up autovacuuming
> tables that also take the longest time to complete, which could cause
> smaller, quick-to-process tables to be neglected.
> 
> It’s not always the case that the oldest tables in terms of (M)XID age
> are also the most expensive to vacuum, but that is often more true
> than not.

I think an approach of doing largest objects first actually might work
really well for balancing work amongst autovacuum workers. Many years
ago I designed a system to backup many databases with a pool of workers
and used this same simple & naive algorithm of just reverse sorting on
db size, and it worked remarkably well. If you have one big thing then
you probably want someone to get started on that first. As long as
there's a pool of workers available, as you work through the queue, you
can actually end up with pretty optimal use of all the workers.

-Jeremy






^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-08 23:59  David Rowley <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-08 23:59 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Thu, 9 Oct 2025 at 12:41, Jeremy Schneider <[email protected]> wrote:
> I think an approach of doing largest objects first actually might work
> really well for balancing work amongst autovacuum workers. Many years
> ago I designed a system to backup many databases with a pool of workers
> and used this same simple & naive algorithm of just reverse sorting on
> db size, and it worked remarkably well. If you have one big thing then
> you probably want someone to get started on that first. As long as
> there's a pool of workers available, as you work through the queue, you
> can actually end up with pretty optimal use of all the workers.

I believe that is methodology for processing work applies much better
in scenarios where there's no new work continually arriving and
there's no adverse effects from giving a lower priority to certain
portions of the work. I don't think you can apply that so easily to
autovacuum as there are scenarios where the work can pile up faster
than it can be handled.  Also, smaller tables can bloat in terms of
growth proportional to the original table size much more quickly than
larger tables and that could have huge consequences for queries to
small tables which are not indexed sufficiently to handle being
becoming bloated and large.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 00:27  Jeremy Schneider <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Jeremy Schneider @ 2025-10-09 00:27 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Thu, 9 Oct 2025 12:59:23 +1300
David Rowley <[email protected]> wrote:

> I believe that is methodology for processing work applies much better
> in scenarios where there's no new work continually arriving and
> there's no adverse effects from giving a lower priority to certain
> portions of the work. I don't think you can apply that so easily to
> autovacuum as there are scenarios where the work can pile up faster
> than it can be handled.  Also, smaller tables can bloat in terms of
> growth proportional to the original table size much more quickly than
> larger tables and that could have huge consequences for queries to
> small tables which are not indexed sufficiently to handle being
> becoming bloated and large.

I'm arguing that it works well with autovacuum. Not saying there aren't
going to be certain workloads that it's suboptimal for. We're talking
about sorting by (M)XID age. As the clock continues to move forward any
table that doesn't get processed naturally moves up the queue for the
next autovac run. I think the concerns are minimal here and this would
be a good change in general.

-Jeremy


-- 
To know the thoughts and deeds that have marked man's progress is to
feel the great heart throbs of humanity through the centuries; and if
one does not feel in these pulsations a heavenward striving, one must
indeed be deaf to the harmonies of life.

Helen Keller, The Story Of My Life, 1902, 1903, 1905, introduction by
Ralph Barton Perry (Garden City, NY: Doubleday & Company, 1954), p90.






^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 00:30  Jeremy Schneider <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Jeremy Schneider @ 2025-10-09 00:30 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Wed, 8 Oct 2025 17:27:27 -0700
Jeremy Schneider <[email protected]> wrote:

> On Thu, 9 Oct 2025 12:59:23 +1300
> David Rowley <[email protected]> wrote:
> 
> > I believe that is methodology for processing work applies much
> > better in scenarios where there's no new work continually arriving
> > and there's no adverse effects from giving a lower priority to
> > certain portions of the work. I don't think you can apply that so
> > easily to autovacuum as there are scenarios where the work can pile
> > up faster than it can be handled.  Also, smaller tables can bloat
> > in terms of growth proportional to the original table size much
> > more quickly than larger tables and that could have huge
> > consequences for queries to small tables which are not indexed
> > sufficiently to handle being becoming bloated and large.
> 
> I'm arguing that it works well with autovacuum. Not saying there
> aren't going to be certain workloads that it's suboptimal for. We're
> talking about sorting by (M)XID age. As the clock continues to move
> forward any table that doesn't get processed naturally moves up the
> queue for the next autovac run. I think the concerns are minimal here
> and this would be a good change in general.

Hmm, doesn't work quite like that if the full queue needs to be
processed before the next iteration ~ but at steady state these small
tables are going to get processed at the same rate whether they were
top of bottom of the queue right?

And in non-steady-state conditions, this seems like a better order than
pg_class ordering?

-Jeremy





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 01:03  David Rowley <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-09 01:03 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Thu, 9 Oct 2025 at 13:27, Jeremy Schneider <[email protected]> wrote:
> I'm arguing that it works well with autovacuum. Not saying there aren't
> going to be certain workloads that it's suboptimal for. We're talking
> about sorting by (M)XID age. As the clock continues to move forward any
> table that doesn't get processed naturally moves up the queue for the
> next autovac run. I think the concerns are minimal here and this would
> be a good change in general.

I thought if we're to have a priority queue that it would be hard to
argue against sorting by how far over the given auto-vacuum threshold
that the table is.  If you assume that a table that just meets the
dead rows required to trigger autovacuum based on the
autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but
another table that has n_mod_since_analyze twice over the
autovacuum_analyze_scale_factor gets priority 2.0.  Effectively,
prioritise by the percentage over the given threshold the table is.
That way users could still tune things when they weren't happy with
the priority given to a table by adjusting the corresponding
reloption.

It just seems strange to me to only account for 1 of the 4 trigger
points for autovacuum when it's possible to account for all 4 without
much extra trouble.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 01:25  Jeremy Schneider <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Jeremy Schneider @ 2025-10-09 01:25 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Thu, 9 Oct 2025 14:03:34 +1300
David Rowley <[email protected]> wrote:

> I thought if we're to have a priority queue that it would be hard to
> argue against sorting by how far over the given auto-vacuum threshold
> that the table is.  If you assume that a table that just meets the
> dead rows required to trigger autovacuum based on the
> autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but
> another table that has n_mod_since_analyze twice over the
> autovacuum_analyze_scale_factor gets priority 2.0.  Effectively,
> prioritise by the percentage over the given threshold the table is.
> That way users could still tune things when they weren't happy with
> the priority given to a table by adjusting the corresponding
> reloption.

If users are tuning this thing then I feel like we've already lost the
battle :)

On a healthy system, autovac runs continually and hits tables at
regular intervals based on their steady state change rates. We have
existing knobs (for better or worse) that people can use to tell PG to
hit certain tables more frequently, to get rid of sleeps/delays, etc.

With our fleet of PG databases here, my current approach is geared
toward setting log_autovacuum_min_duration to some conservative value
fleet-wide, then monitoring based on the logs for any cases where it
runs longer than a defined threshold. I'm able to catch problems sooner
this way, versus monitoring on xid age alone.

Whenever there are problems with autovacuum, the actual issue is never
going to be resolved by what order autovacuum processes tables. I don't
think we should encourage any tunables here... to me it seems like
putting focus entirely in the wrong place.

-Jeremy






^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 01:47  Jeremy Schneider <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Jeremy Schneider @ 2025-10-09 01:47 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Wed, 8 Oct 2025 18:25:20 -0700
Jeremy Schneider <[email protected]> wrote:

> On Thu, 9 Oct 2025 14:03:34 +1300
> David Rowley <[email protected]> wrote:
> 
> > I thought if we're to have a priority queue that it would be hard to
> > argue against sorting by how far over the given auto-vacuum
> > threshold that the table is.  If you assume that a table that just
> > meets the dead rows required to trigger autovacuum based on the
> > autovacuum_vacuum_scale_factor setting gets a priority of 1.0, but
> > another table that has n_mod_since_analyze twice over the
> > autovacuum_analyze_scale_factor gets priority 2.0.  Effectively,
> > prioritise by the percentage over the given threshold the table is.
> > That way users could still tune things when they weren't happy with
> > the priority given to a table by adjusting the corresponding
> > reloption.
> 
> If users are tuning this thing then I feel like we've already lost the
> battle :)

I replied too quickly. Re-reading your email, I think your proposing a
different algorithm, taking tuple counts into account. No tunables. Is
there a fully fleshed out version of the proposed alternative algorithm
somewhere? (one of the older threads?) I guess this is why its so hard
to get anything committed in this area...

-J





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 03:13  David Rowley <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-09 03:13 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; pgsql-hackers

On Thu, 9 Oct 2025 at 14:47, Jeremy Schneider <[email protected]> wrote:
>
> On Wed, 8 Oct 2025 18:25:20 -0700
> Jeremy Schneider <[email protected]> wrote:
> > If users are tuning this thing then I feel like we've already lost the
> > battle :)
>
> I replied too quickly. Re-reading your email, I think your proposing a
> different algorithm, taking tuple counts into account. No tunables. Is
> there a fully fleshed out version of the proposed alternative algorithm
> somewhere? (one of the older threads?) I guess this is why its so hard
> to get anything committed in this area...

It's along the lines of the "1a)" from [1]. I don't think that post
does a great job of explaining it.

I think the best way to understand it is if you look at
relation_needs_vacanalyze() and see how it calculates boolean values
for boolean output params. So, instead of calculating just a boolean
value it instead calculates a float4 where < 1.0 means don't do the
operation and anything >= 1.0 means do the operation. For example,
let's say a table has 600 dead rows and the scale factor and threshold
settings mean that autovacuum will trigger at 200 (3 times more dead
tuples than the trigger point). That would result in the value of 3.0
(600 / 200).  The priority for relfrozenxid portion is basically
age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account
for mxid by doing the same for that and taking the maximum of each
value).  For each of those component "scores", the priority for
autovacuum would be the maximum of each of those.

Effectively, it's a method of aligning the different units of measure,
transactions or tuples into a single value which is calculated based
on the very same values that we use today to trigger autovacuums.

David

[1] https://postgr.es/m/CAApHDvo8DWyt4CWhF=NPeRstz_78SteEuuNDfYO7cjp=7YTK4g@mail.gmail.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 16:01  Nathan Bossart <[email protected]>
  parent: Andres Freund <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-09 16:01 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers

On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote:
> On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:
>> The attached patch works by storing the maximum of the XID age and the MXID
>> age in the list with the OIDs and sorting it prior to processing.
> 
> I think it may be worth trying to avoid reliably using the same order -
> otherwise e.g. a corrupt index on the first scheduled table can cause
> autovacuum to reliably fail on the same relation, never allowing it to
> progress past that point.

Hm.  What if we kept a short array of "failed" tables in shared memory?
Each worker would consult this table before processing.  If the table is
there, it would remove it from the shared table and skip processing it.
Then the next worker would try processing the table again.

I also wonder how hard it would be to gracefully catch the error and let
the worker continue with the rest of its list...

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 16:13  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-09 16:13 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote:
> I think the best way to understand it is if you look at
> relation_needs_vacanalyze() and see how it calculates boolean values
> for boolean output params. So, instead of calculating just a boolean
> value it instead calculates a float4 where < 1.0 means don't do the
> operation and anything >= 1.0 means do the operation. For example,
> let's say a table has 600 dead rows and the scale factor and threshold
> settings mean that autovacuum will trigger at 200 (3 times more dead
> tuples than the trigger point). That would result in the value of 3.0
> (600 / 200).  The priority for relfrozenxid portion is basically
> age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account
> for mxid by doing the same for that and taking the maximum of each
> value).  For each of those component "scores", the priority for
> autovacuum would be the maximum of each of those.
> 
> Effectively, it's a method of aligning the different units of measure,
> transactions or tuples into a single value which is calculated based
> on the very same values that we use today to trigger autovacuums.

I like the idea of a "score" approach, but I'm worried that we'll never
come to an agreement on the formula to use.  Perhaps we'd have more luck
getting consensus on a multifaceted strategy if we kept it brutally simple.
IMHO it's worth a try...

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 16:15  Andres Freund <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Andres Freund @ 2025-10-09 16:15 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: pgsql-hackers

Hi,

On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote:
> On Wed, Oct 08, 2025 at 01:37:22PM -0400, Andres Freund wrote:
> > On 2025-10-08 10:18:17 -0500, Nathan Bossart wrote:
> >> The attached patch works by storing the maximum of the XID age and the MXID
> >> age in the list with the OIDs and sorting it prior to processing.
> > 
> > I think it may be worth trying to avoid reliably using the same order -
> > otherwise e.g. a corrupt index on the first scheduled table can cause
> > autovacuum to reliably fail on the same relation, never allowing it to
> > progress past that point.
> 
> Hm.  What if we kept a short array of "failed" tables in shared memory?

I've thought about having that as part of pgstats...


> Each worker would consult this table before processing.  If the table is
> there, it would remove it from the shared table and skip processing it.
> Then the next worker would try processing the table again.
> 
> I also wonder how hard it would be to gracefully catch the error and let
> the worker continue with the rest of its list...

The main set of cases I've seen are when workers get hung up permanently in
corrupt indexes. There never is actually an error, the autovacuums just get
terminated as part of whatever independent reason there is to restart. The
problem with that is that you'll never actually have vacuum fail...

Greetings,

Andres Freund





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 16:33  Nathan Bossart <[email protected]>
  parent: Andres Freund <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-10-09 16:33 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers

On Thu, Oct 09, 2025 at 12:15:31PM -0400, Andres Freund wrote:
> On 2025-10-09 11:01:16 -0500, Nathan Bossart wrote:
>> I also wonder how hard it would be to gracefully catch the error and let
>> the worker continue with the rest of its list...
> 
> The main set of cases I've seen are when workers get hung up permanently in
> corrupt indexes. There never is actually an error, the autovacuums just get
> terminated as part of whatever independent reason there is to restart. The
> problem with that is that you'll never actually have vacuum fail...

Ah.  Wouldn't the other workers skip that table in that scenario?  I'm not
following the great advantage of varying the order in this case.  I suppose
the full set of workers might be able to process more tables before one
inevitably gets stuck.  Is that it?

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-09 19:45  Peter Geoghegan <[email protected]>
  parent: Andres Freund <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Peter Geoghegan @ 2025-10-09 19:45 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Nathan Bossart <[email protected]>; pgsql-hackers

On Thu, Oct 9, 2025 at 12:15 PM Andres Freund <[email protected]> wrote:
> > Each worker would consult this table before processing.  If the table is
> > there, it would remove it from the shared table and skip processing it.
> > Then the next worker would try processing the table again.
> >
> > I also wonder how hard it would be to gracefully catch the error and let
> > the worker continue with the rest of its list...
>
> The main set of cases I've seen are when workers get hung up permanently in
> corrupt indexes.

How recently was this? I'm aware of problems like that that we
discussed around 2018, but they were greatly mitigated.
First by your commit 3a01f68e, then by my commit c34787f9.

In general, there's no particularly good reason why (at least with
nbtree indexes) VACUUM should ever hang forever. The access pattern is
overwhelmingly simple, sequential access. The only exception is nbtree
page deletion (plus backtracking), where it isn't particularly hard to
just be very careful about self-deadlock.

> There never is actually an error, the autovacuums just get
> terminated as part of whatever independent reason there is to restart.

What do you mean?

In general I'd expect nbtree VACUUM of a corrupt index to either not
fail at all (we'll soldier on to the best of our ability when page
deletion encounters an inconsistency), or to get permanently stuck due
to locking the same page twice/self-deadlock (though as I said, those
problems were mitigated, and might even be almost impossible these
days). Every other case involves some kind of error (e.g., an OOM is
just about possible).

I agree with you about using a perfectly deterministic order coming
with real downsides, without any upside. Don't interpret what I've
said as expressing opposition to that idea.


--
Peter Geoghegan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-10 17:31  Nathan Bossart <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-10 17:31 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Thu, Oct 09, 2025 at 11:13:48AM -0500, Nathan Bossart wrote:
> On Thu, Oct 09, 2025 at 04:13:23PM +1300, David Rowley wrote:
>> I think the best way to understand it is if you look at
>> relation_needs_vacanalyze() and see how it calculates boolean values
>> for boolean output params. So, instead of calculating just a boolean
>> value it instead calculates a float4 where < 1.0 means don't do the
>> operation and anything >= 1.0 means do the operation. For example,
>> let's say a table has 600 dead rows and the scale factor and threshold
>> settings mean that autovacuum will trigger at 200 (3 times more dead
>> tuples than the trigger point). That would result in the value of 3.0
>> (600 / 200).  The priority for relfrozenxid portion is basically
>> age(relfrozenxid) / autovacuum_freeze_max_age (plus need to account
>> for mxid by doing the same for that and taking the maximum of each
>> value).  For each of those component "scores", the priority for
>> autovacuum would be the maximum of each of those.
>> 
>> Effectively, it's a method of aligning the different units of measure,
>> transactions or tuples into a single value which is calculated based
>> on the very same values that we use today to trigger autovacuums.
> 
> I like the idea of a "score" approach, but I'm worried that we'll never
> come to an agreement on the formula to use.  Perhaps we'd have more luck
> getting consensus on a multifaceted strategy if we kept it brutally simple.
> IMHO it's worth a try...

Here's a prototype of a "score" approach.  Two notes:

* I've given special priority to anti-wraparound vacuums.  I think this is
important to avoid focusing too much on bloat when wraparound is imminent.
In any case, we need a separate wraparound score in case autovacuum is
disabled.

* I didn't include the analyze threshold in the score because it doesn't
apply to TOAST tables, and therefore would artificially lower their
prioritiy.  Perhaps there is another way to deal with this.

This is very much just a prototype of the basic idea.  As-is, I think it'll
favor processing tables with lots of bloat unless we're in an
anti-wraparound scenario.  Maybe that's okay.  I'm not sure how scientific
we want to be about all of this, but I do intend to try some long-running
tests.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-10 18:42  Robert Haas <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Robert Haas @ 2025-10-10 18:42 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Fri, Oct 10, 2025 at 1:31 PM Nathan Bossart <[email protected]> wrote:
> Here's a prototype of a "score" approach.  Two notes:
>
> * I've given special priority to anti-wraparound vacuums.  I think this is
> important to avoid focusing too much on bloat when wraparound is imminent.
> In any case, we need a separate wraparound score in case autovacuum is
> disabled.
>
> * I didn't include the analyze threshold in the score because it doesn't
> apply to TOAST tables, and therefore would artificially lower their
> prioritiy.  Perhaps there is another way to deal with this.
>
> This is very much just a prototype of the basic idea.  As-is, I think it'll
> favor processing tables with lots of bloat unless we're in an
> anti-wraparound scenario.  Maybe that's okay.  I'm not sure how scientific
> we want to be about all of this, but I do intend to try some long-running
> tests.

I think this is a reasonable starting point, although I'm surprised
that you chose to combine the sub-scores using + rather than Max.

I think it will take a lot of experimentation to figure out whether
this particular algorithm (or any other) works well in practice. My
intuition (for whatever that is worth to you, which may not be much)
is that what will anger users is cases when we ignore a horrible
problem to deal with a routine problem. Figuring out how to design the
scoring system to avoid such outcomes is the hard part of this
problem, IMHO. For this particular algorithm, the main hazards that
spring to mind for me are:

- The wraparound score can't be more than about 10, but the bloat
score could be arbitrarily large, especially for tables with few
tuples, so there may be lots of cases in which the wraparound score
has no impact on the behavior.

- The patch attempts to guard against this by disregarding the
non-wraparound portion of the score once the wraparound portion
reaches 1.0, but that results in an abrupt behavior shift at that
point. Suddenly we go from mostly ignoring the wraparound score to
entirely ignoring the bloat score. This might result in the system
abruptly ignoring tables that are bloating extremely rapidly in favor
of trying to catch up in a wraparound situation that is not yet
terribly urgent.

When I've thought about this problem -- and I can't claim to have
thought about it very hard -- it's seemed to me that we need to (1)
somehow normalize everything to somewhat similar units and (2) make
sure that severe wraparound danger always wins over every other
consideration, but mild wraparound danger can lose to severe bloat.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-10 19:44  Nathan Bossart <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-10 19:44 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

Thanks for taking a look.

On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote:
> I think this is a reasonable starting point, although I'm surprised
> that you chose to combine the sub-scores using + rather than Max.

My thinking was that we should consider as many factors as we can in the
score, not just the worst one.  If a table has medium bloat and medium
wraparound risk, should it always be lower in priority to something with
large bloat and small wraparound risk?  It seems worth exploring.  I am
curious why you first thought of Max.

> When I've thought about this problem -- and I can't claim to have
> thought about it very hard -- it's seemed to me that we need to (1)
> somehow normalize everything to somewhat similar units and (2) make
> sure that severe wraparound danger always wins over every other
> consideration, but mild wraparound danger can lose to severe bloat.

Agreed.  I need to think about this some more.  While I'm optimistic that
we could come up with some sort of normalization framework, I deperately
want to avoid super complicated formulas and GUCs, as those seem like
sure-fire ways of ensuring nothing ever gets committed.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-10 20:24  Robert Haas <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Robert Haas @ 2025-10-10 20:24 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Fri, Oct 10, 2025 at 3:44 PM Nathan Bossart <[email protected]> wrote:
> On Fri, Oct 10, 2025 at 02:42:57PM -0400, Robert Haas wrote:
> > I think this is a reasonable starting point, although I'm surprised
> > that you chose to combine the sub-scores using + rather than Max.
>
> My thinking was that we should consider as many factors as we can in the
> score, not just the worst one.  If a table has medium bloat and medium
> wraparound risk, should it always be lower in priority to something with
> large bloat and small wraparound risk?  It seems worth exploring.  I am
> curious why you first thought of Max.

The right answer depends a good bit on how exactly you do the scoring,
but it seems to me that it would be easy to overweight secondary
problems. Consider a table with an XID age of 900m and an MXID age of
900m and another table with an XID age of 1.8b. I think it is VERY
clear that the second one is MUCH worse; but just adding things up
will make them seem equal.

> Agreed.  I need to think about this some more.  While I'm optimistic that
> we could come up with some sort of normalization framework, I deperately
> want to avoid super complicated formulas and GUCs, as those seem like
> sure-fire ways of ensuring nothing ever gets committed.

IMHO, the trick here is to come up with something that's neither too
simple nor too complicated. If it's too simple, we'll easily come up
with cases where it sucks, and possibly where it's worse than what we
do now (an impressive achievement, to be sure). If it's too
complicated, it will be full of arbitrary things that will provoke
dissent and probably not work out well in practice. I don't think we
need something dramatically awesome to make a change to the status
quo, but if it's extremely easy to think up simple scenarios in which
a given idea will fail spectacularly, I'd be inclined to suspect that
there will be a lot of real-world spectacular failures.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-10 21:59  Jeremy Schneider <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Jeremy Schneider @ 2025-10-10 21:59 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Nathan Bossart <[email protected]>; David Rowley <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Fri, 10 Oct 2025 16:24:51 -0400
Robert Haas <[email protected]> wrote:

> I don't think we
> need something dramatically awesome to make a change to the status
> quo, but if it's extremely easy to think up simple scenarios in which
> a given idea will fail spectacularly, I'd be inclined to suspect that
> there will be a lot of real-world spectacular failures.

What does a real-world spectacular failure look like?

"If those 3 autovac workers had processed tables in a different order
everything would have been peachy"

But if autovac is going to get jammed up long enough to wraparound the
system, does it matter whether or not it did a one-time processing of a
bunch of small tables before it got jammed?

One particular table always scoring high shouldn't block autovac from
other tables, because it doesn't start a new iteration until it goes all
the way through the list from its current iteration right? And one
iteration of autovac needs to process everything in the list... so it
should take the same overall time regardless of order?

The spectacular failures I've seen with autovac usually come down to
things like too much sleeping (cost_delay) or too few workers, where
better ordering would be nice but probably wouldn't fix any real
problems leading to the spectacular failures

From  Robert's 2024 pgConf.dev talk:
1. slow - forward progress not fast enough
2. stuck - no forward progress
3. spinning - not accomplishing anything
4. skipped - thinks not needed
5. starvation - cant keep up

I don't think any of these are really addressed by simply changing
table order.

From Robert's 2022 email to hackers:
> A few people have proposed scoring systems, which I think is closer
> to the right idea, because our basic goal is to start vacuuming any
> given table soon enough that we finish vacuuming it before some
> catastrophe strikes.
...
> If table A will cause wraparound in 2 hours and take 2 hours to
> vacuum, and table B will cause wraparound in 1 hour and take 10
> minutes to vacuum, table A is more urgent even though the catastrophe
> is further out.

Robert it sounds to me like the main use case you're focused on here
is where basically wraparound is imminent - we are already screwed - and
our very last hope was that a last-ditch autovac can finish just in time

Failsafe and dynamic cost updates were huge advancements. Do we allow
dynamic adjustment to worker count yet?

I hope y'all just pick something and commit it without getting too lost
in the details. I honestly think in the list of improvements around
autovac, this is the lowest priority on my list of hopes and dreams as a
user for wraparound prevention :) because if this ever matters to me for
avoiding wraparound, I was screwed long before we got to this point and
this is not going to fix my underlying problems.

-Jeremy





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-12 06:27  David Rowley <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-12 06:27 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Sat, 11 Oct 2025 at 07:43, Robert Haas <[email protected]> wrote:
> I think this is a reasonable starting point, although I'm surprised
> that you chose to combine the sub-scores using + rather than Max.

Adding up the component scores doesn't make sense to me either. That
means you could have 0.5 for inserted tuples, 0.5 for dead tuples and,
say 0.1 for analyze threshold, which all add up to 1.1, but neither
component score is high enough for auto-vacuum to have to do anything
yet. With Max(), we'd clearly see that there's nothing to do since the
overall score isn't >= 1.0.

> - The wraparound score can't be more than about 10, but the bloat
> score could be arbitrarily large, especially for tables with few
> tuples, so there may be lots of cases in which the wraparound score
> has no impact on the behavior.

That's a good point. I think we definitely do want to make it so
tables in near danger of causing the database to stop accepting
transactions are dealt with ASAP.

Maybe the score calculation could change when the relevant age() goes
above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start
scaling it very aggressively beyond that. There's plenty to debate,
but at a first cut, maybe something like the following (coded in SQL
for ease of result viewing):

select xidage as "age(relfrozenxid)",case xidage::float8 <
current_setting('vacuum_failsafe_age')::float8 when true then xidage /
current_setting('autovacuum_freeze_max_age')::float8 else power(xidage
/ current_setting('autovacuum_freeze_max_age')::float8,xidage::float8
/ 100_000_000) end xid_age_score from
generate_series(0,2_000_000_000,100_000_000) xidage;

which gives 1e+20 for age of 2 billion. It would take quite an
unreasonable amount of bloat to score higher than that.

I guess someone might argue that we should start taking it more
seriously before the table's relfrozenxid age gets to
vacuum_failsafe_age. Maybe that's true. I just don't know what. In any
case, if a table's age gets that old, then something's probably not
configured very well and needs attention. I did think maybe we could
keep the addressing of auto-vacuum being configured to run too slowly
as a separate thread.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-13 12:32  Robert Haas <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: Robert Haas @ 2025-10-13 12:32 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: Nathan Bossart <[email protected]>; David Rowley <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Fri, Oct 10, 2025 at 6:00 PM Jeremy Schneider
<[email protected]> wrote:
> The spectacular failures I've seen with autovac usually come down to
> things like too much sleeping (cost_delay) or too few workers, where
> better ordering would be nice but probably wouldn't fix any real
> problems leading to the spectacular failures

Since I have said the same thing myself, I can hardly disagree.
However, there are probably a few exceptions. For instance, if
autovacuum on a certain table is failing repeatedly or accomplishing
nothing without removing the apparent need to autovacuum, and happens
to be the first one in pg_class, it could divert a lot of attention
from other tables.

> Robert it sounds to me like the main use case you're focused on here
> is where basically wraparound is imminent - we are already screwed - and
> our very last hope was that a last-ditch autovac can finish just in time

Yes, I would argue that this is the scenario that really matters. As
you say above, the main thing is having little enough sleeping and a
sufficient number of workers. When that's the case, we can do the work
in any order and life will mostly be fine. However, if we get into a
desperate situation by, say, having one table that can't be vacuumed,
and eventually someone fixes that, say by dropping the corrupt index
that is preventing vacuuming of that table, we might like it if
autovacuum focused on getting that table vacuumed rather than getting
lost in the sauce. Of course, if we have the pretty common situation
where autovacuum gets behind on all tables, say due to a stale
replication slot, then this is less critical, although a perfect
system would probably prioritize vacuuming the *largest* tables in
this situation, since those will take the longest to finish, and it's
when a vacuum of every table in the cluster has been *completed* that
the XID horizons can advance.

> I hope y'all just pick something and commit it without getting too lost
> in the details. I honestly think in the list of improvements around
> autovac, this is the lowest priority on my list of hopes and dreams as a
> user for wraparound prevention :) because if this ever matters to me for
> avoiding wraparound, I was screwed long before we got to this point and
> this is not going to fix my underlying problems.

I'm not sure if this was your intention, but to me this kind of reads
like "well, it's not going to matter anyway so just do whatever and
move on" and I don't agree with that. I think that if we're not going
to do high-quality engineering here, we just shouldn't change anything
at all. It's better to keep having the same bad behavior than for each
release to have new and different bad behavior. One possible positive
result of leaning into this prioritization problem is that whoever's
working in it (Nathan, in this case) might gain some useful insights
about how to tackle some of the other problems in this space. All of
this is hard enough that we haven't really had any major improvements
in this area since, I want to say, 8.3, and it's desirable to break
that logjam even if we don't all agree on which problems are most
urgent. Even if I ultimately don't agree with whatever Nathan wants to
do or proposes, I'm glad he's trying to do something, which is (in my
experience) generally much better than making no effort at all.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-21 14:38  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-21 14:38 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Sun, Oct 12, 2025 at 07:27:10PM +1300, David Rowley wrote:
> On Sat, 11 Oct 2025 at 07:43, Robert Haas <[email protected]> wrote:
>> I think this is a reasonable starting point, although I'm surprised
>> that you chose to combine the sub-scores using + rather than Max.
> 
> Adding up the component scores doesn't make sense to me either. That
> means you could have 0.5 for inserted tuples, 0.5 for dead tuples and,
> say 0.1 for analyze threshold, which all add up to 1.1, but neither
> component score is high enough for auto-vacuum to have to do anything
> yet. With Max(), we'd clearly see that there's nothing to do since the
> overall score isn't >= 1.0.

In v3, I switched to Max().

> Maybe the score calculation could change when the relevant age() goes
> above vacuum_failsafe_age / vacuum_multixact_failsafe_age and start
> scaling it very aggressively beyond that. There's plenty to debate,
> but at a first cut, maybe something like the following (coded in SQL
> for ease of result viewing):
> 
> select xidage as "age(relfrozenxid)",case xidage::float8 <
> current_setting('vacuum_failsafe_age')::float8 when true then xidage /
> current_setting('autovacuum_freeze_max_age')::float8 else power(xidage
> / current_setting('autovacuum_freeze_max_age')::float8,xidage::float8
> / 100_000_000) end xid_age_score from
> generate_series(0,2_000_000_000,100_000_000) xidage;
> 
> which gives 1e+20 for age of 2 billion. It would take quite an
> unreasonable amount of bloat to score higher than that.
> 
> I guess someone might argue that we should start taking it more
> seriously before the table's relfrozenxid age gets to
> vacuum_failsafe_age. Maybe that's true. I just don't know what. In any
> case, if a table's age gets that old, then something's probably not
> configured very well and needs attention. I did think maybe we could
> keep the addressing of auto-vacuum being configured to run too slowly
> as a separate thread.

I did something similar to this in v3, although I used the *_freeze_max_age
parameters as the point to start scaling aggressively, and I simply raised
the score to the power of 10.

I've yet to do any real testing with this stuff.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-21 20:07  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-21 20:07 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Wed, 22 Oct 2025 at 03:38, Nathan Bossart <[email protected]> wrote:
> I did something similar to this in v3, although I used the *_freeze_max_age
> parameters as the point to start scaling aggressively, and I simply raised
> the score to the power of 10.
>
> I've yet to do any real testing with this stuff.

I've not tested it or compiled it, but the patch looks good. I did
think that the freeze vacuum isn't that big a deal if it's just over
the *freeze_max_age and thought it should become aggressive very
quickly at the failsafe age, but that leaves a much smaller window of
time to do the freezing if autovacuum has been busy with other higher
priority tables.  Your scaling is much more gentle and comes out (with
standard settings) with a score of 1 billion for a table at the
failsafe age, and about 1 million at half the failsafe age. That seems
reasonable as it's hard to imagine a table having a 1 billion bloat
score.

However, just thinking of non-standard setting... I do wonder if it'll
be aggressive enough if someone did something like raise the
*freeze_max_age to 1 billion (it's certainly common that people raise
this). With a 1.6 billion vacuum_failsafe_age, a table at
freeze_max_age only scores in at 110. I guess there's no reason we
couldn't keep your calc and then scale the score further once over
vacuum_failsafe_age to ensure those are the highest priority. There is
a danger that if a table scores too low when age(relfrozenxid) >
vacuum_failsafe_age that autovacuum dawdles along handling bloated
tables while oblivious to the nearing armageddon.

Is it worth writing a comment explaining the philosophy behind the
scoring system to make it easier for people to understand that it aims
to standardise the priority of vacuums and unify the various trigger
thresholds into a single number to determine which tables are most
important to vacuum and/or analyze first?

Thanks for working on this.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-22 18:40  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-22 18:40 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Wed, Oct 22, 2025 at 09:07:33AM +1300, David Rowley wrote:
> However, just thinking of non-standard setting... I do wonder if it'll
> be aggressive enough if someone did something like raise the
> *freeze_max_age to 1 billion (it's certainly common that people raise
> this). With a 1.6 billion vacuum_failsafe_age, a table at
> freeze_max_age only scores in at 110. I guess there's no reason we
> couldn't keep your calc and then scale the score further once over
> vacuum_failsafe_age to ensure those are the highest priority. There is
> a danger that if a table scores too low when age(relfrozenxid) >
> vacuum_failsafe_age that autovacuum dawdles along handling bloated
> tables while oblivious to the nearing armageddon.

That's a good point.  I wonder if we should try to make the wraparound
score independent of the *_freeze_max_age parameters (once the table age
surpasses said parameters).  Else, different settings will greatly impact
how aggressively tables are prioritized the closer they are to wraparound.
Even if autovacuum_freeze_max_age is set to 200M, it's not critically
important for autovacuum to pick up tables right away as soon as their age
reaches 200M.  But if the parameter is set to 2B, we _do_ want autovacuum
to prioritize tables right away once their age reaches 2B.

> Is it worth writing a comment explaining the philosophy behind the
> scoring system to make it easier for people to understand that it aims
> to standardise the priority of vacuums and unify the various trigger
> thresholds into a single number to determine which tables are most
> important to vacuum and/or analyze first?

Yes, I think so.
 
> Thanks for working on this.

I appreciate the discussion.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-22 18:58  Nathan Bossart <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-22 18:58 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Wed, Oct 22, 2025 at 01:40:11PM -0500, Nathan Bossart wrote:
> On Wed, Oct 22, 2025 at 09:07:33AM +1300, David Rowley wrote:
>> However, just thinking of non-standard setting... I do wonder if it'll
>> be aggressive enough if someone did something like raise the
>> *freeze_max_age to 1 billion (it's certainly common that people raise
>> this). With a 1.6 billion vacuum_failsafe_age, a table at
>> freeze_max_age only scores in at 110. I guess there's no reason we
>> couldn't keep your calc and then scale the score further once over
>> vacuum_failsafe_age to ensure those are the highest priority. There is
>> a danger that if a table scores too low when age(relfrozenxid) >
>> vacuum_failsafe_age that autovacuum dawdles along handling bloated
>> tables while oblivious to the nearing armageddon.
> 
> That's a good point.  I wonder if we should try to make the wraparound
> score independent of the *_freeze_max_age parameters (once the table age
> surpasses said parameters).  Else, different settings will greatly impact
> how aggressively tables are prioritized the closer they are to wraparound.
> Even if autovacuum_freeze_max_age is set to 200M, it's not critically
> important for autovacuum to pick up tables right away as soon as their age
> reaches 200M.  But if the parameter is set to 2B, we _do_ want autovacuum
> to prioritize tables right away once their age reaches 2B.

I'm imagining something a bit like the following:

    select xidage "age(relfrozenxid)",
    power(1.001, xidage::float8 / (select min_val
    from pg_settings where name = 'autovacuum_freeze_max_age')::float8)
    xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage;

     age(relfrozenxid) |   xid_age_score
    -------------------+--------------------
                     0 |                  1
             100000000 | 2.7169239322355936
             200000000 |   7.38167565355452
             300000000 | 20.055451243143093
             400000000 |  54.48913545427955
             500000000 |  148.0428361625591
             600000000 | 402.22112456608977
             700000000 |  1092.804199384323
             800000000 |  2969.065882554825
             900000000 |  8066.726152697397
            1000000000 | 21916.681339054314
            1100000000 | 59545.956045257895
            1200000000 |  161781.8330472099
            1300000000 |  439548.9340069078
            1400000000 | 1194221.0181920114
            1500000000 |  3244607.664704634
            1600000000 |   8815352.21495106
            1700000000 | 23950641.403886583
            1800000000 |  65072070.82261215
            1900000000 | 176795866.53808445
            2000000000 |  480340920.9176516
    (21 rows)

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-22 19:34  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: David Rowley @ 2025-10-22 19:34 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Thu, 23 Oct 2025 at 07:58, Nathan Bossart <[email protected]> wrote:
> > That's a good point.  I wonder if we should try to make the wraparound
> > score independent of the *_freeze_max_age parameters (once the table age
> > surpasses said parameters).  Else, different settings will greatly impact
> > how aggressively tables are prioritized the closer they are to wraparound.
> > Even if autovacuum_freeze_max_age is set to 200M, it's not critically
> > important for autovacuum to pick up tables right away as soon as their age
> > reaches 200M.  But if the parameter is set to 2B, we _do_ want autovacuum
> > to prioritize tables right away once their age reaches 2B.
>
> I'm imagining something a bit like the following:
>
>     select xidage "age(relfrozenxid)",
>     power(1.001, xidage::float8 / (select min_val
>     from pg_settings where name = 'autovacuum_freeze_max_age')::float8)
>     xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage;
>
>      age(relfrozenxid) |   xid_age_score
>     -------------------+--------------------
>                      0 |                  1
>              100000000 | 2.7169239322355936
>              200000000 |   7.38167565355452
>              300000000 | 20.055451243143093

This does start to put the score > 1 before the table reaches
autovacuum_freeze_max_age. I don't think that's great as the score of
1.0 was meant to represent that the table now requires some autovacuum
work.

The main reason I was trying to keep the score scaling with the
percentage over the given threshold that the table is was that I had
imagined we could use the score number to start reducing the sleep
time between autovacuum_vacuum_cost_limit when the highest scoring
table persists in being high for too long. I was considering this to
fix the misconfigured autovacuum problem that so many people have. If
we scaled it the way similar to the query above, the score would look
high even before it reaches the limit.  This is the reason I was
scaling the score linear with the autovacuum_freeze_max_age with the
version I sent and only scaling exponentially after the failsafe age.
I wanted to talk about the "reducing the cost delay" feature
separately so as not to load up this thread and widen the scope for
varying opinions, but in its most trivial form, the
vacuum_cost_limit() code could be adjusted to only sleep for
autovacuum_vacuum_cost_delay / <the table's score>.

I think the one I proposed in [1] does this quite well. The table
remains eligible to be autovacuumed with any score >= 1.0, and there's
still a huge window of time to freeze a table once it's over
autovacuum_freeze_max_age before there are issues and the exponential
scaling once over failsafe age should ensure that the table is top of
the list for when the failsafe code kicks in and removes the cost
limit. If we had the varying sleep time as I mentioned above, the
failsafe code could even be removed as the
"autovacuum_vacuum_cost_delay / <tables score>" calculation would
effectively zero the sleep time with any table > failsafe age.

David

[1] https://postgr.es/m/CAApHDvqrd=SHVUytdRj55OWnLH98Rvtzqam5zq2f4XKRZa7t9Q@mail.gmail.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-22 19:43  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-22 19:43 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Thu, Oct 23, 2025 at 08:34:49AM +1300, David Rowley wrote:
> On Thu, 23 Oct 2025 at 07:58, Nathan Bossart <[email protected]> wrote:
>> I'm imagining something a bit like the following:
>>
>>     select xidage "age(relfrozenxid)",
>>     power(1.001, xidage::float8 / (select min_val
>>     from pg_settings where name = 'autovacuum_freeze_max_age')::float8)
>>     xid_age_score from generate_series(0,2_000_000_000,100_000_000) xidage;
>>
>>      age(relfrozenxid) |   xid_age_score
>>     -------------------+--------------------
>>                      0 |                  1
>>              100000000 | 2.7169239322355936
>>              200000000 |   7.38167565355452
>>              300000000 | 20.055451243143093
> 
> This does start to put the score > 1 before the table reaches
> autovacuum_freeze_max_age. I don't think that's great as the score of
> 1.0 was meant to represent that the table now requires some autovacuum
> work.

My thinking was that this formula would only be used once the table reaches
autovacuum_freeze_max_age.  If the age is less than that, we'd do something
else, such as dividing the age by the *_max_age setting.

> The main reason I was trying to keep the score scaling with the
> percentage over the given threshold that the table is was that I had
> imagined we could use the score number to start reducing the sleep
> time between autovacuum_vacuum_cost_limit when the highest scoring
> table persists in being high for too long. I was considering this to
> fix the misconfigured autovacuum problem that so many people have. If
> we scaled it the way similar to the query above, the score would look
> high even before it reaches the limit.  This is the reason I was
> scaling the score linear with the autovacuum_freeze_max_age with the
> version I sent and only scaling exponentially after the failsafe age.
> I wanted to talk about the "reducing the cost delay" feature
> separately so as not to load up this thread and widen the scope for
> varying opinions, but in its most trivial form, the
> vacuum_cost_limit() code could be adjusted to only sleep for
> autovacuum_vacuum_cost_delay / <the table's score>.

I see.

> I think the one I proposed in [1] does this quite well. The table
> remains eligible to be autovacuumed with any score >= 1.0, and there's
> still a huge window of time to freeze a table once it's over
> autovacuum_freeze_max_age before there are issues and the exponential
> scaling once over failsafe age should ensure that the table is top of
> the list for when the failsafe code kicks in and removes the cost
> limit.

Yeah.  I'll update the patch with that formula.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-23 18:22  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-23 18:22 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> > I think the one I proposed in [1] does this quite well. The table
> > remains eligible to be autovacuumed with any score >= 1.0, and there's
> > still a huge window of time to freeze a table once it's over
> > autovacuum_freeze_max_age before there are issues and the exponential
> > scaling once over failsafe age should ensure that the table is top of
> > the list for when the failsafe code kicks in and removes the cost
> > limit.
>
> Yeah.  I'll update the patch with that formula.

I was looking at v3, and I understand the formula will be updated in the
next version. However, do you think we should benchmark the approach
of using an intermediary list to store the eligible tables and sorting
that list,
which may cause larger performance overhead for databases with hundreds
of tables that may all be eligible for autovacuum. I do think such cases
out there are common, particularly in multi-tenant type databases, where
each tenant could be one or more tables.

What do you think?

--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-23 18:47  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-23 18:47 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Oct 23, 2025 at 01:22:24PM -0500, Sami Imseih wrote:
> I was looking at v3, and I understand the formula will be updated in the
> next version. However, do you think we should benchmark the approach
> of using an intermediary list to store the eligible tables and sorting
> that list,
> which may cause larger performance overhead for databases with hundreds
> of tables that may all be eligible for autovacuum. I do think such cases
> out there are common, particularly in multi-tenant type databases, where
> each tenant could be one or more tables.

We already have an intermediary list of table OIDs, so the additional
overhead is ultimately just the score calculation and the sort operation.
I'd be quite surprised if that added up to anything remotely worrisome,
even for thousands of eligible tables.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-23 19:32  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-23 19:32 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Thu, Oct 23, 2025 at 01:22:24PM -0500, Sami Imseih wrote:
> > I was looking at v3, and I understand the formula will be updated in the
> > next version. However, do you think we should benchmark the approach
> > of using an intermediary list to store the eligible tables and sorting
> > that list,
> > which may cause larger performance overhead for databases with hundreds
> > of tables that may all be eligible for autovacuum. I do think such cases
> > out there are common, particularly in multi-tenant type databases, where
> > each tenant could be one or more tables.
>
> We already have an intermediary list of table OIDs, so the additional
> overhead is ultimately just the score calculation and the sort operation.
> I'd be quite surprised if that added up to anything remotely worrisome,
> even for thousands of eligible tables.

Yeah, you’re correct, the list already exists; sorry I missed that. My
main concern is
the additional overhead of the sort operation, especially if we have
many eligible
tables and an aggressive autovacuum_naptime. I don’t think we should make the
existing performance of many relations any worse with an additional
sort. That said,
in such cases the sort may not even be the main performance
bottleneck, since the
catalog scan itself already doesn’t scale well with many relations.
With our current
approach, we have more options to improve this, but if we add a sort,
we may not be
able to avoid a full scan.

--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-23 20:24  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-23 20:24 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 24 Oct 2025 at 08:33, Sami Imseih <[email protected]> wrote:
> Yeah, you’re correct, the list already exists; sorry I missed that. My
> main concern is
> the additional overhead of the sort operation, especially if we have
> many eligible
> tables and an aggressive autovacuum_naptime.

It is true that there are reasons that millions of tables could
suddenly become eligible for autovacuum work with the consumption of a
single xid, but I imagine sorting the list of tables is probably the
least of the DBAs worries for that case as sorting the
tables_to_process list is going to take a tiny fraction of the time
that doing the vacuum work will take.

If your concern is that the sort could take too large a portion of
someone's 1sec autovacuum_naptime instance, then you also need to
consider that the list isn't likely to be very long as there's very
little time for tables to become eligible in such a short naptime, and
if the tables are piling up because autovacuum is configured to run
too slowly, then lets fix that at the root cause rather than be
worried about improving one area because another area needs work. If
we think like that, we'll remain gridlocked and autovacuum will never
be improved. TBH, I think that mindset has likely contributed quite a
bit to the fact that we've made about zero improvements in this area
despite nobody thinking that nothing needs to be done.

There are also things that could be done if we were genuinely
concerned and had actual proof that this could reasonably be a
problem. sort_template.h would reduce the constant factor of the
indirect function call overhead by quite a bit. On a quick test here
with a table containing 1 million random float8 values, a Seq Scan and
in-memory Sort, EXPLAIN ANALYZE reports the sort took about 21ms:
(actual time=172.273..193.824). I really doubt anyone will be
concerned with 21ms when there's a list of 1 million tables needing to
be autovacuumed.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-23 20:48  Sami Imseih <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-23 20:48 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Fri, 24 Oct 2025 at 08:33, Sami Imseih <[email protected]> wrote:
> > Yeah, you’re correct, the list already exists; sorry I missed that. My
> > main concern is
> > the additional overhead of the sort operation, especially if we have
> > many eligible
> > tables and an aggressive autovacuum_naptime.
>
> It is true that there are reasons that millions of tables could
> suddenly become eligible for autovacuum work with the consumption of a
> single xid, but I imagine sorting the list of tables is probably the
> least of the DBAs worries for that case as sorting the
> tables_to_process list is going to take a tiny fraction of the time
> that doing the vacuum work will take.

Yes, in my last reply, I did indicate that the sort will likely not be
the operation that will tip the performance over, but the
catalog scan itself that I have seen not scale well as the number of
relations grow ( in cases of thousands or hundreds of thousands of tables).
If we are to prioritize vacuuming by M(XID), then it will be hard to avoid the
catalog scan anymore in a future improvement.

>TBH, I think that mindset has likely contributed quite a
> bit to the fact that we've made about zero improvements in this area
> despite nobody thinking that nothing needs to be done.

I am not against this idea, just thinking out loud about the high relation
cases I have seen in the past.

--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-23 22:39  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-23 22:39 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 24 Oct 2025 at 09:48, Sami Imseih <[email protected]> wrote:
> Yes, in my last reply, I did indicate that the sort will likely not be
> the operation that will tip the performance over, but the
> catalog scan itself that I have seen not scale well as the number of
> relations grow ( in cases of thousands or hundreds of thousands of tables).
> If we are to prioritize vacuuming by M(XID), then it will be hard to avoid the
> catalog scan anymore in a future improvement.

I grant you that I could see that could be a problem for a
sufficiently large number of tables and small enough
autovacuum_naptime, but I don't see how anything being proposed here
moves the goalposts on the requirements to scan pg_class. We at least
need to get the relopts from somewhere, plus reltuples, relpages,
relallfrozen. We can't magic those values out of thin air. So, since
nothing is changing in regards to the scan of pg_class or which
columns we need to look at in that table, I don't know why we'd
consider it a topic to discuss on this thread. If this thread becomes
a dumping ground for unrelated problems, then nothing will be done to
fix the problem at hand.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-24 15:08  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-24 15:08 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

Here is an updated patch based on the latest discussion.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-24 21:13  Peter Geoghegan <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Peter Geoghegan @ 2025-10-24 21:13 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Wed, Oct 22, 2025 at 3:35 PM David Rowley <[email protected]> wrote:
> If we had the varying sleep time as I mentioned above, the
> failsafe code could even be removed as the
> "autovacuum_vacuum_cost_delay / <tables score>" calculation would
> effectively zero the sleep time with any table > failsafe age.

I'm not sure what you mean by "the failsafe could be removed".
Importantly, the failsafe will abandon all further index vacuuming.
That's why it's presented as something that you as a user are not
supposed to rely on.

-- 
Peter Geoghegan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-24 22:25  David Rowley <[email protected]>
  parent: Peter Geoghegan <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: David Rowley @ 2025-10-24 22:25 UTC (permalink / raw)
  To: Peter Geoghegan <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; Sami Imseih <[email protected]>; pgsql-hackers

On Sat, 25 Oct 2025 at 10:14, Peter Geoghegan <[email protected]> wrote:
>
> On Wed, Oct 22, 2025 at 3:35 PM David Rowley <[email protected]> wrote:
> > If we had the varying sleep time as I mentioned above, the
> > failsafe code could even be removed as the
> > "autovacuum_vacuum_cost_delay / <tables score>" calculation would
> > effectively zero the sleep time with any table > failsafe age.
>
> I'm not sure what you mean by "the failsafe could be removed".
> Importantly, the failsafe will abandon all further index vacuuming.
> That's why it's presented as something that you as a user are not
> supposed to rely on.

I didn't realise it did that too. I thought it just dropped the delay
to zero. In that case, I revoke the statement.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-26 01:25  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-26 01:25 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, 25 Oct 2025 at 04:08, Nathan Bossart <[email protected]> wrote:
> Here is an updated patch based on the latest discussion.

Thanks. I've just had a look at it. A few comments and questions.

1) The subtraction here looks back to front:

+ xid_age = TransactionIdIsNormal(relfrozenxid) ? relfrozenxid - recentXid : 0;
+ mxid_age = MultiXactIdIsValid(relminmxid) ? relminmxid - recentMulti : 0;

2) Would it be better to move all the code that sets the xid_score and
mxid_score to under an "if (force_vacuum)"? Those two variables could
be declared in there too.

3) Could the following be refactored a bit so we only check the "relid
!= StatisticRelationId" condition once?

+ if (relid != StatisticRelationId &&
+ classForm->relkind != RELKIND_TOASTVALUE)

Something like:

/* ANALYZE refuses to work with pg_statistic and we don't analyze
toast tables */
if (anltuples > anlthresh && relid != StatisticRelationId &&
    classForm->relkind != RELKIND_TOASTVALUE)
{
    *doanalyze = true;
    // calc analyze score and Max with *score
}
else
  *doanalyze = false;

then delete:

/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
    *doanalyze = false;

4) Should these be TransactionIds?

+ uint32 xid_age;
+ uint32 mxid_age;

5) Instead of:

+ double score = 0.0;

Is it better to zero the score inside relation_needs_vacanalyze() so
it works the same as the other output parameters?

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-27 16:06  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-10-27 16:06 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sun, Oct 26, 2025 at 02:25:48PM +1300, David Rowley wrote:
> Thanks. I've just had a look at it. A few comments and questions.

Thanks.

> 1) The subtraction here looks back to front:
> 
> + xid_age = TransactionIdIsNormal(relfrozenxid) ? relfrozenxid - recentXid : 0;
> + mxid_age = MultiXactIdIsValid(relminmxid) ? relminmxid - recentMulti : 0;

D'oh.

> 2) Would it be better to move all the code that sets the xid_score and
> mxid_score to under an "if (force_vacuum)"? Those two variables could
> be declared in there too.

Seems reasonable.

> 3) Could the following be refactored a bit so we only check the "relid
> != StatisticRelationId" condition once?

Yes.  We can update the vacuum part to follow the same pattern, too.

> 4) Should these be TransactionIds?
> 
> + uint32 xid_age;
> + uint32 mxid_age;

Probably.

> 5) Instead of:
> 
> + double score = 0.0;
> 
> Is it better to zero the score inside relation_needs_vacanalyze() so
> it works the same as the other output parameters?

My only concern about this is that some compilers might complain about
potentially-uninitialized uses.  But we can still zero it in the function
regardless.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-27 17:47  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-27 17:47 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

I spent some time looking at this, and I am not sure how much this
will move the goalpost, since most of the time the bottleneck for
autovacuum is the limited number of workers and large tables that
take a long time to process.

That said, this is a good change for the simple reason that it is
better to have a well-defined prioritization strategy for autovacuum
than something that is somewhat random, as mentioned earlier.

Just a couple of comments on v5:

1/ Should we add documentation explaining this prioritization behavior in [0]?

I wrote a sql that returns the tables and scores, which I found was
useful when I was testing this out, so having the actually rules spelled out
in docs will actually be super useful.

If we don't want to go that much in depth, at minimum the docs should say:

"Autovacuum prioritizes tables based on how far they exceed their thresholds
or if they are approaching wraparound limits." so a DBA can understand
this behavior.

2/
* The score is calculated as the maximum of the ratios of each of the table's
* relevant values to its threshold. For example, if the number of inserted
* tuples is 100, and the insert threshold for the table is 80, the insert
* score is 1.25.

Should we consider clamping down on the score when
reltuples = -1, otherwise the scores for such tables ( new tables
with a large amount of ingested data ) will be over-inflated? Perhaps,
if reltuples = -1 ( # of reltuples not known ), then give a score of .5,
so we are not over-prioritizing but not pushing down to the bottom?

[0] https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM

--
Sami Imseih
Amazon Web Services





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-27 21:15  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-27 21:15 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Mon, Oct 27, 2025 at 12:47:15PM -0500, Sami Imseih wrote:
> 1/ Should we add documentation explaining this prioritization behavior in [0]?
> 
> I wrote a sql that returns the tables and scores, which I found was
> useful when I was testing this out, so having the actually rules spelled out
> in docs will actually be super useful.

Can you elaborate on how it would be useful?  I'd be open to adding a short
note that autovacuum attempts to prioritize the tables in a smart way, but
I'm not sure I see the value of documenting every detail.  I also don't
want to add too much friction to future changes to the prioritization
logic.

> If we don't want to go that much in depth, at minimum the docs should say:
> 
> "Autovacuum prioritizes tables based on how far they exceed their thresholds
> or if they are approaching wraparound limits." so a DBA can understand
> this behavior.

Yeah, I would probably choose to keep it relatively vague like this.

> * The score is calculated as the maximum of the ratios of each of the table's
> * relevant values to its threshold. For example, if the number of inserted
> * tuples is 100, and the insert threshold for the table is 80, the insert
> * score is 1.25.
> 
> Should we consider clamping down on the score when
> reltuples = -1, otherwise the scores for such tables ( new tables
> with a large amount of ingested data ) will be over-inflated? Perhaps,
> if reltuples = -1 ( # of reltuples not known ), then give a score of .5,
> so we are not over-prioritizing but not pushing down to the bottom?

I'm not sure it's worth expending too much energy to deal with this.  In
the worst case, the table will be given an arbitrarily high priority the
first time it is vacuumed, but AFAICT that's it.  But that's already the
case, as the thresholds will be artificially low before the first
VACUUM/ANALYZE.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-27 22:35  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-27 22:35 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> > I wrote a sql that returns the tables and scores, which I found was
> > useful when I was testing this out, so having the actually rules spelled out
> > in docs will actually be super useful.
>
> Can you elaborate on how it would be useful?  I'd be open to adding a short
> note that autovacuum attempts to prioritize the tables in a smart way, but
> I'm not sure I see the value of documenting every detail.

We discuss the threshold calculations in the documentation, and users
can write scripts to monitor which tables are eligible. However, there
is nothing that indicates which table autovacuum will work on next (I
have been asked that question by users a few times, sometimes out of
curiosity, or because they are monitoring vacuum activity and wondering
when their important table will get a vacuum cycle, or if they should
kick off a manual vacuum). With the scoring system, it will be much more
difficult to explain, unless someone walks through the code.

> I also don't
> want to add too much friction to future changes to the prioritization
> logic.

Maybe future changes is a good reason to document the way autovacuum
prioritizes, since this is a user-facing change.

> > If we don't want to go that much in depth, at minimum the docs should say:
> >
> > "Autovacuum prioritizes tables based on how far they exceed their thresholds
> > or if they are approaching wraparound limits." so a DBA can understand
> > this behavior.
>
> Yeah, I would probably choose to keep it relatively vague like this.

With all the above said, starting with something small is definitely better
than nothing.

> > * The score is calculated as the maximum of the ratios of each of the table's
> > * relevant values to its threshold. For example, if the number of inserted
> > * tuples is 100, and the insert threshold for the table is 80, the insert
> > * score is 1.25.
> >
> > Should we consider clamping down on the score when
> > reltuples = -1, otherwise the scores for such tables ( new tables
> > with a large amount of ingested data ) will be over-inflated? Perhaps,
> > if reltuples = -1 ( # of reltuples not known ), then give a score of .5,
> > so we are not over-prioritizing but not pushing down to the bottom?
>
> I'm not sure it's worth expending too much energy to deal with this.  In
> the worst case, the table will be given an arbitrarily high priority the
> first time it is vacuumed, but AFAICT that's it.  But that's already the
> case, as the thresholds will be artificially low before the first
> VACUUM/ANALYZE.

I can think of scenarios where they may be workloads that create/drops
staging tables and load some data ( like batch processing ) where this
may become an issue because we are now forcing such tables to the top
of the list, potentially impacting other tables from getting vacuum cycles.
It could happen now, but the difference with this change is we are
forcing these tables to the top of the priority; based on an unknown
value (pg_class.reltuples = -1).

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-27 22:47  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-27 22:47 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

The patch is starting to look good. Here's a review of v5:

1. I think the following code at the bottom of
relation_needs_vacanalyze() can be deleted. You've added the check to
ensure *doanalyze never gets set to true for pg_statistic.

/* ANALYZE refuses to work with pg_statistic */
if (relid == StatisticRelationId)
    *doanalyze = false;

2. As #1, but in recheck_relation_needs_vacanalyze(), the following I
think can now be removed:

/* ignore ANALYZE for toast tables */
if (classForm->relkind == RELKIND_TOASTVALUE)
    *doanalyze = false;

3. Would you be able to include what the idea behind the * 1.05 in the
preceding comment?

On Tue, 28 Oct 2025 at 05:06, Nathan Bossart <[email protected]> wrote:
> +        effective_xid_failsafe_age = Max(vacuum_failsafe_age,
> +                                         autovacuum_freeze_max_age * 1.05);
> +        effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
> +                                          autovacuum_multixact_freeze_max_age * 1.05);

I assume it's to workaround some strange configuration settings, but
don't know for sure, or why 1.05 is a good value.

4. I think it might be neater to format the following as 3 separate "if" tests:

> +        if (force_vacuum ||
> +            vactuples > vacthresh ||
> +            (vac_ins_base_thresh >= 0 && instuples > vacinsthresh))
> +        {
> +            *dovacuum = true;
> +            *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> +            if (vac_ins_base_thresh >= 0)
> +                *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
> +        }
> +        else
> +            *dovacuum = false;

i.e:

        if (force_vacuum)
            *dovacuum = true;

        if (vactuples > vacthresh)
        {
            *dovacuum = true;
            *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
        }

        if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
        {
            *dovacuum = true;
            *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
        }

and also get rid of all the "else *dovacuum = false;" (and *dovacuum =
false) in favour of setting those to false at the top of the function.
It's just getting harder to track that those parameters are getting
set in all cases when they're meant to be.

doing that also gets rid of the duplicative "if (vac_ins_base_thresh
>= 0)" check and also saves doing the score calc when the inputs to it
don't make sense. The current code is relying on Max always picking
the current *score when the threshold isn't met.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-27 23:16  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-27 23:16 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, 28 Oct 2025 at 11:35, Sami Imseih <[email protected]> wrote:
> We discuss the threshold calculations in the documentation, and users
> can write scripts to monitor which tables are eligible. However, there
> is nothing that indicates which table autovacuum will work on next (I
> have been asked that question by users a few times, sometimes out of
> curiosity, or because they are monitoring vacuum activity and wondering
> when their important table will get a vacuum cycle, or if they should
> kick off a manual vacuum). With the scoring system, it will be much more
> difficult to explain, unless someone walks through the code.

I think it's reasonable to want to document how autovacuum prioritises
tables, but maybe not in too much detail. Longer term, I think it
would be good to have a pg_catalog view for this which showed the
relid or schema/relname, and the output values of
relation_needs_vacanalyze(). If we had that and we documented that
autovacuum workers work from that list, but they just may have an
older snapshot of it, then that might help make the score easier to
document. It would also allow people to question the scores as I
expect at least some people might not agree with the priorities. That
would allow us to consider tuning the score calculation if someone
points out a deficiency with the current calculation.

Also, longer-term, it also doesn't seem that unreasonable that the
autovacuum worker might want to refresh the tables_to_process once it
finishes a table and if autovacuum_naptime * $value units of time have
passed since it was last checked. That would allow the worker to deal
with and react accordingly when scores have changed significantly
since it last checked.  I mean, it might be days between when
autovacuum calculates the scores and finally vacuums the table when
the list is long, of it it was tied up with large tables. Other
workers may have gotten to some of the tables too, so the score may
have dropped, but again made its way above the threshold, but to a
lesser extent.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-28 21:03  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-28 21:03 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Oct 28, 2025 at 11:47:08AM +1300, David Rowley wrote:
> 1. I think the following code at the bottom of
> relation_needs_vacanalyze() can be deleted. You've added the check to
> ensure *doanalyze never gets set to true for pg_statistic.
> 
> /* ANALYZE refuses to work with pg_statistic */
> if (relid == StatisticRelationId)
>     *doanalyze = false;
> 
> 2. As #1, but in recheck_relation_needs_vacanalyze(), the following I
> think can now be removed:
> 
> /* ignore ANALYZE for toast tables */
> if (classForm->relkind == RELKIND_TOASTVALUE)
>     *doanalyze = false;

Removed.

> 3. Would you be able to include what the idea behind the * 1.05 in the
> preceding comment?
> 
> On Tue, 28 Oct 2025 at 05:06, Nathan Bossart <[email protected]> wrote:
>> +        effective_xid_failsafe_age = Max(vacuum_failsafe_age,
>> +                                         autovacuum_freeze_max_age * 1.05);
>> +        effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
>> +                                          autovacuum_multixact_freeze_max_age * 1.05);
> 
> I assume it's to workaround some strange configuration settings, but
> don't know for sure, or why 1.05 is a good value.

This is lifted from vacuum_xid_failsafe_check().  As noted in the docs, the
failsafe settings are silently limited to 105% of *_freeze_max_age.  I
expanded on this in the comment atop these lines.

> 4. I think it might be neater to format the following as 3 separate "if" tests:
> 
>> +        if (force_vacuum ||
>> +            vactuples > vacthresh ||
>> +            (vac_ins_base_thresh >= 0 && instuples > vacinsthresh))
>> +        {
>> +            *dovacuum = true;
>> +            *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
>> +            if (vac_ins_base_thresh >= 0)
>> +                *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
>> +        }
>> +        else
>> +            *dovacuum = false;
> 
> i.e:
> 
>         if (force_vacuum)
>             *dovacuum = true;
> 
>         if (vactuples > vacthresh)
>         {
>             *dovacuum = true;
>             *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
>         }
> 
>         if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
>         {
>             *dovacuum = true;
>             *score = Max(*score, (double) instuples / Max(vacinsthresh, 1));
>         }
> 
> and also get rid of all the "else *dovacuum = false;" (and *dovacuum =
> false) in favour of setting those to false at the top of the function.
> It's just getting harder to track that those parameters are getting
> set in all cases when they're meant to be.
> 
> doing that also gets rid of the duplicative "if (vac_ins_base_thresh
> >= 0)" check and also saves doing the score calc when the inputs to it
> don't make sense. The current code is relying on Max always picking
> the current *score when the threshold isn't met.

Done.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-28 21:06  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-28 21:06 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Oct 28, 2025 at 12:16:28PM +1300, David Rowley wrote:
> I think it's reasonable to want to document how autovacuum prioritises
> tables, but maybe not in too much detail. Longer term, I think it
> would be good to have a pg_catalog view for this which showed the
> relid or schema/relname, and the output values of
> relation_needs_vacanalyze(). If we had that and we documented that
> autovacuum workers work from that list, but they just may have an
> older snapshot of it, then that might help make the score easier to
> document. It would also allow people to question the scores as I
> expect at least some people might not agree with the priorities. That
> would allow us to consider tuning the score calculation if someone
> points out a deficiency with the current calculation.
> 
> Also, longer-term, it also doesn't seem that unreasonable that the
> autovacuum worker might want to refresh the tables_to_process once it
> finishes a table and if autovacuum_naptime * $value units of time have
> passed since it was last checked. That would allow the worker to deal
> with and react accordingly when scores have changed significantly
> since it last checked.  I mean, it might be days between when
> autovacuum calculates the scores and finally vacuums the table when
> the list is long, of it it was tied up with large tables. Other
> workers may have gotten to some of the tables too, so the score may
> have dropped, but again made its way above the threshold, but to a
> lesser extent.

Agreed on both points.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-28 22:44  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Sami Imseih @ 2025-10-28 22:44 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Done.

My compiler is complaining about v6

"../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on
‘*score’ may be undefined [-Wsequence-point]
 3293 |                         *score = *score = Max(*score, (double)
instuples / Max(vacinsthresh, 1));
[2/2] Linking target src/backend/postgres"

shouldn't just be like below?

*score =Max(*score, (double) instuples / Max(vacinsthresh, 1));


--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-29 03:10  wenhui qiu <[email protected]>
  parent: Sami Imseih <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: wenhui qiu @ 2025-10-29 03:10 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

HI Nathan Bossart

> + if (vactuples > vacthresh)
> + {
> + *dovacuum = true;
> + *score = Max(*score, (double) vactuples / Max(vacthresh, 1));
> + }
> +
> + if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
> + {
> + *dovacuum = true;
> + *score = *score = Max(*score, (double) instuples / Max(vacinsthresh,
1));
> + }
 I think it ( *score = *score = Max(*score, (double) instuples /
Max(vacinsthresh, 1));) I believe this must be a slip of the hand on your
part, having copied an extra one.
I also suggest add debug log for score
    ereport(DEBUG2,
            (errmsg("autovacuum candidate: %s (score=%.3f)",
                    get_rel_name(table->oid), table->score)));

> + effective_xid_failsafe_age = Max(vacuum_failsafe_age,
> + autovacuum_freeze_max_age * 1.05);
Typically, DBAs avoid setting autovacuum_freeze_max_age too close to
vacuum_failsafe_age. Therefore, your logic most likely uses the
vacuum_failsafe_age value.
Would taking the average of the two be a better approach?
#
root@localhost:/data/pgsql/pg18data# grep vacuum_failsafe_age
postgresql.conf
#vacuum_failsafe_age = 1600000000
root@localhost:/data/pgsql/pg18data# grep autovacuum_freeze_max_age
postgresql.conf
#autovacuum_freeze_max_age = 200000000 # maximum XID age before forced
vacuum



Thanks

On Wed, Oct 29, 2025 at 6:45 AM Sami Imseih <[email protected]> wrote:

> > Done.
>
> My compiler is complaining about v6
>
> "../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on
> ‘*score’ may be undefined [-Wsequence-point]
>  3293 |                         *score = *score = Max(*score, (double)
> instuples / Max(vacinsthresh, 1));
> [2/2] Linking target src/backend/postgres"
>
> shouldn't just be like below?
>
> *score =Max(*score, (double) instuples / Max(vacinsthresh, 1));
>
>
> --
> Sami
>
>
>


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-29 15:24  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-29 15:24 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Tue, Oct 28, 2025 at 12:16:28PM +1300, David Rowley wrote:
> > I think it's reasonable to want to document how autovacuum prioritises
> > tables, but maybe not in too much detail. Longer term, I think it
> > would be good to have a pg_catalog view for this which showed the
> > relid or schema/relname, and the output values of
> > relation_needs_vacanalyze(). If we had that and we documented that
> > autovacuum workers work from that list, but they just may have an
> > older snapshot of it, then that might help make the score easier to
> > document. It would also allow people to question the scores as I
> > expect at least some people might not agree with the priorities. That
> > would allow us to consider tuning the score calculation if someone
> > points out a deficiency with the current calculation.
> >
> > Also, longer-term, it also doesn't seem that unreasonable that the
> > autovacuum worker might want to refresh the tables_to_process once it
> > finishes a table and if autovacuum_naptime * $value units of time have
> > passed since it was last checked. That would allow the worker to deal
> > with and react accordingly when scores have changed significantly
> > since it last checked.  I mean, it might be days between when
> > autovacuum calculates the scores and finally vacuums the table when
> > the list is long, of it it was tied up with large tables. Other
> > workers may have gotten to some of the tables too, so the score may
> > have dropped, but again made its way above the threshold, but to a
> > lesser extent.
>
> Agreed on both points.

I think we do need some documentation about this behavior, which v6 is
still missing.

Another thing I have been contemplating about is the change in prioritization
and the resulting difference in the order in which tables are vacuumed
is what it means for workloads in which autovacuum tuning that was
done with the current assumptions will no longer be beneficial.

Let's imagine staging tables that get created and dropped during
some batch processing window and they see huge data
ingestion/changes. The current scan will make these less of a priority
naturally in relation to other permanent tables, but with the new priority,
we are making these staging tables more of a priority. Users will now
need to maybe turn off autovacuum on a per-table level to prevent this
scenario. That is just one example.

What I am also trying to say is should we provide a way, I hate
to say a GUC, for users to go back to the old behavior? or am I
overstating the risk here?

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-29 15:51  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-29 15:51 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Oct 28, 2025 at 05:44:37PM -0500, Sami Imseih wrote:
> My compiler is complaining about v6
> 
> "../src/backend/postmaster/autovacuum.c:3293:32: warning: operation on
> ‘*score’ may be undefined [-Wsequence-point]
>  3293 |                         *score = *score = Max(*score, (double)
> instuples / Max(vacinsthresh, 1));
> [2/2] Linking target src/backend/postgres"
> 
> shouldn't just be like below?
> 
> *score =Max(*score, (double) instuples / Max(vacinsthresh, 1));

Oops.  I fixed that typo in v7.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-29 15:58  Nathan Bossart <[email protected]>
  parent: wenhui qiu <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-10-29 15:58 UTC (permalink / raw)
  To: wenhui qiu <[email protected]>; +Cc: Sami Imseih <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Oct 29, 2025 at 11:10:55AM +0800, wenhui qiu wrote:
> Typically, DBAs avoid setting autovacuum_freeze_max_age too close to
> vacuum_failsafe_age. Therefore, your logic most likely uses the
> vacuum_failsafe_age value.
> Would taking the average of the two be a better approach?

That approach would begin aggressively scaling the priority of tables
sooner, but I don't know if that's strictly better.  In any case, I'd like
to avoid making the score calculation too magical.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-29 16:07  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-29 16:07 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Oct 29, 2025 at 10:24:17AM -0500, Sami Imseih wrote:
> I think we do need some documentation about this behavior, which v6 is
> still missing.

Would you be interested in giving that part a try?

> Another thing I have been contemplating about is the change in prioritization
> and the resulting difference in the order in which tables are vacuumed
> is what it means for workloads in which autovacuum tuning that was
> done with the current assumptions will no longer be beneficial.
> 
> Let's imagine staging tables that get created and dropped during
> some batch processing window and they see huge data
> ingestion/changes. The current scan will make these less of a priority
> naturally in relation to other permanent tables, but with the new priority,
> we are making these staging tables more of a priority. Users will now
> need to maybe turn off autovacuum on a per-table level to prevent this
> scenario. That is just one example.
> 
> What I am also trying to say is should we provide a way, I hate
> to say a GUC, for users to go back to the old behavior? or am I
> overstating the risk here?

It's probably worth testing out this scenario, but I can't say I'm terribly
worried.  Those kinds of tables are already getting chosen by autovacuum
earlier due to reltuples == -1, and this patch will just move them to the
front of the list that autovacuum creates.  In any case, I'd really like to
avoid a GUC or fallback switch here.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 02:58  wenhui qiu <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: wenhui qiu @ 2025-10-30 02:58 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

HI Nathan

> That approach would begin aggressively scaling the priority of tables
> sooner, but I don't know if that's strictly better.  In any case, I'd like
> to avoid making the score calculation too magical.
In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate
feature, if a table’s age still exceeds more than 1.x times the
autovacuum_freeze_max_age, it suggests that the vacuum freeze process is
not functioning properly. Once the age surpasses vacuum_failsafe_age,
wraparound issues are likely to occur soon.Taking the average of
vacuum_failsafe_age and autovacuum_freeze_max_age is not a complex
approach. Under the default configuration, this average already exceeds
four times the autovacuum_freeze_max_age. At that stage, a DBA should have
already intervened to investigate and resolve why the table age is not
decreasing.

Thanks

On Thu, Oct 30, 2025 at 12:07 AM Nathan Bossart <[email protected]>
wrote:

> On Wed, Oct 29, 2025 at 10:24:17AM -0500, Sami Imseih wrote:
> > I think we do need some documentation about this behavior, which v6 is
> > still missing.
>
> Would you be interested in giving that part a try?
>
> > Another thing I have been contemplating about is the change in
> prioritization
> > and the resulting difference in the order in which tables are vacuumed
> > is what it means for workloads in which autovacuum tuning that was
> > done with the current assumptions will no longer be beneficial.
> >
> > Let's imagine staging tables that get created and dropped during
> > some batch processing window and they see huge data
> > ingestion/changes. The current scan will make these less of a priority
> > naturally in relation to other permanent tables, but with the new
> priority,
> > we are making these staging tables more of a priority. Users will now
> > need to maybe turn off autovacuum on a per-table level to prevent this
> > scenario. That is just one example.
> >
> > What I am also trying to say is should we provide a way, I hate
> > to say a GUC, for users to go back to the old behavior? or am I
> > overstating the risk here?
>
> It's probably worth testing out this scenario, but I can't say I'm terribly
> worried.  Those kinds of tables are already getting chosen by autovacuum
> earlier due to reltuples == -1, and this patch will just move them to the
> front of the list that autovacuum creates.  In any case, I'd really like to
> avoid a GUC or fallback switch here.
>
> --
> nathan
>
>
>


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 03:41  David Rowley <[email protected]>
  parent: wenhui qiu <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-10-30 03:41 UTC (permalink / raw)
  To: wenhui qiu <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, 30 Oct 2025 at 15:58, wenhui qiu <[email protected]> wrote:
> In fact, with the introduction of the vacuum_max_eager_freeze_failure_rate feature, if a table’s age still exceeds more than 1.x times the autovacuum_freeze_max_age, it suggests that the vacuum freeze process is not functioning properly. Once the age surpasses vacuum_failsafe_age, wraparound issues are likely to occur soon.Taking the average of vacuum_failsafe_age and autovacuum_freeze_max_age is not a complex approach. Under the default configuration, this average already exceeds four times the autovacuum_freeze_max_age. At that stage, a DBA should have already intervened to investigate and resolve why the table age is not decreasing.

I don't think anyone would like to modify PostgreSQL in any way that
increases the chances that a table gets as old as vacuum_failsafe_age.
Regardless of the order in which tables are vacuumed, if a table gets
as old as that then vacuum is configured to run too slowly, or there
are not enough workers configured to cope with the given amount of
work. I think we need to tackle prioritisation and rate limiting as
two separate items. Nathan is proposing to improve the prioritisation
in this thread and it seems to me that your concerns are with rate
limiting. I've suggested an idea that might help with reducing the
cost_delay based on the score of the table in this thread. I'd rather
not introduce that as a topic for further discussion here (I imagine
Nathan agrees). It's not as if the server is going to consume 1
billion xids in 5 mins. It's at least going to take a day to days or
longer for that to happen and if autovacuum has not managed to get on
top of the workload in that time, then it's configured to run too
slowly and the cost_limit or delay needs to be adjusted.

My concern is that there are countless problems with autovacuum and if
you try and lump them all into a single thread to fix them all at
once, we'll get nowhere. Autovacuum was added to core in 8.1, 20 years
ago and I don't believe we've done anything to change the ratelimiting
aside from reducing the default cost_delay since then. It'd be good to
fix that at some point, just not here, please.

FWIW, I agree with Nathan about keeping the score calculation
non-magical. The score should be simple and easy to document. We can
introduce complexity to it as and when it's needed and when the
supporting evidence arrives, rather than from people waving their
hands.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 06:48  wenhui qiu <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: wenhui qiu @ 2025-10-30 06:48 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

HI
     I think there might be some misunderstanding — I’m only suggesting
changing
effective_xid_failsafe_age = Max(vacuum_failsafe_age,
                                 autovacuum_freeze_max_age * 1.05);
to
effective_xid_failsafe_age = (vacuum_failsafe_age +
autovacuum_freeze_max_age) / 2.0;
In the current logic, effective_xid_failsafe_age is almost always equal to
vacuum_failsafe_age.
As a result, increasing the vacuum priority only when a table’s age reaches
vacuum_failsafe_age is too late.


Thanks

On Thu, Oct 30, 2025 at 11:42 AM David Rowley <[email protected]> wrote:

> On Thu, 30 Oct 2025 at 15:58, wenhui qiu <[email protected]> wrote:
> > In fact, with the introduction of the
> vacuum_max_eager_freeze_failure_rate feature, if a table’s age still
> exceeds more than 1.x times the autovacuum_freeze_max_age, it suggests that
> the vacuum freeze process is not functioning properly. Once the age
> surpasses vacuum_failsafe_age, wraparound issues are likely to occur
> soon.Taking the average of vacuum_failsafe_age and
> autovacuum_freeze_max_age is not a complex approach. Under the default
> configuration, this average already exceeds four times the
> autovacuum_freeze_max_age. At that stage, a DBA should have already
> intervened to investigate and resolve why the table age is not decreasing.
>
> I don't think anyone would like to modify PostgreSQL in any way that
> increases the chances that a table gets as old as vacuum_failsafe_age.
> Regardless of the order in which tables are vacuumed, if a table gets
> as old as that then vacuum is configured to run too slowly, or there
> are not enough workers configured to cope with the given amount of
> work. I think we need to tackle prioritisation and rate limiting as
> two separate items. Nathan is proposing to improve the prioritisation
> in this thread and it seems to me that your concerns are with rate
> limiting. I've suggested an idea that might help with reducing the
> cost_delay based on the score of the table in this thread. I'd rather
> not introduce that as a topic for further discussion here (I imagine
> Nathan agrees). It's not as if the server is going to consume 1
> billion xids in 5 mins. It's at least going to take a day to days or
> longer for that to happen and if autovacuum has not managed to get on
> top of the workload in that time, then it's configured to run too
> slowly and the cost_limit or delay needs to be adjusted.
>
> My concern is that there are countless problems with autovacuum and if
> you try and lump them all into a single thread to fix them all at
> once, we'll get nowhere. Autovacuum was added to core in 8.1, 20 years
> ago and I don't believe we've done anything to change the ratelimiting
> aside from reducing the default cost_delay since then. It'd be good to
> fix that at some point, just not here, please.
>
> FWIW, I agree with Nathan about keeping the score calculation
> non-magical. The score should be simple and easy to document. We can
> introduce complexity to it as and when it's needed and when the
> supporting evidence arrives, rather than from people waving their
> hands.
>
> David
>


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 10:36  David Rowley <[email protected]>
  parent: wenhui qiu <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: David Rowley @ 2025-10-30 10:36 UTC (permalink / raw)
  To: wenhui qiu <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, 30 Oct 2025 at 19:48, wenhui qiu <[email protected]> wrote:
>      I think there might be some misunderstanding — I’m only suggesting changing
> effective_xid_failsafe_age = Max(vacuum_failsafe_age,
>                                  autovacuum_freeze_max_age * 1.05);
> to
> effective_xid_failsafe_age = (vacuum_failsafe_age + autovacuum_freeze_max_age) / 2.0;
> In the current logic, effective_xid_failsafe_age is almost always equal to vacuum_failsafe_age.
> As a result, increasing the vacuum priority only when a table’s age reaches vacuum_failsafe_age is too late.

I understand your proposal. The autovacuum will trigger for the
wraparound at autovacuum_freeze_max_age, so for autovacuum still not
to have gotten to the table by the time the table is aged at
vacuum_failsafe_age, it means autovacuum isn't working quickly enough
to get through the workload, therefore the problem is with the speed
of autovacuum not the priority of autovacuum.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 20:05  Robert Haas <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Robert Haas @ 2025-10-30 20:05 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Oct 29, 2025 at 11:51 AM Nathan Bossart
<[email protected]> wrote:
> Oops.  I fixed that typo in v7.

Are you planning to do some practical experimentation with this? I
feel like it would be a good idea to set up some kind of a test case
where this is expected to provide a benefit and see if it actually
does; and also maybe set up a test case where it will reorder the
tables but with no practical difference in the outcome expected and
verify that, in fact, nothing changes.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 21:02  Nathan Bossart <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-30 21:02 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Sami Imseih <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Oct 30, 2025 at 04:05:19PM -0400, Robert Haas wrote:
> Are you planning to do some practical experimentation with this? I
> feel like it would be a good idea to set up some kind of a test case
> where this is expected to provide a benefit and see if it actually
> does; and also maybe set up a test case where it will reorder the
> tables but with no practical difference in the outcome expected and
> verify that, in fact, nothing changes.

Yes.  I've been thinking through how I want to test this but have yet to
actually do so.  If you have ideas, I'm all ears.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-30 21:05  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-30 21:05 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Thu, Oct 30, 2025 at 04:05:19PM -0400, Robert Haas wrote:
> > Are you planning to do some practical experimentation with this? I
> > feel like it would be a good idea to set up some kind of a test case
> > where this is expected to provide a benefit and see if it actually
> > does; and also maybe set up a test case where it will reorder the
> > tables but with no practical difference in the outcome expected and
> > verify that, in fact, nothing changes.
>
> Yes.  I've been thinking through how I want to test this but have yet to
> actually do so.  If you have ideas, I'm all ears.

FWIW, I've been putting some scripts together to test some workloads
and I will share shortly what I have.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-31 00:38  Sami Imseih <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-10-31 00:38 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> FWIW, I've been putting some scripts together to test some workloads
> and I will share shortly what I have.

Here is my attempt to test the behavior with the new prioritization.
I wanted a way to run the same tests with different workloads, both with
and without the prioritization patch, and to see if anything stands out as
suspicious in terms of autovacuum or autoanalyze activity. For example,
certain tables showing too little or too much autovacuum activity.

The scripts I put together (attached) run a busy update workload (OLTP)
and a separate batch workload. They use pgbench to execute custom scripts
that are generated on the fly.

The results are summarized by the average number of autovacuum and
autoanalyze runs *per table*, along with some other DML activity stats to
ensure that the workloads being compared have similar DML activity.

Using the scripts:

Place the attached scripts in a specific directory, and modify the
section under "Caller should adjust these values" in run_workloads.sh
to adjust the workload. The scripts assume you have a running cluster with
your specific config file adjusted for the test.

Once ready, call run_workloads.sh and at the end a summary will show up
as you see below. Hopefully it works for you :)

The summary.sh script can also be run while the workloads are executing.

Here is a example of a test I wanted to run based on the discussion [0]:

This scenario is one that was mentioned, but there are others in which a
batch process performing inserts only is prioritized over the update
workload.

I ran this test for 10 minutes, using 200 clients for the update workload
and 5 clients for the batch workload, with the following configuration:

```
max_connections=1000;
autovacuum_naptime = '10s'
shared_buffers = '4GB'
autovacuum_max_workers = 6
```


-- HEAD

```
Total Activity
-[ RECORD 1 ]-------------+----------
total_n_dead_tup          | 985183
total_n_mod_since_analyze | 220294866
total_reltuples           | 247690373
total_autovacuum_count    | 137
total_autoanalyze_count   | 470
total_n_tup_upd           | 7720012
total_n_tup_ins           | 446683000
table_count               | 105

Activity By Workload Type
-[ RECORD 1 ]-----------------+----------------
table_group                   | batch_tables
**      avg_autovacuum_count  | 7.400
**      avg_autoanalyze_count | 8.000
avg_vacuum_count              | 0.000
avg_analyze_count             | 0.000
rows_inserted                 | 436683000
rows_updated                  | 0
rows_hot_updated              | 0
table_count                   | 5
-[ RECORD 2 ]-----------------+----------------
table_group                   | numbered_tables
**      avg_autovacuum_count  | 1.000
**      avg_autoanalyze_count | 4.300
avg_vacuum_count              | 1.000
avg_analyze_count             | 0.000
rows_inserted                 | 10000000
rows_updated                  | 7720012
rows_hot_updated              | 7094573
table_count                   | 100

```

-- with v7 applied

```
Total Activity
-[ RECORD 1 ]-------------+----------
total_n_dead_tup          | 1233045
total_n_mod_since_analyze | 137843507
total_reltuples           | 350704437
total_autovacuum_count    | 146
total_autoanalyze_count   | 605
total_n_tup_upd           | 7896354
total_n_tup_ins           | 487974000
table_count               | 105

Activity By Workload Type
-[ RECORD 1 ]-----------------+----------------
table_group                   | batch_tables
**      avg_autovacuum_count  | 11.000
**      avg_autoanalyze_count | 13.200
avg_vacuum_count              | 0.000
avg_analyze_count             | 0.000
rows_inserted                 | 477974000
rows_updated                  | 0
rows_hot_updated              | 0
table_count                   | 5
-[ RECORD 2 ]-----------------+----------------
table_group                   | numbered_tables
**      avg_autovacuum_count  | 0.910
**      avg_autoanalyze_count | 5.390
avg_vacuum_count              | 1.000
avg_analyze_count             | 0.000
rows_inserted                 | 10000000
rows_updated                  | 7896354
rows_hot_updated              | 7123134
table_count                   | 100
```

The results above show what I expected: the batch tables receive higher
priority, as seen from the averages of autovacuum and autoanalyze runs.
This behavior is expected, but it may catch some users by surprise after
an upgrade, since certain tables will now receive more attention than
others. Longer tests might also show more bloat accumulating on heavily
updated tables. In such cases, a user may need to adjust autovacuum
settings on a per-table basis to restore the previous behavior.

So, I am not quite sure what is the best way to test except for trying
to find these non steady state workloads and see the impact of the
prioritization change to (auto)vacuum/analyze activity .

Maybe there is a better way?

[0] https://www.postgresql.org/message-id/aQI7tGEs8IOPxG64%40nathan


--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/x-sh] batch.sh (1.4K, 2-batch.sh)
  download

  [application/x-sh] summary.sh (1.5K, 3-summary.sh)
  download

  [application/x-sh] oltp.sh (3.1K, 4-oltp.sh)
  download

  [application/x-sh] run_workloads.sh (3.3K, 5-run_workloads.sh)
  download

^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-10-31 20:12  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-10-31 20:12 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Oct 30, 2025 at 07:38:15PM -0500, Sami Imseih wrote:
> Here is my attempt to test the behavior with the new prioritization.

Thanks.

> The results above show what I expected: the batch tables receive higher
> priority, as seen from the averages of autovacuum and autoanalyze runs.
> This behavior is expected, but it may catch some users by surprise after
> an upgrade, since certain tables will now receive more attention than
> others. Longer tests might also show more bloat accumulating on heavily
> updated tables. In such cases, a user may need to adjust autovacuum
> settings on a per-table basis to restore the previous behavior.

Interesting.  From these results, it almost sounds as if we're further
amplifying the intended effect of commit 06eae9e.  That could be a good
thing.  Something else I'm curious about is datfrozenxid, i.e., whether
prioritization keeps the database (M)XID ages lower.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-01 01:50  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-01 01:50 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, 1 Nov 2025 at 09:12, Nathan Bossart <[email protected]> wrote:
>
> On Thu, Oct 30, 2025 at 07:38:15PM -0500, Sami Imseih wrote:
> > The results above show what I expected: the batch tables receive higher
> > priority, as seen from the averages of autovacuum and autoanalyze runs.
> > This behavior is expected, but it may catch some users by surprise after
> > an upgrade, since certain tables will now receive more attention than
> > others. Longer tests might also show more bloat accumulating on heavily
> > updated tables. In such cases, a user may need to adjust autovacuum
> > settings on a per-table basis to restore the previous behavior.
>
> Interesting.  From these results, it almost sounds as if we're further
> amplifying the intended effect of commit 06eae9e.  That could be a good
> thing.  Something else I'm curious about is datfrozenxid, i.e., whether
> prioritization keeps the database (M)XID ages lower.

I wonder if it would be more realistic to throttle the work simulation
to a certain speed with pgbench -R rather than having it go flat out.
The results show that quite a bit higher "rows_inserted" for the
batch_tables with the patched version. Sami didn't mention any changes
to vacuum_cost_limit, so I suspect that autovacuum would be getting
quite behind on this run, which isn't ideal.  Rate limiting to
something that the given vacuum_cost_limit could keep up with seems
more realistic. The fact that the patched version did more insert work
in the batch tables does seem a bit unfair as that gave autovacuum
more work to do in the patched test run which would result in the
lower-scoring tables being neglected more in the patched version.

This makes me wonder if we should log the score of the table when the
autovacuum starts for the table.  We do calculate the score again in
recheck_relation_needs_vacanalyze() just before doing the
vacuum/analyze, so maybe the score can be stored in the autovac_table
struct and displayed somewhere. Maybe along with the
log_autovacuum_min_duration / log_autoanalyze_min_duration would be
useful. It might be good in there for DBA analysis to give some
visibility on how bad things got before autovacuum got around to
working on a given table.

If we logged the score, we could do the "unpatched" test with the
patched code, just with commenting out the
list_sort(tables_to_process, TableToProcessComparator); It'd then be
interesting to zero the log_auto*_min_duration settings and review the
order differences and how high the scores got. Would the average score
be higher or lower with patched version? I'd guess lower since the
higher scoring tables would tend to get vacuumed later with the
unpatched version and their score would be even higher by the time
autovacuum got to them. I think if the average score has gone down at
the point that the vacuum starts, then that's a very good thing. Maybe
we'd need to write a patch to recalculate the "tables_to_process" List
after a table is vacuumed and autovacuum_naptime has elapsed for us to
see this, else the priorities might have become too outdated. I'd
expect that to be even more true when vacuum_cost_limit is configured
too low.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-01 03:29  David Rowley <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-01 03:29 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, 1 Nov 2025 at 14:50, David Rowley <[email protected]> wrote:
> If we logged the score, we could do the "unpatched" test with the
> patched code, just with commenting out the
> list_sort(tables_to_process, TableToProcessComparator); It'd then be
> interesting to zero the log_auto*_min_duration settings and review the
> order differences and how high the scores got. Would the average score
> be higher or lower with patched version? I'd guess lower since the
> higher scoring tables would tend to get vacuumed later with the
> unpatched version and their score would be even higher by the time
> autovacuum got to them. I think if the average score has gone down at
> the point that the vacuum starts, then that's a very good thing. Maybe
> we'd need to write a patch to recalculate the "tables_to_process" List
> after a table is vacuumed and autovacuum_naptime has elapsed for us to
> see this, else the priorities might have become too outdated. I'd
> expect that to be even more true when vacuum_cost_limit is configured
> too low.

I'm not yet sure how meaningful it is, but I tried adding the
following to recheck_relation_needs_vacanalyze():

elog(LOG, "Performing autovacuum of table \"%s\" with score = %f",
get_rel_name(relid), score);

then after grepping the logs and loading the data into a table and performing:

select case patched when true then 'v7' else 'master' end as
patched,case when left(tab, 11) = 'table_batch' then 'table_batch_*'
when left(tab,6) = 'table_' then 'table_*' else 'other' end,
avg(score) as avg_Score,count(*) as count from autovac where score>0
and score<2000 group by rollup(1,2) order by 2,1;

with vacuum_cost_limit = 5000, I got:

 patched |     case      |     avg_score      | count
---------+---------------+--------------------+-------
 master  | other         |  2.004997014705882 |    68
 v7      | other         | 1.9668087323943668 |    71
 master  | table_*       |  1.196698981375357 |  1396
 v7      | table_*       | 1.2134741693430646 |  1370
 master  | table_batch_* | 2.1887380086206902 |   116
 v7      | table_batch_* | 1.8882025693430664 |   137
 master  |               | 1.3043197367088595 |  1580
 v7      |               | 1.3059485323193893 |  1578
         |               | 1.3051336187460454 |  3158

It would still be good to do the rate limiting as there's more work
being done in the patched version. Seems to be about 1.1% more rows in
batch_tables and 0.48% more updates in the numbered_tables in the
patched version.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-06 22:21  Sami Imseih <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-11-06 22:21 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

Thanks for the ideas on improving the test!

I am still trying to see how useful this type of testing is,
but I will share what I have done.

> I wonder if it would be more realistic to throttle the work simulation
> to a certain speed with pgbench -R rather than having it go flat out.

good point

> > If we logged the score, we could do the "unpatched" test with the
> > patched code, just with commenting out the
> > list_sort(tables_to_process, TableToProcessComparator); It'd then be
> > interesting to zero the log_auto*_min_duration settings and review the
> > order differences and how high the scores got. Would the average score
> > be higher or lower with patched version?

I agree. I attached a patch on top of v7 that implements a debug GUC
to enable or disable sorting for testing purposes.

> I'm not yet sure how meaningful it is, but I tried adding the
> following to recheck_relation_needs_vacanalyze():
>
> elog(LOG, "Performing autovacuum of table \"%s\" with score = %f",
> get_rel_name(relid), score);

The same attached patch also implements this log.

I also spent more time working on the test script. I cleaned it up and
combined it into a single script. I added a few things:

- Ability to run with or without the batch workload.
- OLTP tables are no longer the same size; they are created with
different row counts using a minimum and maximum row count and a
multiplier for scaling the next table.
- A background collector for pg_stat_all_tables on relevant tables,
stored in relstats_monitor.log.
- Logs are saved after the run for further analysis, such as examining
the scores.

Also attached is analysis for a run with 16 OLTP tables and 3 batch tables.
It shows that with sorting enabled or disabled, the vacuum/analyze activity
does not show any major differences. OLTP had very similar DML and
autovacuum/autoanalyze activity. A few points to highlight:

1/ In the sorted run, we had an equal number of autovacuums/autoanalyze
on the smaller OLTP tables, as if every eligible table needed both
autovacuum and autoanalyze. The unsorted run was less consistent on
the smaller tables. I observed this on several runs. I don't think it's a big
deal, but interesting nonetheless.

2/ Batch tables in the sorted run had less autovacuum time (1,257,821 vs
962,794 ms), but very similar autovacuum counts.

3/ OLTP tables, on the other hand, had more autovacuum time in the
sorted run (3,590,964 vs 3,852,460 ms), but I do not see much difference
in autovacuum/autoanalyze counts.

Other tests I plan on running:
- batch updates/deletes, since the current batch option only tests append-only
tables.
- OLTP only test.

Also, I am thinking about another sorting strategy based on average
autovacuum/autoanalyze time per table. The idea is to sort ascending by
the greater of the two averages, so workers process quicker tables first
instead of all workers potentially getting hung on the slowest tables.
We can calculate the average now that v18 includes total_autovacuum_time
and total_autoanalyze time.

The way I see it, regardless of prioritization, a few large tables may
still monopolize autovacuum workers. But at least this way, the quick tables
get a chance to get processed first. Will this be an idea worth testing out?

--
Sami Imseih
Amazon Web Services (AWS)

From c24eeeb074acfb790fe44b60b2177482b9afe3c3 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 4 Nov 2025 15:55:40 +0000
Subject: [PATCH 1/1] autovacuum score logging and sort enable/disable

---
 src/backend/postmaster/autovacuum.c       |  7 ++++++-
 src/backend/utils/misc/guc_parameters.dat | 10 ++++++++++
 src/backend/utils/misc/guc_tables.c       |  6 ++++++
 src/include/postmaster/autovacuum.h       |  8 ++++++++
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index e48bb06253b..ca9c5c615dc 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -333,6 +333,8 @@ static WorkerInfo MyWorkerInfo = NULL;
 /* PID of launcher, valid only in worker while shutting down */
 int			AutovacuumLauncherPid = 0;
 
+int			debug_autovacuum_sort = DEBUG_AUTOVACUUM_SORT_ON;
+
 static Oid	do_start_worker(void);
 static void ProcessAutoVacLauncherInterrupts(void);
 pg_noreturn static void AutoVacLauncherShutdown(void);
@@ -2083,6 +2085,7 @@ do_autovacuum(void)
 			table->oid = relid;
 			table->score = score;
 
+			elog(LOG, "adding table:%s,%lf,av=%d,aa=%d", get_rel_name(table->oid), score, dovacuum, doanalyze);
 			tables_to_process = lappend(tables_to_process, table);
 		}
 
@@ -2184,6 +2187,7 @@ do_autovacuum(void)
 			table->oid = relid;
 			table->score = score;
 
+			elog(LOG, "adding table:%s,%lf,av=1,aa=0", get_rel_name(table->oid), score);
 			tables_to_process = lappend(tables_to_process, table);
 		}
 
@@ -2309,7 +2313,8 @@ do_autovacuum(void)
 		MemoryContextSwitchTo(AutovacMemCxt);
 	}
 
-	list_sort(tables_to_process, TableToProcessComparator);
+	if (debug_autovacuum_sort == DEBUG_AUTOVACUUM_SORT_ON)
+		list_sort(tables_to_process, TableToProcessComparator);
 
 	/*
 	 * Optionally, create a buffer access strategy object for VACUUM to use.
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index d6fc8333850..2bf9ce4ed27 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -3445,6 +3445,16 @@
   options => 'debug_parallel_query_options',
 },
 
+{ name => 'debug_autovacuum_sort', type => 'enum', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Enables/Disables the autovacuum sort of eligible tables.',
+  long_desc => 'This can be useful for testing the effect of sorting eligible tables in autovacuum.',
+  flags => 'GUC_NOT_IN_SAMPLE | GUC_EXPLAIN',
+  variable => 'debug_autovacuum_sort',
+  boot_val => 'DEBUG_AUTOVACUUM_SORT_ON',
+  options => 'debug_autovacuum_sort_options',
+},
+
+
 { name => 'password_encryption', type => 'enum', context => 'PGC_USERSET', group => 'CONN_AUTH_AUTH',
   short_desc => 'Chooses the algorithm for encrypting passwords.',
   variable => 'Password_encryption',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..aaa93d35187 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -404,6 +404,12 @@ static const struct config_enum_entry debug_parallel_query_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry debug_autovacuum_sort_options[] = {
+	{"off", DEBUG_AUTOVACUUM_SORT_OFF, false},
+	{"on", DEBUG_AUTOVACUUM_SORT_ON, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry plan_cache_mode_options[] = {
 	{"auto", PLAN_CACHE_MODE_AUTO, false},
 	{"force_generic_plan", PLAN_CACHE_MODE_FORCE_GENERIC_PLAN, false},
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 023ac6d5fa8..32688784a06 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -25,6 +25,14 @@ typedef enum
 	AVW_BRINSummarizeRange,
 } AutoVacuumWorkItemType;
 
+/* possible values for debug_autovacuum_sort */
+typedef enum
+{
+	DEBUG_AUTOVACUUM_SORT_OFF,
+	DEBUG_AUTOVACUUM_SORT_ON,
+}			DebugAutovacuumSortMode;
+
+extern PGDLLIMPORT int debug_autovacuum_sort;
 
 /* GUC variables */
 extern PGDLLIMPORT bool autovacuum_start_daemon;
-- 
2.43.0



################################################################
## SORT OFF
################################################################

### from config

BASE_DIR=$HOME/test_autovacuum_prioritization
OLTP_TABLES=16
OLTP_ROWS_START=1000
OLTP_ROWS_MULTIPLIER=2
OLTP_MAX_ROWS=5000000
BATCH_TABLES=3
BATCH_SIZE=100000
BATCH_CONNECTIONS=5
OLTP_CONNECTIONS=200
TIMEOUT=1800
OLTP_RATE=15000
BATCH_SLEEP=5
BUCKETS=15

### from summary_report.txt

=== Database Settings ===
                 name                  |  setting  
---------------------------------------+-----------
 autovacuum                            | on
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_analyze_threshold          | 50
 autovacuum_freeze_max_age             | 200000000
 autovacuum_max_workers                | 6
 autovacuum_multixact_freeze_max_age   | 400000000
 autovacuum_naptime                    | 5
 autovacuum_vacuum_cost_delay          | 2
 autovacuum_vacuum_cost_limit          | -1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_insert_threshold    | 1000
 autovacuum_vacuum_max_threshold       | 100000000
 autovacuum_vacuum_scale_factor        | 0.2
 autovacuum_vacuum_threshold           | 50
 autovacuum_work_mem                   | -1
 autovacuum_worker_slots               | 16
 debug_autovacuum_sort                 | off    <<<-----------------
 log_autovacuum_min_duration           | 600000
 max_connections                       | 1000
 shared_buffers                        | 1048576
(20 rows)

=== Total Activity ===
Expanded display is on.
-[ RECORD 1 ]----------------+----------
total_n_dead_tup             | 3555172
total_n_mod_since_analyze    | 20714669
total_reltuples              | 141795153
total_autovacuum_count       | 1890
total_autoanalyze_count      | 2004
total_n_tup_upd              | 26995189
total_n_tup_hot_upd          | 2
total_n_tup_newpage_upd      | 3868918
total_n_tup_ins              | 161298000
total_total_autovacuum_time  | 4847051
total_total_autoanalyze_time | 695004
avg_autovacuum_time          | 2564.58
avg_autoanalyze_time         | 346.81
table_count                  | 19

### last snapshot from relstats_monitor.log

 ?column? |          timestamp           |     relname      | reltuples | n_dead_tup | av_count | aa_count | total_av_time | total_aa_time | n_tup_upd | n_tup_hot_upd | n_tup_ins | avg_av_time | avg_aa_time 
----------+------------------------------+------------------+-----------+------------+----------+----------+---------------+---------------+-----------+---------------+-----------+-------------+-------------
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_batch_1    |  49200700 |          0 |       17 |       30 |        541848 |         46902 |         0 |             0 |  52000000 |    31873.41 |     1563.40
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_batch_2    |  44600500 |          0 |       17 |       26 |        375595 |         53137 |         0 |             0 |  49300000 |    22093.82 |     2043.73
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_batch_3    |  39800400 |          0 |       18 |       28 |        340378 |         43500 |         0 |             0 |  51200000 |    18909.89 |     1553.57
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_13_4096000 |   4094410 |    1676409 |        0 |        2 |             0 |          9029 |   1680467 |             0 |   4096000 |        0.00 |     4514.50
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_12_2048000 |   2046860 |     803638 |        2 |        4 |        931598 |         41897 |   1680863 |             0 |   2048000 |   465799.00 |    10474.25
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_11_1024000 |   1023180 |     493988 |        5 |        8 |        924292 |         54659 |   1681582 |             0 |   1024000 |   184858.40 |     6832.38
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_10_512000  |    511256 |     171819 |       12 |       19 |        728662 |         77863 |   1680512 |             0 |    512000 |    60721.83 |     4098.05
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_9_256000   |    255997 |      93717 |       24 |       41 |        251528 |         61078 |   1680115 |             0 |    256000 |    10480.33 |     1489.71
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_8_128000   |    127970 |      37869 |       42 |       60 |        166034 |         59070 |   1678672 |             0 |    128000 |     3953.19 |      984.50
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_7_64000    |     63991 |      44945 |       74 |       99 |        116048 |         54583 |   1680251 |             0 |     64000 |     1568.22 |      551.34
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_6_32000    |     32000 |       9673 |      104 |      116 |         84778 |         41956 |   1682235 |             0 |     32000 |      815.17 |      361.69
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_5_16000    |     16000 |      49469 |      186 |      190 |         74828 |         36245 |   1680406 |             0 |     16000 |      402.30 |      190.76
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_4_8000     |      7996 |       4309 |      174 |      179 |         53884 |         22075 |   1682471 |             0 |      8000 |      309.68 |      123.32
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_3_4000     |      3992 |      98236 |      214 |      215 |         49471 |         19975 |   1681034 |             0 |      4000 |      231.17 |       92.91
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_16_4000    |      3988 |      39577 |      184 |      182 |         45433 |         16430 |   1679804 |             0 |      4000 |      246.92 |       90.27
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_15_2000    |      2000 |       1753 |      199 |      194 |         42538 |         14346 |   1682234 |             0 |      2000 |      213.76 |       73.95
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_2_2000     |      1999 |       5668 |      203 |      200 |         42430 |         14802 |   1681089 |             0 |      2000 |      209.01 |       74.01
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_14_1000    |      1000 |       3934 |      203 |      199 |         38499 |         13840 |   1678710 |             1 |      1000 |      189.65 |       69.55
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_1_1000     |       903 |      55323 |      210 |      210 |         36441 |         12661 |   1681126 |             1 |      1000 |      173.53 |       60.29
(19 rows)


################################################################
## SORT ON
################################################################

### from config
BASE_DIR=$HOME/test_autovacuum_prioritization
OLTP_TABLES=16
OLTP_ROWS_START=1000
OLTP_ROWS_MULTIPLIER=2
OLTP_MAX_ROWS=5000000
BATCH_TABLES=3
BATCH_SIZE=100000
BATCH_CONNECTIONS=5
OLTP_CONNECTIONS=200
TIMEOUT=1800
OLTP_RATE=15000
BATCH_SLEEP=5
BUCKETS=15

### from summary_report.txt

=== Database Settings ===
                 name                  |  setting  
---------------------------------------+-----------
 autovacuum                            | on
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_analyze_threshold          | 50
 autovacuum_freeze_max_age             | 200000000
 autovacuum_max_workers                | 6
 autovacuum_multixact_freeze_max_age   | 400000000
 autovacuum_naptime                    | 5
 autovacuum_vacuum_cost_delay          | 2
 autovacuum_vacuum_cost_limit          | -1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_insert_threshold    | 1000
 autovacuum_vacuum_max_threshold       | 100000000
 autovacuum_vacuum_scale_factor        | 0.2
 autovacuum_vacuum_threshold           | 50
 autovacuum_work_mem                   | -1
 autovacuum_worker_slots               | 16
 debug_autovacuum_sort                 | on       <<-----------
 log_autovacuum_min_duration           | 600000
 max_connections                       | 1000
 shared_buffers                        | 1048576
(20 rows)

=== Total Activity ===
Expanded display is on.
-[ RECORD 1 ]----------------+----------
total_n_dead_tup             | 3596571
total_n_mod_since_analyze    | 33295739
total_reltuples              | 128920339
total_autovacuum_count       | 1923
total_autoanalyze_count      | 2036
total_n_tup_upd              | 26992249
total_n_tup_hot_upd          | 4
total_n_tup_newpage_upd      | 3714297
total_n_tup_ins              | 161398000
total_total_autovacuum_time  | 5040892
total_total_autoanalyze_time | 735082
avg_autovacuum_time          | 2621.37
avg_autoanalyze_time         | 361.04
table_count                  | 19

### last snapshot from relstats_monitor.log

 ?column? |           timestamp           |     relname      | reltuples | n_dead_tup | av_count | aa_count | total_av_time | total_aa_time | n_tup_upd | n_tup_hot_upd | n_tup_ins | avg_av_time | avg_aa_time 
----------+-------------------------------+------------------+-----------+------------+----------+----------+---------------+---------------+-----------+---------------+-----------+-------------+-------------
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_batch_2    |  45647800 |          0 |       19 |       32 |        397861 |         58865 |         0 |             0 |  52000000 |    20940.05 |     1839.53
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_batch_3    |  42198400 |          0 |       16 |       26 |        549624 |         48589 |         0 |             0 |  47100000 |    34351.50 |     1868.81
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_batch_1    |  32878300 |          0 |       16 |       29 |        244947 |         35940 |         0 |             0 |  53500000 |    15309.19 |     1239.31
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_13_4096000 |   4095080 |    1674368 |        0 |        2 |             0 |         11957 |   1680719 |             0 |   4096000 |        0.00 |     5978.50
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_12_2048000 |   2048010 |     785834 |        2 |        4 |        973686 |         47742 |   1683000 |             0 |   2048000 |   486843.00 |    11935.50
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_11_1024000 |   1023560 |     471569 |        5 |        8 |       1012300 |         61222 |   1677795 |             0 |   1024000 |   202460.00 |     7652.75
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_10_512000  |    513070 |      65104 |       12 |       18 |        771222 |         64852 |   1679348 |             0 |    512000 |    64268.50 |     3602.89
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_9_256000   |    255877 |     104537 |       22 |       34 |        263788 |         52513 |   1680026 |             0 |    256000 |    11990.36 |     1544.50
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_8_128000   |    127974 |      52589 |       40 |       57 |        182622 |         58427 |   1679095 |             0 |    128000 |     4565.55 |     1025.04
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_7_64000    |     63995 |      15459 |       71 |       88 |        123084 |         50789 |   1680143 |             0 |     64000 |     1733.58 |      577.15
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_6_32000    |     31998 |      18651 |      115 |      130 |         97063 |         47557 |   1678753 |             0 |     32000 |      844.03 |      365.82
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_5_16000    |     16000 |      21281 |      170 |      170 |         77925 |         35527 |   1680055 |             0 |     16000 |      458.38 |      208.98
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_4_8000     |      7997 |      23487 |      195 |      197 |         65038 |         27777 |   1682290 |             0 |      8000 |      333.53 |      141.00
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_3_4000     |      4000 |      26826 |      206 |      206 |         56811 |         22326 |   1682540 |             1 |      4000 |      275.78 |      108.38
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_16_4000    |      3999 |      25514 |      206 |      206 |         56515 |         22142 |   1681183 |             0 |      4000 |      274.34 |      107.49
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_15_2000    |      2000 |      28429 |      207 |      207 |         48227 |         19677 |   1681553 |             2 |      2000 |      232.98 |       95.06
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_2_2000     |      2000 |      28247 |      207 |      207 |         47682 |         19536 |   1679615 |             0 |      2000 |      230.35 |       94.38
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_1_1000     |      1000 |      30309 |      207 |      207 |         36442 |         16492 |   1680101 |             0 |      1000 |      176.05 |       79.67
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_14_1000    |       999 |      30314 |      207 |      207 |         36055 |         16875 |   1681990 |             1 |      1000 |      174.18 |       81.52
(19 rows)


Attachments:

  [text/plain] 0001-autovacuum-score-logging-and-sort-enable-disable.txt (4.1K, 2-0001-autovacuum-score-logging-and-sort-enable-disable.txt)
  download | inline diff:
From c24eeeb074acfb790fe44b60b2177482b9afe3c3 Mon Sep 17 00:00:00 2001
From: Ubuntu <[email protected]>
Date: Tue, 4 Nov 2025 15:55:40 +0000
Subject: [PATCH 1/1] autovacuum score logging and sort enable/disable

---
 src/backend/postmaster/autovacuum.c       |  7 ++++++-
 src/backend/utils/misc/guc_parameters.dat | 10 ++++++++++
 src/backend/utils/misc/guc_tables.c       |  6 ++++++
 src/include/postmaster/autovacuum.h       |  8 ++++++++
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index e48bb06253b..ca9c5c615dc 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -333,6 +333,8 @@ static WorkerInfo MyWorkerInfo = NULL;
 /* PID of launcher, valid only in worker while shutting down */
 int			AutovacuumLauncherPid = 0;
 
+int			debug_autovacuum_sort = DEBUG_AUTOVACUUM_SORT_ON;
+
 static Oid	do_start_worker(void);
 static void ProcessAutoVacLauncherInterrupts(void);
 pg_noreturn static void AutoVacLauncherShutdown(void);
@@ -2083,6 +2085,7 @@ do_autovacuum(void)
 			table->oid = relid;
 			table->score = score;
 
+			elog(LOG, "adding table:%s,%lf,av=%d,aa=%d", get_rel_name(table->oid), score, dovacuum, doanalyze);
 			tables_to_process = lappend(tables_to_process, table);
 		}
 
@@ -2184,6 +2187,7 @@ do_autovacuum(void)
 			table->oid = relid;
 			table->score = score;
 
+			elog(LOG, "adding table:%s,%lf,av=1,aa=0", get_rel_name(table->oid), score);
 			tables_to_process = lappend(tables_to_process, table);
 		}
 
@@ -2309,7 +2313,8 @@ do_autovacuum(void)
 		MemoryContextSwitchTo(AutovacMemCxt);
 	}
 
-	list_sort(tables_to_process, TableToProcessComparator);
+	if (debug_autovacuum_sort == DEBUG_AUTOVACUUM_SORT_ON)
+		list_sort(tables_to_process, TableToProcessComparator);
 
 	/*
 	 * Optionally, create a buffer access strategy object for VACUUM to use.
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index d6fc8333850..2bf9ce4ed27 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -3445,6 +3445,16 @@
   options => 'debug_parallel_query_options',
 },
 
+{ name => 'debug_autovacuum_sort', type => 'enum', context => 'PGC_USERSET', group => 'DEVELOPER_OPTIONS',
+  short_desc => 'Enables/Disables the autovacuum sort of eligible tables.',
+  long_desc => 'This can be useful for testing the effect of sorting eligible tables in autovacuum.',
+  flags => 'GUC_NOT_IN_SAMPLE | GUC_EXPLAIN',
+  variable => 'debug_autovacuum_sort',
+  boot_val => 'DEBUG_AUTOVACUUM_SORT_ON',
+  options => 'debug_autovacuum_sort_options',
+},
+
+
 { name => 'password_encryption', type => 'enum', context => 'PGC_USERSET', group => 'CONN_AUTH_AUTH',
   short_desc => 'Chooses the algorithm for encrypting passwords.',
   variable => 'Password_encryption',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 00c8376cf4d..aaa93d35187 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -404,6 +404,12 @@ static const struct config_enum_entry debug_parallel_query_options[] = {
 	{NULL, 0, false}
 };
 
+static const struct config_enum_entry debug_autovacuum_sort_options[] = {
+	{"off", DEBUG_AUTOVACUUM_SORT_OFF, false},
+	{"on", DEBUG_AUTOVACUUM_SORT_ON, false},
+	{NULL, 0, false}
+};
+
 static const struct config_enum_entry plan_cache_mode_options[] = {
 	{"auto", PLAN_CACHE_MODE_AUTO, false},
 	{"force_generic_plan", PLAN_CACHE_MODE_FORCE_GENERIC_PLAN, false},
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 023ac6d5fa8..32688784a06 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -25,6 +25,14 @@ typedef enum
 	AVW_BRINSummarizeRange,
 } AutoVacuumWorkItemType;
 
+/* possible values for debug_autovacuum_sort */
+typedef enum
+{
+	DEBUG_AUTOVACUUM_SORT_OFF,
+	DEBUG_AUTOVACUUM_SORT_ON,
+}			DebugAutovacuumSortMode;
+
+extern PGDLLIMPORT int debug_autovacuum_sort;
 
 /* GUC variables */
 extern PGDLLIMPORT bool autovacuum_start_daemon;
-- 
2.43.0



  [text/plain] analysis.txt (12.8K, 3-analysis.txt)
  download | inline:

################################################################
## SORT OFF
################################################################

### from config

BASE_DIR=$HOME/test_autovacuum_prioritization
OLTP_TABLES=16
OLTP_ROWS_START=1000
OLTP_ROWS_MULTIPLIER=2
OLTP_MAX_ROWS=5000000
BATCH_TABLES=3
BATCH_SIZE=100000
BATCH_CONNECTIONS=5
OLTP_CONNECTIONS=200
TIMEOUT=1800
OLTP_RATE=15000
BATCH_SLEEP=5
BUCKETS=15

### from summary_report.txt

=== Database Settings ===
                 name                  |  setting  
---------------------------------------+-----------
 autovacuum                            | on
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_analyze_threshold          | 50
 autovacuum_freeze_max_age             | 200000000
 autovacuum_max_workers                | 6
 autovacuum_multixact_freeze_max_age   | 400000000
 autovacuum_naptime                    | 5
 autovacuum_vacuum_cost_delay          | 2
 autovacuum_vacuum_cost_limit          | -1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_insert_threshold    | 1000
 autovacuum_vacuum_max_threshold       | 100000000
 autovacuum_vacuum_scale_factor        | 0.2
 autovacuum_vacuum_threshold           | 50
 autovacuum_work_mem                   | -1
 autovacuum_worker_slots               | 16
 debug_autovacuum_sort                 | off    <<<-----------------
 log_autovacuum_min_duration           | 600000
 max_connections                       | 1000
 shared_buffers                        | 1048576
(20 rows)

=== Total Activity ===
Expanded display is on.
-[ RECORD 1 ]----------------+----------
total_n_dead_tup             | 3555172
total_n_mod_since_analyze    | 20714669
total_reltuples              | 141795153
total_autovacuum_count       | 1890
total_autoanalyze_count      | 2004
total_n_tup_upd              | 26995189
total_n_tup_hot_upd          | 2
total_n_tup_newpage_upd      | 3868918
total_n_tup_ins              | 161298000
total_total_autovacuum_time  | 4847051
total_total_autoanalyze_time | 695004
avg_autovacuum_time          | 2564.58
avg_autoanalyze_time         | 346.81
table_count                  | 19

### last snapshot from relstats_monitor.log

 ?column? |          timestamp           |     relname      | reltuples | n_dead_tup | av_count | aa_count | total_av_time | total_aa_time | n_tup_upd | n_tup_hot_upd | n_tup_ins | avg_av_time | avg_aa_time 
----------+------------------------------+------------------+-----------+------------+----------+----------+---------------+---------------+-----------+---------------+-----------+-------------+-------------
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_batch_1    |  49200700 |          0 |       17 |       30 |        541848 |         46902 |         0 |             0 |  52000000 |    31873.41 |     1563.40
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_batch_2    |  44600500 |          0 |       17 |       26 |        375595 |         53137 |         0 |             0 |  49300000 |    22093.82 |     2043.73
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_batch_3    |  39800400 |          0 |       18 |       28 |        340378 |         43500 |         0 |             0 |  51200000 |    18909.89 |     1553.57
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_13_4096000 |   4094410 |    1676409 |        0 |        2 |             0 |          9029 |   1680467 |             0 |   4096000 |        0.00 |     4514.50
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_12_2048000 |   2046860 |     803638 |        2 |        4 |        931598 |         41897 |   1680863 |             0 |   2048000 |   465799.00 |    10474.25
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_11_1024000 |   1023180 |     493988 |        5 |        8 |        924292 |         54659 |   1681582 |             0 |   1024000 |   184858.40 |     6832.38
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_10_512000  |    511256 |     171819 |       12 |       19 |        728662 |         77863 |   1680512 |             0 |    512000 |    60721.83 |     4098.05
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_9_256000   |    255997 |      93717 |       24 |       41 |        251528 |         61078 |   1680115 |             0 |    256000 |    10480.33 |     1489.71
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_8_128000   |    127970 |      37869 |       42 |       60 |        166034 |         59070 |   1678672 |             0 |    128000 |     3953.19 |      984.50
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_7_64000    |     63991 |      44945 |       74 |       99 |        116048 |         54583 |   1680251 |             0 |     64000 |     1568.22 |      551.34
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_6_32000    |     32000 |       9673 |      104 |      116 |         84778 |         41956 |   1682235 |             0 |     32000 |      815.17 |      361.69
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_5_16000    |     16000 |      49469 |      186 |      190 |         74828 |         36245 |   1680406 |             0 |     16000 |      402.30 |      190.76
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_4_8000     |      7996 |       4309 |      174 |      179 |         53884 |         22075 |   1682471 |             0 |      8000 |      309.68 |      123.32
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_3_4000     |      3992 |      98236 |      214 |      215 |         49471 |         19975 |   1681034 |             0 |      4000 |      231.17 |       92.91
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_16_4000    |      3988 |      39577 |      184 |      182 |         45433 |         16430 |   1679804 |             0 |      4000 |      246.92 |       90.27
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_15_2000    |      2000 |       1753 |      199 |      194 |         42538 |         14346 |   1682234 |             0 |      2000 |      213.76 |       73.95
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_2_2000     |      1999 |       5668 |      203 |      200 |         42430 |         14802 |   1681089 |             0 |      2000 |      209.01 |       74.01
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_14_1000    |      1000 |       3934 |      203 |      199 |         38499 |         13840 |   1678710 |             1 |      1000 |      189.65 |       69.55
 TAB_DATA | 2025-11-06 20:13:17.08327+00 | table_1_1000     |       903 |      55323 |      210 |      210 |         36441 |         12661 |   1681126 |             1 |      1000 |      173.53 |       60.29
(19 rows)


################################################################
## SORT ON
################################################################

### from config
BASE_DIR=$HOME/test_autovacuum_prioritization
OLTP_TABLES=16
OLTP_ROWS_START=1000
OLTP_ROWS_MULTIPLIER=2
OLTP_MAX_ROWS=5000000
BATCH_TABLES=3
BATCH_SIZE=100000
BATCH_CONNECTIONS=5
OLTP_CONNECTIONS=200
TIMEOUT=1800
OLTP_RATE=15000
BATCH_SLEEP=5
BUCKETS=15

### from summary_report.txt

=== Database Settings ===
                 name                  |  setting  
---------------------------------------+-----------
 autovacuum                            | on
 autovacuum_analyze_scale_factor       | 0.1
 autovacuum_analyze_threshold          | 50
 autovacuum_freeze_max_age             | 200000000
 autovacuum_max_workers                | 6
 autovacuum_multixact_freeze_max_age   | 400000000
 autovacuum_naptime                    | 5
 autovacuum_vacuum_cost_delay          | 2
 autovacuum_vacuum_cost_limit          | -1
 autovacuum_vacuum_insert_scale_factor | 0.2
 autovacuum_vacuum_insert_threshold    | 1000
 autovacuum_vacuum_max_threshold       | 100000000
 autovacuum_vacuum_scale_factor        | 0.2
 autovacuum_vacuum_threshold           | 50
 autovacuum_work_mem                   | -1
 autovacuum_worker_slots               | 16
 debug_autovacuum_sort                 | on       <<-----------
 log_autovacuum_min_duration           | 600000
 max_connections                       | 1000
 shared_buffers                        | 1048576
(20 rows)

=== Total Activity ===
Expanded display is on.
-[ RECORD 1 ]----------------+----------
total_n_dead_tup             | 3596571
total_n_mod_since_analyze    | 33295739
total_reltuples              | 128920339
total_autovacuum_count       | 1923
total_autoanalyze_count      | 2036
total_n_tup_upd              | 26992249
total_n_tup_hot_upd          | 4
total_n_tup_newpage_upd      | 3714297
total_n_tup_ins              | 161398000
total_total_autovacuum_time  | 5040892
total_total_autoanalyze_time | 735082
avg_autovacuum_time          | 2621.37
avg_autoanalyze_time         | 361.04
table_count                  | 19

### last snapshot from relstats_monitor.log

 ?column? |           timestamp           |     relname      | reltuples | n_dead_tup | av_count | aa_count | total_av_time | total_aa_time | n_tup_upd | n_tup_hot_upd | n_tup_ins | avg_av_time | avg_aa_time 
----------+-------------------------------+------------------+-----------+------------+----------+----------+---------------+---------------+-----------+---------------+-----------+-------------+-------------
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_batch_2    |  45647800 |          0 |       19 |       32 |        397861 |         58865 |         0 |             0 |  52000000 |    20940.05 |     1839.53
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_batch_3    |  42198400 |          0 |       16 |       26 |        549624 |         48589 |         0 |             0 |  47100000 |    34351.50 |     1868.81
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_batch_1    |  32878300 |          0 |       16 |       29 |        244947 |         35940 |         0 |             0 |  53500000 |    15309.19 |     1239.31
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_13_4096000 |   4095080 |    1674368 |        0 |        2 |             0 |         11957 |   1680719 |             0 |   4096000 |        0.00 |     5978.50
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_12_2048000 |   2048010 |     785834 |        2 |        4 |        973686 |         47742 |   1683000 |             0 |   2048000 |   486843.00 |    11935.50
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_11_1024000 |   1023560 |     471569 |        5 |        8 |       1012300 |         61222 |   1677795 |             0 |   1024000 |   202460.00 |     7652.75
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_10_512000  |    513070 |      65104 |       12 |       18 |        771222 |         64852 |   1679348 |             0 |    512000 |    64268.50 |     3602.89
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_9_256000   |    255877 |     104537 |       22 |       34 |        263788 |         52513 |   1680026 |             0 |    256000 |    11990.36 |     1544.50
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_8_128000   |    127974 |      52589 |       40 |       57 |        182622 |         58427 |   1679095 |             0 |    128000 |     4565.55 |     1025.04
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_7_64000    |     63995 |      15459 |       71 |       88 |        123084 |         50789 |   1680143 |             0 |     64000 |     1733.58 |      577.15
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_6_32000    |     31998 |      18651 |      115 |      130 |         97063 |         47557 |   1678753 |             0 |     32000 |      844.03 |      365.82
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_5_16000    |     16000 |      21281 |      170 |      170 |         77925 |         35527 |   1680055 |             0 |     16000 |      458.38 |      208.98
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_4_8000     |      7997 |      23487 |      195 |      197 |         65038 |         27777 |   1682290 |             0 |      8000 |      333.53 |      141.00
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_3_4000     |      4000 |      26826 |      206 |      206 |         56811 |         22326 |   1682540 |             1 |      4000 |      275.78 |      108.38
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_16_4000    |      3999 |      25514 |      206 |      206 |         56515 |         22142 |   1681183 |             0 |      4000 |      274.34 |      107.49
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_15_2000    |      2000 |      28429 |      207 |      207 |         48227 |         19677 |   1681553 |             2 |      2000 |      232.98 |       95.06
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_2_2000     |      2000 |      28247 |      207 |      207 |         47682 |         19536 |   1679615 |             0 |      2000 |      230.35 |       94.38
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_1_1000     |      1000 |      30309 |      207 |      207 |         36442 |         16492 |   1680101 |             0 |      1000 |      176.05 |       79.67
 TAB_DATA | 2025-11-06 20:46:26.073153+00 | table_14_1000    |       999 |      30314 |      207 |      207 |         36055 |         16875 |   1681990 |             1 |      1000 |      174.18 |       81.52
(19 rows)

  [application/x-sh] test_autovacuum_prioritization.sh (14.9K, 4-test_autovacuum_prioritization.sh)
  download

^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-06 23:05  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-06 23:05 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 7 Nov 2025 at 11:21, Sami Imseih <[email protected]> wrote:
> Also, I am thinking about another sorting strategy based on average
> autovacuum/autoanalyze time per table. The idea is to sort ascending by
> the greater of the two averages, so workers process quicker tables first
> instead of all workers potentially getting hung on the slowest tables.
> We can calculate the average now that v18 includes total_autovacuum_time
> and total_autoanalyze time.
>
> The way I see it, regardless of prioritization, a few large tables may
> still monopolize autovacuum workers. But at least this way, the quick tables
> get a chance to get processed first. Will this be an idea worth testing out?

This sounds like a terrible idea to me. It'll mean any table that
starts taking longer due to autovacuum neglect will have its priority
dropped for next time which will result in further neglect. If
vacuum_cost_limit is too low, then the tables in need of vacuum the
most could end up last in the queue. I also don't see how you'd handle
the fact that analyze is likely to be faster than vacuum. Tables that
only need an analyze would just come last with no regard for how
outdated the statistics are?

I'm confused at why we'd have set up our autovacuum trigger points as
they are today because we think those are good times to do a
vacuum/analyze, but then prioritise on something completely different.
Surely if we think 20% dead tuples is worth a vacuum, we must
therefore think that 40% dead tuples are even more worthwhile?! I just
cannot comprehend why we'd deviate from making the priority the
percentage over the trigger point here.  If we come to the conclusion
that we want something else, then maybe our trigger point threshold
method also needs to be redefined. There certainly have been
complaints about 20% of a huge table being too much (I guess
autovacuum_vacuum_max_threshold is our answer to trying to fix that
one).

David

David

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-07 19:22  Sami Imseih <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-11-07 19:22 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Fri, 7 Nov 2025 at 11:21, Sami Imseih <[email protected]> wrote:
> > Also, I am thinking about another sorting strategy based on average
> > autovacuum/autoanalyze time per table. The idea is to sort ascending by
> > the greater of the two averages, so workers process quicker tables first
> > instead of all workers potentially getting hung on the slowest tables.
> > We can calculate the average now that v18 includes total_autovacuum_time
> > and total_autoanalyze time.
> >
> > The way I see it, regardless of prioritization, a few large tables may
> > still monopolize autovacuum workers. But at least this way, the quick tables
> > get a chance to get processed first. Will this be an idea worth testing out?
>
> This sounds like a terrible idea to me. It'll mean any table that
> starts taking longer due to autovacuum neglect will have its priority
> dropped for next time which will result in further neglect.

yes, that is a possibility, but I am not sure how we can actually
avoid these scenarios. The flip side is we are giving a chance
for the eligible fast tables to get more of a chance to get vacuumed,
rather than be backed because workers are all occupied on the
larger tables.

> vacuum_cost_limit is too low, then the tables in need of vacuum the
> most could end up last in the queue. I also don't see how you'd handle
> the fact that analyze is likely to be faster than vacuum. Tables that
> only need an analyze would just come last with no regard for how
> outdated the statistics are?

In the "doanalyze" case only, we will look at the average autoanalyze count,
which will push these types of tables to the front of the queue, not the last.

> I'm confused at why we'd have set up our autovacuum trigger points as
> they are today because we think those are good times to do a
> vacuum/analyze, but then prioritise on something completely different.
> Surely if we think 20% dead tuples is worth a vacuum, we must
> therefore think that 40% dead tuples are even more worthwhile?!

Sure, but thresholds alone don't indicate anything about the how quick
the table can be vacuumed, # of indexes, per table a/v settings, etc.
The average a/v time is a good proxy to determine this.

What I am suggesting here is we think beyond thresholds for
prioritization, and to give a chance for more eligible tables to get
autovacuumed rather than workers being saturated on some
of the slowest-to-vacuum tables.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 00:58  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: David Rowley @ 2025-11-11 00:58 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, 8 Nov 2025 at 08:23, Sami Imseih <[email protected]> wrote:
> > I'm confused at why we'd have set up our autovacuum trigger points as
> > they are today because we think those are good times to do a
> > vacuum/analyze, but then prioritise on something completely different.
> > Surely if we think 20% dead tuples is worth a vacuum, we must
> > therefore think that 40% dead tuples are even more worthwhile?!
>
> Sure, but thresholds alone don't indicate anything about the how quick
> the table can be vacuumed, # of indexes, per table a/v settings, etc.
> The average a/v time is a good proxy to determine this.
>
> What I am suggesting here is we think beyond thresholds for
> prioritization, and to give a chance for more eligible tables to get
> autovacuumed rather than workers being saturated on some
> of the slowest-to-vacuum tables.

Can you define "more eligible" here?

I think I'm not really grasping this because I don't understand why
faster-to-vacuum tables should be prioritised over slower-to-vacuum
tables. Can you explain why you think this is important?

I do understand that in your script that the OLTP tables received less
attention than unpatched, but it wasn't obvious to me why this was an
issue. If it's a case of autovacuum acting on a stale score after it
obtained the list of tables and their scores, do things look different
if we have the autovacuum worker refresh the list and scores after
it's done with a table and autovacuum_naptime has elapsed since the
list was last refreshed?

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 16:36  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-11-11 16:36 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers


--+ojkLxiwPo1IFxfN
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Still catching up on the latest discussion, but here's a v8 patch that
amends the DEBUG3 in relation_needs_vacanalyze() to also log the score.  I
might attempt to add some sort of brief documentation about autovacuum
prioritization next.



^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 19:43  Robert Treat <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Robert Treat @ 2025-11-11 19:43 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Nov 11, 2025 at 11:36 AM Nathan Bossart
<[email protected]> wrote:
>
> Still catching up on the latest discussion, but here's a v8 patch that
> amends the DEBUG3 in relation_needs_vacanalyze() to also log the score.  I
> might attempt to add some sort of brief documentation about autovacuum
> prioritization next.
>
> From skimming the latest discussion, I gather we might want to consider
> re-sorting the list periodically.  Is the idea that we'll re-sort the
> remaining tables in the list, or that we'll basically restart
> do_autovacuum()?  If it's the latter, then we'll need to come up with some
> way to decide when to stop for the current database.  Right now, we just go
> through pg_class and call it a day.
>

FWIW, when I have built these types of systems in the past, and when I
wanted an aggressive recheck-type mechanism, the most common methods
involved tying it to autovacuum_max_workers. This usually was done
under the assumption that generating the list was relatively cheap and
that higher xid age would generate higher priority candidates. Of
course I also was biased towards having it be user controllable at the
database level (ie. no need to modify some control file or cron job or
whatever). To the degree those things are aligned here, there is at
least some anecdata that this is a usable setting.


Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 19:48  Nathan Bossart <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-11-11 19:48 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Nov 11, 2025 at 02:43:19PM -0500, Robert Treat wrote:
> FWIW, when I have built these types of systems in the past, and when I
> wanted an aggressive recheck-type mechanism, the most common methods
> involved tying it to autovacuum_max_workers.

Would you mind elaborating on this point?  Do you mean that you'd rebuild
the list every a_m_w tables, or something else?

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 19:50  Robert Treat <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Robert Treat @ 2025-11-11 19:50 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Nov 11, 2025 at 2:49 PM Nathan Bossart <[email protected]> wrote:
> On Tue, Nov 11, 2025 at 02:43:19PM -0500, Robert Treat wrote:
> > FWIW, when I have built these types of systems in the past, and when I
> > wanted an aggressive recheck-type mechanism, the most common methods
> > involved tying it to autovacuum_max_workers.
>
> Would you mind elaborating on this point?  Do you mean that you'd rebuild
> the list every a_m_w tables, or something else?
>

Yes.


Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:03  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-11 20:03 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, 12 Nov 2025 at 05:36, Nathan Bossart <[email protected]> wrote:
> From skimming the latest discussion, I gather we might want to consider
> re-sorting the list periodically.  Is the idea that we'll re-sort the
> remaining tables in the list, or that we'll basically restart
> do_autovacuum()?  If it's the latter, then we'll need to come up with some
> way to decide when to stop for the current database.  Right now, we just go
> through pg_class and call it a day.

I'm still trying to work out what Sami sees in the results that he
doesn't think is good. I resuggested he try coding up the periodic
refresh-the-list code to see if it makes the thing he sees better. I
was hoping that we could get away with not doing that for stage 1 of
this. My concern there is that these change-the-way-autovacuum-works
patches seems to blow up quickly as everyone chips in with autovacuum
problems they want fixed and expect the patch to do it all.

That said, the periodic refresh probably isn't too hard. I suspected
it was something like:

     /* when enough time has passed, refresh the list to ensure the
scores aren't too out-of-date */
    if (time is > lastcheck + autovacuum_naptime * <something>)
    {
        list_free_deep(tables_to_process);
       goto the_top;
    }
} // end of foreach(cell, tables_to_process)

Perhaps if the test cases we're going to give this involve lengthy
autovacuum runs, then we might need that patch sooner. I'm uncertain
if that's the case with Sami's test. There were some 50GB tables, so I
imagine some of the runs could take a long time, especially so when
running standard vacuum_cost_limit.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:13  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2025-11-11 20:13 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Nov 12, 2025 at 09:03:54AM +1300, David Rowley wrote:
> I'm still trying to work out what Sami sees in the results that he
> doesn't think is good. I resuggested he try coding up the periodic
> refresh-the-list code to see if it makes the thing he sees better. I
> was hoping that we could get away with not doing that for stage 1 of
> this. My concern there is that these change-the-way-autovacuum-works
> patches seems to blow up quickly as everyone chips in with autovacuum
> problems they want fixed and expect the patch to do it all.

+1

> That said, the periodic refresh probably isn't too hard. I suspected
> it was something like:
> 
>      /* when enough time has passed, refresh the list to ensure the
> scores aren't too out-of-date */
>     if (time is > lastcheck + autovacuum_naptime * <something>)
>     {
>         list_free_deep(tables_to_process);
>        goto the_top;
>     }
> } // end of foreach(cell, tables_to_process)

My concern is that this might add already-processed tables back to the
list, so a worker might never be able to clear it.  Maybe that's not a real
problem in practice for some reason, but it does feel like a step too far
for stage 1, as you said above.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:16  Nathan Bossart <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-11-11 20:16 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Nov 11, 2025 at 02:50:55PM -0500, Robert Treat wrote:
> On Tue, Nov 11, 2025 at 2:49 PM Nathan Bossart <[email protected]> wrote:
>> On Tue, Nov 11, 2025 at 02:43:19PM -0500, Robert Treat wrote:
>> > FWIW, when I have built these types of systems in the past, and when I
>> > wanted an aggressive recheck-type mechanism, the most common methods
>> > involved tying it to autovacuum_max_workers.
>>
>> Would you mind elaborating on this point?  Do you mean that you'd rebuild
>> the list every a_m_w tables, or something else?
> 
> Yes.

Interesting.  With our defaults, that would mean rebuilding the list every
few tables, which seems quite aggressive.  I'd start worrying about the
pg_class scanning overhead a little...

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:25  Sami Imseih <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-11-11 20:25 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Sat, 8 Nov 2025 at 08:23, Sami Imseih <[email protected]> wrote:
> > > I'm confused at why we'd have set up our autovacuum trigger points as
> > > they are today because we think those are good times to do a
> > > vacuum/analyze, but then prioritise on something completely different.
> > > Surely if we think 20% dead tuples is worth a vacuum, we must
> > > therefore think that 40% dead tuples are even more worthwhile?!
> >
> > Sure, but thresholds alone don't indicate anything about the how quick
> > the table can be vacuumed, # of indexes, per table a/v settings, etc.
> > The average a/v time is a good proxy to determine this.
> >
> > What I am suggesting here is we think beyond thresholds for
> > prioritization, and to give a chance for more eligible tables to get
> > autovacuumed rather than workers being saturated on some
> > of the slowest-to-vacuum tables.
>
> Can you define "more eligible" here?

What I mean by “more eligible” is that once a worker has its list of tables
that meet the autovacuum thresholds, it’s trying to get through as many
of them as possible within some time window.

If the workers always go after the slowest tables first, they’ll spend most
of that time on just a few heavy ones, and a lot of other eligible tables might
end up waiting much longer to get processed.

Eventually the slow tables will be the bottleneck anyway.

> I think I'm not really grasping this because I don't understand why
> faster-to-vacuum tables should be prioritised over slower-to-vacuum
> tables. Can you explain why you think this is important?

The thing I’m hoping to address is something I’ve seen many times in practice.
Autovacuum workers can get stuck on specific large or slow tables, and when
that happens, users often end up running manual vacuums on those tables
just to keep things moving for the smaller/faster vacuumed tables.

Now, I am not so sure any type of autovacuum prioritization could actually
help in these cases. What does help is adding more autovacuum workers.

> if we have the autovacuum worker refresh the list and scores after
> it's done with a table and autovacuum_naptime has elapsed since the
> list was last refreshed?

That is an interesting idea, but refreshing the list that often may not
be such a good idea, it could be quite expensive on large catalogs.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:26  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-11 20:26 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, 12 Nov 2025 at 09:13, Nathan Bossart <[email protected]> wrote:
>
> On Wed, Nov 12, 2025 at 09:03:54AM +1300, David Rowley wrote:
> >      /* when enough time has passed, refresh the list to ensure the
> > scores aren't too out-of-date */
> >     if (time is > lastcheck + autovacuum_naptime * <something>)
> >     {
> >         list_free_deep(tables_to_process);
> >        goto the_top;
> >     }
> > } // end of foreach(cell, tables_to_process)
>
> My concern is that this might add already-processed tables back to the
> list, so a worker might never be able to clear it.  Maybe that's not a real
> problem in practice for some reason, but it does feel like a step too far
> for stage 1, as you said above.

Oh, that's a good point. That's a very valid concern. I guess that
could be fixed with a hashtable of vacuumed tables and skipping tables
that exist in there, but the problem with that is that the table might
genuinely need to be vacuumed again. It's a bit tricky to know when a
2nd vacuum is a legit requirement and when it's not. Figuring that out
might me more logic that this code wants to know about.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:43  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-11 20:43 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, 12 Nov 2025 at 09:25, Sami Imseih <[email protected]> wrote:
> The thing I’m hoping to address is something I’ve seen many times in practice.
> Autovacuum workers can get stuck on specific large or slow tables, and when
> that happens, users often end up running manual vacuums on those tables
> just to keep things moving for the smaller/faster vacuumed tables.
>
> Now, I am not so sure any type of autovacuum prioritization could actually
> help in these cases. What does help is adding more autovacuum workers.

Thanks for explaining. I think the scoring system in Nathan's patch
helps with this as any smaller table which are neglected continue to
bloat, and because they're smaller, the score will increase more
quickly, and eventually they'll have a higher score than the larger
tables.  I think the situation you're talking about is when *all*
autovacuum workers are busy with large tables and no workers remaining
to deal with the now-higher-scoring smaller tables and they bloat
severely or statistics become wildly outdated as a result.

I'm aware of that problem. It seems to happen mostly when large tables
are busy receiving an anti-wraparound vacuum. I'm not sure what to do
about it, but I don't think changing the scoring system is the right
thing. Maybe we can have it configurable so that 1 worker can be
configured to not work on tables above a given size, so there's at
least 1 worker that is less likely to be tied up for extended periods
of time. I don't know if that's a good idea, and also don't know what
realistic values for "given size" are.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 20:53  Sami Imseih <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2025-11-11 20:53 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> > The thing I’m hoping to address is something I’ve seen many times in practice.
> > Autovacuum workers can get stuck on specific large or slow tables, and when
> > that happens, users often end up running manual vacuums on those tables
> > just to keep things moving for the smaller/faster vacuumed tables.
> >
> > Now, I am not so sure any type of autovacuum prioritization could actually
> > help in these cases. What does help is adding more autovacuum workers.
>
> Thanks for explaining. I think the scoring system in Nathan's patch
> helps with this as any smaller table which are neglected continue to
> bloat, and because they're smaller, the score will increase more
> quickly,

That is true.

> Maybe we can have it configurable so that 1 worker can be
> configured to not work on tables above a given size, so there's at
> least 1 worker that is less likely to be tied up for extended periods
> of time. I don't know if that's a good idea, and also don't know what
> realistic values for "given size" are.

Another approach will be to signal for more autovacuum workers to
be spun up ( user can have a minimum and max workers ) if all workers
has been processing the list for a long time ( Also not sure what the
long threshold would be ). This "auto-tuning" of workers could perhaps
reduce the need for manual vacuums. It will still not prevent all workers
from being tied up, but maybe reduce the likelihood.

--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-11 23:22  Robert Treat <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Robert Treat @ 2025-11-11 23:22 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Nov 11, 2025 at 3:27 PM David Rowley <[email protected]> wrote:
> On Wed, 12 Nov 2025 at 09:13, Nathan Bossart <[email protected]> wrote:
> > On Wed, Nov 12, 2025 at 09:03:54AM +1300, David Rowley wrote:
> > >      /* when enough time has passed, refresh the list to ensure the
> > > scores aren't too out-of-date */
> > >     if (time is > lastcheck + autovacuum_naptime * <something>)
> > >     {
> > >         list_free_deep(tables_to_process);
> > >        goto the_top;
> > >     }
> > > } // end of foreach(cell, tables_to_process)
> >
> > My concern is that this might add already-processed tables back to the
> > list, so a worker might never be able to clear it.  Maybe that's not a real
> > problem in practice for some reason, but it does feel like a step too far
> > for stage 1, as you said above.
>
> Oh, that's a good point. That's a very valid concern. I guess that
> could be fixed with a hashtable of vacuumed tables and skipping tables
> that exist in there, but the problem with that is that the table might
> genuinely need to be vacuumed again. It's a bit tricky to know when a
> 2nd vacuum is a legit requirement and when it's not. Figuring that out
> might me more logic that this code wants to know about.
>

Yeah, there is a common theoretical pattern that always comes up in
these discussions where autovacuum gets stuck behind N big tables +
(AVMW - N) small tables that keep filtering up to the top of the list,
and I'm not saying that would never be a problem, but assuming the
algorithm is working correctly, this should be fairly avoidable,
because the use of xid age essentially works as a "hash of vacuumed
tables" equivalent for tracking purposes.

Walking through it, once a table is vacuumed, it should go to the
bottom of the list. The only way it crops back-up quickly is due to
significant activity on it, but even then, you need a special set of
circumstances, like it needs to be a small enough table with heavy
updates and a small autovacuum_vacuum_threshold. This type of combo
would cause the table to look like it is excessively bloated and in
need of vacuuming, but even in this scenario, eventually other tables
will get an xid age high enough that they will "out rank" the high
activity table and get their turn. TBH I'm not sure if we need to do
replanning, but in the scenarios where I have used it, having more
accurate information on the state of the database has generally been
better than relying on more stale information. Of course it isn't
100%, but the current implementation isn't either, and don't forget we
still have the failsafe_age as, well, a failsafe.


Robert Treat
https://xzilla.net





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-12 20:10  Nathan Bossart <[email protected]>
  parent: Robert Treat <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-11-12 20:10 UTC (permalink / raw)
  To: Robert Treat <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Nov 11, 2025 at 06:22:36PM -0500, Robert Treat wrote:
> On Tue, Nov 11, 2025 at 3:27 PM David Rowley <[email protected]> wrote:
>> On Wed, 12 Nov 2025 at 09:13, Nathan Bossart <[email protected]> wrote:
>> > My concern is that this might add already-processed tables back to the
>> > list, so a worker might never be able to clear it.  Maybe that's not a real
>> > problem in practice for some reason, but it does feel like a step too far
>> > for stage 1, as you said above.
>>
>> Oh, that's a good point. That's a very valid concern. I guess that
>> could be fixed with a hashtable of vacuumed tables and skipping tables
>> that exist in there, but the problem with that is that the table might
>> genuinely need to be vacuumed again. It's a bit tricky to know when a
>> 2nd vacuum is a legit requirement and when it's not. Figuring that out
>> might me more logic that this code wants to know about.
> 
> Yeah, there is a common theoretical pattern that always comes up in
> these discussions where autovacuum gets stuck behind N big tables +
> (AVMW - N) small tables that keep filtering up to the top of the list,
> and I'm not saying that would never be a problem, but assuming the
> algorithm is working correctly, this should be fairly avoidable,
> because the use of xid age essentially works as a "hash of vacuumed
> tables" equivalent for tracking purposes.

I do think re-prioritization is worth considering, but IMHO we should leave
it out of phase 1.  I think it's pretty easy to reason about one round of
prioritization being okay.  The order is completely arbitrary today, so how
could ordering by vacuum-related criteria make things any worse?  In my
view, changing the list contents in fancier ways (e.g., adding
just-processed tables back to the list) is a step further that requires
more discussion and testing.

To be clear, I am totally for serious consideration of reprioritization,
adjusting cost delay settings, etc., but as David has repeatedly stressed,
we are unlikely to get anything committed if we try to boil the ocean.  I'd
love for this thread to spin off into all kinds of other autovacuum-related
threads, but we should be taking baby steps if we want to accomplish
anything here.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-12 22:10  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-11-12 22:10 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Treat <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> I do think re-prioritization is worth considering, but IMHO we should leave
> it out of phase 1.  I think it's pretty easy to reason about one round of
> prioritization being okay.  The order is completely arbitrary today, so how
> could ordering by vacuum-related criteria make things any worse?

While it’s true that the current table order is arbitrary, that arbitrariness
naturally helps distribute vacuum work across tables of various sizes
at a given time

The proposal now is by design forcing all the top bloated table, that
will require more I/O to vacuum to be vacuumed at the same time,
by all workers. Users may observe this after they upgrade and wonder
why their I/O profile changed and perhaps slowed others non-vacuum
related processing down. They also don't have a knob to go back to
the previous behavior.

Of course, this behavior can and will happen now, but with this
prioritization, we are forcing it.

Is this a concern?

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-12 23:51  Jeremy Schneider <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Jeremy Schneider @ 2025-11-12 23:51 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers


> On Nov 12, 2025, at 5:10 PM, Sami Imseih <[email protected]> wrote:
> 
> 
>> 
>> I do think re-prioritization is worth considering, but IMHO we should leave
>> it out of phase 1.  I think it's pretty easy to reason about one round of
>> prioritization being okay.  The order is completely arbitrary today, so how
>> could ordering by vacuum-related criteria make things any worse?
> 
> While it’s true that the current table order is arbitrary, that arbitrariness
> naturally helps distribute vacuum work across tables of various sizes
> at a given time
> 
> The proposal now is by design forcing all the top bloated table, that
> will require more I/O to vacuum to be vacuumed at the same time,
> by all workers. Users may observe this after they upgrade and wonder
> why their I/O profile changed and perhaps slowed others non-vacuum
> related processing down. They also don't have a knob to go back to
> the previous behavior.
> 
> Of course, this behavior can and will happen now, but with this
> prioritization, we are forcing it.
> 
> Is this a concern?

It’s still possible to tune the cost delay, the number of autovacuum workers, etc - if someone needs to manage too much autovacuum I/O concurrency and dialing it back down a little bit. I think that’s sufficient

-Jeremy






^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-13 00:32  Sami Imseih <[email protected]>
  parent: Jeremy Schneider <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2025-11-13 00:32 UTC (permalink / raw)
  To: Jeremy Schneider <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers

>
> > On Nov 12, 2025, at 5:10 PM, Sami Imseih <[email protected]> wrote:
> >
> > 
> >>
> >> I do think re-prioritization is worth considering, but IMHO we should
> leave
> >> it out of phase 1.  I think it's pretty easy to reason about one round
> of
> >> prioritization being okay.  The order is completely arbitrary today, so
> how
> >> could ordering by vacuum-related criteria make things any worse?
> >
> > While it’s true that the current table order is arbitrary, that
> arbitrariness
> > naturally helps distribute vacuum work across tables of various sizes
> > at a given time
> >
> > The proposal now is by design forcing all the top bloated table, that
> > will require more I/O to vacuum to be vacuumed at the same time,
> > by all workers. Users may observe this after they upgrade and wonder
> > why their I/O profile changed and perhaps slowed others non-vacuum
> > related processing down. They also don't have a knob to go back to
> > the previous behavior.
> >
> > Of course, this behavior can and will happen now, but with this
> > prioritization, we are forcing it.
> >
> > Is this a concern?
>
> It’s still possible to tune the cost delay, the number of autovacuum
> workers, etc - if someone needs to manage too much autovacuum I/O
> concurrency and dialing it back down a little bit. I think that’s sufficient
>

Yes, the need to tune a/v for I/O( lower cost limit, higher cost delay
) will likely be
greater with this change.

--
Sami


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

*  回复:回复:another autovacuum scheduling thread
@ 2025-11-14 02:25  =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= @ 2025-11-14 02:25 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Jeremy Schneider <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; pgsql-hackers

Thank you for your reply. You are absolutely right, and I apologize for straying off-topic in this thread. I have moved my thoughts to a separate thread [0].
[0]https://www.postgresql.org/message-id/fffdd62e-4d97-4701-ad57-0cd3ef1ebef4.duankunren.dkr%40alibaba-... <https://www.postgresql.org/message-id/fffdd62e-4d97-4701-ad57-0cd3ef1ebef4.duankunren.dkr%40alibaba-... >
------------------------------------------------------------------
发件人:Nathan Bossart <[email protected]>
发送时间:2025年11月14日(周五) 02:47
收件人:"段坤仁(刻韧)"<[email protected]>
抄 送:Sami Imseih<[email protected]>; Jeremy Schneider<[email protected]>; Robert Treat<[email protected]>; David Rowley<[email protected]>; Robert Haas<[email protected]>; "pgsql-hackers"<[email protected]>
主 题:Re: 回复:another autovacuum scheduling thread
On Fri, Nov 14, 2025 at 02:37:31AM +0800, 段坤仁(刻韧) wrote:
> Hi all, I have read the discussion in this thread and am pleased to see the
> community working collaboratively to address some long-standing autovacuum
> problems. Nathan's patch implementation is quite promising and demonstrates
> considerable potential. I have previously attempted similar approaches; however,
> I was unable to develop such a comprehensive and well-formulated calculation
> framework. We would certainly welcome the integration of this improvement into
> v19 or subsequent versions.
Thanks!
> I have tried Nathan's V7 patch and implemented some cost delay mechanisms
> based on it that might be helpful for the issues you guys mentioned.
Unfortunately, cost delay adjustments are off-topic for this thread, as I
hinted yesterday [0]. I'd certainly like to explore this idea, but if we
can't keep the discussion focused, it'll be hard to get anything committed.
[0] https://postgr.es/m/aRTpqMleDpoQm9OO%40nathan <https://postgr.es/m/aRTpqMleDpoQm9OO%40nathan >
-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 14:30  Robert Haas <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Robert Haas @ 2025-11-20 14:30 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Treat <[email protected]>; David Rowley <[email protected]>; Sami Imseih <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Nov 12, 2025 at 3:10 PM Nathan Bossart <[email protected]> wrote:
> I do think re-prioritization is worth considering, but IMHO we should leave
> it out of phase 1.  I think it's pretty easy to reason about one round of
> prioritization being okay.  The order is completely arbitrary today, so how
> could ordering by vacuum-related criteria make things any worse?  In my
> view, changing the list contents in fancier ways (e.g., adding
> just-processed tables back to the list) is a step further that requires
> more discussion and testing.

I agree with your view around reprioritization. To answer your
rhetorical question, the way that reordering the list could hurt is if
the current ordering (pg_class scan order) happened to be a
near-optimal choice. For example, suppose the last table in pg_class
order in a state where vacuuming appears to be necessary but will be
painful and/or useless (VACUUM will error, xmin will prevent all or
most tuple removal, located on an incredibly slow disk with nothing
cached, whatever). Re-sorting the list figures to move that table
earlier, which will not work out for the best. I suspect that
reprioritization actually increases the danger of this kind of failure
mode. The more aggressive you are about making sure that the
highest-priority tables actually get handled first, the more important
it is to be correct about the real order of priority.

I do think in the long term a really good system is probably going to
accumulate a bunch of extra logic to deal with cases like this. For
example, if the first table in the queue causes VACUUM to spend an
hour chugging a way and then fail with an I/O error, we would ideally
want to make sure to wait a while before retrying that table, so that
others don't get starved. But like you say, there's no need to solve
every problem at once.

What seems important to me for this patch is that we don't choose an
actively bad sort order. For instance, if we don't get the balance
between prioritizing anti-wraparound activity and controlling runaway
bloat correct, and especially if there's no way to recover by tweaking
settings, to me that's a scary scenario. I do think it's fairly
realistic for a bad choice of sort order to end up being a regression
over the current lack of a sort order. You might just be getting lucky
right now -- say, because the catalog tables all occur first in the
catalog and vacuuming those tends to be important, and among user
tables, the ones you created first are actually the ones that are most
important. That's not a particularly crazy scenario, IMHO.

Point being: I think we need to avoid the mindset that we can't be
stupider than we are now. I don't think there's any way we would
commit something that is GENERALLY stupider than we are now, but it's
not about averages. It's about whether there are specific cases that
are common enough to worry about which end up getting regressed. I'm
honestly not sure how much of a risk that is, and, again, I'm not
trying to kill the patch. It might well be that the patch is already
good enough that such scenarios will be extremely rare. However, it's
easy to get overconfident when replacing a completely unintelligent
system with a smarter one. The risk of something backfiring can
sometimes be higher than one anticipates.

One idea that might be worth considering is adding a reloption of some
kind that lets the user exert positive control over the sort order. I
know that's scope creep, so maybe it's a bad idea for that reason. But
I think it would be a better idea than Sami's proposal to score system
catalogs more highly, not so much because his idea is necessary
wrong-headed as because it doesn't help with what I see as the
principal danger here, namely, that whatever we do will sometimes turn
out to be wrong. Trying to be right 100% of the time is not going to
work out as well as having a backup plan for the cases where we are
wrong.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 16:25  Nathan Bossart <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-11-20 16:25 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Robert Treat <[email protected]>; David Rowley <[email protected]>; Sami Imseih <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Nov 20, 2025 at 09:30:42AM -0500, Robert Haas wrote:
> Point being: I think we need to avoid the mindset that we can't be
> stupider than we are now. I don't think there's any way we would
> commit something that is GENERALLY stupider than we are now, but it's
> not about averages. It's about whether there are specific cases that
> are common enough to worry about which end up getting regressed. I'm
> honestly not sure how much of a risk that is, and, again, I'm not
> trying to kill the patch. It might well be that the patch is already
> good enough that such scenarios will be extremely rare. However, it's
> easy to get overconfident when replacing a completely unintelligent
> system with a smarter one. The risk of something backfiring can
> sometimes be higher than one anticipates.

That's a fair point.  The possibly-entirely-theoretical case that's in my
head is when your oldest and lowest-OID table is also the biggest and most
active.  That seems like it could be a popular pattern in the field, and it
probably benefits to some degree from the sequential scan returning it
earlier.  I don't know how much to worry about stuff like this.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 16:34  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-11-20 16:34 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Thu, Nov 20, 2025 at 09:30:42AM -0500, Robert Haas wrote:
> > Point being: I think we need to avoid the mindset that we can't be
> > stupider than we are now. I don't think there's any way we would
> > commit something that is GENERALLY stupider than we are now, but it's
> > not about averages. It's about whether there are specific cases that
> > are common enough to worry about which end up getting regressed. I'm
> > honestly not sure how much of a risk that is, and, again, I'm not
> > trying to kill the patch. It might well be that the patch is already
> > good enough that such scenarios will be extremely rare. However, it's
> > easy to get overconfident when replacing a completely unintelligent
> > system with a smarter one. The risk of something backfiring can
> > sometimes be higher than one anticipates.
>
> That's a fair point.  The possibly-entirely-theoretical case that's in my
> head is when your oldest and lowest-OID table is also the biggest and most
> active.  That seems like it could be a popular pattern in the field, and it
> probably benefits to some degree from the sequential scan returning it
> earlier.  I don't know how much to worry about stuff like this.

I think it would be difficult to introduce this new prioritization
system without a
GUC to control the prioritization behavior. Since ordering by pg_class has been
the only behavior ever since autovacuum was released, there should be a way
for users to revert back to this. The default could be the new
prioritization strategy.

Introducing new GUCs is something to be avoided if possible, but I think this
case is a clear one to me.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 18:35  Robert Haas <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Robert Haas @ 2025-11-20 18:35 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Nov 20, 2025 at 11:34 AM Sami Imseih <[email protected]> wrote:
> I think it would be difficult to introduce this new prioritization
> system without a
> GUC to control the prioritization behavior. Since ordering by pg_class has been
> the only behavior ever since autovacuum was released, there should be a way
> for users to revert back to this. The default could be the new
> prioritization strategy.
>
> Introducing new GUCs is something to be avoided if possible, but I think this
> case is a clear one to me.

As I sort of alluded to in my previous message, I'd rather see us
introduce something that lets you get the behavior you want than
something that just lets you get back to the old behavior.
Technically, the latter is good enough to avoid any claim that we've
regressed things: you can always just the new thing off, and so by
definition there are no unavoidable regressions. But that only caters
to the scenario where the current behavior is good by accident
(because it can never be good for any other reason).

Don't take this too literally, but just mooting ideas wildly, suppose
the scoring has a wraparound component, a bloat component, and a
reloption-driven component, and the former two have a weighting factor
that can be adjusted via GUCs. If you want to shut off the new
behavior, you can setting the weighting factors to 0. If you want to
keep the new behavior but adjust the trade-off between the wraparound
and bloat components, you can adjust the relative weighting factors
between the two. If you want to take more manual control, you can use
the reloption, a choice that you can layer on top of the default
strategy or any of the alternate strategies just proposed. Of course,
making this all too complicated is a recipe for failure, but I suspect
that making it at least somewhat configurable is a good idea.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 20:21  Sami Imseih <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2025-11-20 20:21 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> something that just lets you get back to the old behavior.
> Technically, the latter is good enough to avoid any claim that we've
> regressed things:

yes, that is the intention with the GUC. I am worried about
cases in which a user has no way to go back to the old
behavior if the new prioritization strategy causes pain, for
some reason.

> But that only caters
> to the scenario where the current behavior is good by accident
> (because it can never be good for any other reason).

Well, maybe it was never really good, but it was the only behavior,
and the user tuned and tested their autovacuum settings with this
behavior; whether they actually kew it's based on pg_class ordering
or not ( I know users I worked with that do not realize this ).

I think if we are to think how we can improve prioritization, the
thing in mind is what can we do so users are no longer required
to schedule manual vacuums for specific tables ( which is essentially
how users are currently prioritizing tables ). If we go to rigid strategy
that is being proposed now, will it reduce or eliminate the need for
manually scheduling? I am not so sure.

> Don't take this too literally, but just mooting ideas wildly, suppose
> the scoring has a wraparound component, a bloat component, and a
> reloption-driven component, and the former two have a weighting factor
> that can be adjusted via GUCs. If you want to shut off the new
> behavior, you can setting the weighting factors to 0.

Something like this could. be better since it can both give control over
prioritization and allows to revert to the current behavior.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 20:58  David Rowley <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-20 20:58 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 21 Nov 2025 at 07:36, Robert Haas <[email protected]> wrote:
> If you want to take more manual control, you can use
> the reloption, a choice that you can layer on top of the default
> strategy or any of the alternate strategies just proposed. Of course,
> making this all too complicated is a recipe for failure, but I suspect
> that making it at least somewhat configurable is a good idea.

But it is configurable... you're free to change any of
autovacuum_freeze_max_age, autovacuum_multixact_freeze_max_age,
autovacuum_vacuum_scale_factor, autovacuum_vacuum_insert_scale_factor
and autovacuum_analyze_scale_factor, plus all the other
autovacuum_vacuum*_threshold GUCs and relptions to adjust the score.
The design is no accident. Of course, that does also affect the
eligibility for the table to be vacuumed, not just the order, but it's
not like there's no way for users to influence the order. If we really
do discover that pg_catalog tables need vacuum attention sooner, then
maybe we should consider defaulting a reloption for that, or maybe
there's only a subset of pg_catalog tables that that matters for.

For the record, I don't deny that it is possible that there is some
scenario where the pg_class order is better than sorting by the
percentage-over-threshold method, but IMO, it seems quite extreme to
go adding a series of new reloptions to weight the scores based on no
evidence that there's an actual problem or that it's even a good
solution to fixing some currently unknown problem. If we later
discover there is no issue, then reloptions are quite painful to
remove due to pg_dump (or rather failed restores). I think the vacuum
options are complex enough without risking adding a few new ones that
we don't even know are required or are even useful to anyone.

As for the GUC, I think we should at least commit the patch first and
add an open item to "Decisions to Recheck Mid-Beta" for v19 to see if
anyone still thinks a GUC is a good escape hatch, or if we'd prefer to
revert the patch because it's causing trouble. As I see it, we've got
about 6 months or maybe a bit more of testing how well this works
before we need to make a decision. My vote is to use as much of that
time as possible rather than using it to allow people to dream up
hypothetical problems that might or might not exist.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 21:16  Robert Haas <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Robert Haas @ 2025-11-20 21:16 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Nov 20, 2025 at 3:58 PM David Rowley <[email protected]> wrote:
> before we need to make a decision. My vote is to use as much of that
> time as possible rather than using it to allow people to dream up
> hypothetical problems that might or might not exist.

That seems a little harsh. I think the only hypothesis necessary for
my concern to be valid is the hypothesis that whatever algorithm we've
selected may not always work well. I admit that I could be wrong in
thinking so; there are plenty of heuristics in PostgreSQL that are so
effective that nobody ever cares about tuning them. But there's enough
problems with autovacuum that I don't think it's a particularly
adventurous hypothesis, either.

That said, I accept your point that even if we were to agree that
something ought to made tunable here, we would still have the problem
of deciding exactly what GUCs or reloptions to add, and that might be
hard to figure out without more information. Unfortunately, I have a
feeling that unless you or someone here is planning to make a
determined testing effort over the coming months, we're more likely to
get feedback after final release than during development or even beta.
But I do also understand that you don't want us to be paralyzed and
never move forward.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-20 22:12  David Rowley <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-20 22:12 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 21 Nov 2025 at 10:16, Robert Haas <[email protected]> wrote:
>
> On Thu, Nov 20, 2025 at 3:58 PM David Rowley <[email protected]> wrote:
> > before we need to make a decision. My vote is to use as much of that
> > time as possible rather than using it to allow people to dream up
> > hypothetical problems that might or might not exist.
>
> That seems a little harsh.

It wasn't intended to be offensive. It's an observation that there've
been quite a few posts on this thread about various extra things we
should account for in the score without any evidence that they're
worthy of a special case. I used "dream up" since I don't recall any
of those posts arriving with evidence of an actual problem or that the
proposed solution was a valid fix for it, and that the proposed
solution didn't make something else worse.

> That said, I accept your point that even if we were to agree that
> something ought to made tunable here, we would still have the problem
> of deciding exactly what GUCs or reloptions to add, and that might be
> hard to figure out without more information. Unfortunately, I have a
> feeling that unless you or someone here is planning to make a
> determined testing effort over the coming months, we're more likely to
> get feedback after final release than during development or even beta.

You might be right. Or after a week we might discover a good reason
why the percentage-over-threshold method is rubbish and revert it. The
key is probably in the way we act from getting no negative feedback.

I suspect the most likely area the new prioritisation order could
cause issues is from the lack of randomness. Will multiple workers
working into the same database be more likely to bump into each other
somehow in a bad way? Maybe that's a good area to focus testing.

> But I do also understand that you don't want us to be paralyzed and
> never move forward.

Yeah partly, but mostly I just really doubt that this matters that
much. It's been said on this thread already that prioritisation isn't
as important as the autovacuum-configured-to-run-too-slowly issue, and
I agree with that. I just find it hard to believe that the highly
volatile pg_class order has been just perfect all these years and that
sorting by percentage-over-threshold-desc will make things worse
overall. There was mention that pg_catalog tables are first in
pg_class, but I don't really agree with that as if I create some new
tables on a fresh database, I see those getting lower ctids than any
pg_catalog table. The space for that is finite, but there's no
shortage of other reasons for user tables to become mentioned in
pg_class before catalogue tables as the database gets used. I see that
table_beginscan_catalog() uses SO_ALLOW_SYNC too, so there's an extra
layer of randomness from sync scans. I don't recall any complaints
from the order autovacuum works on tables, so, to me, it just seems
strange to think that the volatile order of pg_class just happened to
be right all these years. I suspect what's happening is that the extra
bloat or stale statistics that people get as a result of the
pg_class-order autovacuum just gets unnoticed, ignored or attended to
via adjustments to the corresponding scale_factor reloption.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-22 11:28  Robert Haas <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Robert Haas @ 2025-11-22 11:28 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Nov 20, 2025 at 5:12 PM David Rowley <[email protected]> wrote:
> It wasn't intended to be offensive.

OK.

> I suspect the most likely area the new prioritisation order could
> cause issues is from the lack of randomness. Will multiple workers
> working into the same database be more likely to bump into each other
> somehow in a bad way? Maybe that's a good area to focus testing.

I agree that lack of randomness could cause problems, but I don't see
how it could cause regressions, because the current system isn't
random, either. Even if the order of pg_class is unpredictable, it may
(depending on the workload) not change very much from one day to the
next.

> Yeah partly, but mostly I just really doubt that this matters that
> much. It's been said on this thread already that prioritisation isn't
> as important as the autovacuum-configured-to-run-too-slowly issue, and
> I agree with that. I just find it hard to believe that the highly
> volatile pg_class order has been just perfect all these years and that
> sorting by percentage-over-threshold-desc will make things worse
> overall. There was mention that pg_catalog tables are first in
> pg_class, but I don't really agree with that as if I create some new
> tables on a fresh database, I see those getting lower ctids than any
> pg_catalog table. The space for that is finite, but there's no
> shortage of other reasons for user tables to become mentioned in
> pg_class before catalogue tables as the database gets used. I see that
> table_beginscan_catalog() uses SO_ALLOW_SYNC too, so there's an extra
> layer of randomness from sync scans. I don't recall any complaints
> from the order autovacuum works on tables, so, to me, it just seems
> strange to think that the volatile order of pg_class just happened to
> be right all these years. I suspect what's happening is that the extra
> bloat or stale statistics that people get as a result of the
> pg_class-order autovacuum just gets unnoticed, ignored or attended to
> via adjustments to the corresponding scale_factor reloption.

Interesting. I don't have any real knowledge of how jumbled-up the
order of pg_class is on real production systems, and I agree that if
the answer is "it's usually quite jumbled up" then that is good news
for this patch. In any case, I'm not trying to say that prioritization
is an intrinsically bad idea, because I don't believe that. What I'm
trying to say is that there's a limited number of ways for this patch
to make things worse, and one of them is if someone is winning right
now by accident, so therefore we should think about how many people
might be in that situation. I would argue that if a large number of
users end up with a very similar pattern in terms of how pg_class is
ordered, that makes the patch higher-risk than if, as I think you're
arguing here, there's enough randomness in terms of where things end
up in pg_class to prevent any particular pattern from predominating.
In the latter case, one or two really unlucky users could end up worse
off, but that's not really an issue. What would be an issue is if we
regressed some kind of common pattern. I admit that's a bit
speculative and I'm probably being a little paranoid here: doing smart
things is typically better than doing dumb things, and what we're
doing right now is dumb.

On the other hand, once we ship something, we can't pull it back. If
it causes a problem, someone will call me at 2am and need their system
fixed right now. If my answer is "well, there are no configuration
knobs we can change and no way to get back to the old behavior and I'm
sorry you're having that problem but the only answer is for you to run
all your VACUUMs manually until two years from now when maybe the
algorithm will have been improved," it's not going to be a very good
night. After 15 years at EDB, I've learned that the problem isn't
being wrong per se; it's having no way to get out from under being
wrong. It is absolutely inevitable that I will screw up, you will
screw up, the project as a whole will screw up, and that doesn't worry
me a bit. What does worry me is the prospect that we won't have
thought hard enough about what we're going to do if and when that
happens. Most of the customers that I've gotten to work with over the
years are very gracious about things going wrong with the software as
long as there are some options to deal with the problem. I fully admit
that this patch may already be good enough that I'll never hear a
single customer complain about it, but the time to think through the
reverse scenario, where some users are unhappy, is before we ship, not
after. That necessarily involves some speculation about what might go
wrong and some of that speculation may be groundless, but speculation
causes a lot less pain than angry customers whose problems you can't
fix.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-22 13:07  Dilip Kumar <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Dilip Kumar @ 2025-11-22 13:07 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; Robert Treat <[email protected]>; David Rowley <[email protected]>; Sami Imseih <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Nov 20, 2025 at 9:55 PM Nathan Bossart <[email protected]> wrote:
>
Thanks for working on this problem,  We frequently hear about the auto
vacuuming scheduling issue.  I believe this is a great starting point
to prioritize based on the wraparound and vacuum threshold limit.

However, my vision for addressing this problem has always involved
maintaining two distinct priority queues (or sorted lists). Each of
these queues would contain tables, with the tables within each queue
sorted by their respective scores.

Queue 1: Wraparound-Critical: This queue contains tables that require
immediate action because their XID or MultiXact ID age is critical,
especially those approaching the failsafe limit.
Queue 2: Threshold-Based: This queue includes tables needing VACUUM
due to crossing other thresholds.

Both queues would be maintained as sorted lists, with the highest
priority score at the head.  The autovacuum worker dynamically selects
tables for processing from the head of these 2 queues. For instance,
if a table is initially chosen from the threshold queue but processing
took too long, and another table approaches its failsafe limit due to
a high rate of concurrent XID generation, the latter can be
prioritized from the wraparound queue.  I believe this 2 queue
approach offers more flexibility than attempting to merge these
distinct concerns into a single scoring dimension.

Tables may exist in both queues. If a table is selected and vacuumed,
it will be removed from both queues to prevent redundant efforts.

-- 
Regards,
Dilip Kumar
Google





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-22 17:28  Sami Imseih <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2025-11-22 17:28 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> > I suspect the most likely area the new prioritisation order could
> > cause issues is from the lack of randomness. Will multiple workers
> > working into the same database be more likely to bump into each other
> > somehow in a bad way? Maybe that's a good area to focus testing.
>
> I agree that lack of randomness could cause problems, but I don't see
> how it could cause regressions, because the current system isn't
> random, either. Even if the order of pg_class is unpredictable, it may
> (depending on the workload) not change very much from one day to the
> next.
>
> > Yeah partly, but mostly I just really doubt that this matters that
> > much. It's been said on this thread already that prioritisation isn't
> > as important as the autovacuum-configured-to-run-too-slowly issue, and
> > I agree with that. I just find it hard to believe that the highly
> > volatile pg_class order has been just perfect all these years and that
> > sorting by percentage-over-threshold-desc will make things worse
> > overall. There was mention that pg_catalog tables are first in
> > pg_class, but I don't really agree with that as if I create some new
> > tables on a fresh database, I see those getting lower ctids than any
> > pg_catalog table. The space for that is finite, but there's no
> > shortage of other reasons for user tables to become mentioned in
> > pg_class before catalogue tables as the database gets used. I see that
> > table_beginscan_catalog() uses SO_ALLOW_SYNC too, so there's an extra
> > layer of randomness from sync scans. I don't recall any complaints
> > from the order autovacuum works on tables, so, to me, it just seems
> > strange to think that the volatile order of pg_class just happened to
> > be right all these years. I suspect what's happening is that the extra
> > bloat or stale statistics that people get as a result of the
> > pg_class-order autovacuum just gets unnoticed, ignored or attended to
> > via adjustments to the corresponding scale_factor reloption.
>
> Interesting. I don't have any real knowledge of how jumbled-up the
> order of pg_class is on real production systems, and I agree that if
> the answer is "it's usually quite jumbled up" then that is good news
> for this patch. In any case, I'm not trying to say that prioritization
> is an intrinsically bad idea, because I don't believe that. What I'm
> trying to say is that there's a limited number of ways for this patch
> to make things worse

What I have not been able to prove from my tests is that the processing
order of tables by autovacuum will actually make things any better or any
worse. My tests have been short 30 minute tests that count how many
vacuum cycles tables with various DML activity and sizes received.
I have not found much difference. I am also not sure  how valuable
these short-duration tests are either.

On the field is where the real test occurs and it may be discovered that
the new strategy improves the majority of the cases, and there may also
be cases where the existing strategy is somehow better. Having the
ability to go back to the existing behavior seems like the best way we
can roll this out and learn over time.

These may be the only two strategies we will ever need, or we may find out
that a third strategy in which individual tables are assigned a prioritization
score will also be useful.

--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-22 18:35  Robert Haas <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Robert Haas @ 2025-11-22 18:35 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: David Rowley <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, Nov 22, 2025 at 12:28 PM Sami Imseih <[email protected]> wrote:
> What I have not been able to prove from my tests is that the processing
> order of tables by autovacuum will actually make things any better or any
> worse. My tests have been short 30 minute tests that count how many
> vacuum cycles tables with various DML activity and sizes received.
> I have not found much difference. I am also not sure  how valuable
> these short-duration tests are either.

Yeah, I'm not sure that would be the right way to look for a benefit
from something like this. I think that a better test scenario might
involve figuring out how fast we can recover from a bad situation. As
we've discussed before, if VACUUM is chronically unable to keep up
with the workload, then the system is going to get into a very bad
state and there's not really any help for it. But if we start to get
into a bad situation due to some outside interference and then someone
removes the interference, we might hope that this patch would help us
get back on our feet more quickly.

For instance, suppose that we have a database with a stale replication
slot, so the oldest-XID value for the cluster keeps getting older and
older. autovacuum is probably running but it can't clean anything up.
Then at some point, the DBA realizes that bad things are happening and
drops the replication slot. You might hope that, with the patch,
autovacuum would do a better job getting the system back to a working
state. If you set up some kind of test scenario, you could ask
questions like "what is the largest age(relfrozenxid) that we observe
in the database at any point during the test?" or "from the time the
replication slot is dropped, how much time passes before
age(datfrozenxid) drops to normal?" or "what is the maximum observed
amount of bloat during the test?".

The same kind of idea could apply to anything else that stops vacuum
from running or makes it unproductive: a full table lock on a key
table, an open transaction, a table where VACUUM is failing. I
actually don't know exactly what kind of scenario would be good to
test here, because I struggle to think of a concrete scenario in which
we'd be better off with this than without it (which might be a reason
not to proceed with it, despite the fact that I think we all agree
that, from a theoretical point of view, the idea of prioritizing
sounds better than the idea of not prioritizing). But I think that if
the patch has a benefit, it won't be one where the system is in a
steady state where vacuum is able to keep up. It might be one where
we're in a steady state where vacuum is not able to keep up and things
are getting worse and worse, but the patch allows us to survive for
longer before terrible things happen. But I would say that the most
promising scenario for this patch would be something like what I
describe above, where we're not in a steady state at all: something
bad has happened and now we're trying to recover.

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-22 20:03  Nathan Bossart <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Nathan Bossart @ 2025-11-22 20:03 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, Nov 22, 2025 at 06:28:13AM -0500, Robert Haas wrote:
> What would be an issue is if we
> regressed some kind of common pattern. I admit that's a bit
> speculative and I'm probably being a little paranoid here: doing smart
> things is typically better than doing dumb things, and what we're
> doing right now is dumb.
> 
> On the other hand, once we ship something, we can't pull it back. If
> it causes a problem, someone will call me at 2am and need their system
> fixed right now. If my answer is "well, there are no configuration
> knobs we can change and no way to get back to the old behavior and I'm
> sorry you're having that problem but the only answer is for you to run
> all your VACUUMs manually until two years from now when maybe the
> algorithm will have been improved," it's not going to be a very good
> night. After 15 years at EDB, I've learned that the problem isn't
> being wrong per se; it's having no way to get out from under being
> wrong.

Yeah.  I'm tempted to code up the "weighting factor" GUCs for the next
revision.  As you've noted, those would be useful for tuning and for
reverting back to pre-v19 behavior.  Sure, we might end up with a handful
of retail GUCs that most users don't need, but that's not so terrible.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-23 09:55  David Rowley <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2025-11-23 09:55 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sun, 23 Nov 2025 at 07:35, Robert Haas <[email protected]> wrote:
>
> On Sat, Nov 22, 2025 at 12:28 PM Sami Imseih <[email protected]> wrote:
> > What I have not been able to prove from my tests is that the processing
> > order of tables by autovacuum will actually make things any better or any
> > worse. My tests have been short 30 minute tests that count how many
> > vacuum cycles tables with various DML activity and sizes received.
> > I have not found much difference. I am also not sure  how valuable
> > these short-duration tests are either.
>
> Yeah, I'm not sure that would be the right way to look for a benefit
> from something like this. I think that a better test scenario might
> involve figuring out how fast we can recover from a bad situation. As
> we've discussed before, if VACUUM is chronically unable to keep up
> with the workload, then the system is going to get into a very bad
> state and there's not really any help for it. But if we start to get
> into a bad situation due to some outside interference and then someone
> removes the interference, we might hope that this patch would help us
> get back on our feet more quickly.

One thing that seems to be getting forgotten again is the "/* Stop
applying cost limits from this point on */" added in 1e55e7d17 is only
going to be applied when the table *currently* being vaccumed is over
the failsafe limit. Without Nathan's patch, the worker might end up
idling along carefully obeying the cost limits on dozens of other
tables before it gets around to vacuuming the table that's over the
failsafe limit, then suddenly drop the cost delay code and rush to get
the table frozen, before Postgres stops accepting transactions. With
the patch, Nathan has added some aggressive score scaling, which
should mean any table over the failsafe limit has the highest score
and gets attended to first.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-24 15:19  Sami Imseih <[email protected]>
  parent: Robert Haas <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2025-11-24 15:19 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> > What I have not been able to prove from my tests is that the processing
> > order of tables by autovacuum will actually make things any better or any
> > worse. My tests have been short 30 minute tests that count how many
> > vacuum cycles tables with various DML activity and sizes received.
> > I have not found much difference. I am also not sure  how valuable
> > these short-duration tests are either.
>
> Yeah, I'm not sure that would be the right way to look for a benefit
> from something like this. I think that a better test scenario might
> involve figuring out how fast we can recover from a bad situation. As
> we've discussed before, if VACUUM is chronically unable to keep up
> with the workload, then the system is going to get into a very bad
> state and there's not really any help for it. But if we start to get
> into a bad situation due to some outside interference and then someone
> removes the interference, we might hope that this patch would help us
> get back on our feet more quickly.
>
> For instance, suppose that we have a database with a stale replication
> slot, so the oldest-XID value for the cluster keeps getting older and
> older. autovacuum is probably running but it can't clean anything up.
> Then at some point, the DBA realizes that bad things are happening and
> drops the replication slot. You might hope that, with the patch,
> autovacuum would do a better job getting the system back to a working
> state. If you set up some kind of test scenario, you could ask
> questions like "what is the largest age(relfrozenxid) that we observe
> in the database at any point during the test?" or "from the time the
> replication slot is dropped, how much time passes before
> age(datfrozenxid) drops to normal?" or "what is the maximum observed
> amount of bloat during the test?".



^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2025-11-24 19:59  Robert Haas <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Robert Haas @ 2025-11-24 19:59 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Nathan Bossart <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sun, Nov 23, 2025 at 4:55 AM David Rowley <[email protected]> wrote:
> One thing that seems to be getting forgotten again is the "/* Stop
> applying cost limits from this point on */" added in 1e55e7d17 is only
> going to be applied when the table *currently* being vaccumed is over
> the failsafe limit. Without Nathan's patch, the worker might end up
> idling along carefully obeying the cost limits on dozens of other
> tables before it gets around to vacuuming the table that's over the
> failsafe limit, then suddenly drop the cost delay code and rush to get
> the table frozen, before Postgres stops accepting transactions. With
> the patch, Nathan has added some aggressive score scaling, which
> should mean any table over the failsafe limit has the highest score
> and gets attended to first.

Right, so can we use that to construct a specific, concrete scenario
where we can see that the patch ends up delivering better behavior
than we have today? I think it would be a really good to have at least
one fully worked-out case where we can say "look, if you run this
series of commands without the patch, X happens, and with the patch, Y
happens, and look! Y is better."

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-05 17:03  Nathan Bossart <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-05 17:03 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

My apologies for getting distracted from this work.  It might be a v20 item
at this point.  I haven't addressed any feedback since the v8 patch, but I
did some testing.

On Mon, Nov 24, 2025 at 02:59:33PM -0500, Robert Haas wrote:
> On Sun, Nov 23, 2025 at 4:55 AM David Rowley <[email protected]> wrote:
>> One thing that seems to be getting forgotten again is the "/* Stop
>> applying cost limits from this point on */" added in 1e55e7d17 is only
>> going to be applied when the table *currently* being vaccumed is over
>> the failsafe limit. Without Nathan's patch, the worker might end up
>> idling along carefully obeying the cost limits on dozens of other
>> tables before it gets around to vacuuming the table that's over the
>> failsafe limit, then suddenly drop the cost delay code and rush to get
>> the table frozen, before Postgres stops accepting transactions. With
>> the patch, Nathan has added some aggressive score scaling, which
>> should mean any table over the failsafe limit has the highest score
>> and gets attended to first.
> 
> Right, so can we use that to construct a specific, concrete scenario
> where we can see that the patch ends up delivering better behavior
> than we have today? I think it would be a really good to have at least
> one fully worked-out case where we can say "look, if you run this
> series of commands without the patch, X happens, and with the patch, Y
> happens, and look! Y is better."

I used the xid_wraparound module to artifically induce a situation that
looked like this:

     table_name |    age
    ------------+------------
     t1         | 1950000020
     t2         | 1560000016
     t3         | 1170000013
     t4         |  780000010
     t5         |  390000007
    (5 rows)

Each table had 1M updates, and all other tables on the cluster were frozen.
I created the tables in reverse so that t1 is listed later in pg_class than
t5.

Without the patch, autovacuum goes straight for t5, and then processes t4,
t3, etc.:

     table_name |    age
    ------------+------------
     t1         | 1950000021
     t2         | 1560000017
     t3         | 1170000014
     t4         |  780000011
     t5         |          1
    (5 rows)

With the patch, it processes t1 first:

     table_name |    age
    ------------+------------
     t2         | 1560000017
     t3         | 1170000014
     t4         |  780000011
     t5         |  390000008
     t1         |          1
    (5 rows)

I'll admit this is a pretty extreme/contrived example, but it at least
shows the intended behavior.  As alluded to elsewhere, this prioritization
work might be more useful once we are automatically adjusting the cost
limits in more cases.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-10 15:06  Nathan Bossart <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-10 15:06 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Mar 05, 2026 at 11:03:50AM -0600, Nathan Bossart wrote:
> My apologies for getting distracted from this work.  It might be a v20 item
> at this point.  I haven't addressed any feedback since the v8 patch, but I
> did some testing.

Here's an updated patch with new GUCs that control how much each component
contributes to the autovacuum score for a table.  They default to 1.0, but
can be set anywhere from 0.0 to 1.0 (inclusive).  In theory, setting all of
them to 0.0 should restore the original pg_class order prioritization that
we have today.  I haven't added corresponding reloptions for these GUCs, as
I'm not convinced we need them, but I can add them if folks think they
would be useful.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-10 16:19  Nathan Bossart <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-10 16:19 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Mar 10, 2026 at 10:06:44AM -0500, Nathan Bossart wrote:
> Here's an updated patch with new GUCs that control how much each component
> contributes to the autovacuum score for a table.  They default to 1.0, but
> can be set anywhere from 0.0 to 1.0 (inclusive).  In theory, setting all of
> them to 0.0 should restore the original pg_class order prioritization that
> we have today.  I haven't added corresponding reloptions for these GUCs, as
> I'm not convinced we need them, but I can add them if folks think they
> would be useful.

Apologies for the noise.  cfbot alerted me to a missing #include.

I've been thinking about how we might eventually translate these scores
into automatic cost limit adjustments.  ISTM that might be a bit difficult
because the scores are basically boundless, so we'll need to get creative.
Unfortunately, I have no concrete ideas to propose at the moment, but
that's v20 (or later) material, anyway.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 00:11  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2026-03-11 00:11 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Here's an updated patch with new GUCs that control how much each component
> contributes to the autovacuum score for a table.  They default to 1.0, but
> can be set anywhere from 0.0 to 1.0 (inclusive).  In theory, setting all of
> them to 0.0 should restore the original pg_class order prioritization that
> we have today.  I haven't added corresponding reloptions for these GUCs, as
> I'm not convinced we need them, but I can add them if folks think they
> would be useful.

Starting with GUCs is OK by me.

Just a few things:

1/
+        Oid            xid_age;
+        Oid            mxid_age;

Is using Oid here intentional? I'm curious why not use uint32 for clarity?

2/
The new GUC docs says  "...component of the score...", but without
introducing the concept of the prioritization score.
I think we should expand a bit more on this topic to help a user
understand and tune these more effectively. Attached is my
proposal for the docs. I tried to keep it informative without
being too verbose, and avoided making specific recommendations.

--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/octet-stream] v1-0001-autovacuum-scheduling-improvements-docs.patch (4.5K, 2-v1-0001-autovacuum-scheduling-improvements-docs.patch)
  download | inline diff:
From eee71cdfdaff3295d52c1213d47ec1754e87a1f8 Mon Sep 17 00:00:00 2001
From: Sami Imseih <[email protected]>
Date: Tue, 10 Mar 2026 19:04:03 -0500
Subject: [PATCH v1 1/1] autovacuum scheduling improvements - docs

---
 doc/src/sgml/maintenance.sgml | 95 +++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 7c958b06273..16b50f8e5b6 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -1054,6 +1054,99 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
     not automatically interrupted.
    </para>
 
+   <sect3 id="autovacuum-priority">
+    <title>Processing Priority</title>
+
+   <para>
+    Autovacuum decides what to process in two steps: first it picks a
+    database, then it orders the tables within that database.
+   </para>
+
+   <para>
+    The launcher first checks for databases at risk of wraparound,
+    with transaction ID wraparound taking precedence over multixact
+    wraparound.  If no database is at risk, the least recently
+    auto-vacuumed database is selected.  Databases that have never been
+    connected to, or that have had no activity since the statistics were
+    last reset, are not considered except when at risk of wraparound.
+   </para>
+
+   <para>
+    Within a database, the autovacuum worker builds a list of all tables
+    that need vacuuming or analyzing and sorts them by a
+    <firstterm>priority score</firstterm>.  Tables with higher scores are
+    processed first.
+   </para>
+
+   <para>
+    The score for each table is calculated as the maximum of several
+    component scores, each representing how far the table has exceeded a
+    particular threshold.  Each component is multiplied by a configurable
+    weight parameter:
+
+    <itemizedlist>
+     <listitem>
+      <para>
+       <xref linkend="guc-autovacuum-vacuum-score-weight"/>: the ratio of
+       dead tuples to the table's vacuum threshold.  For example, if a table
+       has 100 dead tuples and its vacuum threshold is 80, this component's
+       score is 1.25.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <xref linkend="guc-autovacuum-vacuum-insert-score-weight"/>: the ratio
+       of inserted tuples (since the last vacuum) to the table's insert
+       vacuum threshold.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <xref linkend="guc-autovacuum-analyze-score-weight"/>: the ratio of
+       modified tuples (since the last analyze) to the table's analyze
+       threshold.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <xref linkend="guc-autovacuum-freeze-score-weight"/>: the ratio of
+       the table's transaction ID age
+       (<structfield>relfrozenxid</structfield>) to
+       <xref linkend="guc-autovacuum-freeze-max-age"/>.  This component is
+       only considered for tables that have already exceeded their freeze max
+       age.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       <xref linkend="guc-autovacuum-multixact-freeze-score-weight"/>: the
+       ratio of the table's multixact age
+       (<structfield>relminmxid</structfield>) to
+       <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Like the
+       freeze score, this is only considered for tables past their multixact
+       freeze max age.
+      </para>
+     </listitem>
+    </itemizedlist>
+
+    The final score is the maximum of these weighted components.
+   </para>
+
+   <para>
+    Tables that are approaching transaction ID or multixact wraparound receive
+    additional priority.  Once a table's age surpasses
+    <xref linkend="guc-vacuum-failsafe-age"/> or
+    <xref linkend="guc-vacuum-multixact-failsafe-age"/>, its freeze score is
+    scaled aggressively so that it sorts toward the top of the list.
+   </para>
+
+   <para>
+    All weights default to 1.0.  Reducing a weight to a value below 1.0
+    decreases the influence of that component on the final score, making
+    tables that exceed that particular threshold less likely to be processed
+    first.
+   </para>
+
    <warning>
     <para>
      Regularly running commands that acquire locks conflicting with a
@@ -1061,6 +1154,8 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
      effectively prevent autovacuums from ever completing.
     </para>
    </warning>
+
+   </sect3>
   </sect2>
  </sect1>
 
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 01:08  Sami Imseih <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Sami Imseih @ 2026-03-11 01:08 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Just a few things:
>
> 1/
> +        Oid            xid_age;
> +        Oid            mxid_age;
>
> Is using Oid here intentional? I'm curious why not use uint32 for clarity?
>
> 2/
> The new GUC docs says  "...component of the score...", but without
> introducing the concept of the prioritization score.
> I think we should expand a bit more on this topic to help a user
> understand and tune these more effectively. Attached is my
> proposal for the docs. I tried to keep it informative without
> being too verbose, and avoided making specific recommendations.

My apologies. I found something else that may need
addressing.

+               if (xid_age >= effective_xid_failsafe_age)
+                       xid_score = pow(xid_score, Max(1.0, (double)
xid_age / 100000000));
+               if (mxid_age >= effective_mxid_failsafe_age)
+                       mxid_score = pow(mxid_score, Max(1.0, (double)
mxid_age / 100000000));
+

The current scaling calculation for force_vacuum could lead to
exorbitantly high scores.
Using DEBUG3 and consume_xids_until(2000000000), notice how the score goes
from 7.93 to 661828682916018.125 once past failsafe age.

36), anl: 0 (threshold 97929), score: 7.930
2026-03-10 19:41:11.979 CDT [74007] DEBUG:  foo: vac: 0 (threshold
195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
7.930
2026-03-10 19:41:32.062 CDT [74038] DEBUG:  foo: vac: 0 (threshold
195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
661828682916018.125
2026-03-10 19:41:32.063 CDT [74038] DEBUG:  foo: vac: 0 (threshold
195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
661828682916018.125
2026-03-10 19:41:51.961 CDT [74066] DEBUG:  foo: vac: 0 (threshold
195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
26761249940789168.000

Do you think it will be better to just to add the age to the
score?

               if (xid_age >= effective_xid_failsafe_age)
                       xid_score += (double) xid_age;
               if (mxid_age >= effective_mxid_failsafe_age)
                       mxid_score += (double) mxid_age

For most cases, this should be high enough to to make the
score high enough to sort to the top, as mentioned in the
comments:

+                * As in vacuum_xid_failsafe_check(), the effective
failsafe age is no
+                * less than 105% the value of the respective *_freeze_max_age
+                * parameter.  Note that per-table settings could
result in a low
+                * score even if the table surpasses the failsafe
settings.  However,
+                * this is a strange enough corner case that we don't
bother trying to
+                * handle it.
+                */

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 03:56  wenhui qiu <[email protected]>
  parent: Sami Imseih <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: wenhui qiu @ 2026-03-11 03:56 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

HI Sami
> Do you think it will be better to just to add the age to the
> score?
>
>                if (xid_age >= effective_xid_failsafe_age)
>                        xid_score += (double) xid_age;
>                if (mxid_age >= effective_mxid_failsafe_age)
>                        mxid_score += (double) mxid_age
>
> For most cases, this should be high enough to to make the
> score high enough to sort to the top, as mentioned in the
> comments:
Agree, +1.The calculation resulted in a very large number. I previously
suggested modifying the algorithm like this:
effective_xid_failsafe_age = (vacuum_failsafe_age +
autovacuum_freeze_max_age) / 2.0. Typically, the `vacuum_failsafe_age`
parameter is rarely adjusted by DBAs.My view has always been that tables
whose age cannot be reduced should be prioritized, while we should try to
avoid tables whose age is already close to vacuum_failsafe_age.




Thanks


On Wed, Mar 11, 2026 at 9:08 AM Sami Imseih <[email protected]> wrote:

> > Just a few things:
> >
> > 1/
> > +        Oid            xid_age;
> > +        Oid            mxid_age;
> >
> > Is using Oid here intentional? I'm curious why not use uint32 for
> clarity?
> >
> > 2/
> > The new GUC docs says  "...component of the score...", but without
> > introducing the concept of the prioritization score.
> > I think we should expand a bit more on this topic to help a user
> > understand and tune these more effectively. Attached is my
> > proposal for the docs. I tried to keep it informative without
> > being too verbose, and avoided making specific recommendations.
>
> My apologies. I found something else that may need
> addressing.
>
> +               if (xid_age >= effective_xid_failsafe_age)
> +                       xid_score = pow(xid_score, Max(1.0, (double)
> xid_age / 100000000));
> +               if (mxid_age >= effective_mxid_failsafe_age)
> +                       mxid_score = pow(mxid_score, Max(1.0, (double)
> mxid_age / 100000000));
> +
>
> The current scaling calculation for force_vacuum could lead to
> exorbitantly high scores.
> Using DEBUG3 and consume_xids_until(2000000000), notice how the score goes
> from 7.93 to 661828682916018.125 once past failsafe age.
>
> 36), anl: 0 (threshold 97929), score: 7.930
> 2026-03-10 19:41:11.979 CDT [74007] DEBUG:  foo: vac: 0 (threshold
> 195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
> 7.930
> 2026-03-10 19:41:32.062 CDT [74038] DEBUG:  foo: vac: 0 (threshold
> 195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
> 661828682916018.125
> 2026-03-10 19:41:32.063 CDT [74038] DEBUG:  foo: vac: 0 (threshold
> 195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
> 661828682916018.125
> 2026-03-10 19:41:51.961 CDT [74066] DEBUG:  foo: vac: 0 (threshold
> 195809), ins: 0 (threshold 176836), anl: 0 (threshold 97929), score:
> 26761249940789168.000
>
> Do you think it will be better to just to add the age to the
> score?
>
>                if (xid_age >= effective_xid_failsafe_age)
>                        xid_score += (double) xid_age;
>                if (mxid_age >= effective_mxid_failsafe_age)
>                        mxid_score += (double) mxid_age
>
> For most cases, this should be high enough to to make the
> score high enough to sort to the top, as mentioned in the
> comments:
>
> +                * As in vacuum_xid_failsafe_check(), the effective
> failsafe age is no
> +                * less than 105% the value of the respective
> *_freeze_max_age
> +                * parameter.  Note that per-table settings could
> result in a low
> +                * score even if the table surpasses the failsafe
> settings.  However,
> +                * this is a strange enough corner case that we don't
> bother trying to
> +                * handle it.
> +                */
>
> --
> Sami Imseih
> Amazon Web Services (AWS)
>
>
>


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 15:53  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-11 15:53 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Mar 10, 2026 at 08:08:26PM -0500, Sami Imseih wrote:
>> +        Oid            xid_age;
>> +        Oid            mxid_age;
>>
>> Is using Oid here intentional? I'm curious why not use uint32 for clarity?

Fixed.

>> The new GUC docs says  "...component of the score...", but without
>> introducing the concept of the prioritization score.
>> I think we should expand a bit more on this topic to help a user
>> understand and tune these more effectively. Attached is my
>> proposal for the docs. I tried to keep it informative without
>> being too verbose, and avoided making specific recommendations.

Good idea.  I put my own spin on it in the attached.  Please let me know
what you think.

> The current scaling calculation for force_vacuum could lead to
> exorbitantly high scores.
> Using DEBUG3 and consume_xids_until(2000000000), notice how the score goes
> from 7.93 to 661828682916018.125 once past failsafe age.
> 
> [...]
> 
> Do you think it will be better to just to add the age to the
> score?

I mean, that's kind of the point.  Once a table surpasses one of the
failsafe thresholds, we want its score to be so exorbitantly high that
autovacuum is all but guaranteed to process it first.  I see no particular
advantage to tempering the score in that case.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 17:08  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2026-03-11 17:08 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Good idea.  I put my own spin on it in the attached.  Please let me know
> what you think.

This look OK to me.

> > The current scaling calculation for force_vacuum could lead to
> > exorbitantly high scores.
> > Using DEBUG3 and consume_xids_until(2000000000), notice how the score goes
> > from 7.93 to 661828682916018.125 once past failsafe age.
> >
> > [...]
> >
> > Do you think it will be better to just to add the age to the
> > score?
>
> I mean, that's kind of the point.  Once a table surpasses one of the
> failsafe thresholds, we want its score to be so exorbitantly high that
> autovacuum is all but guaranteed to process it first.  I see no particular
> advantage to tempering the score in that case.

The main issue is that the scores can reach quadrillions, or even billions,
which feels excessive, especially if exposed in DEBUG3 or in a future
prioritization view.
So scaling the scores down seems like the right thing to do.

We could also do this by dividing the score by a large constant, or
use log10 to compress
the score. Both methods should keep the sort order unchanged.

However, If everyone agrees with the current approach, I will not push
back further.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 17:28  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-11 17:28 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Mar 11, 2026 at 12:08:52PM -0500, Sami Imseih wrote:
> The main issue is that the scores can reach quadrillions, or even billions,
> which feels excessive, especially if exposed in DEBUG3 or in a future
> prioritization view.

But why is that an issue?  Because the number looks big when there's
extremely verbose logging enabled?  I'm not following your objection.  IMHO
we _want_ the score to be excessively high in these cases so that there's
basically zero chance a table with unreasonable bloat takes priority.  This
was discussed a bit upthread [0].

[0] https://postgr.es/m/CAApHDvqrd%3DSHVUytdRj55OWnLH98Rvtzqam5zq2f4XKRZa7t9Q%40mail.gmail.com

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-11 17:59  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2026-03-11 17:59 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

>
> On Wed, Mar 11, 2026 at 12:08:52PM -0500, Sami Imseih wrote:
> > The main issue is that the scores can reach quadrillions, or even
> billions,
> > which feels excessive, especially if exposed in DEBUG3 or in a future
> > prioritization view.
>
> But why is that an issue?  Because the number looks big when there's
> extremely verbose logging enabled?  I'm not following your objection.


Yes, purely the looks of such an excessively large number could look wrong
to a user.
Putting on my user hat, I would be confused and honestly think this is a
bug in the
calculation. If we weren’t exposing the numbers, I would not care.

But, this could just be me.

This comment "this component increases greatly once the age surpasses" is
perhaps
good enough.

we _want_ the score to be excessively high in these cases so that there's
> basically zero chance a table with unreasonable bloat takes priority.  This
> was discussed a bit upthread [0].
>
> [0]
> https://postgr.es/m/CAApHDvqrd%3DSHVUytdRj55OWnLH98Rvtzqam5zq2f4XKRZa7t9Q%40mail.gmail.com
>

Yes, I definitely agree with this.

--
Sami Imseih
Amazon Web Services (AWS)


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-12 19:20  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-12 19:20 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Robert Haas <[email protected]>; David Rowley <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

I'm debating whether I should move forward with committing this [0] for
v19.  On one hand, I think I've addressed all the latest feedback, and I'm
not aware of any sustained objections to the approach.  But on the other
hand, there hasn't been much discussion since November (my fault), and I
can't quite tell if this patch has enough support.  At the moment, I'm
leaning towards committing it, but I'm curious what folks think.

[0] https://postgr.es/m/attachment/191721/v11-0001-autovacuum-scheduling-improvements.patch

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-17 23:06  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-17 23:06 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 13 Mar 2026 at 08:20, Nathan Bossart <[email protected]> wrote:
> I'm debating whether I should move forward with committing this [0] for
> v19.  On one hand, I think I've addressed all the latest feedback, and I'm
> not aware of any sustained objections to the approach.  But on the other
> hand, there hasn't been much discussion since November (my fault), and I
> can't quite tell if this patch has enough support.  At the moment, I'm
> leaning towards committing it, but I'm curious what folks think.

I think it would have been better to have done this about 3 months
ago, but I think it's probably still fine to do now. Feature freeze is
still quite a long way from release. I do expect that the most likely
time that someone would find issues with this change would be during
beta or RC, as that's when people would give PostgreSQL production
workloads to try out. During the dev cycle, I expect it's *mostly*
just hackers giving the database toy workloads in a very targeted way
to something specific that they're hacking on.  Anyway, now since
you've added the GUCs to control the weights, there's a way for users
to have some influence, so at least there's an escape hatch.

I think the GUCs are probably a good idea. I expect the most likely
change that people might want to make would be to raise the priority
of analyze over vacuum since that's often much faster to complete. We
know that some workloads are very sensitive to outdated statistics.

On the other hand, we shouldn't be taking adding 5 new autovacuum GUCs
lightly as there are already so many. If we are going to come up with
something better than this in a few years then it could be better to
wait to reduce the pain of having to remove GUCs in the future. I
don't personally have any better ideas, so I'd rather see it go in
than not.

I didn't look at the patch in detail, but noticed that you might want
to add a "See Section N.N.N for more information." to the new GUCs to
link to the text you've added on how they're used.

Do you think it's worth making the call to
list_sort(tables_to_process, TableToProcessComparator) conditional on
a variable set like: sort_required |= (score != 0.0);? I recall that
someone had concerns that the actual sort itself could add overhead.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-18 16:09  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-18 16:09 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Mar 18, 2026 at 12:06:34PM +1300, David Rowley wrote:
> I think it would have been better to have done this about 3 months
> ago, but I think it's probably still fine to do now. Feature freeze is
> still quite a long way from release. I do expect that the most likely
> time that someone would find issues with this change would be during
> beta or RC, as that's when people would give PostgreSQL production
> workloads to try out. During the dev cycle, I expect it's *mostly*
> just hackers giving the database toy workloads in a very targeted way
> to something specific that they're hacking on.  Anyway, now since
> you've added the GUCs to control the weights, there's a way for users
> to have some influence, so at least there's an escape hatch.

Thanks for chiming in.

> I think the GUCs are probably a good idea. I expect the most likely
> change that people might want to make would be to raise the priority
> of analyze over vacuum since that's often much faster to complete. We
> know that some workloads are very sensitive to outdated statistics.
> 
> On the other hand, we shouldn't be taking adding 5 new autovacuum GUCs
> lightly as there are already so many. If we are going to come up with
> something better than this in a few years then it could be better to
> wait to reduce the pain of having to remove GUCs in the future. I
> don't personally have any better ideas, so I'd rather see it go in
> than not.

Yeah, adding these GUCs feels a bit like etching in stone, but if folks
want configurability, and nobody has better ideas, I'm not sure what else
to do.

> I didn't look at the patch in detail, but noticed that you might want
> to add a "See Section N.N.N for more information." to the new GUCs to
> link to the text you've added on how they're used.

Good idea.  I've added that.

> Do you think it's worth making the call to
> list_sort(tables_to_process, TableToProcessComparator) conditional on
> a variable set like: sort_required |= (score != 0.0);? I recall that
> someone had concerns that the actual sort itself could add overhead.

I don't think it's necessary.  I tested sorting 1M and 10M tables with
randomly generated scores (on my MacBook, with assertions enabled).  The
former took ~150 milliseconds, and the latter took ~1770 milliseconds.  I
suspect there are far bigger problems to worry about if you have anywhere
near that many tables.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-19 09:55  Greg Burd <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Greg Burd @ 2026-03-19 09:55 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; David Rowley <[email protected]>; +Cc: Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers


On Wed, Mar 18, 2026, at 12:09 PM, Nathan Bossart wrote:
> On Wed, Mar 18, 2026 at 12:06:34PM +1300, David Rowley wrote:
>> I think it would have been better to have done this about 3 months
>> ago, but I think it's probably still fine to do now. Feature freeze is
>> still quite a long way from release. I do expect that the most likely
>> time that someone would find issues with this change would be during
>> beta or RC, as that's when people would give PostgreSQL production
>> workloads to try out. During the dev cycle, I expect it's *mostly*
>> just hackers giving the database toy workloads in a very targeted way
>> to something specific that they're hacking on.  Anyway, now since
>> you've added the GUCs to control the weights, there's a way for users
>> to have some influence, so at least there's an escape hatch.
>
> Thanks for chiming in.

Hey Nathan, et. al., I'll chime in too!  Apologies in advance for the length of the message.  I need to learn how to be more concise...

First off, thanks for looking into this!  I think your work is a significant improvement over where we are today and should be in v19.  Put simply, I think that the v12 patch is no worse than pg_class order we use today and in many cases much better.  That said, I think we should tweak it a bit more while we're at it.

>> I think the GUCs are probably a good idea. I expect the most likely
>> change that people might want to make would be to raise the priority
>> of analyze over vacuum since that's often much faster to complete. We
>> know that some workloads are very sensitive to outdated statistics.
>> 
>> On the other hand, we shouldn't be taking adding 5 new autovacuum GUCs
>> lightly as there are already so many. If we are going to come up with
>> something better than this in a few years then it could be better to
>> wait to reduce the pain of having to remove GUCs in the future. I
>> don't personally have any better ideas, so I'd rather see it go in
>> than not.
>
> Yeah, adding these GUCs feels a bit like etching in stone, but if folks
> want configurability, and nobody has better ideas, I'm not sure what else
> to do.

I'm late in the review process. I know David Rowley proposed the unified scoring approach that became the foundation of this patch, and I think that's a great direction. However, I'm concerned that the patch's default scoring weights don't give XID-age urgency sufficient priority over dead-tuple urgency. The weight GUCs (autovacuum_vacuum_score_weight, etc.) can address this, but they max at 1.0, meaning you can only reduce dead-tuple priority, not increase XID priority.

>> I didn't look at the patch in detail, but noticed that you might want
>> to add a "See Section N.N.N for more information." to the new GUCs to
>> link to the text you've added on how they're used.
>
> Good idea.  I've added that.
>
>> Do you think it's worth making the call to
>> list_sort(tables_to_process, TableToProcessComparator) conditional on
>> a variable set like: sort_required |= (score != 0.0);? I recall that
>> someone had concerns that the actual sort itself could add overhead.
>
> I don't think it's necessary.  I tested sorting 1M and 10M tables with
> randomly generated scores (on my MacBook, with assertions enabled).  The
> former took ~150 milliseconds, and the latter took ~1770 milliseconds.  I
> suspect there are far bigger problems to worry about if you have anywhere
> near that many tables.
>
> -- 
> nathan
>
> Attachments:
> * v12-0001-autovacuum-scheduling-improvements.patch

I decided to model before/after behavior using discrete-event simulation. With a bit of LLM help, we no have autovacuum_simulation.py and autovacuum_simulation_fixed.py to compare all three approaches (Before/v12/Proposed small fix to v12) across 20 runs with randomized OID assignments.

Results from the models are at the end of this email.  They simulate the before/after/fixed behavior of the v12 autovacuum scheduling patch and a suggested fix (I'll get to why in a minute).

They test a contention scenario: 100 tables (5 already past freeze_max_age, 15 approaching it at staggered rates, and 80 high-churn tables constantly exceeding their dead-tuple vacuum thresholds) competing for 3 autovacuum workers over 7 days.  I'm sure there are other scenarios that we could test, this felt representational enough to me based on what I've seen.

With the first test, the "Before" scheduler processes tables in OID order; the "After" scheduler uses the patch's urgency scoring with all default GUC values.

The model of the v12 patch produces a clear system-wide improvement: the number of tables simultaneously past freeze_max_age peaks at roughly 20 under score-based scheduling versus 80+ under OID order. This happens because high-churn tables earn high dead-tuple scores, get vacuumed frequently, and their relfrozenxid ages reset as a side effect - preventing them from ever reaching freeze_max_age. Under OID order, high-OID active tables are *starved* (for no other reason than they were created later than their counterparts!) and their XID ages grow needlessly into the danger zone. The v12 patch also correctly prioritizes the most dangerous force-vacuum tables first (those closest to failsafe age), whereas OID order's success with any particular table is coincidental.

But there's a downside in v12, the simulation reveals a scoring-scale concern under default weights. Active tables accumulate dead-tuple scores of 18–70+ within minutes of their last vacuum, while force-vacuum tables that have just crossed freeze_max_age carry XID scores of only 1.0–4.0 (age/freeze_max_age). The exponential boost doesn't activate until failsafe age (1.6B), which is 8× the freeze_max_age threshold. In the gap between 200M and 1.6B, force-vacuum tables are consistently outranked by ordinary dead-tuple work. In the tested scenario this meant only 2 of 20 at-risk tables were actually vacuumed by the score-based scheduler (versus 5 by OID due to luck), and average per-table wraparound exposure was 26% worse.

One possible remedy within the current design would be to either raise the default autovacuum_freeze_score_weight or apply a floor multiplier when force_vacuum is true.  For example, ensuring any force-vacuum score is at least as large as the maximum non-force-vacuum score in the current candidate set. Alternatively, the weight GUC range could be expanded above 1.0 to allow DBAs to explicitly boost XID-age priority. The existing 0.0–1.0 range only allows reducing component priority, which makes it difficult to express "wraparound prevention is more important than bloat control.

Tiered sorting using the existing wraparound flag might be the simplest and safest fix. While the thread discussed using scoring with exponential boost to prioritize wraparound tables, I think a two-tier approach (wraparound tier vs routine tier) would be more robust. We already treat wraparound as non-negotiable (force_vacuum bypasses av_enabled, triggers emergency behavior). The scoring system should reflect this by making wraparound a hard priority tier, not a score component competing with bloat cleanup.

Instead of sorting solely by score, sort first by wraparound, then by score within each tier. The score computation stays exactly as-is; it's only used for relative ordering among force_vacuum tables and among non-force-vacuum tables.

Something like:

/*
 * Comparison function for sorting TableToProcess candidates.
 *
 * Tables at risk of wraparound are always processed before routine
 * maintenance work.  Within each tier, tables are sorted by descending
 * urgency score.
 */
static int
TableToProcessComparator(const ListCell *a, const ListCell *b)
{
	TableToProcess *ta = (TableToProcess *) lfirst(a);
	TableToProcess *tb = (TableToProcess *) lfirst(b);

	/* Wraparound prevention always takes priority */
	if (ta->wraparound && !tb->wraparound)
		return -1;
	if (!ta->wraparound && tb->wraparound)
		return 1;

	/* Within same tier, highest score first */
	if (ta->score > tb->score)
		return -1;
	if (ta->score < tb->score)
		return 1;
	return 0;
}

The simulation code, workload generator, and visualizations are attached (in "foo_tgz" because my last attempt at this email was stuck in the moderation queue). I'd welcome feedback on whether the scenario is representative. The autovacuum_simulation_fixed.py includes the proposed fix and also runs a number of iteration where it randomized the table Oids so as to remove any dependency on ordering that may be implied in our current algorithm (the before, or v18 algorithm).

The proposed addition to your fix is in the v20260318 patch attached.

best.

-greg


$ ./autovacuum_simulation.py
======================================================================
AUTOVACUUM SCHEDULING SIMULATION (v12 patch)
Accurate model of score-based prioritization
======================================================================

PostgreSQL config:
 autovacuum_freeze_max_age     =   200,000,000
 vacuum_failsafe_age           = 1,600,000,000
 autovacuum_vacuum_threshold   = 50
 autovacuum_vacuum_scale_factor= 0.2
 autovacuum_max_workers        = 3

Simulation: 7 days, 60s steps, seed=42

Generating tables...
   5 critical   — already past freeze_max_age
  15 aging      — approaching freeze_max_age
  80 active     — high dead-tuple rate

BEFORE simulation (catalog OID order):
 [OID order  ]   0%
 [OID order  ]  20%
 [OID order  ]  40%
 [OID order  ]  60%
 [OID order  ]  80%
 [OID order  ] 100%

AFTER simulation (score-based priority):
 [Score-based]   0%
 [Score-based]  20%
 [Score-based]  40%
 [Score-based]  60%
 [Score-based]  80%
 [Score-based] 100%

==============================================================================
RESULTS: Exposure time (minutes at risk before vacuum)
Table               OID   Crossed     Before      After   Change
------------------------------------------------------------------------------
critical_0        16465   day 0.0    10080m     10080m      +0%
critical_1        16398   day 0.0       14m     10080m  -71900%
critical_2        16387   day 0.0        4m     10080m  -251900%
critical_3        16478   day 0.0    10080m         4m    +100%
critical_4        16419   day 0.0    10080m         4m    +100%
aging_0           16415   day 0.3     9648m      9648m      +0%
aging_1           16412   day 0.7     9114m      9114m      +0%
aging_7           16459   day 0.7     9027m      9027m      +0%
aging_6           16395   day 0.8        9m      8982m  -99700%
aging_10          16481   day 0.9     8721m      8721m      +0%
aging_11          16472   day 0.9     8721m      8721m      +0%
aging_2           16401   day 1.0      415m      8579m   -1967%
aging_3           16397   day 1.0       12m      8578m  -71383%
aging_4           16470   day 1.1     8499m      8499m      +0%
aging_12          16411   day 1.1     8483m      8483m      +0%
aging_8           16438   day 1.2     8296m      8296m      +0%
aging_5           16453   day 1.3     8178m      8178m      +0%
aging_14          16448   day 2.1     7101m      7101m      +0%
aging_9           16388   day 2.2        4m      6979m  -174375%
aging_13          16413   day 2.2     6959m      6959m      +0%
------------------------------------------------------------------------------
AVERAGE                               6172m      7806m     -26%
MAXIMUM                              10080m     10080m      +0%
==============================================================================

Generating visualization...
 ✓ output/autovacuum_scheduling_impact.png

Done.


-------------------------------------------------------

$ ./autovacuum_simulation_fixed.py
========================================================================
THREE-WAY AUTOVACUUM SCHEDULING COMPARISON
Before (OID) vs v12 Patch (Score) vs Proposed Fix (Tiered)
========================================================================

Config: 3 workers, 7-day sim, 60s steps, 20 runs
Tables: 5 critical + 15 aging + 80 active = 100
freeze_max_age = 200,000,000
Estimated runtime: 3-8 minutes

Run      OID avg    Score avg   Tiered avg
-------------------------------------------
  1       7222m       7892m          4m
  2       7642m       7892m          4m
  3       7961m       7892m          4m
  4       6333m       7892m          4m
  5       8110m       7892m          4m
  6       6359m       7892m          4m
  7       6629m       7892m          4m
  8       8526m       7892m          4m
  9       6385m       7892m          4m
 10       6813m       7892m          4m
 11       8588m       7892m          4m
 12       7261m       7892m          4m
 13       6682m       7892m          4m
 14       8035m       7892m          4m
 15       5667m       7892m          4m
 16       7595m       7892m          4m
 17       6394m       7892m          4m
 18       6686m       7892m          4m
 19       7819m       7892m          4m
 20       7178m       7892m          4m

========================================================================
AGGREGATE RESULTS
========================================================================

Avg exposure per run (minutes):
 OID       :    7194 ± 816      (min=5667, max=8588)
 Score     :    7892 ± 0        (min=7892, max=7892)
 Tiered    :       4 ± 0        (min=4, max=4)

Peak concurrent force_vacuum tables:
 OID       :      82 ± 3        (min=79, max=88)
 Score     :      20 ± 0        (min=20, max=20)
 Tiered    :       5 ± 0        (min=5, max=5)

Pairwise wins (lower avg exposure = better):
 Score beats OID: 5/20   loses: 15/20   ties: 0/20
 Tiered beats OID: 20/20   loses: 0/20   ties: 0/20
 Tiered beats Score: 20/20   loses: 0/20   ties: 0/20

Variance (std dev of avg exposure across runs):
 OID       : 816 min
 Score     : 0 min
 Tiered    : 0 min

Per-table mean exposure (minutes):
 Table                      OID           Score          Tiered
 --------------------------------------------------------------
 critical_0        7564±4470    10080±0           7±0
 critical_1        8577±3671    10080±0           7±0
 critical_2        9073±3099     8938±0           3±0
 critical_3        8581±3661        4±0           4±0
 critical_4        9167±2826        3±0           3±0
 aging_0           7253±4256     9648±0           4±0
 aging_1           7324±3675     9114±0           4±0
 aging_2           6865±3517     8579±0           5±0
 aging_3           7253±3714     8851±0           4±0
 aging_4           6018±4014     8579±0           4±0
 aging_5           6856±2951     8064±0           4±0
 aging_6           6799±3482     8496±0           3±0
 aging_7           5765±4318     8853±0           4±0
 aging_8           7091±3227     8644±0           4±0
 aging_9           6358±2738     7479±0           4±0
 aging_10          7106±3620     8870±0           4±0
 aging_11          7175±3673     8965±0           4±0
 aging_12          6499±3299     8106±0           4±0
 aging_13          6080±3108     7595±0           4±0
 aging_14          6479±3913     8891±0           4±0

========================================================================
Completed in 108 seconds (1.8 minutes)
========================================================================

Generating visualization...
 ✓ output/three_way_comparison.png

Attachments:

  [application/octet-stream] v20260318-0001-autovacuum-scheduling-improvements.patch (29.9K, 2-v20260318-0001-autovacuum-scheduling-improvements.patch)
  download | inline diff:
From b907e947e495f717832644f5c740ec75782dc5d9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <[email protected]>
Date: Fri, 10 Oct 2025 12:28:37 -0500
Subject: [PATCH v20260318] autovacuum scheduling improvements

---
 doc/src/sgml/config.sgml                      |  90 ++++++
 doc/src/sgml/maintenance.sgml                 |  92 ++++++
 src/backend/postmaster/autovacuum.c           | 263 ++++++++++++++----
 src/backend/utils/misc/guc_parameters.dat     |  40 +++
 src/backend/utils/misc/postgresql.conf.sample |   5 +
 src/include/postmaster/autovacuum.h           |   6 +-
 src/tools/pgindent/typedefs.list              |   1 +
 7 files changed, 449 insertions(+), 48 deletions(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 8cdd826fbd3..229f41353eb 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9395,6 +9395,96 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
        </listitem>
       </varlistentry>
 
+      <varlistentry id="guc-autovacuum-freeze-score-weight" xreflabel="autovacuum_freeze_score_weight">
+       <term><varname>autovacuum_freeze_score_weight</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>autovacuum_freeze_score_weight</varname></primary>
+        <secondary>configuration parameter</secondary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the scaling factor of the transaction ID age component of
+         the score used by autovacuum for prioritization purposes.  The default
+         is <literal>1.0</literal>.  This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.  See <xref linkend="autovacuum-priority"/> for more information.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-autovacuum-multixact-freeze-score-weight" xreflabel="autovacuum_multixact_freeze_score_weight">
+       <term><varname>autovacuum_multixact_freeze_score_weight</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>autovacuum_multixact_freeze_score_weight</varname></primary>
+        <secondary>configuration parameter</secondary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the scaling factor of the multixact ID age component of the
+         score used by autovacuum for prioritization purposes.  The default is
+         <literal>1.0</literal>.  This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.  See <xref linkend="autovacuum-priority"/> for more information.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-autovacuum-vacuum-score-weight" xreflabel="autovacuum_vacuum_score_weight">
+       <term><varname>autovacuum_vacuum_score_weight</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>autovacuum_vacuum_score_weight</varname></primary>
+        <secondary>configuration parameter</secondary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the scaling factor of the vacuum threshold component of the
+         score used by autovacuum for prioritization purposes.  The default is
+         <literal>1.0</literal>.  This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.  See <xref linkend="autovacuum-priority"/> for more information.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-autovacuum-vacuum-insert-score-weight" xreflabel="autovacuum_vacuum_insert_score_weight">
+       <term><varname>autovacuum_vacuum_insert_score_weight</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>autovacuum_vacuum_insert_score_weight</varname></primary>
+        <secondary>configuration parameter</secondary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the scaling factor of the vacuum insert threshold component
+         of the score used by autovacuum for prioritization purposes.  The
+         default is <literal>1.0</literal>.  This parameter can only be set in
+         the <filename>postgresql.conf</filename> file or on the server command
+         line.  See <xref linkend="autovacuum-priority"/> for more information.
+        </para>
+       </listitem>
+      </varlistentry>
+
+      <varlistentry id="guc-autovacuum-analyze-score-weight" xreflabel="autovacuum_analyze_score_weight">
+       <term><varname>autovacuum_analyze_score_weight</varname> (<type>floating point</type>)
+       <indexterm>
+        <primary><varname>autovacuum_analyze_score_weight</varname></primary>
+        <secondary>configuration parameter</secondary>
+       </indexterm>
+       </term>
+       <listitem>
+        <para>
+         Specifies the scaling factor of the analyze threshold component of the
+         score used by autovacuum for prioritization purposes.  The default is
+         <literal>1.0</literal>.  This parameter can only be set in the
+         <filename>postgresql.conf</filename> file or on the server command
+         line.  See <xref linkend="autovacuum-priority"/> for more information.
+        </para>
+       </listitem>
+      </varlistentry>
+
      </variablelist>
     </sect2>
 
diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index 7c958b06273..b609f05be07 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -1061,6 +1061,98 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
      effectively prevent autovacuums from ever completing.
     </para>
    </warning>
+
+   <sect3 id="autovacuum-priority">
+    <title>Autovacuum Prioritization</title>
+
+    <para>
+     Autovacuum decides what to process in two steps: first it chooses a
+     database, then it chooses the tables within that database.  The autovacuum
+     launcher process prioritizes databases at risk of transaction ID or
+     multixact ID wraparound, else it chooses the database processed least
+     recently.  As an exception, it skips databases with no connections or no
+     activity since the last statistics reset, unless at risk of wraparound.
+    </para>
+
+    <para>
+     Within a database, the autovacuum worker process builds a list of tables
+     that require vacuum or analyze and sorts them using a scoring system.  It
+     scores each table by taking the maximum value of several component scores
+     representing various criteria important to vacuum or analyze.  Those
+     components are as follows:
+    </para>
+
+    <itemizedlist>
+     <listitem>
+      <para>
+       The <emphasis>transaction ID</emphasis> component measures the age in
+       transactions of the table's
+       <structname>pg_class</structname>.<structfield>relfrozenxid</structfield>
+       field as compared to <xref linkend="guc-autovacuum-freeze-max-age"/>.
+       Furthermore, this component increases greatly once the age surpasses
+       <xref linkend="guc-vacuum-failsafe-age"/>.  The final value for this
+       component can be adjusted via
+       <xref linkend="guc-autovacuum-freeze-score-weight"/>.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The <emphasis>multixact ID</emphasis> component measures the age in
+       multixacts of the table's
+       <structname>pg_class</structname>.<structfield>relminmxid</structfield>
+       field as compared to
+       <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Furthermore,
+       this component increases greatly once the age surpasses
+       <xref linkend="guc-vacuum-multixact-failsafe-age"/>.  The final value
+       for this component can be adjusted via
+       <xref linkend="guc-autovacuum-multixact-freeze-score-weight"/>.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The <emphasis>vacuum</emphasis> component measures the number of updated
+       or deleted tuples as compared to the threshold calculated with
+       <xref linkend="guc-autovacuum-vacuum-threshold"/>,
+       <xref linkend="guc-autovacuum-vacuum-scale-factor"/>, and
+       <xref linkend="guc-autovacuum-vacuum-max-threshold"/>.  The final value
+       for this component can be adjusted via
+       <xref linkend="guc-autovacuum-vacuum-score-weight"/>.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The <emphasis>vacuum insert</emphasis> component measures the number of
+       inserted tuples as compared to the threshold calculated with
+       <xref linkend="guc-autovacuum-vacuum-insert-threshold"/> and
+       <xref linkend="guc-autovacuum-vacuum-insert-scale-factor"/>.  The final
+       value for this component can be adjusted via
+       <xref linkend="guc-autovacuum-vacuum-insert-score-weight"/>.
+      </para>
+     </listitem>
+
+     <listitem>
+      <para>
+       The <emphasis>analyze</emphasis> component measures the number of
+       inserted, updated, or deleted tuples as compared to the threshold
+       calculated with
+       <xref linkend="guc-autovacuum-analyze-threshold"/> and
+       <xref linkend="guc-autovacuum-analyze-scale-factor"/>.  The final value
+       for this component can be adjusted via
+       <xref linkend="guc-autovacuum-analyze-score-weight"/>.
+      </para>
+     </listitem>
+    </itemizedlist>
+
+    <para>
+     To revert to the prioritization strategy used before
+     <productname>PostgreSQL</productname> 19 (i.e., the order the tables are
+     listed in the <literal>pg_class</literal> system catalog), set all of the
+     aforementioned "weight" parameters to <literal>0.0</literal>.
+    </para>
+   </sect3>
   </sect2>
  </sect1>
 
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index 219673db930..53712f80849 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -62,6 +62,7 @@
  */
 #include "postgres.h"
 
+#include <math.h>
 #include <signal.h>
 #include <sys/time.h>
 #include <unistd.h>
@@ -130,7 +131,11 @@ int			autovacuum_anl_thresh;
 double		autovacuum_anl_scale;
 int			autovacuum_freeze_max_age;
 int			autovacuum_multixact_freeze_max_age;
-
+double		autovacuum_freeze_score_weight = 1.0;
+double		autovacuum_multixact_freeze_score_weight = 1.0;
+double		autovacuum_vacuum_score_weight = 1.0;
+double		autovacuum_vacuum_insert_score_weight = 1.0;
+double		autovacuum_analyze_score_weight = 1.0;
 double		autovacuum_vac_cost_delay;
 int			autovacuum_vac_cost_limit;
 
@@ -312,6 +317,13 @@ static AutoVacuumShmemStruct *AutoVacuumShmem;
 static dlist_head DatabaseList = DLIST_STATIC_INIT(DatabaseList);
 static MemoryContext DatabaseListCxt = NULL;
 
+typedef struct
+{
+	Oid			oid;
+	double		score;
+	bool		wraparound;
+} TableToProcess;
+
 /*
  * Dummy pointer to persuade Valgrind that we've not leaked the array of
  * avl_dbase structs.  Make it global to ensure the compiler doesn't
@@ -350,7 +362,8 @@ static void relation_needs_vacanalyze(Oid relid, AutoVacOpts *relopts,
 									  Form_pg_class classForm,
 									  PgStat_StatTabEntry *tabentry,
 									  int effective_multixact_freeze_max_age,
-									  bool *dovacuum, bool *doanalyze, bool *wraparound);
+									  bool *dovacuum, bool *doanalyze, bool *wraparound,
+									  double *score);
 
 static void autovacuum_do_vac_analyze(autovac_table *tab,
 									  BufferAccessStrategy bstrategy);
@@ -1867,6 +1880,33 @@ get_database_list(void)
 	return dblist;
 }
 
+/*
+ * Comparison function for sorting autovac_table candidates.
+ *
+ * Tables at risk of wraparound (force_vacuum) are always processed
+ * before routine maintenance work.  Within each tier, tables are
+ * sorted by descending urgency score.
+ */
+static int
+TableToProcessComparator(const ListCell *a, const ListCell *b)
+{
+	TableToProcess *ta = (TableToProcess *) lfirst(a);
+	TableToProcess *tb = (TableToProcess *) lfirst(b);
+
+	/* Wraparound prevention always takes priority */
+	if (ta->wraparound && !tb->wraparound)
+		return -1;
+	if (!ta->wraparound && tb->wraparound)
+		return 1;
+
+	/* Within same tier, highest score first */
+	if (ta->score > tb->score)
+		return -1;
+	if (ta->score < tb->score)
+		return 1;
+	return 0;
+}
+
 /*
  * Process a database table-by-table
  *
@@ -1880,7 +1920,7 @@ do_autovacuum(void)
 	HeapTuple	tuple;
 	TableScanDesc relScan;
 	Form_pg_database dbForm;
-	List	   *table_oids = NIL;
+	List	   *tables_to_process = NIL;
 	List	   *orphan_oids = NIL;
 	HASHCTL		ctl;
 	HTAB	   *table_toast_map;
@@ -1992,6 +2032,7 @@ do_autovacuum(void)
 		bool		dovacuum;
 		bool		doanalyze;
 		bool		wraparound;
+		double		score = 0.0;
 
 		if (classForm->relkind != RELKIND_RELATION &&
 			classForm->relkind != RELKIND_MATVIEW)
@@ -2032,11 +2073,19 @@ do_autovacuum(void)
 		/* Check if it needs vacuum or analyze */
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+								  &dovacuum, &doanalyze, &wraparound,
+								  &score);
 
-		/* Relations that need work are added to table_oids */
+		/* Relations that need work are added to tables_to_process */
 		if (dovacuum || doanalyze)
-			table_oids = lappend_oid(table_oids, relid);
+		{
+			TableToProcess *table = palloc_object(TableToProcess);
+
+			table->oid = relid;
+			table->score = score;
+			table->wraparound = wraparound;
+			tables_to_process = lappend(tables_to_process, table);
+		}
 
 		/*
 		 * Remember TOAST associations for the second pass.  Note: we must do
@@ -2092,6 +2141,7 @@ do_autovacuum(void)
 		bool		dovacuum;
 		bool		doanalyze;
 		bool		wraparound;
+		double		score = 0.0;
 
 		/*
 		 * We cannot safely process other backends' temp tables, so skip 'em.
@@ -2124,11 +2174,18 @@ do_autovacuum(void)
 
 		relation_needs_vacanalyze(relid, relopts, classForm, tabentry,
 								  effective_multixact_freeze_max_age,
-								  &dovacuum, &doanalyze, &wraparound);
+								  &dovacuum, &doanalyze, &wraparound,
+								  &score);
 
 		/* ignore analyze for toast tables */
 		if (dovacuum)
-			table_oids = lappend_oid(table_oids, relid);
+		{
+			TableToProcess *table = palloc_object(TableToProcess);
+
+			table->oid = relid;
+			table->score = score;
+			tables_to_process = lappend(tables_to_process, table);
+		}
 
 		/* Release stuff to avoid leakage */
 		if (free_relopts)
@@ -2252,6 +2309,8 @@ do_autovacuum(void)
 		MemoryContextSwitchTo(AutovacMemCxt);
 	}
 
+	list_sort(tables_to_process, TableToProcessComparator);
+
 	/*
 	 * Optionally, create a buffer access strategy object for VACUUM to use.
 	 * We use the same BufferAccessStrategy object for all tables VACUUMed by
@@ -2280,9 +2339,9 @@ do_autovacuum(void)
 	/*
 	 * Perform operations on collected tables.
 	 */
-	foreach(cell, table_oids)
+	foreach_ptr(TableToProcess, table, tables_to_process)
 	{
-		Oid			relid = lfirst_oid(cell);
+		Oid			relid = table->oid;
 		HeapTuple	classTup;
 		autovac_table *tab;
 		bool		isshared;
@@ -2513,7 +2572,7 @@ deleted:
 		pg_atomic_test_set_flag(&MyWorkerInfo->wi_dobalance);
 	}
 
-	list_free(table_oids);
+	list_free_deep(tables_to_process);
 
 	/*
 	 * Perform additional work items, as requested by backends.
@@ -2915,6 +2974,7 @@ recheck_relation_needs_vacanalyze(Oid relid,
 								  bool *wraparound)
 {
 	PgStat_StatTabEntry *tabentry;
+	double		score;
 
 	/* fetch the pgstat table entry */
 	tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
@@ -2922,15 +2982,12 @@ recheck_relation_needs_vacanalyze(Oid relid,
 
 	relation_needs_vacanalyze(relid, avopts, classForm, tabentry,
 							  effective_multixact_freeze_max_age,
-							  dovacuum, doanalyze, wraparound);
+							  dovacuum, doanalyze, wraparound,
+							  &score);
 
 	/* Release tabentry to avoid leakage */
 	if (tabentry)
 		pfree(tabentry);
-
-	/* ignore ANALYZE for toast tables */
-	if (classForm->relkind == RELKIND_TOASTVALUE)
-		*doanalyze = false;
 }
 
 /*
@@ -2971,6 +3028,43 @@ recheck_relation_needs_vacanalyze(Oid relid,
  * autovacuum_vacuum_threshold GUC variable.  Similarly, a vac_scale_factor
  * value < 0 is substituted with the value of
  * autovacuum_vacuum_scale_factor GUC variable.  Ditto for analyze.
+ *
+ * This function also returns a score that can be used to sort the list of
+ * tables to process.  The idea is to have autovacuum prioritize tables that
+ * are furthest beyond their thresholds (e.g., a table nearing transaction ID
+ * wraparound should be vacuumed first).  This prioritization scheme is
+ * certainly far from perfect; there are simply too many possibilities for any
+ * scoring technique to work across all workloads, and the situation might
+ * change significantly between the time we calculate the score and the time
+ * that autovacuum gets to processing it.  However, we have attempted to
+ * develop something that is expected to work for a large portion of workloads
+ * with reasonable parameter settings.
+ *
+ * The score is calculated as the maximum of the ratios of each of the table's
+ * relevant values to its threshold.  For example, if the number of inserted
+ * tuples is 100, and the insert threshold for the table is 80, the insert
+ * score is 1.25.  If all other scores are below that value, the returned score
+ * will be 1.25.  The other criteria considered for the score are the table
+ * ages (both relfrozenxid and relminmxid) compared to the corresponding
+ * freeze-max-age setting, the number of updated/deleted tuples compared to the
+ * vacuum threshold, and the number of inserted/updated/deleted tuples compared
+ * to the analyze threshold.
+ *
+ * One exception to the previous paragraph is for tables nearing wraparound,
+ * i.e., those that have surpassed the effective failsafe ages.  In that case,
+ * the relfrozen/relminmxid-based score is scaled aggressively so that the
+ * table has a decent chance of sorting to the top of the list.
+ *
+ * To adjust how strongly each component contributes to the score, the
+ * following parameters can be adjusted from their default of 1.0 to anywhere
+ * between 0.0 and 1.0 (inclusive).  Setting all of these to 0.0 restores
+ * pre-v19 prioritization behavior:
+ *
+ *     autovacuum_freeze_score_weight
+ *     autovacuum_multixact_freeze_score_weight
+ *     autovacuum_vacuum_score_weight
+ *     autovacuum_vacuum_insert_score_weight
+ *     autovacuum_analyze_score_weight
  */
 static void
 relation_needs_vacanalyze(Oid relid,
@@ -2981,7 +3075,8 @@ relation_needs_vacanalyze(Oid relid,
  /* output params below */
 						  bool *dovacuum,
 						  bool *doanalyze,
-						  bool *wraparound)
+						  bool *wraparound,
+						  double *score)
 {
 	bool		force_vacuum;
 	bool		av_enabled;
@@ -3010,11 +3105,16 @@ relation_needs_vacanalyze(Oid relid,
 	int			multixact_freeze_max_age;
 	TransactionId xidForceLimit;
 	TransactionId relfrozenxid;
+	MultiXactId relminmxid;
 	MultiXactId multiForceLimit;
 
 	Assert(classForm != NULL);
 	Assert(OidIsValid(relid));
 
+	*score = 0.0;
+	*dovacuum = false;
+	*doanalyze = false;
+
 	/*
 	 * Determine vacuum/analyze equation parameters.  We have two possible
 	 * sources: the passed reloptions (which could be a main table or a toast
@@ -3062,17 +3162,17 @@ relation_needs_vacanalyze(Oid relid,
 
 	av_enabled = (relopts ? relopts->enabled : true);
 
+	relfrozenxid = classForm->relfrozenxid;
+	relminmxid = classForm->relminmxid;
+
 	/* Force vacuum if table is at risk of wraparound */
 	xidForceLimit = recentXid - freeze_max_age;
 	if (xidForceLimit < FirstNormalTransactionId)
 		xidForceLimit -= FirstNormalTransactionId;
-	relfrozenxid = classForm->relfrozenxid;
 	force_vacuum = (TransactionIdIsNormal(relfrozenxid) &&
 					TransactionIdPrecedes(relfrozenxid, xidForceLimit));
 	if (!force_vacuum)
 	{
-		MultiXactId relminmxid = classForm->relminmxid;
-
 		multiForceLimit = recentMulti - multixact_freeze_max_age;
 		if (multiForceLimit < FirstMultiXactId)
 			multiForceLimit -= FirstMultiXactId;
@@ -3081,13 +3181,58 @@ relation_needs_vacanalyze(Oid relid,
 	}
 	*wraparound = force_vacuum;
 
+	/* Update the score. */
+	if (force_vacuum)
+	{
+		uint32		xid_age;
+		uint32		mxid_age;
+		double		xid_score;
+		double		mxid_score;
+		int			effective_xid_failsafe_age;
+		int			effective_mxid_failsafe_age;
+
+		/*
+		 * To calculate the (M)XID age portion of the score, divide the age by
+		 * its respective *_freeze_max_age parameter.
+		 */
+		xid_age = TransactionIdIsNormal(relfrozenxid) ? recentXid - relfrozenxid : 0;
+		mxid_age = MultiXactIdIsValid(relminmxid) ? recentMulti - relminmxid : 0;
+
+		xid_score = (double) xid_age / freeze_max_age;
+		mxid_score = (double) mxid_age / multixact_freeze_max_age;
+
+		/*
+		 * To ensure tables are given increased priority once they begin
+		 * approaching wraparound, we scale the score aggressively if the ages
+		 * surpass vacuum_failsafe_age or vacuum_multixact_failsafe_age.
+		 *
+		 * As in vacuum_xid_failsafe_check(), the effective failsafe age is no
+		 * less than 105% the value of the respective *_freeze_max_age
+		 * parameter.  Note that per-table settings could result in a low
+		 * score even if the table surpasses the failsafe settings.  However,
+		 * this is a strange enough corner case that we don't bother trying to
+		 * handle it.
+		 */
+		effective_xid_failsafe_age = Max(vacuum_failsafe_age,
+										 autovacuum_freeze_max_age * 1.05);
+		effective_mxid_failsafe_age = Max(vacuum_multixact_failsafe_age,
+										  autovacuum_multixact_freeze_max_age * 1.05);
+
+		if (xid_age >= effective_xid_failsafe_age)
+			xid_score = pow(xid_score, Max(1.0, (double) xid_age / 100000000));
+		if (mxid_age >= effective_mxid_failsafe_age)
+			mxid_score = pow(mxid_score, Max(1.0, (double) mxid_age / 100000000));
+
+		xid_score *= autovacuum_freeze_score_weight;
+		mxid_score *= autovacuum_multixact_freeze_score_weight;
+
+		*score = Max(xid_score, mxid_score);
+		*dovacuum = true;
+	}
+
 	/* User disabled it in pg_class.reloptions?  (But ignore if at risk) */
 	if (!av_enabled && !force_vacuum)
-	{
-		*doanalyze = false;
-		*dovacuum = false;
 		return;
-	}
 
 	/*
 	 * If we found stats for the table, and autovacuum is currently enabled,
@@ -3136,34 +3281,58 @@ relation_needs_vacanalyze(Oid relid,
 			vac_ins_scale_factor * reltuples * pcnt_unfrozen;
 		anlthresh = (float4) anl_base_thresh + anl_scale_factor * reltuples;
 
+		/*
+		 * Determine if this table needs vacuum, and update the score if it
+		 * does.
+		 */
+		if (vactuples > vacthresh)
+		{
+			double		vacthresh_score;
+
+			vacthresh_score = (double) vactuples / Max(vacthresh, 1);
+			vacthresh_score *= autovacuum_vacuum_score_weight;
+
+			*score = Max(*score, vacthresh_score);
+			*dovacuum = true;
+		}
+
+		if (vac_ins_base_thresh >= 0 && instuples > vacinsthresh)
+		{
+			double		vacinsthresh_score;
+
+			vacinsthresh_score = (double) instuples / Max(vacinsthresh, 1);
+			vacinsthresh_score *= autovacuum_vacuum_insert_score_weight;
+
+			*score = Max(*score, vacinsthresh_score);
+			*dovacuum = true;
+		}
+
+		/*
+		 * Determine if this table needs analyze, and update the score if it
+		 * does.  Note that we don't analyze TOAST tables and pg_statistic.
+		 */
+		if (anltuples > anlthresh &&
+			relid != StatisticRelationId &&
+			classForm->relkind != RELKIND_TOASTVALUE)
+		{
+			double		anlthresh_score;
+
+			anlthresh_score = (double) anltuples / Max(anlthresh, 1);
+			anlthresh_score *= autovacuum_analyze_score_weight;
+
+			*score = Max(*score, anlthresh_score);
+			*doanalyze = true;
+		}
+
 		if (vac_ins_base_thresh >= 0)
-			elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: %.0f (threshold %.0f), anl: %.0f (threshold %.0f)",
+			elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: %.0f (threshold %.0f), anl: %.0f (threshold %.0f), score: %.3f",
 				 NameStr(classForm->relname),
-				 vactuples, vacthresh, instuples, vacinsthresh, anltuples, anlthresh);
+				 vactuples, vacthresh, instuples, vacinsthresh, anltuples, anlthresh, *score);
 		else
-			elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: (disabled), anl: %.0f (threshold %.0f)",
+			elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: (disabled), anl: %.0f (threshold %.0f), score %.3f",
 				 NameStr(classForm->relname),
-				 vactuples, vacthresh, anltuples, anlthresh);
-
-		/* Determine if this table needs vacuum or analyze. */
-		*dovacuum = force_vacuum || (vactuples > vacthresh) ||
-			(vac_ins_base_thresh >= 0 && instuples > vacinsthresh);
-		*doanalyze = (anltuples > anlthresh);
+				 vactuples, vacthresh, anltuples, anlthresh, *score);
 	}
-	else
-	{
-		/*
-		 * Skip a table not found in stat hash, unless we have to force vacuum
-		 * for anti-wrap purposes.  If it's not acted upon, there's no need to
-		 * vacuum it.
-		 */
-		*dovacuum = force_vacuum;
-		*doanalyze = false;
-	}
-
-	/* ANALYZE refuses to work with pg_statistic */
-	if (relid == StatisticRelationId)
-		*doanalyze = false;
 }
 
 /*
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 0c9854ad8fc..b47bc534ee1 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -136,6 +136,14 @@
   max => '100.0',
 },
 
+{ name => 'autovacuum_analyze_score_weight', type => 'real', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
+  short_desc => 'Scaling factor of analyze score for autovacuum prioritization.',
+  variable => 'autovacuum_analyze_score_weight',
+  boot_val => '1.0',
+  min => '0.0',
+  max => '1.0',
+},
+
 { name => 'autovacuum_analyze_threshold', type => 'int', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
   short_desc => 'Minimum number of tuple inserts, updates, or deletes prior to analyze.',
   variable => 'autovacuum_anl_thresh',
@@ -154,6 +162,14 @@
   max => '2000000000',
 },
 
+{ name => 'autovacuum_freeze_score_weight', type => 'real', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
+  short_desc => 'Scaling factor of freeze score for autovacuum prioritization.',
+  variable => 'autovacuum_freeze_score_weight',
+  boot_val => '1.0',
+  min => '0.0',
+  max => '1.0',
+},
+
 { name => 'autovacuum_max_workers', type => 'int', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
   short_desc => 'Sets the maximum number of simultaneously running autovacuum worker processes.',
   variable => 'autovacuum_max_workers',
@@ -171,6 +187,14 @@
   max => '2000000000',
 },
 
+{ name => 'autovacuum_multixact_freeze_score_weight', type => 'real', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
+  short_desc => 'Scaling factor of multixact freeze score for autovacuum prioritization.',
+  variable => 'autovacuum_multixact_freeze_score_weight',
+  boot_val => '1.0',
+  min => '0.0',
+  max => '1.0',
+},
+
 { name => 'autovacuum_naptime', type => 'int', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
   short_desc => 'Time to sleep between autovacuum runs.',
   flags => 'GUC_UNIT_S',
@@ -207,6 +231,14 @@
   max => '100.0',
 },
 
+{ name => 'autovacuum_vacuum_insert_score_weight', type => 'real', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
+  short_desc => 'Scaling factor of vacuum insert score for autovacuum prioritization.',
+  variable => 'autovacuum_vacuum_insert_score_weight',
+  boot_val => '1.0',
+  min => '0.0',
+  max => '1.0',
+},
+
 { name => 'autovacuum_vacuum_insert_threshold', type => 'int', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
   short_desc => 'Minimum number of tuple inserts prior to vacuum.',
   long_desc => '-1 disables insert vacuums.',
@@ -233,6 +265,14 @@
   max => '100.0',
 },
 
+{ name => 'autovacuum_vacuum_score_weight', type => 'real', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
+  short_desc => 'Scaling factor of vacuum score for autovacuum prioritization.',
+  variable => 'autovacuum_vacuum_score_weight',
+  boot_val => '1.0',
+  min => '0.0',
+  max => '1.0',
+},
+
 { name => 'autovacuum_vacuum_threshold', type => 'int', context => 'PGC_SIGHUP', group => 'VACUUM_AUTOVACUUM',
   short_desc => 'Minimum number of tuple updates or deletes prior to vacuum.',
   variable => 'autovacuum_vac_thresh',
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e4abe6c0077..ab9cbebbc3f 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -733,6 +733,11 @@
 #autovacuum_multixact_freeze_max_age = 400000000        # maximum multixact age
                                                         # before forced vacuum
                                                         # (change requires restart)
+#autovacuum_freeze_score_weight = 1.0
+#autovacuum_multixact_freeze_score_weight = 1.0
+#autovacuum_vacuum_score_weight = 1.0
+#autovacuum_vacuum_insert_score_weight = 1.0
+#autovacuum_analyze_score_weight = 1.0
 #autovacuum_vacuum_cost_delay = 2ms     # default vacuum cost delay for
                                         # autovacuum, in milliseconds;
                                         # -1 means use vacuum_cost_delay
diff --git a/src/include/postmaster/autovacuum.h b/src/include/postmaster/autovacuum.h
index 5aa0f3a8ac1..b21d111d4d5 100644
--- a/src/include/postmaster/autovacuum.h
+++ b/src/include/postmaster/autovacuum.h
@@ -43,7 +43,11 @@ extern PGDLLIMPORT int autovacuum_freeze_max_age;
 extern PGDLLIMPORT int autovacuum_multixact_freeze_max_age;
 extern PGDLLIMPORT double autovacuum_vac_cost_delay;
 extern PGDLLIMPORT int autovacuum_vac_cost_limit;
-
+extern PGDLLIMPORT double autovacuum_freeze_score_weight;
+extern PGDLLIMPORT double autovacuum_multixact_freeze_score_weight;
+extern PGDLLIMPORT double autovacuum_vacuum_score_weight;
+extern PGDLLIMPORT double autovacuum_vacuum_insert_score_weight;
+extern PGDLLIMPORT double autovacuum_analyze_score_weight;
 extern PGDLLIMPORT int Log_autovacuum_min_duration;
 extern PGDLLIMPORT int Log_autoanalyze_min_duration;
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 174e2798443..bc4e6444a2d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3088,6 +3088,7 @@ TableScanDesc
 TableScanDescData
 TableSpaceCacheEntry
 TableSpaceOpts
+TableToProcess
 TablespaceList
 TablespaceListCell
 TapeBlockTrailer
-- 
2.51.2



  [application/octet-stream] foo_tgz (608.7K, 3-foo_tgz)
  download

  [image/png] three_way_comparison.png (377.2K, 4-three_way_comparison.png)
  download | view image

^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-19 11:44  David Rowley <[email protected]>
  parent: Greg Burd <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-19 11:44 UTC (permalink / raw)
  To: Greg Burd <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, 19 Mar 2026 at 22:57, Greg Burd <[email protected]> wrote:
> I'm late in the review process. I know David Rowley proposed the unified scoring approach that became the foundation of this patch, and I think that's a great direction. However, I'm concerned that the patch's default scoring weights don't give XID-age urgency sufficient priority over dead-tuple urgency. The weight GUCs (autovacuum_vacuum_score_weight, etc.) can address this, but they max at 1.0, meaning you can only reduce dead-tuple priority, not increase XID priority.

I think that it would be good if you could state *why* you disagree
with the proposed scoring rather than *that* you disagree. All this
stuff was talked about around [1]. For me, I don't see what's
particularly alarming about a table reaching
autovaccum_max_freeze_age. That GUC is set to less than 10% of the
total transaction ID space of where the table must be frozen. Why is
it you think these should take priority over everything else? SLRU
buffers are configurable since v17, so having to lookup the clog for a
wider range of xids isn't as big an issue as it used to be, plus
memory and L3 sizes are bigger than they used to be. Is slow clog
lookups what you're concerned about? You didn't really say.

Having said that, I'd not realised that Nathan capped the new GUCs at
1.0. I think we should allow those to be set higher, likely at least
to 10.0. Maybe we could consider adjusting the code that's setting the
xid_score/mxid_score so that we start scaling the score aggressively
when if (xid_age >= effective_xid_failsafe_age /
Max(autovacuum_freeze_score_weight,1.0)) becomes true. Then, if people
want to play it safer, then they can set
autovacuum_freeze_score_weight = 2.0 and have the aggressive scaling
kick in at 800 million, or whatever half of effective_xid_failsafe_age
is set to. You could set yours to 8.0, if you really want tables over
autovacuum_freeze_max_age to take priority over everything else. I
just don't see or understand the reason why you'd want to.

It's a fairly common misconception that a wraparound vacuum is
something to be alarmed about. Maybe you've fallen for that? I recall
a few proposals to adjust the wording that's shown in pg_stat_activity
to make them seem less alarming.

David

[1] https://www.postgresql.org/message-id/CAApHDvqobtKMwJbhKB_c%3D3-TM%3DTgS3bcuvzcWMm3ee1c0mz9hw%40mail...





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-19 15:49  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2026-03-19 15:49 UTC (permalink / raw)
  To: Greg Burd <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Mar 19, 2026 at 09:49:34AM -0400, Greg Burd wrote:
> My concern isn't that wraparound vacuums are inherently alarming, I agree
> with you that reaching freeze_max_age isn't a crisis. The issue is a
> scoring-scale problem in the gap between freeze_max_age (200M) and
> failsafe age (1.6B).
> 
> In that 1.4B XID window, force_vacuum tables have XID scores of 1.0–8.0
> (age/freeze_max_age), while typical active tables accumulate dead-tuple
> scores of 18–70+ within hours of their last vacuum. The exponential boost
> doesn't activate until failsafe age, so force_vacuum tables are
> systematically outranked by routine bloat cleanup for what could be days
> or weeks in production.

I think "systematically outranked" makes the problem sound worse than it
is.  Once the freeze age is reached, the table is going to get added to the
list no matter what, it just might be sorted lower.

>> Having said that, I'd not realised that Nathan capped the new GUCs at
>> 1.0. I think we should allow those to be set higher, likely at least
>> to 10.0.
> 
> That would definitely help. If autovacuum_freeze_score_weight could be
> set to 8.0–10.0, DBAs could manually restore the priority we want.

Done in the attached.

>> Maybe we could consider adjusting the code that's setting the
>> xid_score/mxid_score so that we start scaling the score aggressively
>> when if (xid_age >= effective_xid_failsafe_age /
>> Max(autovacuum_freeze_score_weight,1.0)) becomes true
> 
> This is clever, it would make the aggressive scaling kick in earlier when
> the weight is higher. At weight=8.0, you'd get exponential boost starting
> at 200M (failsafe/8) instead of 1.6B.

Seems reasonable.  I've added this, too.  Something else we might want to
consider is scaling the score once the freeze age is reached, just much
less aggressively than we do at the failsafe age.  It probably doesn't make
sense to start scaling too much at 200M, but at 1.5B, yeah, we should
probably process the table sooner than later.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-19 17:36  Greg Burd <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Greg Burd @ 2026-03-19 17:36 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Sami Imseih <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers


On Thu, Mar 19, 2026, at 11:49 AM, Nathan Bossart wrote:
> On Thu, Mar 19, 2026 at 09:49:34AM -0400, Greg Burd wrote:
>> My concern isn't that wraparound vacuums are inherently alarming, I agree
>> with you that reaching freeze_max_age isn't a crisis. The issue is a
>> scoring-scale problem in the gap between freeze_max_age (200M) and
>> failsafe age (1.6B).
>> 
>> In that 1.4B XID window, force_vacuum tables have XID scores of 1.0–8.0
>> (age/freeze_max_age), while typical active tables accumulate dead-tuple
>> scores of 18–70+ within hours of their last vacuum. The exponential boost
>> doesn't activate until failsafe age, so force_vacuum tables are
>> systematically outranked by routine bloat cleanup for what could be days
>> or weeks in production.
>
> I think "systematically outranked" makes the problem sound worse than it
> is.  Once the freeze age is reached, the table is going to get added to the
> list no matter what, it just might be sorted lower.

Yeah, that was a bit of hyperbole on my part. :)

>>> Having said that, I'd not realised that Nathan capped the new GUCs at
>>> 1.0. I think we should allow those to be set higher, likely at least
>>> to 10.0.
>> 
>> That would definitely help. If autovacuum_freeze_score_weight could be
>> set to 8.0–10.0, DBAs could manually restore the priority we want.
>
> Done in the attached.

+1

>>> Maybe we could consider adjusting the code that's setting the
>>> xid_score/mxid_score so that we start scaling the score aggressively
>>> when if (xid_age >= effective_xid_failsafe_age /
>>> Max(autovacuum_freeze_score_weight,1.0)) becomes true
>> 
>> This is clever, it would make the aggressive scaling kick in earlier when
>> the weight is higher. At weight=8.0, you'd get exponential boost starting
>> at 200M (failsafe/8) instead of 1.6B.
>
> Seems reasonable.  I've added this, too.

+1

> Something else we might want to
> consider is scaling the score once the freeze age is reached, just much
> less aggressively than we do at the failsafe age.  It probably doesn't make
> sense to start scaling too much at 200M, but at 1.5B, yeah, we should
> probably process the table sooner than later.

So a scaling factor relative to some point like 200M?  Maybe... but for now I think what you have in v13 is about right and a solid improvement over what's there now.

> -- 
> nathan
>
> Attachments:
> * v13-0001-autovacuum-scheduling-improvements.patch

LGTM!

best.

-greg





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-20 04:22  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2026-03-20 04:22 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Greg Burd <[email protected]>; David Rowley <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Done in the attached.

Thanks!

I have a few comments.

#1.
+ * between 0.0 and 1.0 (inclusive).  Setting all of these to 0.0 restores

This should be "0.0 and 10.0"

#2.
typo:

+                * being scaling aggressively.

Thisshould be "begin"

#3.
This reads more like documentation than code comment.

+ * To adjust how strongly each component contributes to the score, the
+ * following parameters can be adjusted from their default of 1.0 to anywhere
+ * between 0.0 and 1.0 (inclusive).  Setting all of these to 0.0 restores
+ * pre-v19 prioritization behavior:
+ *
+ *     autovacuum_freeze_score_weight
+ *     autovacuum_multixact_freeze_score_weight
+ *     autovacuum_vacuum_score_weight
+ *     autovacuum_vacuum_insert_score_weight
+ *     autovacuum_analyze_score_weight

I don't actually think this section adds any value at all to the comments.

#4.
+                       elog(DEBUG3, "%s: vac: %.0f (threshold %.0f),
ins: (disabled), anl: %.0f (threshold %.0f), score %.3f",

A missing colon after "score", unlike the other occurrence which has it.

#5.
+               if (autovacuum_freeze_score_weight > 1.0)
+                       effective_xid_failsafe_age /=
autovacuum_freeze_score_weight;
+               if (autovacuum_multixact_freeze_score_weight > 1.0)
+                       effective_mxid_failsafe_age /=
autovacuum_multixact_freeze_score_weight;
+

Shouldn't this be "if (autovacuum_freeze_score_weight > 0.0)" ?
A weight > 0 should always adjust the threshold, right? we should only
prevent division by 0 here.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-20 04:40  David Rowley <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-20 04:40 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 20 Mar 2026 at 17:22, Sami Imseih <[email protected]> wrote:
> #5.
> +               if (autovacuum_freeze_score_weight > 1.0)
> +                       effective_xid_failsafe_age /=
> autovacuum_freeze_score_weight;
> +               if (autovacuum_multixact_freeze_score_weight > 1.0)
> +                       effective_mxid_failsafe_age /=
> autovacuum_multixact_freeze_score_weight;
> +
>
> Shouldn't this be "if (autovacuum_freeze_score_weight > 0.0)" ?
> A weight > 0 should always adjust the threshold, right? we should only
> prevent division by 0 here.

We really do want to ensure that tables are scaling very aggressively
when they reach the failsafe age. We don't want any quirky user
settings changing that. Prioritising anything else over a table at
failsafe age would be a very bad thing. The only point in doing what I
suggested was to allow users to give themselves more margin to get the
freezing done before failsafe age, certainly not less margin. In any
case, effective_xid_failsafe_age and effective_mxid_failsafe_age are
signed ints and default to 1.6 billion. There's just not enough
bit-space to divide them by any number much below 1.0 before they'll
wrap.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-20 05:03  Sami Imseih <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2026-03-20 05:03 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

>
> case, effective_xid_failsafe_age and effective_mxid_failsafe_age are
> signed ints and default to 1.6 billion. There's just not enough
> bit-space to divide them by any number much below 1.0 before they'll
> wrap.
>

Of course. It Would not be safe to divide by < 1.

Thanks!

--
Sami


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-20 20:31  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2026-03-20 20:31 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

Thanks for the feedback.  Here is an updated patch.

I kept the comment about the weight parameters in autovacuum.c.  Since
there's a bunch of code related to them, IMHO we should have an explanatory
note somewhere.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-21 14:55  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2026-03-21 14:55 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Thanks for the feedback.  Here is an updated patch.

LGTM.

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-22 00:08  Bharath Rupireddy <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Bharath Rupireddy @ 2026-03-22 00:08 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, Mar 20, 2026 at 1:31 PM Nathan Bossart <[email protected]> wrote:
>
> Thanks for the feedback.  Here is an updated patch.
>
> I kept the comment about the weight parameters in autovacuum.c.  Since
> there's a bunch of code related to them, IMHO we should have an explanatory
> note somewhere.

Hi Nathan, Thank you for working on this feature. I'm late to this
thread. I read the patch, and here are some general thoughts. I
haven't read the full thread, so some of these thoughts may be
repetitive - thanks for bearing with me.

1/ The Autovacuum Prioritization section in the docs is a good start
for explaining the usage aspects of the new scoring system. However,
IMHO, adding a couple of production-like scenarios showing how these
scores need to be adjusted and used would be a good addition to this
section.

2/ Any plans to extend the new scoring system to the table level
(i.e., reloptions)? I think it would help in situations where there
are huge tables that need to be prioritized for vacuum over others, so
setting the scoring high for those tables would allow the next
autovacuum to pick them up first.

3/ Any plans to add tests to demo how each of these parameters could
help in various situations - when the system needs freezing to be
prioritized for avoiding XID wraparound over cleaning up dead tuples,
and when it needs analyze to be prioritized to get correct plans,
etc.? If needed, we could add elog(DEBUGX) messages to emit the reason
or effectiveness of these new scores so that we can verify them in
tests.

4/ Is adding a reason (such as how each of these scores influenced the
autovacuum to pick this table) to vacuum progress reporting a good
idea? This helps answer some of the why and how questions when the
autovacuum is in progress.

-- 
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-23 15:22  Nathan Bossart <[email protected]>
  parent: Bharath Rupireddy <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-23 15:22 UTC (permalink / raw)
  To: Bharath Rupireddy <[email protected]>; +Cc: Sami Imseih <[email protected]>; David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Sat, Mar 21, 2026 at 05:08:56PM -0700, Bharath Rupireddy wrote:
> Hi Nathan, Thank you for working on this feature. I'm late to this
> thread. I read the patch, and here are some general thoughts. I
> haven't read the full thread, so some of these thoughts may be
> repetitive - thanks for bearing with me.

Thanks for looking.

> 1/ The Autovacuum Prioritization section in the docs is a good start
> for explaining the usage aspects of the new scoring system. However,
> IMHO, adding a couple of production-like scenarios showing how these
> scores need to be adjusted and used would be a good addition to this
> section.

IMO these scores generally shouldn't need adjusting outside of cases where
the DBA wants to revert to the previous scheduling strategy or there's some
other unique scenario where the prioritization isn't working for them.  I
think it'd be hard to give any general guidance, at least at this point
since we have no field experience yet.

> 2/ Any plans to extend the new scoring system to the table level
> (i.e., reloptions)? I think it would help in situations where there
> are huge tables that need to be prioritized for vacuum over others, so
> setting the scoring high for those tables would allow the next
> autovacuum to pick them up first.

I thought about that, but decided to leave those out because 1) the scoring
parameters are meant to be global and affect workers' prioritization of all
tables and 2) there are already a number of parameters that can be adjusted
to affect the score.  If users find that they really need reloptions for
some reason, we can always add them later.  Removing them seems more
difficult.

> 3/ Any plans to add tests to demo how each of these parameters could
> help in various situations - when the system needs freezing to be
> prioritized for avoiding XID wraparound over cleaning up dead tuples,
> and when it needs analyze to be prioritized to get correct plans,
> etc.? If needed, we could add elog(DEBUGX) messages to emit the reason
> or effectiveness of these new scores so that we can verify them in
> tests.

I hadn't planned on doing any additional tests/demonstrations here.  And
I'd really like to avoid burying all this information in DEBUG log
messages.  IMO we ought to eventually be exposing this stuff in system
views and the like, as has been discussed elsewhere.

> 4/ Is adding a reason (such as how each of these scores influenced the
> autovacuum to pick this table) to vacuum progress reporting a good
> idea? This helps answer some of the why and how questions when the
> autovacuum is in progress.

Yeah, adding that in addition to a system view, etc. could be nice.  I'm a
little hesitant to start making big additions to the patch at this point,
but I can give it a whirl if folks think something like this should be
added for v19.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-23 19:01  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Sami Imseih @ 2026-03-23 19:01 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> > 4/ Is adding a reason (such as how each of these scores influenced the
> > autovacuum to pick this table) to vacuum progress reporting a good
> > idea? This helps answer some of the why and how questions when the
> > autovacuum is in progress.
>
> Yeah, adding that in addition to a system view, etc. could be nice.  I'm a
> little hesitant to start making big additions to the patch at this point,
> but I can give it a whirl if folks think something like this should be
> added for v19.

Adding a system view will be nice. I am attaching a version I used in earlier
testing (cleaned up with docs), if we are inclined to get this in. I
think it will be
useful.

This follows the same setup as do_autovacuum(); scanning pg_class,
filtering relation kinds and temp tables, and computing
effective_multixact_freeze_max_age
are done in the SQL-callable function, while another wrapper
compute_autovac_score() handles
the per-relation setup (snapshotting recentXid/recentMulti, fetching
reloptions and the pgstat entry)
before calling relation_needs_vacanalyze(). The function holds an
AccessShareLock on pg_class for
the duration of the scan, so this should be relatively lightweight.

```
test=# select * from pg_stat_autovacuum_priority order by score desc ;
 relid |     schemaname     |           relname           | dovacuum |
doanalyze | wraparound |         score
-------+--------------------+-----------------------------+----------+-----------+------------+-----------------------
 16400 | public             | pgbench_accounts            | t        |
f         | t          | 1.055318563196673e+16
 16404 | public             | pgbench_branches            | t        |
t         | t          |    442.01666666666665
 16396 | public             | pgbench_tellers             | t        |
t         | t          |    172.97333333333333
 16393 | public             | pgbench_history             | t        |
t         | t          |     4.703261221642761
 14227 | pg_toast           | pg_toast_14224              | t        |
f         | t          |            2.08555407
```
Note in the test above, I used xid_wraparound to calculate a score
with the failsafe POW()
adjustment. Notice that this is a very high score being emitted as
discussed earlier [1].
This is documented in v14 as "scaled aggressively so that the table
has a decent chance of
sorting to the top of the list."

Maybe the doc should say something like " scaled aggressively, which
can produce very large values, to ensure
the table sorts to the top of the list."


[1] [https://www.postgresql.org/message-id/CAA5RZ0vfhAnFBp4HrBQc%2BALaJMx6vCvMtnBi39ST_4nH9PZEjA%40mail.g...]

--
Sami Imseih
Amazon Web Services (AWS)


Attachments:

  [application/octet-stream] v1-0001-Add-pg_stat_autovacuum_priority-view.patch (15.3K, 2-v1-0001-Add-pg_stat_autovacuum_priority-view.patch)
  download | inline diff:
From 2e08f022fef177c586838d1adb3f14202ea48578 Mon Sep 17 00:00:00 2001
From: Sami Imseih <[email protected]>
Date: Mon, 23 Mar 2026 17:03:59 +0000
Subject: [PATCH v1 1/1] Add pg_stat_autovacuum_priority view

Add a new system view that exposes the autovacuum
priority score for each relation in the current
database.  This allows users to inspect each table's
autovacuum eligibility and priority.

The columns returned are: relid, schemaname, relname,
needs_vacuum, needs_analyze, wraparound, and score.

The view results are based on the output of
relation_needs_vacanalyze(), in which the same setup
as do_autovacuum() is performed before calling
relation_needs_vacanalyze().  pg_class is scanned with
an AccessShareLock, so it is relatively lightweight.

Unlike do_autovacuum(), we don't need to derive
pg_toast relationships to the relation in advance, and
we just treat TOAST tables as another relation coming
in from pg_class.
---
 doc/src/sgml/maintenance.sgml        |   6 ++
 doc/src/sgml/monitoring.sgml         | 108 ++++++++++++++++++++++
 src/backend/catalog/system_views.sql |  13 +++
 src/backend/postmaster/autovacuum.c  | 132 +++++++++++++++++++++++++--
 src/include/catalog/catversion.h     |   2 +-
 src/include/catalog/pg_proc.dat      |   9 ++
 src/test/regress/expected/rules.out  |  10 ++
 7 files changed, 270 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/maintenance.sgml b/doc/src/sgml/maintenance.sgml
index b5a191c130b..1a262fa1244 100644
--- a/doc/src/sgml/maintenance.sgml
+++ b/doc/src/sgml/maintenance.sgml
@@ -1153,6 +1153,12 @@ analyze threshold = analyze base threshold + analyze scale factor * number of tu
      listed in the <literal>pg_class</literal> system catalog), set all of the
      aforementioned "weight" parameters to <literal>0.0</literal>.
     </para>
+
+    <para>
+     The <link linkend="monitoring-pg-stat-autovacuum-priority-view">
+     <structname>pg_stat_autovacuum_priority</structname></link> view can be
+     used to inspect each table's autovacuum eligibility and priority score.
+    </para>
    </sect3>
   </sect2>
  </sect1>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 462019a972c..901dd704804 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -463,6 +463,15 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_autovacuum_priority</structname><indexterm><primary>pg_stat_autovacuum_priority</primary></indexterm></entry>
+      <entry>One row per relation in the current database, showing
+       a table's autovacuum eligibility and priority. See
+       <link linkend="monitoring-pg-stat-autovacuum-priority-view">
+       <structname>pg_stat_autovacuum_priority</structname></link> for details.
+      </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_bgwriter</structname><indexterm><primary>pg_stat_bgwriter</primary></indexterm></entry>
       <entry>One row only, showing statistics about the
@@ -2847,6 +2856,105 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-autovacuum-priority-view">
+  <title><structname>pg_stat_autovacuum_priority</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_autovacuum_priority</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_autovacuum_priority</structname> view contains
+   one row per relation in the current database, showing a table's
+   autovacuum eligibility and priority.
+  </para>
+
+  <table id="pg-stat-autovacuum-priority-view" xreflabel="pg_stat_autovacuum_priority">
+   <title><structname>pg_stat_autovacuum_priority</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       Column Type
+      </para>
+      <para>
+       Description
+      </para></entry>
+     </row>
+    </thead>
+
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relid</structfield> <type>oid</type>
+      </para>
+      <para>
+       OID of the relation
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>schemaname</structfield> <type>name</type>
+      </para>
+      <para>
+       Name of the schema that this table is in
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>relname</structfield> <type>name</type>
+      </para>
+      <para>
+       Name of the relation
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>needs_vacuum</structfield> <type>boolean</type>
+      </para>
+      <para>
+       True if autovacuum considers this relation in need of vacuuming
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>needs_analyze</structfield> <type>boolean</type>
+      </para>
+      <para>
+       True if autovacuum considers this relation in need of analyzing
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wraparound</structfield> <type>boolean</type>
+      </para>
+      <para>
+       True if vacuuming is needed to prevent transaction ID or
+       multixact ID wraparound
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>score</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Priority score used by autovacuum to order which relations to
+       process first. Higher values indicate greater urgency. Zero if
+       the relation does not currently need vacuuming or analyzing.
+      </para></entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+ </sect2>
+
  <sect2 id="monitoring-pg-stat-io-view">
   <title><structname>pg_stat_io</structname></title>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f1ed7b58f13..f6ec7653204 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -795,6 +795,19 @@ CREATE VIEW pg_stat_xact_user_tables AS
     WHERE schemaname NOT IN ('pg_catalog', 'information_schema') AND
           schemaname !~ '^pg_toast';
 
+CREATE VIEW pg_stat_autovacuum_priority AS
+    SELECT
+            S.relid,
+            N.nspname AS schemaname,
+            C.relname AS relname,
+            S.needs_vacuum,
+            S.needs_analyze,
+            S.wraparound,
+            S.score
+    FROM pg_stat_get_autovacuum_priority() S
+         JOIN pg_class C ON C.oid = S.relid
+         LEFT JOIN pg_namespace N ON N.oid = C.relnamespace;
+
 CREATE VIEW pg_statio_all_tables AS
     SELECT
             C.oid AS relid,
diff --git a/src/backend/postmaster/autovacuum.c b/src/backend/postmaster/autovacuum.c
index b5c153a8835..8a73f167653 100644
--- a/src/backend/postmaster/autovacuum.c
+++ b/src/backend/postmaster/autovacuum.c
@@ -80,6 +80,7 @@
 #include "catalog/pg_namespace.h"
 #include "commands/vacuum.h"
 #include "common/int.h"
+#include "funcapi.h"
 #include "lib/ilist.h"
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
@@ -111,6 +112,7 @@
 #include "utils/syscache.h"
 #include "utils/timeout.h"
 #include "utils/timestamp.h"
+#include "utils/tuplestore.h"
 #include "utils/wait_event.h"
 
 
@@ -372,6 +374,10 @@ static void perform_work_item(AutoVacuumWorkItem *workitem);
 static void autovac_report_activity(autovac_table *tab);
 static void autovac_report_workitem(AutoVacuumWorkItem *workitem,
 									const char *nspname, const char *relname);
+static void compute_autovac_score(HeapTuple tuple, TupleDesc pg_class_desc,
+								  int effective_multixact_freeze_max_age,
+								  bool *dovacuum, bool *doanalyze,
+								  bool *wraparound, double *score);
 static void avl_sigusr2_handler(SIGNAL_ARGS);
 static bool av_worker_available(void);
 static void check_av_worker_gucs(void);
@@ -2057,6 +2063,13 @@ do_autovacuum(void)
 								  &dovacuum, &doanalyze, &wraparound,
 								  &score);
 
+		elog(DEBUG3, "%s: dovacuum: %s, doanalyze: %s, wraparound: %s, score: %.3f",
+			 NameStr(classForm->relname),
+			 dovacuum ? "yes" : "no",
+			 doanalyze ? "yes" : "no",
+			 wraparound ? "yes" : "no",
+			 score);
+
 		/* Relations that need work are added to tables_to_process */
 		if (dovacuum || doanalyze)
 		{
@@ -2157,6 +2170,12 @@ do_autovacuum(void)
 								  &dovacuum, &doanalyze, &wraparound,
 								  &score);
 
+		elog(DEBUG3, "%s: dovacuum: %s, wraparound: %s, score: %.3f",
+			 NameStr(classForm->relname),
+			 dovacuum ? "yes" : "no",
+			 wraparound ? "yes" : "no",
+			 score);
+
 		/* ignore analyze for toast tables */
 		if (dovacuum)
 		{
@@ -3312,15 +3331,6 @@ relation_needs_vacanalyze(Oid relid,
 			*score = Max(*score, anlthresh_score);
 			*doanalyze = true;
 		}
-
-		if (vac_ins_base_thresh >= 0)
-			elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: %.0f (threshold %.0f), anl: %.0f (threshold %.0f), score: %.3f",
-				 NameStr(classForm->relname),
-				 vactuples, vacthresh, instuples, vacinsthresh, anltuples, anlthresh, *score);
-		else
-			elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: (disabled), anl: %.0f (threshold %.0f), score: %.3f",
-				 NameStr(classForm->relname),
-				 vactuples, vacthresh, anltuples, anlthresh, *score);
 	}
 }
 
@@ -3635,3 +3645,107 @@ check_av_worker_gucs(void)
 				 errdetail("The server will only start up to \"autovacuum_worker_slots\" (%d) autovacuum workers at a given time.",
 						   autovacuum_worker_slots)));
 }
+
+/*
+ * compute_autovac_score
+ *		Wrapper around relation_needs_vacanalyze() that handles the
+ *		per-relation setup similar to do_autovacuum() before calling
+ *		relation_needs_vacanalyze().
+ */
+static void
+compute_autovac_score(HeapTuple tuple, TupleDesc pg_class_desc,
+					  int effective_multixact_freeze_max_age,
+					  bool *dovacuum, bool *doanalyze,
+					  bool *wraparound, double *score)
+{
+	Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+	AutoVacOpts *relopts;
+	PgStat_StatTabEntry *tabentry;
+
+	relopts = extract_autovac_opts(tuple, pg_class_desc);
+
+	tabentry = pgstat_fetch_stat_tabentry_ext(classForm->relisshared,
+											  classForm->oid);
+
+	relation_needs_vacanalyze(classForm->oid, relopts, classForm, tabentry,
+							  effective_multixact_freeze_max_age,
+							  dovacuum, doanalyze, wraparound, score);
+
+	if (relopts)
+		pfree(relopts);
+	if (tabentry)
+		pfree(tabentry);
+}
+
+/*
+ * pg_stat_get_autovacuum_priority
+ *		Returns the autovacuum priority score for each relation in the
+ *		current database.
+ *
+ *		This follows the same setup as do_autovacuum(): snapshotting
+ *		recentXid/recentMulti, scanning pg_class, filtering relation kinds
+ *		and temp tables, and computing effective_multixact_freeze_max_age
+ *		are done here, while compute_autovac_score() handles the per-relation
+ *		setup (fetching reloptions and the pgstat entry).
+ */
+#define NUM_AV_SCORE_COLS 5
+
+Datum
+pg_stat_get_autovacuum_priority(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	Relation	classRel;
+	TableScanDesc relScan;
+	HeapTuple	tuple;
+	TupleDesc	pg_class_desc;
+	int			effective_multixact_freeze_max_age;
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	effective_multixact_freeze_max_age = MultiXactMemberFreezeThreshold();
+
+	/* Snapshot once before the scan, like do_autovacuum()'s caller. */
+	recentXid = ReadNextTransactionId();
+	recentMulti = ReadNextMultiXactId();
+
+	classRel = table_open(RelationRelationId, AccessShareLock);
+	pg_class_desc = CreateTupleDescCopy(RelationGetDescr(classRel));
+
+	relScan = table_beginscan_catalog(classRel, 0, NULL);
+	while ((tuple = heap_getnext(relScan, ForwardScanDirection)) != NULL)
+	{
+		Form_pg_class classForm = (Form_pg_class) GETSTRUCT(tuple);
+		bool		dovacuum;
+		bool		doanalyze;
+		bool		wraparound;
+		double		score = 0.0;
+		Datum		values[NUM_AV_SCORE_COLS];
+		bool		nulls[NUM_AV_SCORE_COLS] = {false};
+
+		if (classForm->relkind != RELKIND_RELATION &&
+			classForm->relkind != RELKIND_MATVIEW &&
+			classForm->relkind != RELKIND_TOASTVALUE)
+			continue;
+
+		if (classForm->relpersistence == RELPERSISTENCE_TEMP)
+			continue;
+
+		compute_autovac_score(tuple, pg_class_desc,
+							  effective_multixact_freeze_max_age,
+							  &dovacuum, &doanalyze, &wraparound, &score);
+
+		values[0] = ObjectIdGetDatum(classForm->oid);
+		values[1] = BoolGetDatum(dovacuum);
+		values[2] = BoolGetDatum(doanalyze);
+		values[3] = BoolGetDatum(wraparound);
+		values[4] = Float8GetDatum(score);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+	}
+	table_endscan(relScan);
+
+	table_close(classRel, AccessShareLock);
+
+	return (Datum) 0;
+}
diff --git a/src/include/catalog/catversion.h b/src/include/catalog/catversion.h
index 420850293f8..bce64758823 100644
--- a/src/include/catalog/catversion.h
+++ b/src/include/catalog/catversion.h
@@ -57,6 +57,6 @@
  */
 
 /*							yyyymmddN */
-#define CATALOG_VERSION_NO	202603201
+#define CATALOG_VERSION_NO	202603231
 
 #endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 84e7adde0e5..e52420898ce 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5667,6 +5667,15 @@
   proname => 'pg_stat_get_total_autoanalyze_time', provolatile => 's',
   proparallel => 'r', prorettype => 'float8', proargtypes => 'oid',
   prosrc => 'pg_stat_get_total_autoanalyze_time' },
+{ oid => '8409',
+  descr => 'statistics: autovacuum priority scores for all relations',
+  proname => 'pg_stat_get_autovacuum_priority', prorows => '100',
+  proretset => 't', provolatile => 'v', proparallel => 'r',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,bool,bool,bool,float8}',
+  proargmodes => '{o,o,o,o,o}',
+  proargnames => '{relid,needs_vacuum,needs_analyze,wraparound,score}',
+  prosrc => 'pg_stat_get_autovacuum_priority' },
 { oid => '1936', descr => 'statistics: currently active backend IDs',
   proname => 'pg_stat_get_backend_idset', prorows => '100', proretset => 't',
   provolatile => 's', proparallel => 'r', prorettype => 'int4',
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 32bea58db2c..257f21be004 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1860,6 +1860,16 @@ pg_stat_archiver| SELECT archived_count,
     last_failed_time,
     stats_reset
    FROM pg_stat_get_archiver() s(archived_count, last_archived_wal, last_archived_time, failed_count, last_failed_wal, last_failed_time, stats_reset);
+pg_stat_autovacuum_priority| SELECT s.relid,
+    n.nspname AS schemaname,
+    c.relname,
+    s.needs_vacuum,
+    s.needs_analyze,
+    s.wraparound,
+    s.score
+   FROM ((pg_stat_get_autovacuum_priority() s(relid, needs_vacuum, needs_analyze, wraparound, score)
+     JOIN pg_class c ON ((c.oid = s.relid)))
+     LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace)));
 pg_stat_bgwriter| SELECT pg_stat_get_bgwriter_buf_written_clean() AS buffers_clean,
     pg_stat_get_bgwriter_maxwritten_clean() AS maxwritten_clean,
     pg_stat_get_buf_alloc() AS buffers_alloc,
-- 
2.47.3



^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-23 21:01  Nathan Bossart <[email protected]>
  parent: Sami Imseih <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-23 21:01 UTC (permalink / raw)
  To: Sami Imseih <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Mon, Mar 23, 2026 at 02:01:22PM -0500, Sami Imseih wrote:
> Adding a system view will be nice. I am attaching a version I used in earlier
> testing (cleaned up with docs), if we are inclined to get this in. I
> think it will be
> useful.

Thanks.  IMHO we should continue to focus on the main patch and get that
committed first.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-23 22:27  Jim Nasby <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Jim Nasby @ 2026-03-23 22:27 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Sami Imseih <[email protected]>; Bharath Rupireddy <[email protected]>; David Rowley <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Mon, Mar 23, 2026 at 4:01 PM Nathan Bossart <[email protected]>
wrote:

> On Mon, Mar 23, 2026 at 02:01:22PM -0500, Sami Imseih wrote:
> > Adding a system view will be nice. I am attaching a version I used in
> earlier
> > testing (cleaned up with docs), if we are inclined to get this in. I
> > think it will be
> > useful.
>
> Thanks.  IMHO we should continue to focus on the main patch and get that
> committed first.
>

+1 ... for one thing if we're going to add a view meant for monitoring
autovac decisions I'd like to think about ways to measure how many tables
are "close" to being eligible for autovac. In particular, the scenario
where you've just done an MVU via some form of logical, so now the freeze
ages on all your tables are extremely similar.

It might be nice if we had an official means of publishing things that
we'd really like users to kick the tires on, but hold a clear understanding
that we promise no backwards compatibility, promise we'll keep supporting,
etc. I can't see how that'd work with backend code, but could certainly be
done for anything in userspace.


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-24 00:32  David Rowley <[email protected]>
  parent: Jim Nasby <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-24 00:32 UTC (permalink / raw)
  To: Jim Nasby <[email protected]>; +Cc: Nathan Bossart <[email protected]>; Sami Imseih <[email protected]>; Bharath Rupireddy <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, 24 Mar 2026 at 11:27, Jim Nasby <[email protected]> wrote:
>
> On Mon, Mar 23, 2026 at 4:01 PM Nathan Bossart <[email protected]> wrote:
>> Thanks.  IMHO we should continue to focus on the main patch and get that
>> committed first.
>
>
> +1 ... for one thing if we're going to add a view meant for monitoring autovac decisions I'd like to think about ways to measure how many tables are "close" to being eligible for autovac. In particular, the scenario where you've just done an MVU via some form of logical, so now the freeze ages on all your tables are extremely similar.

+1 for main patch first. I do think a view would be useful as a
follow-up. However, which columns we put in that view might have some
influence on how the current patch should look. I think the view
should show the individual scores and the total score as the Max() of
the individual scores.  If we didn't do that, it might be confusing to
the user which aspect of the score the final score is derived from.
That might mean that it'd be better to have
relation_needs_vacanalyze() output the scores individually, or perhaps
populate a struct that we pass in that gets allocated on the stack
during do_autovacuum(). That'd mean a bit less churn if we go with the
view containing individual scores.

I think it would be good to have the view show tables that are not
eligible for autovacuum too. It should be easy for users to filter
those out for cases where they're not needed. Doing that would make it
very easy for anyone who wanted to code up a script to run off-peak to
vacuum tables that might need attention on the next peak. Something
like:

SELECT 'VACUUM ' || vacrelid::regclass || ';' from
pg_stat_autovacuum_priority WHERE vacuum_score BETWEEN 0.75 AND 1.0
ORDER BY vacuum_score DESC;
\gexec

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-24 17:15  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2026-03-24 17:15 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Bharath Rupireddy <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Tue, Mar 24, 2026 at 01:32:40PM +1300, David Rowley wrote:
> +1 for main patch first. I do think a view would be useful as a
> follow-up. However, which columns we put in that view might have some
> influence on how the current patch should look. I think the view
> should show the individual scores and the total score as the Max() of
> the individual scores.  If we didn't do that, it might be confusing to
> the user which aspect of the score the final score is derived from.
> That might mean that it'd be better to have
> relation_needs_vacanalyze() output the scores individually, or perhaps
> populate a struct that we pass in that gets allocated on the stack
> during do_autovacuum(). That'd mean a bit less churn if we go with the
> view containing individual scores.

Agreed.  Here's a first try at that.  I also updated the DEBUG3 at the end
of relation_needs_vacanalyze() to show the individual scores.  The comment
above that function might need some work, and we might need a bit of
additional commentary elsewhere.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-24 21:12  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2026-03-24 21:12 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Jim Nasby <[email protected]>; Bharath Rupireddy <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> On Tue, Mar 24, 2026 at 01:32:40PM +1300, David Rowley wrote:
> > +1 for main patch first. I do think a view would be useful as a
> > follow-up. However, which columns we put in that view might have some
> > influence on how the current patch should look. I think the view
> > should show the individual scores and the total score as the Max() of
> > the individual scores.  If we didn't do that, it might be confusing to
> > the user which aspect of the score the final score is derived from.
> > That might mean that it'd be better to have
> > relation_needs_vacanalyze() output the scores individually, or perhaps
> > populate a struct that we pass in that gets allocated on the stack
> > during do_autovacuum(). That'd mean a bit less churn if we go with the
> > view containing individual scores.
>
> Agreed.  Here's a first try at that.  I also updated the DEBUG3 at the end
> of relation_needs_vacanalyze() to show the individual scores.  The comment
> above that function might need some work, and we might need a bit of
> additional commentary elsewhere.

This is good and these values could be exposed in the future view
individually. I like this.

v15 also LGTM.

--
Sami





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-25 21:12  Bharath Rupireddy <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: Bharath Rupireddy @ 2026-03-25 21:12 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

Hi,

On Tue, Mar 24, 2026 at 10:15 AM Nathan Bossart
<[email protected]> wrote:
>
> Agreed.  Here's a first try at that.  I also updated the DEBUG3 at the end
> of relation_needs_vacanalyze() to show the individual scores.  The comment
> above that function might need some work, and we might need a bit of
> additional commentary elsewhere.

Sorry for the late response. Thank you for sending the latest patch.

+1 for getting the main patch in first and then the scoring stats view.

I looked closely at the v15 patch today and it LGTM.

A couple of thoughts (I don't intend to block the main patch from
getting in :)):

1/ If a large bloated table scores highest and takes hours to vacuum,
other tables' XID ages keep advancing (not to the failsafe limits yet,
but approaching close to them). By the time the autovacuum worker
finishes, a table that now needs freezing is still stuck at its old
position in the sorted table list.

Would it make sense to recompute scores and re-sort the remaining
table list after each table is processed in do_autovacuum()'s main
loop - say, after a certain amount of time spent vacuuming the large
table(s)? This would catch the above scenarios. I see that the scores
per table are being calculated in relation_needs_vacanalyze, but they
are ignored in the recheck path (table_recheck_autovac ->
recheck_relation_needs_vacanalyze -> relation_needs_vacanalyze).

IMHO, this could be a useful future addition.

2/ Nit: IIUC, autovacuum_vacuum_score_weight and other existing vacuum
thresholds and scale factor GUCs are the ones to tune to make the
tables prioritized first. I still think it would be nice to have a
document explaining these scenarios - maybe after the main patch gets
in.

--
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-25 21:18  Nathan Bossart <[email protected]>
  parent: Bharath Rupireddy <[email protected]>
  0 siblings, 2 replies; 143+ messages in thread

From: Nathan Bossart @ 2026-03-25 21:18 UTC (permalink / raw)
  To: Bharath Rupireddy <[email protected]>; +Cc: David Rowley <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Wed, Mar 25, 2026 at 02:12:16PM -0700, Bharath Rupireddy wrote:
> Would it make sense to recompute scores and re-sort the remaining
> table list after each table is processed in do_autovacuum()'s main
> loop - say, after a certain amount of time spent vacuuming the large
> table(s)? This would catch the above scenarios. I see that the scores
> per table are being calculated in relation_needs_vacanalyze, but they
> are ignored in the recheck path (table_recheck_autovac ->
> recheck_relation_needs_vacanalyze -> relation_needs_vacanalyze).

I think this was discussed a bit upthread, and we decided to leave it out
for now.  But things like reprioritization and automatic cost limit
adjustments seem worth considering for v20.

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-25 21:24  Bharath Rupireddy <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 0 replies; 143+ messages in thread

From: Bharath Rupireddy @ 2026-03-25 21:24 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

Hi,

On Wed, Mar 25, 2026 at 2:18 PM Nathan Bossart <[email protected]> wrote:
>
> On Wed, Mar 25, 2026 at 02:12:16PM -0700, Bharath Rupireddy wrote:
> > Would it make sense to recompute scores and re-sort the remaining
> > table list after each table is processed in do_autovacuum()'s main
> > loop - say, after a certain amount of time spent vacuuming the large
> > table(s)? This would catch the above scenarios. I see that the scores
> > per table are being calculated in relation_needs_vacanalyze, but they
> > are ignored in the recheck path (table_recheck_autovac ->
> > recheck_relation_needs_vacanalyze -> relation_needs_vacanalyze).
>
> I think this was discussed a bit upthread, and we decided to leave it out
> for now.  But things like reprioritization and automatic cost limit
> adjustments seem worth considering for v20.

+1. Thanks for the clarification.

-- 
Bharath Rupireddy
Amazon Web Services: https://aws.amazon.com





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-25 22:00  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  1 sibling, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-25 22:00 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, 26 Mar 2026 at 10:18, Nathan Bossart <[email protected]> wrote:
>
> On Wed, Mar 25, 2026 at 02:12:16PM -0700, Bharath Rupireddy wrote:
> > Would it make sense to recompute scores and re-sort the remaining
> > table list after each table is processed in do_autovacuum()'s main
> > loop - say, after a certain amount of time spent vacuuming the large
> > table(s)? This would catch the above scenarios. I see that the scores
> > per table are being calculated in relation_needs_vacanalyze, but they
> > are ignored in the recheck path (table_recheck_autovac ->
> > recheck_relation_needs_vacanalyze -> relation_needs_vacanalyze).
>
> I think this was discussed a bit upthread, and we decided to leave it out
> for now.  But things like reprioritization and automatic cost limit
> adjustments seem worth considering for v20.

Agreed. I think the reason you mentioned in [1] was a good reason not
to do this.

There are also other autovacuum workers that may be calculating a more
up-to-date list. They may well process the table that's increased
score before the worker with the slightly stale list makes it there.
That seems fine and natural to me.

David

[1] https://postgr.es/m/aROY-MUVO_mYTl2f%40nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-25 22:10  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-25 22:10 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

Here is what I have staged for commit.  I'll give this at least one more
close read-through beforehand, but I'm hoping to commit it Thursday or
Friday.  Thanks everybody for the thoughtful discussion.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-26 00:28  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-26 00:28 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, 26 Mar 2026 at 11:10, Nathan Bossart <[email protected]> wrote:
>
> Here is what I have staged for commit.  I'll give this at least one more
> close read-through beforehand, but I'm hoping to commit it Thursday or
> Friday.  Thanks everybody for the thoughtful discussion.

A review:

1. I don't think the following is exactly true:

+       <xref linkend="guc-autovacuum-multixact-freeze-max-age"/>.  Furthermore,
+       this component increases greatly once the age surpasses
+       <xref linkend="guc-vacuum-multixact-failsafe-age"/>.  The final value

Should it mention dividing by the autovacuum_freeze_score_weight?

2. Is it worth expanding the following paragraph with some examples?
For example, raising the priority of analyze, someone might want to
set autovacuum_analyze_score_weight to 2.0, effectively doubling the
analyze scores.

+    <para>
+     To revert to the prioritization strategy used before
+     <productname>PostgreSQL</productname> 19 (i.e., the order the tables are
+     listed in the <literal>pg_class</literal> system catalog), set all of the
+     aforementioned "weight" parameters to <literal>0.0</literal>.
+    </para>

3. We talked about this a bit, but after reading the below comment and
looking at sort_template.h, I wonder if we might be putting too much
faith into the current qsort implementation. The presorted check
should pass when all the scores are 0.0. There's also the code that
runs before that for when n < 7 which shouldn't do any swapping. I'm
wondering if that comment might be putting a little too much faith
into that remaining true in the future. You're also calling the same
thing out in #2, so maybe we better take some safer measures to ensure
it remains true instead of relying on the current qsort code not
changing.

+ * between 0.0 and 10.0 (inclusive).  Setting all of these to 0.0 restores
+ * pre-v19 prioritization behavior:

4. I think we tend to favour not breaking ERROR strings up like this:

- elog(DEBUG3, "%s: vac: %.0f (threshold %.0f), ins: %.0f (threshold
%.0f), anl: %.0f (threshold %.0f)",
+ elog(DEBUG3,
+ "%s: "
+ "vac: %.0f (thresh %.0f, score %.2f), "
+ "ins: %.0f (thresh %.0f, score %.2f), "
+ "anl: %.0f (thresh %.0f, score %.2f), "
+ "xid score: %.2f, mxid score: %.2f",

It reduces grepability of error messages. I guess you could argue that
this has so many format specifiers that it's not that greppable
anyway... I sometimes do guess those to grep for things.

5. Maybe worth specifying the range in a comment in the following:

+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -733,6 +733,11 @@
 #autovacuum_multixact_freeze_max_age = 400000000        # maximum multixact age
                                                         # before forced vacuum
                                                         # (change
requires restart)
+#autovacuum_freeze_score_weight = 1.0
+#autovacuum_multixact_freeze_score_weight = 1.0
+#autovacuum_vacuum_score_weight = 1.0
+#autovacuum_vacuum_insert_score_weight = 1.0
+#autovacuum_analyze_score_weight = 1.0

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-26 16:49  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-26 16:49 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Thu, Mar 26, 2026 at 01:28:16PM +1300, David Rowley wrote:
> A review:

Thanks.  I believe I've addressed all your feedback.

-- 
nathan


^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-26 21:29  David Rowley <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: David Rowley @ 2026-03-26 21:29 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, 27 Mar 2026 at 05:49, Nathan Bossart <[email protected]> wrote:
>
> On Thu, Mar 26, 2026 at 01:28:16PM +1300, David Rowley wrote:
> > A review:
>
> Thanks.  I believe I've addressed all your feedback.

It might just be a personal taste thing, but I'd have done the
following differently:

+ if (autovacuum_freeze_score_weight != 0.0 ||
+ autovacuum_multixact_freeze_score_weight != 0.0 ||
+ autovacuum_vacuum_score_weight != 0.0 ||
+ autovacuum_vacuum_insert_score_weight != 0.0 ||
+ autovacuum_analyze_score_weight != 0.0)
+ list_sort(tables_to_process, TableToProcessComparator);

I'd have done:

+ table->score = scores.max;
+ sort_required |= (scores.max != 0.0);
+ tables_to_process = lappend(tables_to_process, table);

...

if (sort_required)
    list_sort(tables_to_process, TableToProcessComparator);

But, I'm fine if you'd rather keep it the way you have it.

No further comments at this stage.

Thanks for working on this.

David





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-27 15:18  Nathan Bossart <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 143+ messages in thread

From: Nathan Bossart @ 2026-03-27 15:18 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Sami Imseih <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

On Fri, Mar 27, 2026 at 10:29:47AM +1300, David Rowley wrote:
> Thanks for working on this.

Committed!

-- 
nathan





^ permalink  raw  reply  [nested|flat] 143+ messages in thread

* Re: another autovacuum scheduling thread
@ 2026-03-27 23:17  Sami Imseih <[email protected]>
  parent: Nathan Bossart <[email protected]>
  0 siblings, 0 replies; 143+ messages in thread

From: Sami Imseih @ 2026-03-27 23:17 UTC (permalink / raw)
  To: Nathan Bossart <[email protected]>; +Cc: David Rowley <[email protected]>; Bharath Rupireddy <[email protected]>; Jim Nasby <[email protected]>; Greg Burd <[email protected]>; Robert Haas <[email protected]>; Robert Treat <[email protected]>; Jeremy Schneider <[email protected]>; pgsql-hackers

> Committed!

Thanks for pushing this!

Also, [1] for the priority view discussed earlier [2].

[1] https://www.postgresql.org/message-id/CAA5RZ0s4xjMrB-VAnLccC7kY8d0-4806-Lsac-czJsdA1LXtAw%40mail.gma...
[2] [https://www.postgresql.org/message-id/CAApHDvqQN-B2sQov8nsfZOmx-VeJMauSf4kLa3A8LsK1tUyBNw%40mail.gma...]

--
Sami Imseih
Amazon Web Services (AWS)





^ permalink  raw  reply  [nested|flat] 143+ messages in thread


end of thread, other threads:[~2026-03-27 23:17 UTC | newest]

Thread overview: 143+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-10-08 15:18 another autovacuum scheduling thread Nathan Bossart <[email protected]>
2025-10-08 17:06 ` Sami Imseih <[email protected]>
2025-10-08 17:20   ` Álvaro Herrera <[email protected]>
2025-10-08 17:47   ` Sami Imseih <[email protected]>
2025-10-08 23:40   ` Jeremy Schneider <[email protected]>
2025-10-08 23:59     ` David Rowley <[email protected]>
2025-10-09 00:27       ` Jeremy Schneider <[email protected]>
2025-10-09 00:30         ` Jeremy Schneider <[email protected]>
2025-10-09 01:03         ` David Rowley <[email protected]>
2025-10-09 01:25           ` Jeremy Schneider <[email protected]>
2025-10-09 01:47             ` Jeremy Schneider <[email protected]>
2025-10-09 03:13               ` David Rowley <[email protected]>
2025-10-09 16:13                 ` Nathan Bossart <[email protected]>
2025-10-10 17:31                   ` Nathan Bossart <[email protected]>
2025-10-10 18:42                     ` Robert Haas <[email protected]>
2025-10-10 19:44                       ` Nathan Bossart <[email protected]>
2025-10-10 20:24                         ` Robert Haas <[email protected]>
2025-10-10 21:59                           ` Jeremy Schneider <[email protected]>
2025-10-13 12:32                             ` Robert Haas <[email protected]>
2025-10-12 06:27                       ` David Rowley <[email protected]>
2025-10-21 14:38                         ` Nathan Bossart <[email protected]>
2025-10-21 20:07                           ` David Rowley <[email protected]>
2025-10-22 18:40                             ` Nathan Bossart <[email protected]>
2025-10-22 18:58                               ` Nathan Bossart <[email protected]>
2025-10-22 19:34                                 ` David Rowley <[email protected]>
2025-10-22 19:43                                   ` Nathan Bossart <[email protected]>
2025-10-23 18:22                                     ` Sami Imseih <[email protected]>
2025-10-23 18:47                                       ` Nathan Bossart <[email protected]>
2025-10-23 19:32                                         ` Sami Imseih <[email protected]>
2025-10-23 20:24                                           ` David Rowley <[email protected]>
2025-10-23 20:48                                             ` Sami Imseih <[email protected]>
2025-10-23 22:39                                               ` David Rowley <[email protected]>
2025-10-24 15:08                                                 ` Nathan Bossart <[email protected]>
2025-10-26 01:25                                                   ` David Rowley <[email protected]>
2025-10-27 16:06                                                     ` Nathan Bossart <[email protected]>
2025-10-27 17:47                                                       ` Sami Imseih <[email protected]>
2025-10-27 21:15                                                         ` Nathan Bossart <[email protected]>
2025-10-27 22:35                                                           ` Sami Imseih <[email protected]>
2025-10-27 23:16                                                             ` David Rowley <[email protected]>
2025-10-28 21:06                                                               ` Nathan Bossart <[email protected]>
2025-10-29 15:24                                                                 ` Sami Imseih <[email protected]>
2025-10-29 16:07                                                                   ` Nathan Bossart <[email protected]>
2025-10-30 02:58                                                                     ` wenhui qiu <[email protected]>
2025-10-30 03:41                                                                       ` David Rowley <[email protected]>
2025-10-30 06:48                                                                         ` wenhui qiu <[email protected]>
2025-10-30 10:36                                                                           ` David Rowley <[email protected]>
2025-10-27 22:47                                                       ` David Rowley <[email protected]>
2025-10-28 21:03                                                         ` Nathan Bossart <[email protected]>
2025-10-28 22:44                                                           ` Sami Imseih <[email protected]>
2025-10-29 03:10                                                             ` wenhui qiu <[email protected]>
2025-10-29 15:58                                                               ` Nathan Bossart <[email protected]>
2025-10-29 15:51                                                             ` Nathan Bossart <[email protected]>
2025-10-30 20:05                                                               ` Robert Haas <[email protected]>
2025-10-30 21:02                                                                 ` Nathan Bossart <[email protected]>
2025-10-30 21:05                                                                   ` Sami Imseih <[email protected]>
2025-10-31 00:38                                                                     ` Sami Imseih <[email protected]>
2025-10-31 20:12                                                                       ` Nathan Bossart <[email protected]>
2025-11-01 01:50                                                                         ` David Rowley <[email protected]>
2025-11-01 03:29                                                                           ` David Rowley <[email protected]>
2025-11-06 22:21                                                                             ` Sami Imseih <[email protected]>
2025-11-06 23:05                                                                               ` David Rowley <[email protected]>
2025-11-07 19:22                                                                                 ` Sami Imseih <[email protected]>
2025-11-11 00:58                                                                                   ` David Rowley <[email protected]>
2025-11-11 16:36                                                                                     ` Nathan Bossart <[email protected]>
2025-11-11 19:43                                                                                       ` Robert Treat <[email protected]>
2025-11-11 19:48                                                                                         ` Nathan Bossart <[email protected]>
2025-11-11 19:50                                                                                           ` Robert Treat <[email protected]>
2025-11-11 20:16                                                                                             ` Nathan Bossart <[email protected]>
2025-11-11 20:03                                                                                       ` David Rowley <[email protected]>
2025-11-11 20:13                                                                                         ` Nathan Bossart <[email protected]>
2025-11-11 20:26                                                                                           ` David Rowley <[email protected]>
2025-11-11 23:22                                                                                             ` Robert Treat <[email protected]>
2025-11-12 20:10                                                                                               ` Nathan Bossart <[email protected]>
2025-11-12 22:10                                                                                                 ` Sami Imseih <[email protected]>
2025-11-12 23:51                                                                                                   ` Jeremy Schneider <[email protected]>
2025-11-13 00:32                                                                                                     ` Sami Imseih <[email protected]>
2025-11-20 14:30                                                                                                 ` Robert Haas <[email protected]>
2025-11-20 16:25                                                                                                   ` Nathan Bossart <[email protected]>
2025-11-20 16:34                                                                                                     ` Sami Imseih <[email protected]>
2025-11-20 18:35                                                                                                       ` Robert Haas <[email protected]>
2025-11-20 20:21                                                                                                         ` Sami Imseih <[email protected]>
2025-11-20 20:58                                                                                                         ` David Rowley <[email protected]>
2025-11-20 21:16                                                                                                           ` Robert Haas <[email protected]>
2025-11-20 22:12                                                                                                             ` David Rowley <[email protected]>
2025-11-22 11:28                                                                                                               ` Robert Haas <[email protected]>
2025-11-22 17:28                                                                                                                 ` Sami Imseih <[email protected]>
2025-11-22 18:35                                                                                                                   ` Robert Haas <[email protected]>
2025-11-23 09:55                                                                                                                     ` David Rowley <[email protected]>
2025-11-24 19:59                                                                                                                       ` Robert Haas <[email protected]>
2026-03-05 17:03                                                                                                                         ` Nathan Bossart <[email protected]>
2026-03-10 15:06                                                                                                                           ` Nathan Bossart <[email protected]>
2026-03-10 16:19                                                                                                                             ` Nathan Bossart <[email protected]>
2026-03-11 00:11                                                                                                                               ` Sami Imseih <[email protected]>
2026-03-11 01:08                                                                                                                                 ` Sami Imseih <[email protected]>
2026-03-11 03:56                                                                                                                                   ` wenhui qiu <[email protected]>
2026-03-11 15:53                                                                                                                                   ` Nathan Bossart <[email protected]>
2026-03-11 17:08                                                                                                                                     ` Sami Imseih <[email protected]>
2026-03-11 17:28                                                                                                                                       ` Nathan Bossart <[email protected]>
2026-03-11 17:59                                                                                                                                         ` Sami Imseih <[email protected]>
2026-03-12 19:20                                                                                                                                           ` Nathan Bossart <[email protected]>
2026-03-17 23:06                                                                                                                                             ` David Rowley <[email protected]>
2026-03-18 16:09                                                                                                                                               ` Nathan Bossart <[email protected]>
2026-03-19 09:55                                                                                                                                                 ` Greg Burd <[email protected]>
2026-03-19 11:44                                                                                                                                                   ` David Rowley <[email protected]>
2026-03-19 15:49                                                                                                                                                     ` Nathan Bossart <[email protected]>
2026-03-19 17:36                                                                                                                                                       ` Greg Burd <[email protected]>
2026-03-20 04:22                                                                                                                                                       ` Sami Imseih <[email protected]>
2026-03-20 04:40                                                                                                                                                         ` David Rowley <[email protected]>
2026-03-20 05:03                                                                                                                                                           ` Sami Imseih <[email protected]>
2026-03-20 20:31                                                                                                                                                             ` Nathan Bossart <[email protected]>
2026-03-21 14:55                                                                                                                                                               ` Sami Imseih <[email protected]>
2026-03-22 00:08                                                                                                                                                               ` Bharath Rupireddy <[email protected]>
2026-03-23 15:22                                                                                                                                                                 ` Nathan Bossart <[email protected]>
2026-03-23 19:01                                                                                                                                                                   ` Sami Imseih <[email protected]>
2026-03-23 21:01                                                                                                                                                                     ` Nathan Bossart <[email protected]>
2026-03-23 22:27                                                                                                                                                                       ` Jim Nasby <[email protected]>
2026-03-24 00:32                                                                                                                                                                         ` David Rowley <[email protected]>
2026-03-24 17:15                                                                                                                                                                           ` Nathan Bossart <[email protected]>
2026-03-24 21:12                                                                                                                                                                             ` Sami Imseih <[email protected]>
2026-03-25 21:12                                                                                                                                                                             ` Bharath Rupireddy <[email protected]>
2026-03-25 21:18                                                                                                                                                                               ` Nathan Bossart <[email protected]>
2026-03-25 21:24                                                                                                                                                                                 ` Bharath Rupireddy <[email protected]>
2026-03-25 22:00                                                                                                                                                                                 ` David Rowley <[email protected]>
2026-03-25 22:10                                                                                                                                                                                   ` Nathan Bossart <[email protected]>
2026-03-26 00:28                                                                                                                                                                                     ` David Rowley <[email protected]>
2026-03-26 16:49                                                                                                                                                                                       ` Nathan Bossart <[email protected]>
2026-03-26 21:29                                                                                                                                                                                         ` David Rowley <[email protected]>
2026-03-27 15:18                                                                                                                                                                                           ` Nathan Bossart <[email protected]>
2026-03-27 23:17                                                                                                                                                                                             ` Sami Imseih <[email protected]>
2025-11-24 15:19                                                                                                                     ` Sami Imseih <[email protected]>
2025-11-22 20:03                                                                                                                 ` Nathan Bossart <[email protected]>
2025-11-22 13:07                                                                                                     ` Dilip Kumar <[email protected]>
2025-11-11 20:25                                                                                     ` Sami Imseih <[email protected]>
2025-11-11 20:43                                                                                       ` David Rowley <[email protected]>
2025-11-11 20:53                                                                                         ` Sami Imseih <[email protected]>
2025-10-24 21:13                                   ` Peter Geoghegan <[email protected]>
2025-10-24 22:25                                     ` David Rowley <[email protected]>
2025-10-08 17:37 ` Andres Freund <[email protected]>
2025-10-09 16:01   ` Nathan Bossart <[email protected]>
2025-10-09 16:15     ` Andres Freund <[email protected]>
2025-10-09 16:33       ` Nathan Bossart <[email protected]>
2025-10-09 19:45       ` Peter Geoghegan <[email protected]>
2025-11-14 02:25  回复:回复:another autovacuum scheduling thread =?UTF-8?B?5q615Z2k5LuBKOWIu+mfpyk=?= <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox