Should we document the cost of pg_database

public inbox for [email protected]  
help / color / mirror / Atom feed

Should we document the cost of pg_database_size()? Alternatives?
3+ messages / 2 participants
[nested] [flat]

* Should we document the cost of pg_database_size()? Alternatives?
@ 2025-07-18 00:54  Craig Ringer <[email protected]>
  0 siblings, 2 replies; 3+ messages in thread

From: Craig Ringer @ 2025-07-18 00:54 UTC (permalink / raw)
  To: [email protected]; +Cc: Gregory Bulloch <[email protected]>; Shan Shan Zhao <[email protected]>

Hi all

(Long time!)

I recently ran into an issue where a monitoring component was calling
pg_catalog.pg_database_size() at a ~15s interval on a large
schema-sharded database. It took so long that the query was timing out
before the next 15s scrape interval would start.

The resulting filesystem metadata activity was also hammering the
inode cache etc, affecting normal operations.

================
pg_database_size() is expensive
================

The pg_database_size() function will stat every fork of every
relation, which is pretty expensive on a big DB with a lot of extents
and high relation counts. In my toy database:

test=# select pg_database_size('craig');
 pg_database_size
------------------
        338084643
(1 row)

# strace -c -p 1009203
[...running...]

test=# select pg_database_size('craig');
 pg_database_size
------------------
        338084643
(1 row)

output of strace is:

strace: Process 1009203 attached
^Cstrace: Process 1009203 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 91.18    0.001148           2       393           newfstatat
  3.42    0.000043          14         3           sendto
  1.59    0.000020           6         3         1 openat
  1.35    0.000017           4         4           getdents64
  1.27    0.000016           8         2           close
  0.95    0.000012           6         2         1 recvfrom
  0.24    0.000003           1         2           fstat
  0.00    0.000000           0         1           epoll_wait
------ ----------- ----------- --------- --------- ----------------
100.00    0.001259           3       410         2 total

# find base/16385 | wc -l
394

so it's fstatat()ing (almost) every file including all forks, indexes,
extents, etc.

No surprise it was taking forever on a very complex DB.

================
Docs for pg_database_size() don't mention performance impact
================

When I checked the docs to verify my recollection that
pg_database_size() would stat() every relfilenode (all extents, all
forks including maps, etc) I was surprised to see the docs don't say
much at all about it:
https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-ADMIN-DBOBJECT

> pg_database_size ( oid ) → bigint
>
> Computes the total disk space used by the database with the specified name or OID. To use this function, you must have CONNECT privilege on the specified database (which is granted by default) or have privileges of the pg_read_all_stats role.

It strikes me that we should probably warn about the cost of this
function on large DBs in the docs.

================
Lower-impact alternatives?
================

Which leads to ... what's the alternative, then, if we should tell
users not to poll pg_database_size() from monitoring systems?

FS-based sizing isn't really enough
----------------

Asking users to monitor at the filesystem level works, kind-of, but
it'll lead to confusion due to WAL and temp files in simple installs.
To get decent results they will need to have a separate dedicated
volume for pg_wal. And which temp files are counted will differ; IIRC
pg_database_size() does not count extents created by an in-progress
REINDEX etc, but DOES count temp table sizes, for example. FS-based
monitoring will also include things like spilled pg_replslot spilled
reorder buffers, which can be considerable and aren't reasonably
considered part of the "database size" or included in
pg_database_size(). And of course it can see only the sum of all
database sizes on a multi-database postgres instance unless the user
has one volume per database using distinct tablespaces. So
filesystem-based monitoring is not really a proper replacement.

Using relpages from pg_class is closer but problematic
-----------------

Should the user aggregate relpages from pg_class, multiply by the page
size, and call it kind-of-good-enough?

That seems awfully crude, and user-unfriendly at that, though works in
trivial cases:

test=# select pg_size_pretty(pg_database_size('test'));
 pg_size_pretty
----------------
 309 MB
(1 row)

test=# select pg_size_pretty(sum(relpages)*(1024*8)) from pg_class;
 pg_size_pretty
----------------
 308 MB
(1 row)

However, it breaks down pretty easily with anything less trivial. For
one thing, relpages lags behind VACUUM, so quickly growing or changing
tables and temp tables won't generally be captured. E.g. run

CREATE TEMPORARY TABLE bar AS
SELECT repeat('foofoo',x) AS filler1, repeat('morefillerhere',x) AS filler2
FROM generate_series(1,10000) x;

... then pg_database_size() will reflect the temp table, but summing
relpages won't, even if you're running it in the session that created
the temp table. The relpages for bar is initially 0 until after
VACUUM.

For example, I got a pg_database_size() of 322MB, an initial relpages
sum based estimate of 308MB (unchanged from before temp table
creation) and a relpages sum estimate after VACUUM of 319MB. The
on-disk size of the temp table is 11MB.

Meanwhile the actual size of base/16385 is 323MB according to `du` so
... good enough.

But the overall size of the whole datadir is 1093MB at the moment due
to WAL. Excluding WAL it's 356, mostly due to other DBs in the same
instance's base/.

==============
Any other alternatives?
==============

I'm interested in any other options anyone might suggest on better
ways to track DB size.

Obviously there's going to be a trade-off between freshness and cost;
pg_database_size() is expensive because it's immediate and current,
whereas using relpages is inaccurate because it waits for vacuum.

Any suggestions for a middle ground?

If using relpages is considered good-enough, would anyone be
interested if I was to cook up a patch for a more user-friendly
interface like pg_database_size_cached()  to present it to users?

Whew. /novel.

--
Craig Ringer
EnterpriseDB

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Should we document the cost of pg_database_size()? Alternatives?
@ 2025-07-18 02:10  Craig Ringer <[email protected]>
  parent: Craig Ringer <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Craig Ringer @ 2025-07-18 02:10 UTC (permalink / raw)
  To: [email protected]; +Cc: Gregory Bulloch <[email protected]>; Shan Shan Zhao <[email protected]>

On Fri, 18 Jul 2025 at 12:54, Craig Ringer
<[email protected]> wrote:
> [...]
> I recently ran into an issue where a monitoring component was calling
> pg_catalog.pg_database_size() at a ~15s interval on a large
> schema-sharded database. It took so long that the query was timing out
> before the next 15s scrape interval would start.

Fast-follow on this, because I thought to do a local experiment on performance.

I suspect that there's more at work with the reported issue than just
the cost of statting all the files, because even if I create a 100,000
table db with at least 2 indexes per table (and all the auxiliary
forks etc that will result in)

I created 100,000 tables to fake up a DB that looks like a
schema-sharded one, then tested timing on pg_database_size().

test=# select count(1) from pg_class;
 count
-------
 40496
(1 row)

and now

test=# select pg_database_size(current_database());
 pg_database_size
------------------
        660013571
(1 row)

Time: 75.614 ms

... it still takes hardly any time at all.

If I drop my disk cache on my workstation:

echo 3 > /proc/sys/vm/drop_caches

then repeating pg_database_size()  only takes 254 ms the first time,
then back to 74ms.

In other words, even if the DB has many millions of tables, there's no
way that pg_database_size() could reasonably take minutes to run and
contribute meaningfully to a timeout of a monitoring or scrape
process.

Not unless there's something (or several somethings) else badly wrong
on the DB instance - extreme inode cache thrashing, excessive block
device read-ahead, unreasonably high block device I/O latencies, etc.

I still think it's worth mentioning pg_database_size() needing to stat
every file in the docs, but it's clear there's more going on in the
particular case I'm seeing than that alone. I'll try to report back if
I learn anything interesting that explains the rest of the performance
issues.

FYI the crude method used to create the schema since I couldn't be
bothered scripting something sensible up was:

turn fsync off (never do this unless you're happy to completely
destroy all data in your postgres instance), set
max_locks_per_transaction=1000 and restart Pg

then

DO LANGUAGE plpgsql
$$
BEGIN FOR i IN 1..10000 LOOP
  EXECUTE format('CREATE TABLE %I(x serial primary key, y integer
unique)', 'padding_table_b_'||i);
  EXECUTE format('INSERT INTO %I(y) VALUES (0)', 'padding_table_b_'||i);
  IF i % 100 = 0 THEN RAISE NOTICE 'up to: %', i; END IF;
END LOOP;
END;
$$;

--
Craig Ringer
EnterpriseDB

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Should we document the cost of pg_database_size()? Alternatives?
@ 2025-07-18 03:34  Ron Johnson <[email protected]>
  parent: Craig Ringer <[email protected]>
  1 sibling, 0 replies; 3+ messages in thread

From: Ron Johnson @ 2025-07-18 03:34 UTC (permalink / raw)
  To: pgsql-general

On Thu, Jul 17, 2025 at 8:55 PM Craig Ringer <[email protected]>
wrote:
[snip]

>
> FS-based sizing isn't really enough
> ----------------
>
> Asking users to monitor at the filesystem level works, kind-of, but
> it'll lead to confusion due to WAL and temp files in simple installs.
> To get decent results they will need to have a separate dedicated
> volume for pg_wal. And which temp files are counted will differ; IIRC
> pg_database_size() does not count extents created by an in-progress
> REINDEX etc, but DOES count temp table sizes, for example. FS-based
> monitoring will also include things like spilled pg_replslot spilled
> reorder buffers, which can be considerable and aren't reasonably
> considered part of the "database size" or included in
> pg_database_size(). And of course it can see only the sum of all
> database sizes on a multi-database postgres instance unless the user
> has one volume per database using distinct tablespaces. So
> filesystem-based monitoring is not really a proper replacement.
>

Whether the filesystem creeps above 90%, 95%, etc because of WAL files or
temp files or because of REINDEX or VACUUM FULL / CLUSTER / PG_REPACK is
irrelevant. it's the filesystem at 100% that will ruin your day,

Thus, we monitor filesystems, and don't monitor database size.

If the alarm does ever go off, *then* I check the cause.  (This isn't as
reactionary as it sounds, because I regularly check replication backlog,
for orphan slots, do REINDEXING and CLUSTER one table at a time, and don't
let junk onto the cluster disk.)

-- 
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2025-07-18 03:34 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-07-18 00:54 Should we document the cost of pg_database_size()? Alternatives? Craig Ringer <[email protected]>
2025-07-18 02:10 ` Craig Ringer <[email protected]>
2025-07-18 03:34 ` Ron Johnson <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox