Re: Stack-based tracking of per-node WAL/buffer usage

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Lukas Fittl <[email protected]>
To: Andres Freund <[email protected]>
Cc: Heikki Linnakangas <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: Peter Smith <[email protected]>
Cc: Zsolt Parragi <[email protected]>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: Tue, 7 Apr 2026 15:27:45 -0700
Message-ID: <CAP53PkzpJ39HFRGaMR7n5kSsNuqui3TJ3Gw_FU7k_qjdBvT6hg@mail.gmail.com> (raw)
In-Reply-To: <rhah5kx5ftfztu532zck4mhc4hxtfis4z552laisd53sf6ycrn@wnbste4hleyg>
References: <57biou6l65r7gr4nunoe6lignz2x6m3w45gihoypaez4pc46di@txj3bakhj66l>
	<CAP53PkxDupm5U6TV0LF_YWHQ2wfcSiLsgnDBcO6b5AchLJhp=A@mail.gmail.com>
	<mtjyijvuv7xavvcdy3rosg43ycy3t5sluioxkef46se25ajtxp@e3e2xvesqhzu>
	<CAP53Pky4zHCueAyPkWE-cH6SqE_m+Kppmy=zE5K36X3HvCFZLw@mail.gmail.com>
	<pgvy7lsoa5jldwgyay2g757xivzbmgyan547wealc7cwtduvra@dysyw6dtqdgs>
	<3xbje45m5knff52mye5dfnrjdnwv7it2bzmqac3jqe66fvop4a@xvhy6zx7n6sb>
	<CAP53Pkzec5L=PDvF+zrPei2kM1FZH6pD2aD=zFWXwzW8oKXJBg@mail.gmail.com>
	<CAN4CZFM11mo-Bo3nhdcvVQ_ue7w3u_p8AL6xyDs054Q8kSv-xQ@mail.gmail.com>
	<CAP53PkzfeTzoC0WTL-ftVsn00ZV9wTe2_6g71txE5dN3Zq1y5g@mail.gmail.com>
	<CAP53Pkyqsht+exJQYRsjhSWYKu+vFGHhPub7m6PmFD6Or0=p1g@mail.gmail.com>
	<rhah5kx5ftfztu532zck4mhc4hxtfis4z552laisd53sf6ycrn@wnbste4hleyg>

On Tue, Apr 7, 2026 at 3:19 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-07 13:30:11 -0700, Lukas Fittl wrote:
> > 0001 is the change to make queryDesc->totaltime be allocated by
> > ExecutorStart instead of plugins themselves, and adds a
> > queryDesc->totaltime_options to have plugins request which level of
> > summary instrumentation they need. This change is pretty simple, and
> > could still make sense to get into 19. Because of the earlier
> > Instrumentation refactoring that was pushed (thanks!) we're already
> > asking extensions allocating queryDesc->totaltime to modify their use
> > of InstrAlloc, so I think we might as well clean this up now.
>
> Hm.  That's a fair argument. They indeed would have to again change next
> release
>
>  It's not a complicated change and removes more lines than it adds.
>
> I guess one thing I'm not sure is whether the fields shouldn't be renamed at
> the same time:
>
> a) To prevent extensions from continuing to set it, most of them do not test
>    against assert enabled builds. With a different name they would get a
>    compiler error.
>
> b) "totaltime" and "totaltime_options" are pretty poor descriptors of tracking
>    query level statistics.  If everyone has to change anyway, this is a good
>    occasion.
>
> 'query_instr[_options]'?
>
>
> Any opinions?

I think renaming makes sense - both to make sure extensions reconsider
how they use it, and because "totaltime" is a bad name anyway, because
its not just about timing (and hasn't been for many releases).

"query_instr[_options]" seems reasonable to me, although we could drop
the "query_" since it'd be "queryDesc->query_instr" vs
"queryDesc->instr".

>
> > 0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
> > suggested previously. We still get some quick performance wins from
> > doing that (see end of email), and again, its a simple change, so
> > could be considered if someone has bandwidth remaining. I've added a
> > later patch that then does the more complex inlining and gets us the
> > full speed up.
>
> Here it needs a few more inlines to get the full performance, otherwise it
> doesn't inline all the helpers.  I think on balance I didn't like the
> prototype in instrument.h, that's too widely included, and it might even cause
> some circularity issues.  It seems better to do the somewhat ugly thing and
> have the prototype be in executor.h.

Yeah, that makes sense.

>
> > 0002 measurements (with current master and TSC clock source used for
> > timing, best of three):
> >
> > CREATE TABLE lotsarows(key int not null);
> > INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> > VACUUM FREEZE lotsarows;
>
> With the somewhat more extreme benchmark I used in the rdtsc thread and the
> added inline mentioned above I see a bit bigger wins. See the attached
> explainbench.sql - it doesn't quite cover all the combinations, but I think it
> gives a good enough overview.
>
> c=1 pgbench -f ~/tmp/explainbench.sql -P5 -r -t 10
>
> master:
> statement latencies in milliseconds and failures:
>        200.800           0 SELECT pg_prewarm('pgbench_accounts');
>          0.098           0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
>        212.010           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
>        268.648           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
>        232.421           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
>        283.531           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
>          0.030           0 DEALLOCATE query;
>
>
> 0002:
>
> statement latencies in milliseconds and failures:
>        201.558           0 SELECT pg_prewarm('pgbench_accounts');
>          0.103           0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
>        188.696           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
>        244.479           0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
>        223.773           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
>        266.947           0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
>          0.034           0 DEALLOCATE query;
>
> That's something like 4-12%.
>
> Pretty nice for a patch that just adds a few lines around and adds a few
> inlines.

Agreed.


> > @@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
> >
> >       if (auto_explain_enabled())
> >       {
> > +             /* We're always interested in runtime */
> > +             queryDesc->totaltime_options |= INSTRUMENT_TIMER;
>
> > -                     queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
>
> Not that it's going to make a significant difference, but it is nice that this
> now would need to track less.

Yup.

>
> Kinda wonder about having
>   EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
>
> in plenty cases that'd be all one needs, at substantially lower cost.

True. I don't like the name "totals_only", but I like the concept.
Today someone has to go to pg_stat_statements to get just the total
numbers, without running them for all nodes with EXPLAIN ANALYZE (and
incurring its overhead).

Thanks,
Lukas

-- 
Lukas Fittl

view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Stack-based tracking of per-node WAL/buffer usage
  In-Reply-To: <CAP53PkzpJ39HFRGaMR7n5kSsNuqui3TJ3Gw_FU7k_qjdBvT6hg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox