public inbox for [email protected]  
help / color / mirror / Atom feed
From: Andres Freund <[email protected]>
To: Lukas Fittl <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: Thu, 4 Sep 2025 16:23:06 -0400
Message-ID: <h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle> (raw)
In-Reply-To: <CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com>
References: <CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com>

Hi,

On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].

I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.


> 0001: Separate node instrumentation from other use of Instrumentation struct
> 
>     Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.

It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).


> 0003: Introduce stack for tracking per-node WAL/buffer usage


> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>  	DestReceiver *dest;
>  	bool		sendTuples;
>  	MemoryContext oldcontext;
> +	InstrStackResource *res;
>  
>  	/* sanity checks */
>  	Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>  	if (queryDesc->totaltime)
>  		InstrStart(queryDesc->totaltime);
>  
> +	/* Start up per-query node level instrumentation */
> +	res = InstrStartQuery();
> +
>  	/*
>  	 * extract information from the query descriptor and the query feature.
>  	 */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>  	if (sendTuples)
>  		dest->rShutdown(dest);
>  
> +	/* Shut down per-query node level instrumentation */
> +	InstrShutdownQuery(res);
> +
>  	if (queryDesc->totaltime)
>  		InstrStop(queryDesc->totaltime, estate->es_processed);


Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.


> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>  
>  	/* Stop the node if we started it above, reporting 0 tuples. */
>  	if (node->instrument && node->instrument->running)
> +	{
>  		InstrStopNode(node->instrument, 0);
>  
> +		/*
> +		 * Propagate WAL/buffer stats to the parent node on the
> +		 * instrumentation stack (which is where InstrStopNode returned us
> +		 * to).
> +		 */
> +		InstrNodeAddToCurrent(&node->instrument->stack);
> +	}
> +
>  	return false;
>  }

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

	/*
	 * If we know we won't need to back up, we can release resources at this
	 * point.
	 */
	if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
		ExecShutdownNode(planstate);


> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> +	/*
> +	 * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> +	 * get what was first registered first at shutdown. To avoid handling
> +	 * that, we are resetting the stack here on abort (instead of recovering
> +	 * to previous).
> +	 */
> +	pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Greetings,

Andres Freund





view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Stack-based tracking of per-node WAL/buffer usage
  In-Reply-To: <h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox