Re: Stack-based tracking of per-node WAL/buffer usage

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Stack-based tracking of per-node WAL/buffer usage
2+ messages / 2 participants
[nested] [flat]

* Re: Stack-based tracking of per-node WAL/buffer usage
@ 2026-01-09 19:38  Andres Freund <[email protected]>
  0 siblings, 1 reply; 2+ messages in thread

From: Andres Freund @ 2026-01-09 19:38 UTC (permalink / raw)
  To: Lukas Fittl <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>

Hi,


On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:
> Attached v4 patchset that addresses feedback (unless otherwise noted below)
> and is rebased on master. Other changes:
>
> [...]
> - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
> illustrates how the stack mechanism can be used to separate index and table
> buffer accesses in the EXPLAIN for Index Scans

Nice!


I pushed 0001. The only changes I made were to break a few long lines.


> From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other
>  use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This overuse of the same struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time and
> WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
> per-node instrumentation - these calls were added without any apparent
> benefit since the relevant fields were never read.
>
> Introduce the NodeInstrumentation struct to carry forward the per-node
> instrumentation information, and introduce TriggerInstrumentation to
> capture trigger timing and firings (previously counted in "ntuples").

Hm. It looks to me that after moving trigger instrumentation to its own thing,
there are no users of InstrAlloc() with n > 1. I think it may make sense to
drop that?


> diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
> index f098a5557cf..e87810d292e 100644
> --- a/src/backend/executor/execParallel.c
> +++ b/src/backend/executor/execParallel.c
> @@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
>   * instrument_options: Same meaning here as in instrument.c.
>   *
>   * instrument_offset: Offset, relative to the start of this structure,
> - * of the first Instrumentation object.  This will depend on the length of
> + * of the first NodeInstrumentation object.  This will depend on the length of
>   * the plan_node_id array.
>   *
>   * num_workers: Number of workers.
> @@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
>  	int			num_workers;
>  	int			num_plan_nodes;
>  	int			plan_node_id[FLEXIBLE_ARRAY_MEMBER];
> -	/* array of num_plan_nodes * num_workers Instrumentation objects follows */
> +
> +	/*
> +	 * array of num_plan_nodes * num_workers NodeInstrumentation objects
> +	 * follows
> +	 */
>  };

Spurious change, I think?



> +
> +/* Trigger instrumentation handling */
> +TriggerInstrumentation *
> +InstrAllocTrigger(int n, int instrument_options)
> +{
> +	TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
> +	bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> +	bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> +	bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> +	int			i;
> +
> +	for (i = 0; i < n; i++)
> +	{
> +		tginstr[i].instr.need_bufusage = need_buffers;
> +		tginstr[i].instr.need_walusage = need_wal;
> +		tginstr[i].instr.need_timer = need_timer;
> +	}
> +
> +	return tginstr;

Hm. Not that it's a huge difference, but I wonder if we ought to use
InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition?

> +/* Trigger instrumentation */
> +typedef struct TriggerInstrumentation
> +{
> +	Instrumentation instr;
> +	int			firings;		/* number of times the instrumented trigger
> +								 * was fired */
> +}			TriggerInstrumentation;
> +
> +/*
> + * Specialized instrumentation for per-node execution statistics
> + */
> +typedef struct NodeInstrumentation
>  {
>  	/* Parameters set at node creation: */
>  	bool		need_timer;		/* true if we need timer data */
> @@ -92,25 +125,34 @@ typedef struct Instrumentation
>  	double		nfiltered2;		/* # of tuples removed by "other" quals */
>  	BufferUsage bufusage;		/* total buffer usage */
>  	WalUsage	walusage;		/* total WAL usage */
> -} Instrumentation;
> +}			NodeInstrumentation;

The indentation here will look better if you add TriggerInstrumentation,
NodeInstrumentation to typedefs.list


>  typedef struct WorkerInstrumentation
>  {
>  	int			num_workers;	/* # of structures that follow */
> -	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> +	NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
>  } WorkerInstrumentation;

Hm. Shouldn't this be WorkerNodeInstrumentation now?


> From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 31 Aug 2025 16:34:42 -0700
> Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
>  with INSTR_* macros

Looks sane to me.

I'd move that to the head of the queue in the next revision.


> From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
>  stack-based instrumentation
>
> Previously, in order to determine the buffer/WAL usage of a given code section,
> we utilized continuously incrementing global counters that get updated when the
> actual activity (e.g. shared block read) occurred, and then calculated a diff when
> the code section ended. This resulted in a bottleneck for executor node instrumentation
> specifically, with the function BufferUsageAccumDiff showing up in profiles and
> in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
>
> Instead, introduce a stack-based mechanism, where the actual activity writes
> into the current stack entry. In the case of executor nodes, this means that
> each node gets its own stack entry that is pushed at InstrStartNode, and popped
> at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism)
> and get added to their parent entry when they are finalized, i.e. no more
> modifications can occur.
>
> To correctly handle abort situations, any use of instrumentation stacks must
> involve either a top-level Instrumentation struct, and its associated InstrStart/
> InstrStop helpers (which use resource owners to handle aborts), or dedicated
> PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after
> an abort.

I think it may be good to have some tests for edge cases. E.g. testing that a
query that an explain analyze of a query that executes a plpgsql function
which in turn does another explain analyze or just another query without
explain analyze, does something reasonable.


> @@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
>  				   PGSS_EXEC,
>  				   INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
>  				   queryDesc->estate->es_total_processed,
> -				   &queryDesc->totaltime->bufusage,
> -				   &queryDesc->totaltime->walusage,
> +
> +		/*
> +		 * Check if stack is initialized - it is not when ExecutorRun wasn't
> +		 * called
> +		 */
> +				   queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->bufusage : NULL,
> +				   queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->walusage : NULL,
>  				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
>  				   NULL,
>  				   queryDesc->estate->es_parallel_workers_to_launch,

Maybe it's just the diff, but this looks a bit odd, indentation wise.



> @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
>  		resultRelInfo->ri_TrigWhenExprs = (ExprState **)
>  			palloc0(n * sizeof(ExprState *));
>  		if (instrument_options)
> -			resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> +		{
> +			/*
> +			 * Triggers do not individually track buffer/WAL usage, even if
> +			 * otherwise tracked
> +			 */

Why is that?


> @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
>  		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>  		elog(ERROR, "InstrStart called twice in a row");
>
> -	if (instr->need_bufusage)
> -		instr->bufusage_start = pgBufferUsage;
> +	if (instr->need_bufusage || instr->need_walusage)
> +	{
> +		Assert(CurrentResourceOwner != NULL);
> +		instr->owner = CurrentResourceOwner;
> +
> +		/*
> +		 * Allocate the stack resource in a memory context that survives
> +		 * during an abort. This will be freed by InstrStop (regular
> +		 * execution) or ResOwnerReleaseInstrumentation (abort).
> +		 *
> +		 * We don't do this in InstrAlloc to avoid leaking when InstrStart +
> +		 * InstrStop isn't called.
> +		 */
> +		if (instr->stack == NULL)
> +			instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
>
> -	if (instr->need_walusage)
> -		instr->walusage_start = pgWalUsage;
> +		ResourceOwnerEnlarge(instr->owner);
> +		ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
> +
> +		InstrPushStack(instr->stack);
> +	}
>  }

I'm still worried that allocating something in CurTransactionContext will bite
us eventually. It seems like it'd be pretty sane to want to have
instrumentation around procedure execution, for example.


>  void
> -InstrStop(Instrumentation *instr)
> +InstrStop(Instrumentation *instr, bool finalize)
>  {
>  	instr_time	endtime;
>
> @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
>  		INSTR_TIME_SET_ZERO(instr->starttime);
>  	}
>
> -	/* Add delta of buffer usage since entry to node's totals */
> -	if (instr->need_bufusage)
> -		BufferUsageAccumDiff(&instr->bufusage,
> -							 &pgBufferUsage, &instr->bufusage_start);
> +	if (instr->need_bufusage || instr->need_walusage)
> +	{
> +		InstrPopStack(instr->stack, finalize);
>
> -	if (instr->need_walusage)
> -		WalUsageAccumDiff(&instr->walusage,
> -						  &pgWalUsage, &instr->walusage_start);
> +		Assert(instr->owner != NULL);
> +		ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
> +		instr->owner = NULL;
> +
> +		if (finalize)
> +		{
> +			/*
> +			 * To avoid keeping memory allocated beyond when its needed, copy
> +			 * the result to the current memory context, and free it in the
> +			 * transaction context.
> +			 */
> +			InstrStack *stack = palloc(sizeof(InstrStack));
> +
> +			memcpy(stack, instr->stack, sizeof(InstrStack));
> +			pfree(instr->stack);
> +			instr->stack = stack;
> +		}
> +	}
>  }

Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
pfree. Won't that typically increase memory usage due to temporarily having
two versions of the InstrStack?


>  /* Trigger instrumentation handling */
> @@ -98,15 +194,20 @@ TriggerInstrumentation *
>  InstrAllocTrigger(int n, int instrument_options)
>  {
>  	TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
> -	bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> -	bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
>  	bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
>  	int			i;
>
> +	/*
> +	 * To avoid having to determine when the last trigger fired, we never
> +	 * track WAL/buffer usage for now
> +	 */
> +	Assert((instrument_options & INSTRUMENT_BUFFERS) == 0);
> +	Assert((instrument_options & INSTRUMENT_WAL) == 0);
> +
>  	for (i = 0; i < n; i++)
>  	{
> -		tginstr[i].instr.need_bufusage = need_buffers;
> -		tginstr[i].instr.need_walusage = need_wal;
> +		tginstr[i].instr.need_bufusage = false;
> +		tginstr[i].instr.need_walusage = false;
>  		tginstr[i].instr.need_timer = need_timer;
>  	}
>

That does seem problematic to me.




> @@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
>  }
>
>  /* start instrumentation during parallel executor startup */
> -void
> +Instrumentation *
>  InstrStartParallelQuery(void)
>  {
> -	save_pgBufferUsage = pgBufferUsage;
> -	save_pgWalUsage = pgWalUsage;
> +	Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> +
> +	InstrStart(instr);
> +	return instr;
>  }

>  /* report usage after parallel executor shutdown */
>  void
> -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
>  {
> -	memset(bufusage, 0, sizeof(BufferUsage));
> -	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
> -	memset(walusage, 0, sizeof(WalUsage));
> -	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
> +	InstrStop(instr, true);
> +	memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage));
> +	memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage));
>  }


Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
make that pretty clear...



>  		if (verbose || params.log_vacuum_min_duration == 0 ||
>  			TimestampDifferenceExceeds(starttime, endtime,
>  									   params.log_vacuum_min_duration))
>  		{
>  			long		secs_dur;
>  			int			usecs_dur;
> -			WalUsage	walusage;
> -			BufferUsage bufferusage;
>  			StringInfoData buf;
>  			char	   *msgfmt;
>  			int32		diff;
> @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
>  			int64		total_blks_hit;
>  			int64		total_blks_read;
>  			int64		total_blks_dirtied;
> +			BufferUsage bufusage = instr->stack->bufusage;
> +			WalUsage	walusage = instr->stack->walusage;

>  			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
> -			memset(&walusage, 0, sizeof(WalUsage));
> -			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
> -			memset(&bufferusage, 0, sizeof(BufferUsage));
> -			BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
> -
> -			total_blks_hit = bufferusage.shared_blks_hit +
> -				bufferusage.local_blks_hit;
> -			total_blks_read = bufferusage.shared_blks_read +
> -				bufferusage.local_blks_read;
> -			total_blks_dirtied = bufferusage.shared_blks_dirtied +
> -				bufferusage.local_blks_dirtied;
> +
> +			total_blks_hit = bufusage.shared_blks_hit +
> +				bufusage.local_blks_hit;
> +			total_blks_read = bufusage.shared_blks_read +
> +				bufusage.local_blks_read;
> +			total_blks_dirtied = bufusage.shared_blks_dirtied +
> +				bufusage.local_blks_dirtied;

Why rename and move the declaration of bufferusage here? That seems to make
the diff unnecessarily noisy?

Kinda going the other way: Why copy this on the stack, rather than use a
pointer?

> diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
> index 34b6410d6a2..77b4c59e71c 100644
> --- a/src/backend/commands/prepare.c
> +++ b/src/backend/commands/prepare.c
> @@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
>  	ListCell   *p;
>  	ParamListInfo paramLI = NULL;
>  	EState	   *estate = NULL;
> -	instr_time	planstart;
> -	instr_time	planduration;
> -	BufferUsage bufusage_start,
> -				bufusage;
> +	Instrumentation *instr = NULL;
>  	MemoryContextCounters mem_counters;
>  	MemoryContext planner_ctx = NULL;
>  	MemoryContext saved_ctx = NULL;
>
> +	if (es->buffers)
> +		instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
> +	else
> +		instr = InstrAlloc(1, INSTRUMENT_TIMER);

I'd not duplicate the call to InstrAlloc().



> From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Thu, 30 Oct 2025 22:27:30 -0700
> Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access
>  reporting in EXPLAIN
>
> This makes it clear whether activity was on the index directly, or
> on the table based on heap fetches.

Yay^2.


> ---
>  src/backend/commands/explain.c       | 56 ++++++++++++++++------------
>  src/backend/executor/execProcnode.c  | 13 +++++++
>  src/backend/executor/instrument.c    | 25 +++++++++++++
>  src/backend/executor/nodeIndexscan.c | 15 +++++++-
>  src/include/access/genam.h           |  3 ++
>  src/include/executor/instrument.h    |  3 ++
>  6 files changed, 91 insertions(+), 24 deletions(-)
>
> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> index dd3bf615581..fb96dd5248c 100644

> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d00cf820a27..f19af428d97 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
>  	planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
>
>  	if (node->instrument && node->instrument->stack.previous)
> +	{

Perhaps we should instead (probably in the earlier patch) just make the
function return early if we don't have instrumentation / a previous node? We
might gain a few more special cases, and it's nice if they're less indented.


> +		/*
> +		 * Index Scan nodes account for heap buffer usage separately, so we
> +		 * need to explitly add here
> +		 */

s/heap/table/


Greetings,

Andres Freund






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

* Re: Stack-based tracking of per-node WAL/buffer usage
@ 2026-01-15 22:05  Peter Smith <[email protected]>
  parent: Andres Freund <[email protected]>
  0 siblings, 0 replies; 2+ messages in thread

From: Peter Smith @ 2026-01-15 22:05 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Lukas Fittl <[email protected]>; PostgreSQL Hackers <[email protected]>

On Sat, Jan 10, 2026 at 6:38 AM Andres Freund <[email protected]> wrote:
>
> Hi,
>
>
> On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:
> > Attached v4 patchset that addresses feedback (unless otherwise noted below)
> > and is rebased on master. Other changes:
> >
> > [...]
> > - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
> > illustrates how the stack mechanism can be used to separate index and table
> > buffer accesses in the EXPLAIN for Index Scans
>
> Nice!
>
>
> I pushed 0001. The only changes I made were to break a few long lines.
>
>

I happened to be reading the code in this recent push [1] and saw this
new macro:

+#define INSTR_TIME_LT(x,y) \
+ ((x).ticks > (y).ticks)

Is that macro name OK? It seemed backwards to me. Shouldn't it be
called INSTR_TIME_GT because it is checking that x is "Greater Than"
y?

======
[1] https://github.com/postgres/postgres/commit/e5a5e0a90750d665cab417322b9f85c806430d85#diff-d927962e25...

Kind Regards.
Peter Smith.
Fujitsu Australia






^ permalink  raw  reply  [nested|flat] 2+ messages in thread

end of thread, other threads:[~2026-01-15 22:05 UTC | newest]

Thread overview: 2+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-01-09 19:38 Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-01-15 22:05 ` Peter Smith <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox