public inbox for [email protected]  
help / color / mirror / Atom feed
From: Lukas Fittl <[email protected]>
To: Andres Freund <[email protected]>
Cc: PostgreSQL Hackers <[email protected]>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: Tue, 9 Sep 2025 12:35:43 -0700
Message-ID: <CAP53Pkw1RpyAVsvJwUOMgGnVoQUibMk+EB+w-8WqY8kzsC4WjA@mail.gmail.com> (raw)
In-Reply-To: <h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle>
References: <CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com>
	<h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle>

Hi Andres,

Thanks for the review!

Attached an updated patch set that addresses the feedback, and also adds
the complete removal of the global pgBufferUsage variable in later patches
(0005-0007), to avoid counting both the stack and the variable.

FWIW, pgWalUsage would also be nice to remove, but it has some interesting
interactions with the cumulative statistics system
and heap_page_prune_and_freeze, and seems less performance critical.

On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <[email protected]> wrote:

> > 0001: Separate node instrumentation from other use of Instrumentation
> struct
> ...
> It's mildly odd that the two types of instrumentation have a different
> definition of 'total' (one double, one instr_time).
>

Yeah, agreed. I added in a new 0001 patch that changes this to instr_time
consistently. I don't see a good reason to keep the transformation to
seconds in the Instrumentation logic, since all in-tree callers convert it
to milliseconds anyway.


> > 0003: Introduce stack for tracking per-node WAL/buffer usage
>
> Why are we doing Instr{Start,Stop}Query when not using instrumentation?
> Resowner stuff ain't free, so I'd skip them when not necessary.
>

Makes sense, I've adjusted that to be conditional (in the now renamed 0004
patch).

In the updated patch I've also decided to piggyback on QueryDesc totaltime
as the "owning" Instrumentation here for the query's lifetime. It seems
simpler that way and avoids having special purpose methods. To go along
with that I've changed the general purpose Instrumentation struct to use
stack-based instrumentation at the same time.

> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> > index d286471254b..7436f307994 100644
> > --- a/src/backend/executor/execProcnode.c
> > +++ b/src/backend/executor/execProcnode.c
> > @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void
> *context)
> ...
> > +             InstrNodeAddToCurrent(&node->instrument->stack);
> ...
>
> Can we rely on this being reached? Note that ExecutePlan() calls
> ExecShutdownNode() conditionally:


You are of course correct, I didn't consider cursors correctly here.

It seems there isn't an existing executor node walk that could be
repurposed, so I added a new one in ExecutorFinish
("ExecAccumNodeInstrumentation"). From my read of the code there are no use
cases where we need aggregated instrumentation data before ExecutorFinish.

> +static void
> > +ResOwnerReleaseInstrStack(Datum res)
> > +{
> > +     /*
> > +      * XXX: Registered resources are *not* called in reverse order,
> i.e. we'll
> > +      * get what was first registered first at shutdown. To avoid
> handling
> > +      * that, we are resetting the stack here on abort (instead of
> recovering
> > +      * to previous).
> > +      */
> > +     pgInstrStack = NULL;
> > +}
>
> Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
> EXPLAIN ANALYZE of a query that executes a function, which internally
> triggers
> an error and catches it?
>
> I wonder if the solution could be to walk the stack and search for the
> to-be-released element.
>

Yes, good point, I did not consider that case. I've addressed this by only
updating the current stack if its not already a parent of the element being
released. We are also always adding the element's statistics to the
(updated) active stack element at abort.

Thanks,
Lukas

-- 
Lukas Fittl


Attachments:

  [application/octet-stream] v2-0006-Introduce-alternate-Instrumentation-stack-mechani.patch (4.8K, 3-v2-0006-Introduce-alternate-Instrumentation-stack-mechani.patch)
  download | inline diff:
From 071887cb7979b6537ca91834ccfe8dbc92bc34d2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:02 -0700
Subject: [PATCH v2 6/7] Introduce alternate Instrumentation stack mechanism
 relying on PG_FINALLY

The resource owner-based Instrumentation stack cannot handle wrapping
certain utility commands that close and re-open the top-level transaction,
like the CLUSTER command. This is a problem for pg_stat_statements tracking
of utility commands specifically. To support tracking such activity, allow
issuing explicit InstrPushStack/InstrPopStack commands to modify the stack,
with the InstrPopStack in a PG_FINALLY to ensure cleanup on abort.
---
 .../pg_stat_statements/pg_stat_statements.c   | 50 +++++--------------
 src/include/executor/instrument.h             |  3 ++
 2 files changed, 15 insertions(+), 38 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index deb6d43a47f..2d21dfbfcfb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -909,21 +909,13 @@ pgss_planner(Query *parse,
 	{
 		instr_time	start;
 		instr_time	duration;
-		BufferUsage bufusage_start,
-					bufusage;
-		WalUsage	walusage_start,
-					walusage;
+		InstrStack *stack;
 
-		/* We need to track buffer usage as the planner can access them. */
-		bufusage_start = pgBufferUsage;
-
-		/*
-		 * Similarly the planner could write some WAL records in some cases
-		 * (e.g. setting a hint bit with those being WAL-logged)
-		 */
-		walusage_start = pgWalUsage;
 		INSTR_TIME_SET_CURRENT(start);
 
+		/* We need to track buffer/WAL usage as the planner can access them. */
+		stack = InstrPushStack();
+
 		nesting_level++;
 		PG_TRY();
 		{
@@ -936,6 +928,7 @@ pgss_planner(Query *parse,
 		}
 		PG_FINALLY();
 		{
+			InstrPopStack(stack);
 			nesting_level--;
 		}
 		PG_END_TRY();
@@ -943,14 +936,6 @@ pgss_planner(Query *parse,
 		INSTR_TIME_SET_CURRENT(duration);
 		INSTR_TIME_SUBTRACT(duration, start);
 
-		/* calc differences of buffer counters. */
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
-		/* calc differences of WAL counters. */
-		memset(&walusage, 0, sizeof(WalUsage));
-		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
 		pgss_store(query_string,
 				   parse->queryId,
 				   parse->stmt_location,
@@ -958,8 +943,8 @@ pgss_planner(Query *parse,
 				   PGSS_PLAN,
 				   INSTR_TIME_GET_MILLISEC(duration),
 				   0,
-				   &bufusage,
-				   &walusage,
+				   &stack->bufusage,
+				   &stack->walusage,
 				   NULL,
 				   NULL,
 				   0,
@@ -1155,14 +1140,10 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		instr_time	start;
 		instr_time	duration;
 		uint64		rows;
-		BufferUsage bufusage_start,
-					bufusage;
-		WalUsage	walusage_start,
-					walusage;
+		InstrStack *stack;
 
-		bufusage_start = pgBufferUsage;
-		walusage_start = pgWalUsage;
 		INSTR_TIME_SET_CURRENT(start);
+		stack = InstrPushStack();
 
 		nesting_level++;
 		PG_TRY();
@@ -1178,6 +1159,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		}
 		PG_FINALLY();
 		{
+			InstrPopStack(stack);
 			nesting_level--;
 		}
 		PG_END_TRY();
@@ -1206,14 +1188,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 					   qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
 			qc->nprocessed : 0;
 
-		/* calc differences of buffer counters. */
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
-		/* calc differences of WAL counters. */
-		memset(&walusage, 0, sizeof(WalUsage));
-		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
 		pgss_store(queryString,
 				   saved_queryId,
 				   saved_stmt_location,
@@ -1221,8 +1195,8 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 				   PGSS_EXEC,
 				   INSTR_TIME_GET_MILLISEC(duration),
 				   rows,
-				   &bufusage,
-				   &walusage,
+				   &stack->bufusage,
+				   &stack->walusage,
 				   NULL,
 				   NULL,
 				   0,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bf766706580..8804ee64311 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -147,6 +147,9 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStart(Instrumentation *instr);
 extern void InstrStop(Instrumentation *instr, double nTuples, bool finalize);
 
+extern InstrStack * InstrPushStack(void);
+extern void InstrPopStack(InstrStack * res);
+
 extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
 											bool async_mode);
 extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
-- 
2.47.1



  [application/octet-stream] v2-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch (17.7K, 4-v2-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch)
  download | inline diff:
From 5626b4a35f90826e7835a4f047b99fd41648e724 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:56 -0700
Subject: [PATCH v2 7/7] Convert remaining users of pgBufferUsage to use
 InstrStart/InstrStop, drop the global

---
 src/backend/access/heap/vacuumlazy.c | 35 ++++++++---------
 src/backend/commands/analyze.c       | 35 ++++++++---------
 src/backend/commands/explain.c       | 26 +++++--------
 src/backend/commands/explain_dr.c    | 31 ++++++++-------
 src/backend/commands/prepare.c       | 26 +++++--------
 src/backend/executor/instrument.c    | 56 +++++++++++-----------------
 src/include/executor/instrument.h    |  8 +---
 7 files changed, 90 insertions(+), 127 deletions(-)

diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 981d9380a92..b7978b385f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -628,8 +628,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	TimestampTz starttime = 0;
 	PgStat_Counter startreadtime = 0,
 				startwritetime = 0;
-	WalUsage	startwalusage = pgWalUsage;
-	BufferUsage startbufferusage = pgBufferUsage;
+	Instrumentation *instr = NULL;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
 
@@ -644,6 +643,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 			startreadtime = pgStatBlockReadTime;
 			startwritetime = pgStatBlockWriteTime;
 		}
+		instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+		InstrStart(instr);
 	}
 
 	/* Used for instrumentation and stats report */
@@ -946,14 +947,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
 
+		InstrStop(instr, 0, true);
+
 		if (verbose || params.log_min_duration == 0 ||
 			TimestampDifferenceExceeds(starttime, endtime,
 									   params.log_min_duration))
 		{
 			long		secs_dur;
 			int			usecs_dur;
-			WalUsage	walusage;
-			BufferUsage bufferusage;
 			StringInfoData buf;
 			char	   *msgfmt;
 			int32		diff;
@@ -964,17 +965,13 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 			int64		total_blks_dirtied;
 
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
-			memset(&walusage, 0, sizeof(WalUsage));
-			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-			memset(&bufferusage, 0, sizeof(BufferUsage));
-			BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-
-			total_blks_hit = bufferusage.shared_blks_hit +
-				bufferusage.local_blks_hit;
-			total_blks_read = bufferusage.shared_blks_read +
-				bufferusage.local_blks_read;
-			total_blks_dirtied = bufferusage.shared_blks_dirtied +
-				bufferusage.local_blks_dirtied;
+
+			total_blks_hit = INSTR_GET_BUFUSAGE(instr).shared_blks_hit +
+				INSTR_GET_BUFUSAGE(instr).local_blks_hit;
+			total_blks_read = INSTR_GET_BUFUSAGE(instr).shared_blks_read +
+				INSTR_GET_BUFUSAGE(instr).local_blks_read;
+			total_blks_dirtied = INSTR_GET_BUFUSAGE(instr).shared_blks_dirtied +
+				INSTR_GET_BUFUSAGE(instr).local_blks_dirtied;
 
 			initStringInfo(&buf);
 			if (verbose)
@@ -1136,10 +1133,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 							 total_blks_dirtied);
 			appendStringInfo(&buf,
 							 _("WAL usage: %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes, %" PRId64 " buffers full\n"),
-							 walusage.wal_records,
-							 walusage.wal_fpi,
-							 walusage.wal_bytes,
-							 walusage.wal_buffers_full);
+							 INSTR_GET_WALUSAGE(instr).wal_records,
+							 INSTR_GET_WALUSAGE(instr).wal_fpi,
+							 INSTR_GET_WALUSAGE(instr).wal_bytes,
+							 INSTR_GET_WALUSAGE(instr).wal_buffers_full);
 			appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
 
 			ereport(verbose ? INFO : LOG,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8ea2913d906..a407d85f864 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -302,9 +302,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
-	WalUsage	startwalusage = pgWalUsage;
-	BufferUsage startbufferusage = pgBufferUsage;
-	BufferUsage bufferusage;
+	Instrumentation *instr = NULL;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 
@@ -355,6 +353,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 		}
 
 		pg_rusage_init(&ru0);
+
+		instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+		InstrStart(instr);
 	}
 
 	/* Used for instrumentation and stats report */
@@ -735,12 +736,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
 
+		InstrStop(instr, 0, true);
+
 		if (verbose || params.log_min_duration == 0 ||
 			TimestampDifferenceExceeds(starttime, endtime,
 									   params.log_min_duration))
 		{
 			long		delay_in_ms;
-			WalUsage	walusage;
 			double		read_rate = 0;
 			double		write_rate = 0;
 			char	   *msgfmt;
@@ -749,17 +751,12 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			int64		total_blks_read;
 			int64		total_blks_dirtied;
 
-			memset(&bufferusage, 0, sizeof(BufferUsage));
-			BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-			memset(&walusage, 0, sizeof(WalUsage));
-			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
-			total_blks_hit = bufferusage.shared_blks_hit +
-				bufferusage.local_blks_hit;
-			total_blks_read = bufferusage.shared_blks_read +
-				bufferusage.local_blks_read;
-			total_blks_dirtied = bufferusage.shared_blks_dirtied +
-				bufferusage.local_blks_dirtied;
+			total_blks_hit = INSTR_GET_BUFUSAGE(instr).shared_blks_hit +
+				INSTR_GET_BUFUSAGE(instr).local_blks_hit;
+			total_blks_read = INSTR_GET_BUFUSAGE(instr).shared_blks_read +
+				INSTR_GET_BUFUSAGE(instr).local_blks_read;
+			total_blks_dirtied = INSTR_GET_BUFUSAGE(instr).shared_blks_dirtied +
+				INSTR_GET_BUFUSAGE(instr).local_blks_dirtied;
 
 			/*
 			 * We do not expect an analyze to take > 25 days and it simplifies
@@ -832,10 +829,10 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 							 total_blks_dirtied);
 			appendStringInfo(&buf,
 							 _("WAL usage: %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes, %" PRId64 " buffers full\n"),
-							 walusage.wal_records,
-							 walusage.wal_fpi,
-							 walusage.wal_bytes,
-							 walusage.wal_buffers_full);
+							 INSTR_GET_WALUSAGE(instr).wal_records,
+							 INSTR_GET_WALUSAGE(instr).wal_fpi,
+							 INSTR_GET_WALUSAGE(instr).wal_bytes,
+							 INSTR_GET_WALUSAGE(instr).wal_buffers_full);
 			appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
 
 			ereport(verbose ? INFO : LOG,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1c4d0e14334..aabb5e1dd08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -321,14 +321,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 						 QueryEnvironment *queryEnv)
 {
 	PlannedStmt *plan;
-	instr_time	planstart,
-				planduration;
-	BufferUsage bufusage_start,
-				bufusage;
+	Instrumentation *instr = NULL;
 	MemoryContextCounters mem_counters;
 	MemoryContext planner_ctx = NULL;
 	MemoryContext saved_ctx = NULL;
 
+	if (es->buffers)
+		instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+	else
+		instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
 	if (es->memory)
 	{
 		/*
@@ -345,15 +347,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 		saved_ctx = MemoryContextSwitchTo(planner_ctx);
 	}
 
-	if (es->buffers)
-		bufusage_start = pgBufferUsage;
-	INSTR_TIME_SET_CURRENT(planstart);
+	InstrStart(instr);
 
 	/* plan the query */
 	plan = pg_plan_query(query, queryString, cursorOptions, params);
 
-	INSTR_TIME_SET_CURRENT(planduration);
-	INSTR_TIME_SUBTRACT(planduration, planstart);
+	InstrStop(instr, 0, true);
 
 	if (es->memory)
 	{
@@ -361,16 +360,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 		MemoryContextMemConsumed(planner_ctx, &mem_counters);
 	}
 
-	/* calc differences of buffer counters. */
-	if (es->buffers)
-	{
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-	}
-
 	/* run it (if needed) and produce output */
 	ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-				   &planduration, (es->buffers ? &bufusage : NULL),
+				   &instr->total, (es->buffers ? &INSTR_GET_BUFUSAGE(instr) : NULL),
 				   es->memory ? &mem_counters : NULL);
 }
 
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 5715546cf43..51f498aa1b4 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -109,15 +109,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
 	MemoryContext oldcontext;
 	StringInfo	buf = &myState->buf;
 	int			natts = typeinfo->natts;
-	instr_time	start,
-				end;
-	BufferUsage instr_start;
+	Instrumentation *instr = NULL;
 
 	/* only measure time, buffers if requested */
-	if (myState->es->timing)
-		INSTR_TIME_SET_CURRENT(start);
-	if (myState->es->buffers)
-		instr_start = pgBufferUsage;
+	if (myState->es->timing || myState->es->buffers)
+	{
+		InstrumentOption instrument_options = 0;
+
+		if (myState->es->timing)
+			instrument_options |= INSTRUMENT_TIMER;
+		if (myState->es->buffers)
+			instrument_options |= INSTRUMENT_BUFFERS;
+		instr = InstrAlloc(1, instrument_options);
+		InstrStart(instr);
+	}
 
 	/* Set or update my derived attribute info, if needed */
 	if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -185,18 +190,16 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
 
+	if (myState->es->timing || myState->es->buffers)
+		InstrStop(instr, 0, true);
+
 	/* Update timing data */
 	if (myState->es->timing)
-	{
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
-	}
+		INSTR_TIME_ADD(myState->metrics.timeSpent, instr->total);
 
 	/* Update buffer metrics */
 	if (myState->es->buffers)
-		BufferUsageAccumDiff(&myState->metrics.bufferUsage,
-							 &pgBufferUsage,
-							 &instr_start);
+		BufferUsageAdd(&myState->metrics.bufferUsage, &INSTR_GET_BUFUSAGE(instr));
 
 	return true;
 }
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 34b6410d6a2..d92aeb6a1df 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	ListCell   *p;
 	ParamListInfo paramLI = NULL;
 	EState	   *estate = NULL;
-	instr_time	planstart;
-	instr_time	planduration;
-	BufferUsage bufusage_start,
-				bufusage;
+	Instrumentation *instr = NULL;
 	MemoryContextCounters mem_counters;
 	MemoryContext planner_ctx = NULL;
 	MemoryContext saved_ctx = NULL;
 
+	if (es->buffers)
+		instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+	else
+		instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
 	if (es->memory)
 	{
 		/* See ExplainOneQuery about this */
@@ -596,9 +598,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		saved_ctx = MemoryContextSwitchTo(planner_ctx);
 	}
 
-	if (es->buffers)
-		bufusage_start = pgBufferUsage;
-	INSTR_TIME_SET_CURRENT(planstart);
+	InstrStart(instr);
 
 	/* Look it up in the hash table */
 	entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +633,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	cplan = GetCachedPlan(entry->plansource, paramLI,
 						  CurrentResourceOwner, pstate->p_queryEnv);
 
-	INSTR_TIME_SET_CURRENT(planduration);
-	INSTR_TIME_SUBTRACT(planduration, planstart);
+	InstrStop(instr, 0, true);
 
 	if (es->memory)
 	{
@@ -642,13 +641,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		MemoryContextMemConsumed(planner_ctx, &mem_counters);
 	}
 
-	/* calc differences of buffer counters. */
-	if (es->buffers)
-	{
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-	}
-
 	plan_list = cplan->stmt_list;
 
 	/* Explain each query */
@@ -658,7 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 		if (pstmt->commandType != CMD_UTILITY)
 			ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
-						   &planduration, (es->buffers ? &bufusage : NULL),
+						   &instr->total, (es->buffers ? &INSTR_GET_BUFUSAGE(instr) : NULL),
 						   es->memory ? &mem_counters : NULL);
 		else
 			ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index d5fdbecb025..d61830a7fd8 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -18,11 +18,9 @@
 #include "executor/instrument.h"
 #include "utils/memutils.h"
 
-BufferUsage pgBufferUsage;
 WalUsage	pgWalUsage;
 InstrStack *pgInstrStack = NULL;
 
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
 /*
@@ -113,6 +111,27 @@ ResOwnerReleaseInstrumentation(Datum res)
 	instr->finalized = true;
 }
 
+InstrStack *
+InstrPushStack()
+{
+	InstrStack *stack = palloc0(sizeof(InstrStack));
+
+	stack->previous = pgInstrStack;
+	pgInstrStack = stack;
+
+	return stack;
+}
+
+void
+InstrPopStack(InstrStack * stack)
+{
+	Assert(stack != NULL);
+
+	pgInstrStack = stack->previous;
+	if (pgInstrStack)
+		InstrStackAdd(pgInstrStack, stack);
+}
+
 /* General purpose instrumentation handling */
 Instrumentation *
 InstrAlloc(int n, int instrument_options)
@@ -393,12 +412,11 @@ InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
 		WalUsageAdd(&dst->walusage, walusage);
 	}
 
-	BufferUsageAdd(&pgBufferUsage, bufusage);
 	WalUsageAdd(&pgWalUsage, walusage);
 }
 
 /* dst += add */
-static void
+void
 BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 {
 	dst->shared_blks_hit += add->shared_blks_hit;
@@ -419,36 +437,6 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
-/* dst += add - sub */
-void
-BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add,
-					 const BufferUsage *sub)
-{
-	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
-	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
-	dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
-	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
-	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
-	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
-	dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
-	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
-	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
-	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
-	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
-						  add->shared_blk_read_time, sub->shared_blk_read_time);
-	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
-						  add->shared_blk_write_time, sub->shared_blk_write_time);
-	INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
-						  add->local_blk_read_time, sub->local_blk_read_time);
-	INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
-						  add->local_blk_write_time, sub->local_blk_write_time);
-	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
-						  add->temp_blk_read_time, sub->temp_blk_read_time);
-	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
-						  add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
 /* helper functions for WAL usage accumulation */
 static void
 WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 8804ee64311..e45c452bc79 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -139,7 +139,6 @@ typedef struct WorkerInstrumentation
 	NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
 } WorkerInstrumentation;
 
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 extern PGDLLIMPORT InstrStack * pgInstrStack;
 
@@ -162,9 +161,8 @@ extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
 extern Instrumentation *InstrStartParallelQuery(void);
 extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
 extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
-								 const BufferUsage *add, const BufferUsage *sub);
 extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
 							  const WalUsage *sub);
 
@@ -175,22 +173,18 @@ extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
 	instr->stack.walusage
 
 #define INSTR_BUFUSAGE_INCR(fld) do { \
-		pgBufferUsage.fld++; \
 		if (pgInstrStack) \
 			pgInstrStack->bufusage.fld++; \
 	} while(0)
 #define INSTR_BUFUSAGE_ADD(fld,val) do { \
-		pgBufferUsage.fld += val; \
 		if (pgInstrStack) \
 			pgInstrStack->bufusage.fld += val; \
 	} while(0)
 #define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
-	INSTR_TIME_ADD(pgBufferUsage.fld, val); \
 	if (pgInstrStack) \
 		INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
 	} while (0)
 #define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
-	INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
 	if (pgInstrStack) \
 		INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
 	} while (0)
-- 
2.47.1



  [application/octet-stream] v2-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch (7.3K, 5-v2-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch)
  download | inline diff:
From b874638131c4d22b746bd635dcb1e3034b9ef5dc Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v2 1/7] Instrumentation: Keep time fields as instrtime,
 require caller to convert

Previously the Instrumentation logic always converted to seconds, only for many
of the callers to do unnecessary division to get to milliseconds. Since an upcoming
refactoring will split the Instrumentation struct, utilize instrtime always to
keep things simpler.
---
 contrib/auto_explain/auto_explain.c           |  2 +-
 .../pg_stat_statements/pg_stat_statements.c   |  2 +-
 src/backend/commands/explain.c                |  8 ++++----
 src/backend/executor/instrument.c             | 20 ++++++++-----------
 src/include/executor/instrument.h             |  6 +++---
 src/include/portability/instr_time.h          |  2 ++
 6 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 1f4badb4928..c10f2fc0f25 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -388,7 +388,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		InstrEndLoop(queryDesc->totaltime);
 
 		/* Log plan if duration is exceeded. */
-		msec = queryDesc->totaltime->total * 1000.0;
+		msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
 		if (msec >= auto_explain_log_min_duration)
 		{
 			ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1cb368c8590..b9c971de1e5 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1091,7 +1091,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 				   queryDesc->plannedstmt->stmt_location,
 				   queryDesc->plannedstmt->stmt_len,
 				   PGSS_EXEC,
-				   queryDesc->totaltime->total * 1000.0,	/* convert to msec */
+				   INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
 				   queryDesc->estate->es_total_processed,
 				   &queryDesc->totaltime->bufusage,
 				   &queryDesc->totaltime->walusage,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8345bc0264b..95b7a9d227f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1830,8 +1830,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		planstate->instrument && planstate->instrument->nloops > 0)
 	{
 		double		nloops = planstate->instrument->nloops;
-		double		startup_ms = 1000.0 * planstate->instrument->startup / nloops;
-		double		total_ms = 1000.0 * planstate->instrument->total / nloops;
+		double		startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
+		double		total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
 		double		rows = planstate->instrument->ntuples / nloops;
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1896,8 +1896,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 			if (nloops <= 0)
 				continue;
-			startup_ms = 1000.0 * instrument->startup / nloops;
-			total_ms = 1000.0 * instrument->total / nloops;
+			startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
+			total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
 			rows = instrument->ntuples / nloops;
 
 			ExplainOpenWorker(n, es);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..1c92abe6761 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -114,7 +114,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 	if (!instr->running)
 	{
 		instr->running = true;
-		instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+		instr->firsttuple = instr->counter;
 	}
 	else
 	{
@@ -123,7 +123,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 		 * this might be the first tuple
 		 */
 		if (instr->async_mode && save_tuplecount < 1.0)
-			instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+			instr->firsttuple = instr->counter;
 	}
 }
 
@@ -139,8 +139,6 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
 void
 InstrEndLoop(Instrumentation *instr)
 {
-	double		totaltime;
-
 	/* Skip if nothing has happened, or already shut down */
 	if (!instr->running)
 		return;
@@ -149,10 +147,8 @@ InstrEndLoop(Instrumentation *instr)
 		elog(ERROR, "InstrEndLoop called on running node");
 
 	/* Accumulate per-cycle statistics into totals */
-	totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
-
-	instr->startup += instr->firsttuple;
-	instr->total += totaltime;
+	INSTR_TIME_ADD(instr->startup, instr->firsttuple);
+	INSTR_TIME_ADD(instr->total, instr->counter);
 	instr->ntuples += instr->tuplecount;
 	instr->nloops += 1;
 
@@ -160,7 +156,7 @@ InstrEndLoop(Instrumentation *instr)
 	instr->running = false;
 	INSTR_TIME_SET_ZERO(instr->starttime);
 	INSTR_TIME_SET_ZERO(instr->counter);
-	instr->firsttuple = 0;
+	INSTR_TIME_SET_ZERO(instr->firsttuple);
 	instr->tuplecount = 0;
 }
 
@@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 		dst->running = true;
 		dst->firsttuple = add->firsttuple;
 	}
-	else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
+	else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
 		dst->firsttuple = add->firsttuple;
 
 	INSTR_TIME_ADD(dst->counter, add->counter);
 
 	dst->tuplecount += add->tuplecount;
-	dst->startup += add->startup;
-	dst->total += add->total;
+	INSTR_TIME_ADD(dst->startup, add->startup);
+	INSTR_TIME_ADD(dst->total, add->total);
 	dst->ntuples += add->ntuples;
 	dst->ntuples2 += add->ntuples2;
 	dst->nloops += add->nloops;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6c..ba5c986907e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -77,13 +77,13 @@ typedef struct Instrumentation
 	bool		running;		/* true if we've completed first tuple */
 	instr_time	starttime;		/* start time of current iteration of node */
 	instr_time	counter;		/* accumulated runtime for this node */
-	double		firsttuple;		/* time for first tuple of this cycle */
+	instr_time	firsttuple;		/* time for first tuple of this cycle */
 	double		tuplecount;		/* # of tuples emitted so far this cycle */
 	BufferUsage bufusage_start; /* buffer usage at start */
 	WalUsage	walusage_start; /* WAL usage at start */
 	/* Accumulated statistics across all completed cycles: */
-	double		startup;		/* total startup time (in seconds) */
-	double		total;			/* total time (in seconds) */
+	instr_time	startup;		/* total startup time */
+	instr_time	total;			/* total time */
 	double		ntuples;		/* total tuples produced */
 	double		ntuples2;		/* secondary node-specific tuple counter */
 	double		nloops;			/* # of run cycles for this node */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..646934020d1 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -184,6 +184,8 @@ GetTimerFrequency(void)
 #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
 	((x).ticks += (y).ticks - (z).ticks)
 
+#define INSTR_TIME_CMP_LT(x,y) \
+	((x).ticks > (y).ticks)
 
 #define INSTR_TIME_GET_DOUBLE(t) \
 	((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
-- 
2.47.1



  [application/octet-stream] v2-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch (9.5K, 6-v2-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch)
  download | inline diff:
From f52c067dc350299fb8bd20e3e08025f745b8e215 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:24:49 -0700
Subject: [PATCH v2 5/7] Use Instrumentation stack for parallel query
 aggregation in workers

---
 src/backend/access/brin/brin.c        |  6 ++++--
 src/backend/access/gin/gininsert.c    |  6 ++++--
 src/backend/access/nbtree/nbtsort.c   |  6 ++++--
 src/backend/commands/vacuumparallel.c |  6 ++++--
 src/backend/executor/execParallel.c   |  6 ++++--
 src/backend/executor/instrument.c     | 21 ++++++++++-----------
 src/include/executor/instrument.h     |  4 ++--
 7 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e462..4626093116e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2870,6 +2870,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	Relation	indexRel;
 	LOCKMODE	heapLockmode;
 	LOCKMODE	indexLockmode;
+	Instrumentation *instr;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	int			sortmem;
@@ -2919,7 +2920,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	tuplesort_attach_shared(sharedsort, seg);
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,7 +2935,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber]);
 
 	index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e9d4b27427e..022085ca645 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2079,6 +2079,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	Relation	indexRel;
 	LOCKMODE	heapLockmode;
 	LOCKMODE	indexLockmode;
+	Instrumentation *instr;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	int			sortmem;
@@ -2147,7 +2148,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	tuplesort_attach_shared(sharedsort, seg);
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
@@ -2162,7 +2163,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber]);
 
 	index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8828a7a8f89..615fd1e03f7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1752,6 +1752,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	Relation	indexRel;
 	LOCKMODE	heapLockmode;
 	LOCKMODE	indexLockmode;
+	Instrumentation *instr;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	int			sortmem;
@@ -1827,7 +1828,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	}
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/* Perform sorting of spool, and possibly a spool2 */
 	sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,7 +1838,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber]);
 
 #ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..c5309a015e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -994,6 +994,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	PVIndStats *indstats;
 	PVShared   *shared;
 	TidStore   *dead_items;
+	Instrumentation *instr;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1083,7 +1084,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	error_context_stack = &errcallback;
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/* Process indexes to perform vacuum/cleanup */
 	parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,7 +1092,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
 	/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e87810d292e..061c6a4aa69 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1434,6 +1434,7 @@ void
 ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 {
 	FixedParallelExecutorState *fpes;
+	Instrumentation *instr;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	DestReceiver *receiver;
@@ -1494,7 +1495,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	 * leader, which also doesn't count buffer accesses and WAL activity that
 	 * occur during executor startup.
 	 */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/*
 	 * Run the plan.  If we specified a tuple bound, be careful not to demand
@@ -1510,7 +1511,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
 	/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 8ef626721f3..d5fdbecb025 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,10 +19,8 @@
 #include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
 InstrStack *pgInstrStack = NULL;
-static WalUsage save_pgWalUsage;
 
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -364,22 +362,23 @@ InstrStackAdd(InstrStack * dst, InstrStack * add)
 	WalUsageAdd(&dst->walusage, &add->walusage);
 }
 
-/* note current values during parallel executor startup */
-void
+/* start instrumentation during parallel executor startup */
+Instrumentation *
 InstrStartParallelQuery(void)
 {
-	save_pgBufferUsage = pgBufferUsage;
-	save_pgWalUsage = pgWalUsage;
+	Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+	InstrStart(instr);
+	return instr;
 }
 
 /* report usage after parallel executor shutdown */
 void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
 {
-	memset(bufusage, 0, sizeof(BufferUsage));
-	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
-	memset(walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+	InstrStop(instr, 0, true);
+	memcpy(bufusage, &INSTR_GET_BUFUSAGE(instr), sizeof(BufferUsage));
+	memcpy(walusage, &INSTR_GET_WALUSAGE(instr), sizeof(WalUsage));
 }
 
 /* accumulate work done by workers in leader's stats */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d04607ce40c..bf766706580 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -156,8 +156,8 @@ extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
 extern void InstrEndLoop(NodeInstrumentation * instr);
 extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
 
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern Instrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
 extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
 extern void BufferUsageAccumDiff(BufferUsage *dst,
 								 const BufferUsage *add, const BufferUsage *sub);
-- 
2.47.1



  [application/octet-stream] v2-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch (20.5K, 7-v2-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch)
  download | inline diff:
From 39e4a7cae8edc26bfcc9ec99756ad01cd2f587b9 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v2 4/7] Introduce stack for tracking per-node WAL/buffer usage

---
 .../pg_stat_statements/pg_stat_statements.c   |   4 +-
 src/backend/commands/explain.c                |   8 +-
 src/backend/commands/trigger.c                |   4 +-
 src/backend/executor/execMain.c               |  25 ++-
 src/backend/executor/execProcnode.c           |  29 +++
 src/backend/executor/instrument.c             | 199 ++++++++++++++----
 src/include/executor/executor.h               |   1 +
 src/include/executor/instrument.h             |  53 ++++-
 8 files changed, 260 insertions(+), 63 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4ec33fbf470..deb6d43a47f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1087,8 +1087,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 				   PGSS_EXEC,
 				   INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
 				   queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
+				   &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
+				   &INSTR_GET_WALUSAGE(queryDesc->totaltime),
 				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
 				   NULL,
 				   queryDesc->estate->es_parallel_workers_to_launch,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d005fcdbc98..1c4d0e14334 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2280,9 +2280,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	/* Show buffer/WAL usage */
 	if (es->buffers && planstate->instrument)
-		show_buffer_usage(es, &planstate->instrument->bufusage);
+		show_buffer_usage(es, &planstate->instrument->stack.bufusage);
 	if (es->wal && planstate->instrument)
-		show_wal_usage(es, &planstate->instrument->walusage);
+		show_wal_usage(es, &planstate->instrument->stack.walusage);
 
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -2299,9 +2299,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 			ExplainOpenWorker(n, es);
 			if (es->buffers)
-				show_buffer_usage(es, &instrument->bufusage);
+				show_buffer_usage(es, &instrument->stack.bufusage);
 			if (es->wal)
-				show_wal_usage(es, &instrument->walusage);
+				show_wal_usage(es, &instrument->stack.walusage);
 			ExplainCloseWorker(n, es);
 		}
 	}
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 9b53dd99e99..67a2fdd034a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
 	 * one "tuple returned" (really the number of firings).
 	 */
 	if (instr)
-		InstrStop(instr + tgindx, 1);
+		InstrStop(instr + tgindx, 1, false);
 
 	return (HeapTuple) DatumGetPointer(result);
 }
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
 	 * one "tuple returned" (really the number of firings).
 	 */
 	if (instr)
-		InstrStop(instr + tgindx, 1);
+		InstrStop(instr + tgindx, 1, false);
 }
 
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e459b3aa797..37f00026dc6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -329,6 +329,13 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	 */
 	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
 
+	/*
+	 * Start up required top-level instrumentation stack for WAL/buffer
+	 * tracking
+	 */
+	if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+		queryDesc->totaltime = InstrAlloc(1, estate->es_instrument);
+
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
 		InstrStart(queryDesc->totaltime);
@@ -383,7 +390,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		dest->rShutdown(dest);
 
 	if (queryDesc->totaltime)
-		InstrStop(queryDesc->totaltime, estate->es_processed);
+		InstrStop(queryDesc->totaltime, estate->es_processed, false);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -442,8 +449,15 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 	if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
 		AfterTriggerEndQuery(estate);
 
+	/*
+	 * Accumulate per node statistics, and then shut down instrumentation
+	 * stack
+	 */
+	if (queryDesc->totaltime && estate->es_instrument)
+		ExecAccumNodeInstrumentation(queryDesc->planstate);
+
 	if (queryDesc->totaltime)
-		InstrStop(queryDesc->totaltime, 0);
+		InstrStop(queryDesc->totaltime, 0, true);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigWhenExprs = (ExprState **)
 			palloc0(n * sizeof(ExprState *));
 		if (instrument_options)
-			resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
+		{
+			if ((instrument_options & INSTRUMENT_TIMER) != 0)
+				resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
+			else
+				resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
+		}
 	}
 	else
 	{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d286471254b..1b3b39222a9 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,7 @@
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
 static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
 static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecAccumNodeInstrumentation_walker(PlanState *node, void *context);
 
 
 /* ------------------------------------------------------------------------
@@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
 	return false;
 }
 
+/*
+ * ExecAccumNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation stack).
+ */
+void
+ExecAccumNodeInstrumentation(PlanState *node)
+{
+	(void) ExecAccumNodeInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
+{
+	if (node == NULL)
+		return false;
+
+	check_stack_depth();
+
+	planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
+
+	if (node->instrument && node->instrument->stack.previous)
+		InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
+
+	return false;
+}
+
 /*
  * ExecSetTupleBound
  *
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1fe0f4204e5..8ef626721f3 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,56 +16,150 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "utils/memutils.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
+InstrStack *pgInstrStack = NULL;
 static WalUsage save_pgWalUsage;
 
 static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
+/*
+ * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+	.name = "instrumentation",
+	.release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+	.release_priority = RELEASE_PRIO_FIRST,
+	.ReleaseResource = ResOwnerReleaseInstrumentation,
+	.DebugPrint = NULL,			/* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrStack(ResourceOwner owner, Instrumentation *instr)
+{
+	ResourceOwnerRemember(owner, PointerGetDatum(instr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrStack(ResourceOwner owner, Instrumentation *instr)
+{
+	ResourceOwnerForget(owner, PointerGetDatum(instr), &instrumentation_resowner_desc);
+}
+
+static void
+InstrPushStackResource(Instrumentation *res)
+{
+	ResourceOwner owner = CurrentResourceOwner;
+
+	Assert(owner != NULL);
+
+	res->owner = owner;
+
+	ResourceOwnerEnlarge(owner);
+	ResourceOwnerRememberInstrStack(owner, res);
+
+	res->stack.previous = pgInstrStack;
+	pgInstrStack = &res->stack;
+}
+
+static void
+InstrPopStackResource(Instrumentation *res)
+{
+	Assert(res != NULL);
+	Assert(res->owner != NULL);
+
+	pgInstrStack = res->stack.previous;
+
+	ResourceOwnerForgetInstrStack(res->owner, res);
+}
+
+static bool
+StackIsParent(InstrStack * stack, InstrStack * entry)
+{
+	if (entry->previous == NULL)
+		return false;
+
+	if (entry->previous == stack)
+		return true;
+
+	return StackIsParent(stack, entry->previous);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+	Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
+
+	/*
+	 * Because registered resources are *not* called in reverse order, we'll
+	 * get what was first registered first at shutdown. Thus, on any later
+	 * resources we need to not change the stack, which was already set to the
+	 * correct previous entry.
+	 */
+	if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
+		pgInstrStack = instr->stack.previous;
+
+	/*
+	 * Always accumulate all collected stats before the abort, even if we
+	 * already walked up the stack with an earlier resource.
+	 */
+	if (pgInstrStack)
+		InstrStackAdd(pgInstrStack, &instr->stack);
+
+	instr->finalized = true;
+}
 
 /* General purpose instrumentation handling */
 Instrumentation *
 InstrAlloc(int n, int instrument_options)
 {
-	Instrumentation *instr;
+	Instrumentation *instr = NULL;
+	bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+	bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+	bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+	int			i;
+
+	/*
+	 * If resource owner will be used, we must allocate in the transaction
+	 * context (not the calling context, usually a lower context), because the
+	 * memory might otherwise be freed too early in an abort situation.
+	 */
+	if (need_buffers || need_wal)
+		instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
+	else
+		instr = palloc0(n * sizeof(Instrumentation));
 
-	/* initialize all fields to zeroes, then modify as needed */
-	instr = palloc0(n * sizeof(Instrumentation));
-	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+	for (i = 0; i < n; i++)
 	{
-		bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
-		bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
-		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
-		int			i;
-
-		for (i = 0; i < n; i++)
-		{
-			instr[i].need_bufusage = need_buffers;
-			instr[i].need_walusage = need_wal;
-			instr[i].need_timer = need_timer;
-		}
+		instr[i].need_bufusage = need_buffers;
+		instr[i].need_walusage = need_wal;
+		instr[i].need_timer = need_timer;
 	}
 
 	return instr;
 }
+
 void
 InstrStart(Instrumentation *instr)
 {
+	Assert(!instr->finalized);
+
 	if (instr->need_timer &&
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
 		elog(ERROR, "InstrStart called twice in a row");
 
-	if (instr->need_bufusage)
-		instr->bufusage_start = pgBufferUsage;
-
-	if (instr->need_walusage)
-		instr->walusage_start = pgWalUsage;
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPushStackResource(instr);
 }
+
 void
-InstrStop(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr, double nTuples, bool finalize)
 {
 	instr_time	endtime;
 
@@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
 		INSTR_TIME_SET_ZERO(instr->starttime);
 	}
 
-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->need_bufusage)
-		BufferUsageAccumDiff(&instr->bufusage,
-							 &pgBufferUsage, &instr->bufusage_start);
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPopStackResource(instr);
 
-	if (instr->need_walusage)
-		WalUsageAccumDiff(&instr->walusage,
-						  &pgWalUsage, &instr->walusage_start);
+	if (finalize)
+	{
+		instr->finalized = true;
+		if (pgInstrStack)
+			InstrStackAdd(pgInstrStack, &instr->stack);
+	}
 }
 
 /* Allocate new node instrumentation structure(s) */
@@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
 		elog(ERROR, "InstrStartNode called twice in a row");
 
-	/* save buffer usage totals at node entry, if needed */
-	if (instr->need_bufusage)
-		instr->bufusage_start = pgBufferUsage;
+	if (instr->need_bufusage || instr->need_walusage)
+	{
+		/* Ensure that we always have a parent, even at the top most node */
+		Assert(pgInstrStack != NULL);
 
-	if (instr->need_walusage)
-		instr->walusage_start = pgWalUsage;
+		instr->stack.previous = pgInstrStack;
+		pgInstrStack = &instr->stack;
+	}
 }
 
 /* Exit from a plan node */
@@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
 		INSTR_TIME_SET_ZERO(instr->starttime);
 	}
 
-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->need_bufusage)
-		BufferUsageAccumDiff(&instr->bufusage,
-							 &pgBufferUsage, &instr->bufusage_start);
-
-	if (instr->need_walusage)
-		WalUsageAccumDiff(&instr->walusage,
-						  &pgWalUsage, &instr->walusage_start);
+	if (instr->need_bufusage || instr->need_walusage)
+	{
+		/* Ensure that we always have a parent, even at the top most node */
+		Assert(instr->stack.previous != NULL);
+		pgInstrStack = instr->stack.previous;
+	}
 
 	/* Is this the first tuple of this cycle? */
 	if (!instr->running)
@@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
 
 	/* Add delta of buffer usage since entry to node's totals */
 	if (dst->need_bufusage)
-		BufferUsageAdd(&dst->bufusage, &add->bufusage);
+		BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
 
 	if (dst->need_walusage)
-		WalUsageAdd(&dst->walusage, &add->walusage);
+		WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
+}
+
+void
+InstrStackAdd(InstrStack * dst, InstrStack * add)
+{
+	Assert(dst != NULL);
+	Assert(add != NULL);
+
+	BufferUsageAdd(&dst->bufusage, &add->bufusage);
+	WalUsageAdd(&dst->walusage, &add->walusage);
 }
 
 /* note current values during parallel executor startup */
@@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
 void
 InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
 {
+	if (pgInstrStack != NULL)
+	{
+		InstrStack *dst = pgInstrStack;
+
+		BufferUsageAdd(&dst->bufusage, bufusage);
+		WalUsageAdd(&dst->walusage, walusage);
+	}
+
 	BufferUsageAdd(&pgBufferUsage, bufusage);
 	WalUsageAdd(&pgWalUsage, walusage);
 }
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 31133514e84..ba76a370d3f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -297,6 +297,7 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern void ExecShutdownNode(PlanState *node);
+extern void ExecAccumNodeInstrumentation(PlanState *node);
 extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
 
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 78d3653997b..d04607ce40c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
 #define INSTRUMENT_H
 
 #include "portability/instr_time.h"
+#include "utils/resowner.h"
 
 
 /*
@@ -66,11 +67,23 @@ typedef enum InstrumentOption
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
+/* Stack of WAL/buffer usage used for per-node instrumentation */
+typedef struct InstrStack
+{
+	struct InstrStack *previous;
+	BufferUsage bufusage;
+	WalUsage	walusage;
+}			InstrStack;
+
 /*
  * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
  *
  * Initialized through InstrAlloc, followed by one or more calls to a pair of
  * InstrStart/InstrStop (activity is measured inbetween).
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller *must* not exit out of
+ * top level transaction between InstrStart/InstrStop calls in regular execution. If this is needed,
+ * directly use InstrPushStack/InstrPopStack in a PG_TRY/PG_FINALLY block instead.
  */
 typedef struct Instrumentation
 {
@@ -79,18 +92,22 @@ typedef struct Instrumentation
 	bool		need_bufusage;	/* true if we need buffer usage data */
 	bool		need_walusage;	/* true if we need WAL usage data */
 	/* Internal state keeping: */
+	bool		finalized;		/* true if no more InstrStart calls are
+								 * allowed */
 	instr_time	starttime;		/* start time of last InstrStart */
-	BufferUsage bufusage_start; /* buffer usage at start */
-	WalUsage	walusage_start; /* WAL usage at start */
 	/* Accumulated statistics: */
 	instr_time	total;			/* total runtime */
 	double		ntuples;		/* total tuples counted in InstrStop */
-	BufferUsage bufusage;		/* total buffer usage */
-	WalUsage	walusage;		/* total WAL usage */
+	InstrStack	stack;			/* stack tracking buffer/WAL usage */
+	ResourceOwner owner;
 } Instrumentation;
 
 /*
  * Specialized instrumentation for per-node execution statistics
+ *
+ * Requires use of an outer InstrStart/InstrStop to handle the stack used for WAL/buffer
+ * usage statistics, and relies on it for managing aborts. Solely intended for
+ * the executor and anyone reporting about its activities (e.g. EXPLAIN ANALYZE).
  */
 typedef struct NodeInstrumentation
 {
@@ -105,8 +122,6 @@ typedef struct NodeInstrumentation
 	instr_time	counter;		/* accumulated runtime for this node */
 	instr_time	firsttuple;		/* time for first tuple of this cycle */
 	double		tuplecount;		/* # of tuples emitted so far this cycle */
-	BufferUsage bufusage_start; /* buffer usage at start */
-	WalUsage	walusage_start; /* WAL usage at start */
 	/* Accumulated statistics across all completed cycles: */
 	instr_time	startup;		/* total startup time */
 	instr_time	total;			/* total time */
@@ -115,8 +130,7 @@ typedef struct NodeInstrumentation
 	double		nloops;			/* # of run cycles for this node */
 	double		nfiltered1;		/* # of tuples removed by scanqual or joinqual */
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
-	BufferUsage bufusage;		/* total buffer usage */
-	WalUsage	walusage;		/* total WAL usage */
+	InstrStack	stack;			/* stack tracking buffer/WAL usage */
 }			NodeInstrumentation;
 
 typedef struct WorkerInstrumentation
@@ -127,10 +141,11 @@ typedef struct WorkerInstrumentation
 
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
+extern PGDLLIMPORT InstrStack * pgInstrStack;
 
 extern Instrumentation *InstrAlloc(int n, int instrument_options);
 extern void InstrStart(Instrumentation *instr);
-extern void InstrStop(Instrumentation *instr, double nTuples);
+extern void InstrStop(Instrumentation *instr, double nTuples, bool finalize);
 
 extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
 											bool async_mode);
@@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
 extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
 extern void BufferUsageAccumDiff(BufferUsage *dst,
 								 const BufferUsage *add, const BufferUsage *sub);
+extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
 							  const WalUsage *sub);
 
+#define INSTR_GET_BUFUSAGE(instr) \
+	instr->stack.bufusage
+
+#define INSTR_GET_WALUSAGE(instr) \
+	instr->stack.walusage
+
 #define INSTR_BUFUSAGE_INCR(fld) do { \
 		pgBufferUsage.fld++; \
+		if (pgInstrStack) \
+			pgInstrStack->bufusage.fld++; \
 	} while(0)
 #define INSTR_BUFUSAGE_ADD(fld,val) do { \
 		pgBufferUsage.fld += val; \
+		if (pgInstrStack) \
+			pgInstrStack->bufusage.fld += val; \
 	} while(0)
 #define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
 	INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+	if (pgInstrStack) \
+		INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
 	} while (0)
 #define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
 	INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+	if (pgInstrStack) \
+		INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
 	} while (0)
+
 #define INSTR_WALUSAGE_INCR(fld) do { \
 		pgWalUsage.fld++; \
+		if (pgInstrStack) \
+			pgInstrStack->walusage.fld++; \
 	} while(0)
 #define INSTR_WALUSAGE_ADD(fld,val) do { \
 		pgWalUsage.fld += val; \
+		if (pgInstrStack) \
+			pgInstrStack->walusage.fld += val; \
 	} while(0)
 
 #endif							/* INSTRUMENT_H */
-- 
2.47.1



  [application/octet-stream] v2-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch (9.0K, 8-v2-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch)
  download | inline diff:
From 4f2bb304213b600c7f368d16547cba52641157db Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:34:42 -0700
Subject: [PATCH v2 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
 with INSTR_* macros

This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
---
 src/backend/access/transam/xlog.c      |  8 ++++----
 src/backend/storage/buffer/bufmgr.c    | 24 ++++++++++++------------
 src/backend/storage/buffer/localbuf.c  |  6 +++---
 src/backend/storage/file/buffile.c     |  8 ++++----
 src/backend/utils/activity/pgstat_io.c |  8 ++++----
 src/include/executor/instrument.h      | 19 +++++++++++++++++++
 6 files changed, 46 insertions(+), 27 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..61516f35676 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1078,9 +1078,9 @@ XLogInsertRecord(XLogRecData *rdata,
 	/* Report WAL traffic to the instrumentation. */
 	if (inserted)
 	{
-		pgWalUsage.wal_bytes += rechdr->xl_tot_len;
-		pgWalUsage.wal_records++;
-		pgWalUsage.wal_fpi += num_fpi;
+		INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+		INSTR_WALUSAGE_INCR(wal_records);
+		INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
 
 		/* Required for the flush of pending stats WAL data */
 		pgstat_report_fixed = true;
@@ -2060,7 +2060,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 					WriteRqst.Flush = 0;
 					XLogWrite(WriteRqst, tli, false);
 					LWLockRelease(WALWriteLock);
-					pgWalUsage.wal_buffers_full++;
+					INSTR_WALUSAGE_INCR(wal_buffers_full);
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 
 					/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..d872d9efb93 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -705,7 +705,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		{
 			PinLocalBuffer(bufHdr, true);
 
-			pgBufferUsage.local_blks_hit++;
+			INSTR_BUFUSAGE_INCR(local_blks_hit);
 
 			return true;
 		}
@@ -737,7 +737,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 			else
 				PinBuffer_Locked(bufHdr);	/* pin for first time */
 
-			pgBufferUsage.shared_blks_hit++;
+			INSTR_BUFUSAGE_INCR(shared_blks_hit);
 
 			return true;
 		}
@@ -1147,14 +1147,14 @@ PinBufferForBlock(Relation rel,
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
+			INSTR_BUFUSAGE_INCR(local_blks_hit);
 	}
 	else
 	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
 							 strategy, foundPtr, io_context);
 		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
+			INSTR_BUFUSAGE_INCR(shared_blks_hit);
 	}
 	if (rel)
 	{
@@ -1888,9 +1888,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 										  true);
 
 		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
+			INSTR_BUFUSAGE_INCR(local_blks_hit);
 		else
-			pgBufferUsage.shared_blks_hit += 1;
+			INSTR_BUFUSAGE_INCR(shared_blks_hit);
 
 		if (operation->rel)
 			pgstat_count_buffer_hit(operation->rel);
@@ -1958,9 +1958,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 								io_start, 1, io_buffers_len * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
+			INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
 		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
+			INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
 
 		/*
 		 * Track vacuum cost when issuing IO, not after waiting for it.
@@ -2842,7 +2842,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
 	}
 
-	pgBufferUsage.shared_blks_written += extend_by;
+	INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
 
 	*extended_by = extend_by;
 
@@ -2960,7 +2960,7 @@ MarkBufferDirty(Buffer buffer)
 	 */
 	if (!(old_buf_state & BM_DIRTY))
 	{
-		pgBufferUsage.shared_blks_dirtied++;
+		INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageDirty;
 	}
@@ -4368,7 +4368,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
-	pgBufferUsage.shared_blks_written++;
+	INSTR_BUFUSAGE_INCR(shared_blks_written);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5524,7 +5524,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 
 		if (dirtied)
 		{
-			pgBufferUsage.shared_blks_dirtied++;
+			INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
 			if (VacuumCostActive)
 				VacuumCostBalance += VacuumCostPageDirty;
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04fef13409b..74c95a3fc59 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -216,7 +216,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	/* Mark not-dirty */
 	TerminateLocalBufferIO(bufHdr, true, 0, false);
 
-	pgBufferUsage.local_blks_written++;
+	INSTR_BUFUSAGE_INCR(local_blks_written);
 }
 
 static Buffer
@@ -476,7 +476,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 	*extended_by = extend_by;
 
-	pgBufferUsage.local_blks_written += extend_by;
+	INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
 
 	return first_block;
 }
@@ -507,7 +507,7 @@ MarkLocalBufferDirty(Buffer buffer)
 	buf_state = pg_atomic_read_u32(&bufHdr->state);
 
 	if (!(buf_state & BM_DIRTY))
-		pgBufferUsage.local_blks_dirtied++;
+		INSTR_BUFUSAGE_INCR(local_blks_dirtied);
 
 	buf_state |= BM_DIRTY;
 
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..9d39df998cb 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -474,13 +474,13 @@ BufFileLoadBuffer(BufFile *file)
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+		INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
 	}
 
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
-		pgBufferUsage.temp_blks_read++;
+		INSTR_BUFUSAGE_INCR(temp_blks_read);
 }
 
 /*
@@ -548,13 +548,13 @@ BufFileDumpBuffer(BufFile *file)
 		if (track_io_timing)
 		{
 			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+			INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
 		}
 
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
-		pgBufferUsage.temp_blks_written++;
+		INSTR_BUFUSAGE_INCR(temp_blks_written);
 	}
 	file->dirty = false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..4f6274eb573 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			{
 				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 				if (io_object == IOOBJECT_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
 				else if (io_object == IOOBJECT_TEMP_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
 			}
 			else if (io_op == IOOP_READ)
 			{
 				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 				if (io_object == IOOBJECT_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
 				else if (io_object == IOOBJECT_TEMP_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
 			}
 		}
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1ae533f6704..78d3653997b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -149,4 +149,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
 							  const WalUsage *sub);
 
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+		pgBufferUsage.fld++; \
+	} while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+		pgBufferUsage.fld += val; \
+	} while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+	INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+	} while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+	INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+	} while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+		pgWalUsage.fld++; \
+	} while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+		pgWalUsage.fld += val; \
+	} while(0)
+
 #endif							/* INSTRUMENT_H */
-- 
2.47.1



  [application/octet-stream] v2-0002-Separate-node-instrumentation-from-other-use-of-I.patch (21.4K, 9-v2-0002-Separate-node-instrumentation-from-other-use-of-I.patch)
  download | inline diff:
From a6893036163db67839c0cf2a40d0032858032424 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v2 2/7] Separate node instrumentation from other use of
 Instrumentation struct

Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This dual use of the struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.

Instead, simplify the Instrumentation struct to only track time,
WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
outside of per-node instrumentation. Introduce the NodeInstrumentation
struct to carry forward the per-node instrumentation information.
---
 contrib/auto_explain/auto_explain.c           |  8 +-
 .../pg_stat_statements/pg_stat_statements.c   |  8 +-
 contrib/postgres_fdw/postgres_fdw.c           |  2 +-
 src/backend/commands/explain.c                | 11 +--
 src/backend/commands/trigger.c                |  8 +-
 src/backend/executor/execMain.c               | 10 +--
 src/backend/executor/execParallel.c           | 22 +++--
 src/backend/executor/execProcnode.c           |  4 +-
 src/backend/executor/instrument.c             | 86 +++++++++++++++++--
 src/include/executor/instrument.h             | 51 ++++++++---
 src/include/nodes/execnodes.h                 |  3 +-
 11 files changed, 151 insertions(+), 62 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c10f2fc0f25..ee0c3b4c91b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			MemoryContext oldcxt;
 
 			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
 			MemoryContextSwitchTo(oldcxt);
 		}
 	}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		 */
 		oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
 
-		/*
-		 * Make sure stats accumulation is done.  (Note: it's okay if several
-		 * levels of hook all do this.)
-		 */
-		InstrEndLoop(queryDesc->totaltime);
-
 		/* Log plan if duration is exceeded. */
 		msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
 		if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b9c971de1e5..4ec33fbf470 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1021,7 +1021,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			MemoryContext oldcxt;
 
 			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
 			MemoryContextSwitchTo(oldcxt);
 		}
 	}
@@ -1080,12 +1080,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	if (queryId != INT64CONST(0) && queryDesc->totaltime &&
 		pgss_enabled(nesting_level))
 	{
-		/*
-		 * Make sure stats accumulation is done.  (Note: it's okay if several
-		 * levels of hook all do this.)
-		 */
-		InstrEndLoop(queryDesc->totaltime);
-
 		pgss_store(queryDesc->sourceText,
 				   queryId,
 				   queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 456b267f70b..7619ac486c0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
 	if (!resultRelInfo->ri_projectReturning)
 	{
 		TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
-		Instrumentation *instr = node->ss.ps.instrument;
+		NodeInstrumentation *instr = node->ss.ps.instrument;
 
 		Assert(!dmstate->has_returning);
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 95b7a9d227f..d005fcdbc98 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1102,9 +1102,6 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 		char	   *relname;
 		char	   *conname = NULL;
 
-		/* Must clean up instrumentation state */
-		InstrEndLoop(instr);
-
 		/*
 		 * We ignore triggers that were never invoked; they likely aren't
 		 * relevant to the current query type.
@@ -1135,7 +1132,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 				appendStringInfo(es->str, " on %s", relname);
 			if (es->timing)
 				appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
-								 1000.0 * instr->total, instr->ntuples);
+								 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), instr->ntuples);
 			else
 				appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
 		}
@@ -1146,7 +1143,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 				ExplainPropertyText("Constraint Name", conname, es);
 			ExplainPropertyText("Relation", relname, es);
 			if (es->timing)
-				ExplainPropertyFloat("Time", "ms", 1000.0 * instr->total, 3,
+				ExplainPropertyFloat("Time", "ms", 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), 3,
 									 es);
 			ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
 		}
@@ -1888,7 +1885,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 		for (int n = 0; n < w->num_workers; n++)
 		{
-			Instrumentation *instrument = &w->instrument[n];
+			NodeInstrumentation *instrument = &w->instrument[n];
 			double		nloops = instrument->nloops;
 			double		startup_ms;
 			double		total_ms;
@@ -2294,7 +2291,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 		for (int n = 0; n < w->num_workers; n++)
 		{
-			Instrumentation *instrument = &w->instrument[n];
+			NodeInstrumentation *instrument = &w->instrument[n];
 			double		nloops = instrument->nloops;
 
 			if (nloops <= 0)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 579ac8d76ae..9b53dd99e99 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
 	 * If doing EXPLAIN ANALYZE, start charging time to this trigger.
 	 */
 	if (instr)
-		InstrStartNode(instr + tgindx);
+		InstrStart(instr + tgindx);
 
 	/*
 	 * Do the function evaluation in the per-tuple memory context, so that
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
 	 * one "tuple returned" (really the number of firings).
 	 */
 	if (instr)
-		InstrStopNode(instr + tgindx, 1);
+		InstrStop(instr + tgindx, 1);
 
 	return (HeapTuple) DatumGetPointer(result);
 }
@@ -4381,7 +4381,7 @@ AfterTriggerExecute(EState *estate,
 	 * to include time spent re-fetching tuples in the trigger cost.
 	 */
 	if (instr)
-		InstrStartNode(instr + tgindx);
+		InstrStart(instr + tgindx);
 
 	/*
 	 * Fetch the required tuple(s).
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
 	 * one "tuple returned" (really the number of firings).
 	 */
 	if (instr)
-		InstrStopNode(instr + tgindx, 1);
+		InstrStop(instr + tgindx, 1);
 }
 
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ff12e2e1364..e459b3aa797 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
-		InstrStartNode(queryDesc->totaltime);
+		InstrStart(queryDesc->totaltime);
 
 	/*
 	 * extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		dest->rShutdown(dest);
 
 	if (queryDesc->totaltime)
-		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+		InstrStop(queryDesc->totaltime, estate->es_processed);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
-		InstrStartNode(queryDesc->totaltime);
+		InstrStart(queryDesc->totaltime);
 
 	/* Run ModifyTable nodes to completion */
 	ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 		AfterTriggerEndQuery(estate);
 
 	if (queryDesc->totaltime)
-		InstrStopNode(queryDesc->totaltime, 0);
+		InstrStop(queryDesc->totaltime, 0);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1266,7 +1266,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigWhenExprs = (ExprState **)
 			palloc0(n * sizeof(ExprState *));
 		if (instrument_options)
-			resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+			resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
 	}
 	else
 	{
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..e87810d292e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
  * instrument_options: Same meaning here as in instrument.c.
  *
  * instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object.  This will depend on the length of
+ * of the first NodeInstrumentation object.  This will depend on the length of
  * the plan_node_id array.
  *
  * num_workers: Number of workers.
@@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
 	int			num_workers;
 	int			num_plan_nodes;
 	int			plan_node_id[FLEXIBLE_ARRAY_MEMBER];
-	/* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+	/*
+	 * array of num_plan_nodes * num_workers NodeInstrumentation objects
+	 * follows
+	 */
 };
 #define GetInstrumentationArray(sei) \
 	(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
-	 (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+	 (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
 
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
@@ -713,7 +717,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation_len = MAXALIGN(instrumentation_len);
 		instrument_offset = instrumentation_len;
 		instrumentation_len +=
-			mul_size(sizeof(Instrumentation),
+			mul_size(sizeof(NodeInstrumentation),
 					 mul_size(e.nnodes, nworkers));
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -799,7 +803,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	 */
 	if (estate->es_instrument)
 	{
-		Instrumentation *instrument;
+		NodeInstrumentation *instrument;
 		int			i;
 
 		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -809,7 +813,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation->num_plan_nodes = e.nnodes;
 		instrument = GetInstrumentationArray(instrumentation);
 		for (i = 0; i < nworkers * e.nnodes; ++i)
-			InstrInit(&instrument[i], estate->es_instrument);
+			InstrInitNode(&instrument[i], estate->es_instrument);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
@@ -1036,7 +1040,7 @@ static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
 									SharedExecutorInstrumentation *instrumentation)
 {
-	Instrumentation *instrument;
+	NodeInstrumentation *instrument;
 	int			i;
 	int			n;
 	int			ibytes;
@@ -1064,7 +1068,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	 * Switch into per-query memory context.
 	 */
 	oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
-	ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+	ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
 	planstate->worker_instrument =
 		palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
 	MemoryContextSwitchTo(oldcontext);
@@ -1296,7 +1300,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 {
 	int			i;
 	int			plan_node_id = planstate->plan->plan_node_id;
-	Instrumentation *instrument;
+	NodeInstrumentation *instrument;
 
 	InstrEndLoop(planstate->instrument);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..d286471254b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 
 	/* Set up instrumentation for this node if requested */
 	if (estate->es_instrument)
-		result->instrument = InstrAlloc(1, estate->es_instrument,
-										result->async_capable);
+		result->instrument = InstrAllocNode(1, estate->es_instrument,
+											result->async_capable);
 
 	return result;
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1c92abe6761..1fe0f4204e5 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,9 +26,9 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
 
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
 Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int n, int instrument_options)
 {
 	Instrumentation *instr;
 
@@ -41,6 +41,74 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
 		int			i;
 
+		for (i = 0; i < n; i++)
+		{
+			instr[i].need_bufusage = need_buffers;
+			instr[i].need_walusage = need_wal;
+			instr[i].need_timer = need_timer;
+		}
+	}
+
+	return instr;
+}
+void
+InstrStart(Instrumentation *instr)
+{
+	if (instr->need_timer &&
+		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
+		elog(ERROR, "InstrStart called twice in a row");
+
+	if (instr->need_bufusage)
+		instr->bufusage_start = pgBufferUsage;
+
+	if (instr->need_walusage)
+		instr->walusage_start = pgWalUsage;
+}
+void
+InstrStop(Instrumentation *instr, double nTuples)
+{
+	instr_time	endtime;
+
+	/* count the specified tuples */
+	instr->ntuples += nTuples;
+
+	/* let's update the time only if the timer was requested */
+	if (instr->need_timer)
+	{
+		if (INSTR_TIME_IS_ZERO(instr->starttime))
+			elog(ERROR, "InstrStop called without start");
+
+		INSTR_TIME_SET_CURRENT(endtime);
+		INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+		INSTR_TIME_SET_ZERO(instr->starttime);
+	}
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr->need_bufusage)
+		BufferUsageAccumDiff(&instr->bufusage,
+							 &pgBufferUsage, &instr->bufusage_start);
+
+	if (instr->need_walusage)
+		WalUsageAccumDiff(&instr->walusage,
+						  &pgWalUsage, &instr->walusage_start);
+}
+
+/* Allocate new node instrumentation structure(s) */
+NodeInstrumentation *
+InstrAllocNode(int n, int instrument_options, bool async_mode)
+{
+	NodeInstrumentation *instr;
+
+	/* initialize all fields to zeroes, then modify as needed */
+	instr = palloc0(n * sizeof(NodeInstrumentation));
+	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+	{
+		bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+		bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+		int			i;
+
 		for (i = 0; i < n; i++)
 		{
 			instr[i].need_bufusage = need_buffers;
@@ -55,9 +123,9 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
 
 /* Initialize a pre-allocated instrumentation structure. */
 void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitNode(NodeInstrumentation * instr, int instrument_options)
 {
-	memset(instr, 0, sizeof(Instrumentation));
+	memset(instr, 0, sizeof(NodeInstrumentation));
 	instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
 	instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
 	instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
@@ -65,7 +133,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
 
 /* Entry to a plan node */
 void
-InstrStartNode(Instrumentation *instr)
+InstrStartNode(NodeInstrumentation * instr)
 {
 	if (instr->need_timer &&
 		!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
@@ -81,7 +149,7 @@ InstrStartNode(Instrumentation *instr)
 
 /* Exit from a plan node */
 void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStopNode(NodeInstrumentation * instr, double nTuples)
 {
 	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
@@ -129,7 +197,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 
 /* Update tuple count */
 void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples)
 {
 	/* count the returned tuples */
 	instr->tuplecount += nTuples;
@@ -137,7 +205,7 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
 
 /* Finish a run cycle for a plan node */
 void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation * instr)
 {
 	/* Skip if nothing has happened, or already shut down */
 	if (!instr->running)
@@ -162,7 +230,7 @@ InstrEndLoop(Instrumentation *instr)
 
 /* aggregate instrumentation information */
 void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
 {
 	if (!dst->running && add->running)
 	{
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index ba5c986907e..1ae533f6704 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -66,7 +66,33 @@ typedef enum InstrumentOption
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
+/*
+ * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
 typedef struct Instrumentation
+{
+	/* Parameters set at creation: */
+	bool		need_timer;		/* true if we need timer data */
+	bool		need_bufusage;	/* true if we need buffer usage data */
+	bool		need_walusage;	/* true if we need WAL usage data */
+	/* Internal state keeping: */
+	instr_time	starttime;		/* start time of last InstrStart */
+	BufferUsage bufusage_start; /* buffer usage at start */
+	WalUsage	walusage_start; /* WAL usage at start */
+	/* Accumulated statistics: */
+	instr_time	total;			/* total runtime */
+	double		ntuples;		/* total tuples counted in InstrStop */
+	BufferUsage bufusage;		/* total buffer usage */
+	WalUsage	walusage;		/* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
 {
 	/* Parameters set at node creation: */
 	bool		need_timer;		/* true if we need timer data */
@@ -91,25 +117,30 @@ typedef struct Instrumentation
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
-} Instrumentation;
+}			NodeInstrumentation;
 
 typedef struct WorkerInstrumentation
 {
 	int			num_workers;	/* # of structures that follow */
-	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+	NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
 } WorkerInstrumentation;
 
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
-								   bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr, double nTuples);
+
+extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
+											bool async_mode);
+extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation * instr);
+extern void InstrStopNode(NodeInstrumentation * instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation * instr);
+extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+
 extern void InstrStartParallelQuery(void);
 extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
 extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index de782014b2d..9b3bd66d401 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1169,7 +1169,8 @@ typedef struct PlanState
 	ExecProcNodeMtd ExecProcNodeReal;	/* actual function, if above is a
 										 * wrapper */
 
-	Instrumentation *instrument;	/* Optional runtime stats for this node */
+	NodeInstrumentation *instrument;	/* Optional runtime stats for this
+										 * node */
 	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
 
 	/* Per-worker JIT instrumentation */
-- 
2.47.1



view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Stack-based tracking of per-node WAL/buffer usage
  In-Reply-To: <CAP53Pkw1RpyAVsvJwUOMgGnVoQUibMk+EB+w-8WqY8kzsC4WjA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox