Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veIJm-00BTuw-0P for pgsql-hackers@arkaria.postgresql.org; Fri, 09 Jan 2026 19:38:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1veIJh-008VUR-2K for pgsql-hackers@arkaria.postgresql.org; Fri, 09 Jan 2026 19:38:38 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veIJg-008VUD-2I for pgsql-hackers@lists.postgresql.org; Fri, 09 Jan 2026 19:38:38 +0000 Received: from fout-b3-smtp.messagingengine.com ([202.12.124.146]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1veIJd-005XJz-2i for pgsql-hackers@lists.postgresql.org; Fri, 09 Jan 2026 19:38:36 +0000 Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfout.stl.internal (Postfix) with ESMTP id 7A44E1D00035; Fri, 9 Jan 2026 14:38:31 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-05.internal (MEProxy); Fri, 09 Jan 2026 14:38:31 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm2; t=1767987511; x=1768073911; bh=HiZsCzdsGU 9HvjOH3zzMV0E0CQFY0yWyT6bVbi1DnVY=; b=KbMqdpDRxUcMPUCy2+teSq6l6p f77vsBWEk0g3UcR4LF4uWfx2IiN3iD12hDqrfo8dnLGxrGkrQJ6Lzd1k7x0jgT0y kWbAogNlbjFQK1Da3BkbDg4rQ91UfnBwSYFWH5rEa6mrxa0jvpgTRQ2Rn2N7+uq8 VFaTNWzHbEYpYkeoOLNlbWpqtZVrwogZIu/KdRqNDvDlodhlE24DIxqHCdqzwTpH 4GgKAXqsjK0Akq91kFv/D6hqieKih3AouG8l2GUDyCm6IhiKLWFyzgq4k4ctnmdc M45H72gfyBcL0JRQ0nIL8cG2RzGeH8vPW4K8gYL6Hu4+dEZhNjT2JPzkHBRQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1767987511; x=1768073911; bh=HiZsCzdsGU9HvjOH3zzMV0E0CQFY0yWyT6b Vbi1DnVY=; b=LuvtOdTMFoZUOVfUqSaQYO3IISd7NpOZYLgG9MBGU3xZ63a3pDK S3EDSJhwEaMBwxlTDYdsI7OqZDEr3j4u/5NM/dvf93G3f7LblgmTY4tdJSfu0W0I izUrkGqwGOaDLh9KDBnf9LwVSQ7SrZSDpX3gwaeNUoAASO21Os2i4dPuHocqrocV QGakYWqhohU/j0ZCL2blN4OC4W4ouPRRDIY3VkqOVhffOnkLXGGDYmQ7plE+6ciE E9QqxG9MYorZQUmJ8+ac5va7a/xNghlqiUnyC9klzZyJbe6x9ThE1QVGH8LXiIWx gtoTvnJj4O7SLhH2a/bXxYzGHAThRipAW8g== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddutdelieejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopedvpdhmohguvgep shhmthhpohhuthdprhgtphhtthhopehluhhkrghssehfihhtthhlrdgtohhmpdhrtghpth htohepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdho rhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 9 Jan 2026 14:38:30 -0500 (EST) Date: Fri, 9 Jan 2026 14:38:30 -0500 From: Andres Freund To: Lukas Fittl Cc: PostgreSQL Hackers Subject: Re: Stack-based tracking of per-node WAL/buffer usage Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote: > Attached v4 patchset that addresses feedback (unless otherwise noted below) > and is rebased on master. Other changes: > > [...] > - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that > illustrates how the stack mechanism can be used to separate index and table > buffer accesses in the EXPLAIN for Index Scans Nice! I pushed 0001. The only changes I made were to break a few long lines. > From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl > Date: Sat, 1 Mar 2025 19:31:30 -0800 > Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other > use of Instrumentation struct > > Previously different places (e.g. query "total time") were repurposing > the Instrumentation struct initially introduced for capturing per-node > statistics during execution. This overuse of the same struct is confusing, > e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated > code paths, and prevents future refactorings. > > Instead, simplify the Instrumentation struct to only track time and > WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of > per-node instrumentation - these calls were added without any apparent > benefit since the relevant fields were never read. > > Introduce the NodeInstrumentation struct to carry forward the per-node > instrumentation information, and introduce TriggerInstrumentation to > capture trigger timing and firings (previously counted in "ntuples"). Hm. It looks to me that after moving trigger instrumentation to its own thing, there are no users of InstrAlloc() with n > 1. I think it may make sense to drop that? > diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c > index f098a5557cf..e87810d292e 100644 > --- a/src/backend/executor/execParallel.c > +++ b/src/backend/executor/execParallel.c > @@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState > * instrument_options: Same meaning here as in instrument.c. > * > * instrument_offset: Offset, relative to the start of this structure, > - * of the first Instrumentation object. This will depend on the length of > + * of the first NodeInstrumentation object. This will depend on the length of > * the plan_node_id array. > * > * num_workers: Number of workers. > @@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation > int num_workers; > int num_plan_nodes; > int plan_node_id[FLEXIBLE_ARRAY_MEMBER]; > - /* array of num_plan_nodes * num_workers Instrumentation objects follows */ > + > + /* > + * array of num_plan_nodes * num_workers NodeInstrumentation objects > + * follows > + */ > }; Spurious change, I think? > + > +/* Trigger instrumentation handling */ > +TriggerInstrumentation * > +InstrAllocTrigger(int n, int instrument_options) > +{ > + TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation)); > + bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0; > + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0; > + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0; > + int i; > + > + for (i = 0; i < n; i++) > + { > + tginstr[i].instr.need_bufusage = need_buffers; > + tginstr[i].instr.need_walusage = need_wal; > + tginstr[i].instr.need_timer = need_timer; > + } > + > + return tginstr; Hm. Not that it's a huge difference, but I wonder if we ought to use InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition? > +/* Trigger instrumentation */ > +typedef struct TriggerInstrumentation > +{ > + Instrumentation instr; > + int firings; /* number of times the instrumented trigger > + * was fired */ > +} TriggerInstrumentation; > + > +/* > + * Specialized instrumentation for per-node execution statistics > + */ > +typedef struct NodeInstrumentation > { > /* Parameters set at node creation: */ > bool need_timer; /* true if we need timer data */ > @@ -92,25 +125,34 @@ typedef struct Instrumentation > double nfiltered2; /* # of tuples removed by "other" quals */ > BufferUsage bufusage; /* total buffer usage */ > WalUsage walusage; /* total WAL usage */ > -} Instrumentation; > +} NodeInstrumentation; The indentation here will look better if you add TriggerInstrumentation, NodeInstrumentation to typedefs.list > typedef struct WorkerInstrumentation > { > int num_workers; /* # of structures that follow */ > - Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER]; > + NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER]; > } WorkerInstrumentation; Hm. Shouldn't this be WorkerNodeInstrumentation now? > From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001 > From: Lukas Fittl > Date: Sun, 31 Aug 2025 16:34:42 -0700 > Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage > with INSTR_* macros Looks sane to me. I'd move that to the head of the queue in the next revision. > From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001 > From: Lukas Fittl > Date: Tue, 9 Sep 2025 02:16:59 -0700 > Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through > stack-based instrumentation > > Previously, in order to determine the buffer/WAL usage of a given code section, > we utilized continuously incrementing global counters that get updated when the > actual activity (e.g. shared block read) occurred, and then calculated a diff when > the code section ended. This resulted in a bottleneck for executor node instrumentation > specifically, with the function BufferUsageAccumDiff showing up in profiles and > in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run. > > Instead, introduce a stack-based mechanism, where the actual activity writes > into the current stack entry. In the case of executor nodes, this means that > each node gets its own stack entry that is pushed at InstrStartNode, and popped > at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism) > and get added to their parent entry when they are finalized, i.e. no more > modifications can occur. > > To correctly handle abort situations, any use of instrumentation stacks must > involve either a top-level Instrumentation struct, and its associated InstrStart/ > InstrStop helpers (which use resource owners to handle aborts), or dedicated > PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after > an abort. I think it may be good to have some tests for edge cases. E.g. testing that a query that an explain analyze of a query that executes a plpgsql function which in turn does another explain analyze or just another query without explain analyze, does something reasonable. > @@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc) > PGSS_EXEC, > INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total), > queryDesc->estate->es_total_processed, > - &queryDesc->totaltime->bufusage, > - &queryDesc->totaltime->walusage, > + > + /* > + * Check if stack is initialized - it is not when ExecutorRun wasn't > + * called > + */ > + queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->bufusage : NULL, > + queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->walusage : NULL, > queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL, > NULL, > queryDesc->estate->es_parallel_workers_to_launch, Maybe it's just the diff, but this looks a bit odd, indentation wise. > @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, > resultRelInfo->ri_TrigWhenExprs = (ExprState **) > palloc0(n * sizeof(ExprState *)); > if (instrument_options) > - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options); > + { > + /* > + * Triggers do not individually track buffer/WAL usage, even if > + * otherwise tracked > + */ Why is that? > @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr) > !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime)) > elog(ERROR, "InstrStart called twice in a row"); > > - if (instr->need_bufusage) > - instr->bufusage_start = pgBufferUsage; > + if (instr->need_bufusage || instr->need_walusage) > + { > + Assert(CurrentResourceOwner != NULL); > + instr->owner = CurrentResourceOwner; > + > + /* > + * Allocate the stack resource in a memory context that survives > + * during an abort. This will be freed by InstrStop (regular > + * execution) or ResOwnerReleaseInstrumentation (abort). > + * > + * We don't do this in InstrAlloc to avoid leaking when InstrStart + > + * InstrStop isn't called. > + */ > + if (instr->stack == NULL) > + instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack)); > > - if (instr->need_walusage) > - instr->walusage_start = pgWalUsage; > + ResourceOwnerEnlarge(instr->owner); > + ResourceOwnerRememberInstrStack(instr->owner, instr->stack); > + > + InstrPushStack(instr->stack); > + } > } I'm still worried that allocating something in CurTransactionContext will bite us eventually. It seems like it'd be pretty sane to want to have instrumentation around procedure execution, for example. > void > -InstrStop(Instrumentation *instr) > +InstrStop(Instrumentation *instr, bool finalize) > { > instr_time endtime; > > @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr) > INSTR_TIME_SET_ZERO(instr->starttime); > } > > - /* Add delta of buffer usage since entry to node's totals */ > - if (instr->need_bufusage) > - BufferUsageAccumDiff(&instr->bufusage, > - &pgBufferUsage, &instr->bufusage_start); > + if (instr->need_bufusage || instr->need_walusage) > + { > + InstrPopStack(instr->stack, finalize); > > - if (instr->need_walusage) > - WalUsageAccumDiff(&instr->walusage, > - &pgWalUsage, &instr->walusage_start); > + Assert(instr->owner != NULL); > + ResourceOwnerForgetInstrStack(instr->owner, instr->stack); > + instr->owner = NULL; > + > + if (finalize) > + { > + /* > + * To avoid keeping memory allocated beyond when its needed, copy > + * the result to the current memory context, and free it in the > + * transaction context. > + */ > + InstrStack *stack = palloc(sizeof(InstrStack)); > + > + memcpy(stack, instr->stack, sizeof(InstrStack)); > + pfree(instr->stack); > + instr->stack = stack; > + } > + } > } Hm. I'm not entirely sure I understand the gain due to the palloc & copy & pfree. Won't that typically increase memory usage due to temporarily having two versions of the InstrStack? > /* Trigger instrumentation handling */ > @@ -98,15 +194,20 @@ TriggerInstrumentation * > InstrAllocTrigger(int n, int instrument_options) > { > TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation)); > - bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0; > - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0; > bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0; > int i; > > + /* > + * To avoid having to determine when the last trigger fired, we never > + * track WAL/buffer usage for now > + */ > + Assert((instrument_options & INSTRUMENT_BUFFERS) == 0); > + Assert((instrument_options & INSTRUMENT_WAL) == 0); > + > for (i = 0; i < n; i++) > { > - tginstr[i].instr.need_bufusage = need_buffers; > - tginstr[i].instr.need_walusage = need_wal; > + tginstr[i].instr.need_bufusage = false; > + tginstr[i].instr.need_walusage = false; > tginstr[i].instr.need_timer = need_timer; > } > That does seem problematic to me. > @@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add) > } > > /* start instrumentation during parallel executor startup */ > -void > +Instrumentation * > InstrStartParallelQuery(void) > { > - save_pgBufferUsage = pgBufferUsage; > - save_pgWalUsage = pgWalUsage; > + Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL); > + > + InstrStart(instr); > + return instr; > } > /* report usage after parallel executor shutdown */ > void > -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage) > +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage) > { > - memset(bufusage, 0, sizeof(BufferUsage)); > - BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage); > - memset(walusage, 0, sizeof(WalUsage)); > - WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage); > + InstrStop(instr, true); > + memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage)); > + memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage)); > } Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will make that pretty clear... > if (verbose || params.log_vacuum_min_duration == 0 || > TimestampDifferenceExceeds(starttime, endtime, > params.log_vacuum_min_duration)) > { > long secs_dur; > int usecs_dur; > - WalUsage walusage; > - BufferUsage bufferusage; > StringInfoData buf; > char *msgfmt; > int32 diff; > @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params, > int64 total_blks_hit; > int64 total_blks_read; > int64 total_blks_dirtied; > + BufferUsage bufusage = instr->stack->bufusage; > + WalUsage walusage = instr->stack->walusage; > TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur); > - memset(&walusage, 0, sizeof(WalUsage)); > - WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage); > - memset(&bufferusage, 0, sizeof(BufferUsage)); > - BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage); > - > - total_blks_hit = bufferusage.shared_blks_hit + > - bufferusage.local_blks_hit; > - total_blks_read = bufferusage.shared_blks_read + > - bufferusage.local_blks_read; > - total_blks_dirtied = bufferusage.shared_blks_dirtied + > - bufferusage.local_blks_dirtied; > + > + total_blks_hit = bufusage.shared_blks_hit + > + bufusage.local_blks_hit; > + total_blks_read = bufusage.shared_blks_read + > + bufusage.local_blks_read; > + total_blks_dirtied = bufusage.shared_blks_dirtied + > + bufusage.local_blks_dirtied; Why rename and move the declaration of bufferusage here? That seems to make the diff unnecessarily noisy? Kinda going the other way: Why copy this on the stack, rather than use a pointer? > diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c > index 34b6410d6a2..77b4c59e71c 100644 > --- a/src/backend/commands/prepare.c > +++ b/src/backend/commands/prepare.c > @@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es, > ListCell *p; > ParamListInfo paramLI = NULL; > EState *estate = NULL; > - instr_time planstart; > - instr_time planduration; > - BufferUsage bufusage_start, > - bufusage; > + Instrumentation *instr = NULL; > MemoryContextCounters mem_counters; > MemoryContext planner_ctx = NULL; > MemoryContext saved_ctx = NULL; > > + if (es->buffers) > + instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS); > + else > + instr = InstrAlloc(1, INSTRUMENT_TIMER); I'd not duplicate the call to InstrAlloc(). > From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl > Date: Thu, 30 Oct 2025 22:27:30 -0700 > Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access > reporting in EXPLAIN > > This makes it clear whether activity was on the index directly, or > on the table based on heap fetches. Yay^2. > --- > src/backend/commands/explain.c | 56 ++++++++++++++++------------ > src/backend/executor/execProcnode.c | 13 +++++++ > src/backend/executor/instrument.c | 25 +++++++++++++ > src/backend/executor/nodeIndexscan.c | 15 +++++++- > src/include/access/genam.h | 3 ++ > src/include/executor/instrument.h | 3 ++ > 6 files changed, 91 insertions(+), 24 deletions(-) > > diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c > index dd3bf615581..fb96dd5248c 100644 > diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c > index d00cf820a27..f19af428d97 100644 > --- a/src/backend/executor/execProcnode.c > +++ b/src/backend/executor/execProcnode.c > @@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, void *context) > planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context); > > if (node->instrument && node->instrument->stack.previous) > + { Perhaps we should instead (probably in the earlier patch) just make the function return early if we don't have instrumentation / a previous node? We might gain a few more special cases, and it's nice if they're less indented. > + /* > + * Index Scan nodes account for heap buffer usage separately, so we > + * need to explitly add here > + */ s/heap/table/ Greetings, Andres Freund