public inbox for [email protected]
help / color / mirror / Atom feedStack-based tracking of per-node WAL/buffer usage
42+ messages / 4 participants
[nested] [flat]
* Stack-based tracking of per-node WAL/buffer usage
@ 2025-08-31 23:57 Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2025-08-31 23:57 UTC (permalink / raw)
To: PostgreSQL Hackers <[email protected]>; +Cc: Andres Freund <[email protected]>
Hi,
Please find attached a patch series that introduces a new paradigm for how
per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
distinct buffer hits [0].
Currently we utilize pgWalUsage/pgBufferUsage as global counters, and in
InstrStopNode we call the rather
expensive BufferUsageAccumDiff/WalUsageAccumDiff to know how much activity
happened within a given node cycle.
This proposal instead uses a stack, where each time we enter a node
(InstrStartNode) we point a new global (pgInstrStack) to the current stack
entry. Whilst we're in that node we increment buffer/WAL usage statistics
to the stack entry. On exit (InstrStopNode) we restore the previous entry.
This change provides about a 10% performance benefit for EXPLAIN ANALYZE on
paths that repeatedly enter InstrStopNode, e.g. SELECT COUNT(*):
CREATE TABLE test(id int);
INSERT INTO test SELECT * FROM generate_series(0, 1000000);
master (124ms, best out of 3):
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual
time=124.910..124.910 rows=1.00 loops=1)
Buffers: shared hit=752 read=3673
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual
time=0.201..62.228 rows=1000001.00 loops=1)
Buffers: shared hit=752 read=3673
Planning Time: 0.116 ms
Execution Time: 124.961 ms
patched (109ms, best out of 3):
postgres=# EXPLAIN (ANALYZE) SELECT COUNT(*) FROM test;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=16925.01..16925.02 rows=1 width=8) (actual
time=109.788..109.788 rows=1.00 loops=1)
Buffers: shared hit=940 read=3485
-> Seq Scan on test (cost=0.00..14425.01 rows=1000001 width=0) (actual
time=0.153..69.368 rows=1000001.00 loops=1)
Buffers: shared hit=940 read=3485
Planning Time: 0.134 ms
Execution Time: 109.837 ms
(6 rows)
I have also prototyped a more ambitious approach that completely removes
pgWalUsage/pgBufferUsage (utilizing the stack-collected data for e.g.
pg_stat_statements), but for now this patch set does not include that
change, but instead keeps adding to these legacy globals as well.
Patches attached:
0001: Separate node instrumentation from other use of Instrumentation struct
Previously different places (e.g. query "total time") were repurposing
the per-node Instrumentation struct. Instead, simplify the Instrumentation
struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
the NodeInstrumentation struct to carry forward the per-node
instrumentation information.
0002: Replace direct changes of pgBufferUsage/pgWalUsage with INSTR_* macros
0003: Introduce stack for tracking per-node WAL/buffer usage
Feedback/thoughts welcome!
CCing Andres since he had expressed interest in this off-list.
[0]: See lightning talk slides from PGConf.Dev discussing an HLL-based
EXPLAIN (BUFFERS DISTINCT):
https://resources.pganalyze.com/pganalyze_PGConf.dev_2025_shared_blks_hit_distinct.pdf
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v1-0002-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch (9.0K, 3-v1-0002-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch)
download | inline diff:
From 942d8eb9b0f6a8d95c7cfd6a995d93ea9c667151 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:34:42 -0700
Subject: [PATCH v1 2/3] Replace direct changes of pgBufferUsage/pgWalUsage
with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
---
src/backend/access/transam/xlog.c | 8 ++++----
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 46 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..61516f35676 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1078,9 +1078,9 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2060,7 +2060,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 350cc0402aa..41f7c729c1d 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -705,7 +705,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -737,7 +737,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
else
PinBuffer_Locked(bufHdr); /* pin for first time */
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
@@ -1147,14 +1147,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1888,9 +1888,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -1958,9 +1958,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2865,7 +2865,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -2983,7 +2983,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4391,7 +4391,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5547,7 +5547,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (dirtied)
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 3c0d20f4659..46bc57812df 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -216,7 +216,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -476,7 +476,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -507,7 +507,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u32(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..9d39df998cb 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -474,13 +474,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -548,13 +548,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..4f6274eb573 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 8c563510f4c..3a280f4caae 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -149,4 +149,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v1-0001-Separate-node-instrumentation-from-other-use-of-I.patch (21.8K, 4-v1-0001-Separate-node-instrumentation-from-other-use-of-I.patch)
download | inline diff:
From 3e35547712f1fb10bce4fb3908912a282196b198 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v1 1/3] Separate node instrumentation from other use of
Instrumentation struct
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This dual use of the struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time,
WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
outside of per-node instrumentation. Introduce the NodeInstrumentation
struct to carry forward the per-node instrumentation information.
---
contrib/auto_explain/auto_explain.c | 10 +--
.../pg_stat_statements/pg_stat_statements.c | 10 +--
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 11 +--
src/backend/commands/trigger.c | 8 +-
src/backend/executor/execMain.c | 10 +--
src/backend/executor/execParallel.c | 22 +++--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 86 +++++++++++++++++--
src/include/executor/instrument.h | 51 ++++++++---
src/include/nodes/execnodes.h | 3 +-
11 files changed, 153 insertions(+), 64 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 1f4badb4928..bd059f12224 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,14 +381,8 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
- msec = queryDesc->totaltime->total * 1000.0;
+ msec = INSTR_TIME_GET_DOUBLE(queryDesc->totaltime->total) * 1000.0;
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1cb368c8590..a1e69789c73 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1021,7 +1021,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1080,18 +1080,12 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- queryDesc->totaltime->total * 1000.0, /* convert to msec */
+ INSTR_TIME_GET_DOUBLE(queryDesc->totaltime->total) * 1000.0, /* convert to msec */
queryDesc->estate->es_total_processed,
&queryDesc->totaltime->bufusage,
&queryDesc->totaltime->walusage,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 456b267f70b..7619ac486c0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8345bc0264b..6a135e51996 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1102,9 +1102,6 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
@@ -1135,7 +1132,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- 1000.0 * instr->total, instr->ntuples);
+ 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), instr->ntuples);
else
appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
}
@@ -1146,7 +1143,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Constraint Name", conname, es);
ExplainPropertyText("Relation", relname, es);
if (es->timing)
- ExplainPropertyFloat("Time", "ms", 1000.0 * instr->total, 3,
+ ExplainPropertyFloat("Time", "ms", 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), 3,
es);
ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
}
@@ -1888,7 +1885,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -2294,7 +2291,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 579ac8d76ae..9b53dd99e99 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStart(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -4381,7 +4381,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStart(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1);
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b8b9d2a85f7..b83ced9a57a 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime, estate->es_processed);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime, 0);
MemoryContextSwitchTo(oldcontext);
@@ -1247,7 +1247,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..e87810d292e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -713,7 +717,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -799,7 +803,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -809,7 +813,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1036,7 +1040,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1064,7 +1068,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
@@ -1296,7 +1300,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..d286471254b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(1, estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..c53480d8030 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,9 +26,9 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int n, int instrument_options)
{
Instrumentation *instr;
@@ -41,6 +41,74 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
int i;
+ for (i = 0; i < n; i++)
+ {
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
+ }
+ }
+
+ return instr;
+}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer &&
+ !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
+ elog(ERROR, "InstrStart called twice in a row");
+
+ if (instr->need_bufusage)
+ instr->bufusage_start = pgBufferUsage;
+
+ if (instr->need_walusage)
+ instr->walusage_start = pgWalUsage;
+}
+void
+InstrStop(Instrumentation *instr, double nTuples)
+{
+ instr_time endtime;
+
+ /* count the specified tuples */
+ instr->ntuples += nTuples;
+
+ /* let's update the time only if the timer was requested */
+ if (instr->need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->need_bufusage)
+ BufferUsageAccumDiff(&instr->bufusage,
+ &pgBufferUsage, &instr->bufusage_start);
+
+ if (instr->need_walusage)
+ WalUsageAccumDiff(&instr->walusage,
+ &pgWalUsage, &instr->walusage_start);
+}
+
+/* Allocate new node instrumentation structure(s) */
+NodeInstrumentation *
+InstrAllocNode(int n, int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr;
+
+ /* initialize all fields to zeroes, then modify as needed */
+ instr = palloc0(n * sizeof(NodeInstrumentation));
+ if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ {
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
for (i = 0; i < n; i++)
{
instr[i].need_bufusage = need_buffers;
@@ -55,9 +123,9 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitNode(NodeInstrumentation * instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
+ memset(instr, 0, sizeof(NodeInstrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
@@ -65,7 +133,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStartNode(NodeInstrumentation * instr)
{
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
@@ -81,7 +149,7 @@ InstrStartNode(Instrumentation *instr)
/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStopNode(NodeInstrumentation * instr, double nTuples)
{
double save_tuplecount = instr->tuplecount;
instr_time endtime;
@@ -129,7 +197,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -137,7 +205,7 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation * instr)
{
double totaltime;
@@ -166,7 +234,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
{
if (!dst->running && add->running)
{
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6c..8c563510f4c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -66,7 +66,33 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
+{
+ /* Parameters set at creation: */
+ bool need_timer; /* true if we need timer data */
+ bool need_bufusage; /* true if we need buffer usage data */
+ bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of current iteration of node */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ double ntuples; /* total tuples counted in InstrStop */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
{
/* Parameters set at node creation: */
bool need_timer; /* true if we need timer data */
@@ -91,25 +117,30 @@ typedef struct Instrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
typedef struct WorkerInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr, double nTuples);
+
+extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation * instr);
+extern void InstrStopNode(NodeInstrumentation * instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation * instr);
+extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index de782014b2d..9b3bd66d401 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1169,7 +1169,8 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
/* Per-worker JIT instrumentation */
--
2.47.1
[application/octet-stream] v1-0003-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch (12.3K, 5-v1-0003-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch)
download | inline diff:
From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage
---
src/backend/commands/explain.c | 8 +-
src/backend/executor/execMain.c | 7 ++
src/backend/executor/execProcnode.c | 9 +++
src/backend/executor/instrument.c | 111 ++++++++++++++++++++++++----
src/include/executor/instrument.h | 42 ++++++++++-
5 files changed, 155 insertions(+), 22 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 6a135e51996..584f0adbcc1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2280,9 +2280,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->stack.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->stack.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -2299,9 +2299,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->stack.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->stack.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b83ced9a57a..1c2268bc608 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
DestReceiver *dest;
bool sendTuples;
MemoryContext oldcontext;
+ InstrStackResource *res;
/* sanity checks */
Assert(queryDesc != NULL);
@@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (queryDesc->totaltime)
InstrStart(queryDesc->totaltime);
+ /* Start up per-query node level instrumentation */
+ res = InstrStartQuery();
+
/*
* extract information from the query descriptor and the query feature.
*/
@@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (sendTuples)
dest->rShutdown(dest);
+ /* Shut down per-query node level instrumentation */
+ InstrShutdownQuery(res);
+
if (queryDesc->totaltime)
InstrStop(queryDesc->totaltime, estate->es_processed);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d286471254b..7436f307994 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
/* Stop the node if we started it above, reporting 0 tuples. */
if (node->instrument && node->instrument->running)
+ {
InstrStopNode(node->instrument, 0);
+ /*
+ * Propagate WAL/buffer stats to the parent node on the
+ * instrumentation stack (which is where InstrStopNode returned us
+ * to).
+ */
+ InstrNodeAddToCurrent(&node->instrument->stack);
+ }
+
return false;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index c53480d8030..040d1fdecbd 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,15 +16,40 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
+InstrStack *pgInstrStack = NULL;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+/*
+ * Node-specific instrumentation handling uses ResourceOwner mechanism to
+ * reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrStack(Datum res);
+static const ResourceOwnerDesc instr_stack_resowner_desc =
+{
+ .name = "instrumentation stack scope",
+ .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+ .release_priority = RELEASE_PRIO_FIRST,
+ .ReleaseResource = ResOwnerReleaseInstrStack,
+ .DebugPrint = NULL, /* default message is fine */
+};
+static inline void
+ResourceOwnerRememberInstrStack(ResourceOwner owner, InstrStackResource * scope)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(scope), &instr_stack_resowner_desc);
+}
+static inline void
+ResourceOwnerForgetInstrStack(ResourceOwner owner, InstrStackResource * scope)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(scope), &instr_stack_resowner_desc);
+}
/* General purpose instrumentation handling */
Instrumentation *
@@ -139,12 +164,17 @@ InstrStartNode(NodeInstrumentation * instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStartNode called twice in a row");
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /*
+ * Ensure that we have an active pgInstrStack (InstrStartQuery must
+ * have been called)
+ */
+ Assert(pgInstrStack != NULL);
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ instr->stack.previous = pgInstrStack;
+ pgInstrStack = &instr->stack;
+ }
}
/* Exit from a plan node */
@@ -169,14 +199,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
-
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that there is a stack entry above the top-most node */
+ Assert(instr->stack.previous != NULL);
+ pgInstrStack = instr->stack.previous;
+ }
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -257,10 +285,65 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
/* Add delta of buffer usage since entry to node's totals */
if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
if (dst->need_walusage)
+ WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
+}
+
+InstrStackResource *
+InstrStartQuery()
+{
+ InstrStack *usage = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
+ InstrStackResource *usageRes = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStackResource));
+ ResourceOwner owner = CurrentResourceOwner;
+
+ Assert(owner != NULL);
+
+ usageRes->owner = owner;
+
+ ResourceOwnerEnlarge(owner);
+ ResourceOwnerRememberInstrStack(owner, usageRes);
+
+ usage->previous = pgInstrStack;
+ pgInstrStack = usage;
+
+ return usageRes;
+}
+
+void
+InstrShutdownQuery(InstrStackResource * res)
+{
+ Assert(res != NULL);
+ Assert(res->owner != NULL);
+
+ pgInstrStack = res->previous;
+
+ ResourceOwnerForgetInstrStack(res->owner, res);
+}
+
+static void
+ResOwnerReleaseInstrStack(Datum res)
+{
+ /*
+ * XXX: Registered resources are *not* called in reverse order, i.e. we'll
+ * get what was first registered first at shutdown. To avoid handling
+ * that, we are resetting the stack here on abort (instead of recovering
+ * to previous).
+ */
+ pgInstrStack = NULL;
+}
+
+void
+InstrNodeAddToCurrent(InstrStack * add)
+{
+ if (pgInstrStack != NULL)
+ {
+ InstrStack *dst = pgInstrStack;
+
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
WalUsageAdd(&dst->walusage, &add->walusage);
+ }
}
/* note current values during parallel executor startup */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 3a280f4caae..a98efab5f93 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
#define INSTRUMENT_H
#include "portability/instr_time.h"
+#include "utils/resowner.h"
/*
@@ -66,6 +67,21 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/* Stack of WAL/buffer usage used for per-node instrumentation */
+typedef struct InstrStack
+{
+ struct InstrStack *previous;
+ BufferUsage bufusage;
+ WalUsage walusage;
+} InstrStack;
+
+/* Used to manage resetting of instrumentation stack on abort. */
+typedef struct InstrStackResource
+{
+ InstrStack *previous;
+ ResourceOwner owner;
+} InstrStackResource;
+
/*
* General purpose instrumentation that can capture time, WAL/buffer usage and tuples
*
@@ -91,6 +107,10 @@ typedef struct Instrumentation
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Requires use of InstrStartQuery to initialize the stack used for WAL/buffer
+ * usage statistics, and cleanup through InstrShutdownQuery. Solely intended for
+ * the executor and anyone reporting about its activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -105,8 +125,6 @@ typedef struct NodeInstrumentation
instr_time counter; /* accumulated runtime for this node */
double firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
double startup; /* total startup time (in seconds) */
double total; /* total time (in seconds) */
@@ -115,8 +133,7 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack stack; /* stack tracking buffer/WAL usage */
} NodeInstrumentation;
typedef struct WorkerInstrumentation
@@ -127,6 +144,7 @@ typedef struct WorkerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
+extern PGDLLIMPORT InstrStack * pgInstrStack;
extern Instrumentation *InstrAlloc(int n, int instrument_options);
extern void InstrStart(Instrumentation *instr);
@@ -135,11 +153,14 @@ extern void InstrStop(Instrumentation *instr, double nTuples);
extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
+extern InstrStackResource * InstrStartQuery(void);
+extern void InstrShutdownQuery(InstrStackResource * res);
extern void InstrStartNode(NodeInstrumentation * instr);
extern void InstrStopNode(NodeInstrumentation * instr, double nTuples);
extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation * instr);
extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+extern void InstrNodeAddToCurrent(InstrStack * stack);
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
@@ -151,21 +172,34 @@ extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
#define INSTR_BUFUSAGE_INCR(fld) do { \
pgBufferUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
pgBufferUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ if (pgInstrStack) \
+ INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ if (pgInstrStack) \
+ INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2025-09-04 20:23 ` Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2025-09-04 20:23 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>
Hi,
On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].
I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.
> 0001: Separate node instrumentation from other use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.
It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).
> 0003: Introduce stack for tracking per-node WAL/buffer usage
> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage
> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> DestReceiver *dest;
> bool sendTuples;
> MemoryContext oldcontext;
> + InstrStackResource *res;
>
> /* sanity checks */
> Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> if (queryDesc->totaltime)
> InstrStart(queryDesc->totaltime);
>
> + /* Start up per-query node level instrumentation */
> + res = InstrStartQuery();
> +
> /*
> * extract information from the query descriptor and the query feature.
> */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
> if (sendTuples)
> dest->rShutdown(dest);
>
> + /* Shut down per-query node level instrumentation */
> + InstrShutdownQuery(res);
> +
> if (queryDesc->totaltime)
> InstrStop(queryDesc->totaltime, estate->es_processed);
Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.
> diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>
> /* Stop the node if we started it above, reporting 0 tuples. */
> if (node->instrument && node->instrument->running)
> + {
> InstrStopNode(node->instrument, 0);
>
> + /*
> + * Propagate WAL/buffer stats to the parent node on the
> + * instrumentation stack (which is where InstrStopNode returned us
> + * to).
> + */
> + InstrNodeAddToCurrent(&node->instrument->stack);
> + }
> +
> return false;
> }
Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:
/*
* If we know we won't need to back up, we can release resources at this
* point.
*/
if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
ExecShutdownNode(planstate);
> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> + /*
> + * XXX: Registered resources are *not* called in reverse order, i.e. we'll
> + * get what was first registered first at shutdown. To avoid handling
> + * that, we are resetting the stack here on abort (instead of recovering
> + * to previous).
> + */
> + pgInstrStack = NULL;
> +}
Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?
I wonder if the solution could be to walk the stack and search for the
to-be-released element.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2025-09-09 19:35 ` Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2025-09-09 19:35 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>
Hi Andres,
Thanks for the review!
Attached an updated patch set that addresses the feedback, and also adds
the complete removal of the global pgBufferUsage variable in later patches
(0005-0007), to avoid counting both the stack and the variable.
FWIW, pgWalUsage would also be nice to remove, but it has some interesting
interactions with the cumulative statistics system
and heap_page_prune_and_freeze, and seems less performance critical.
On Thu, Sep 4, 2025 at 1:23 PM Andres Freund <[email protected]> wrote:
> > 0001: Separate node instrumentation from other use of Instrumentation
> struct
> ...
> It's mildly odd that the two types of instrumentation have a different
> definition of 'total' (one double, one instr_time).
>
Yeah, agreed. I added in a new 0001 patch that changes this to instr_time
consistently. I don't see a good reason to keep the transformation to
seconds in the Instrumentation logic, since all in-tree callers convert it
to milliseconds anyway.
> > 0003: Introduce stack for tracking per-node WAL/buffer usage
>
> Why are we doing Instr{Start,Stop}Query when not using instrumentation?
> Resowner stuff ain't free, so I'd skip them when not necessary.
>
Makes sense, I've adjusted that to be conditional (in the now renamed 0004
patch).
In the updated patch I've also decided to piggyback on QueryDesc totaltime
as the "owning" Instrumentation here for the query's lifetime. It seems
simpler that way and avoids having special purpose methods. To go along
with that I've changed the general purpose Instrumentation struct to use
stack-based instrumentation at the same time.
> diff --git a/src/backend/executor/execProcnode.c
> b/src/backend/executor/execProcnode.c
> > index d286471254b..7436f307994 100644
> > --- a/src/backend/executor/execProcnode.c
> > +++ b/src/backend/executor/execProcnode.c
> > @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void
> *context)
> ...
> > + InstrNodeAddToCurrent(&node->instrument->stack);
> ...
>
> Can we rely on this being reached? Note that ExecutePlan() calls
> ExecShutdownNode() conditionally:
You are of course correct, I didn't consider cursors correctly here.
It seems there isn't an existing executor node walk that could be
repurposed, so I added a new one in ExecutorFinish
("ExecAccumNodeInstrumentation"). From my read of the code there are no use
cases where we need aggregated instrumentation data before ExecutorFinish.
> +static void
> > +ResOwnerReleaseInstrStack(Datum res)
> > +{
> > + /*
> > + * XXX: Registered resources are *not* called in reverse order,
> i.e. we'll
> > + * get what was first registered first at shutdown. To avoid
> handling
> > + * that, we are resetting the stack here on abort (instead of
> recovering
> > + * to previous).
> > + */
> > + pgInstrStack = NULL;
> > +}
>
> Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
> EXPLAIN ANALYZE of a query that executes a function, which internally
> triggers
> an error and catches it?
>
> I wonder if the solution could be to walk the stack and search for the
> to-be-released element.
>
Yes, good point, I did not consider that case. I've addressed this by only
updating the current stack if its not already a parent of the element being
released. We are also always adding the element's statistics to the
(updated) active stack element at abort.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v2-0006-Introduce-alternate-Instrumentation-stack-mechani.patch (4.8K, 3-v2-0006-Introduce-alternate-Instrumentation-stack-mechani.patch)
download | inline diff:
From 071887cb7979b6537ca91834ccfe8dbc92bc34d2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:02 -0700
Subject: [PATCH v2 6/7] Introduce alternate Instrumentation stack mechanism
relying on PG_FINALLY
The resource owner-based Instrumentation stack cannot handle wrapping
certain utility commands that close and re-open the top-level transaction,
like the CLUSTER command. This is a problem for pg_stat_statements tracking
of utility commands specifically. To support tracking such activity, allow
issuing explicit InstrPushStack/InstrPopStack commands to modify the stack,
with the InstrPopStack in a PG_FINALLY to ensure cleanup on abort.
---
.../pg_stat_statements/pg_stat_statements.c | 50 +++++--------------
src/include/executor/instrument.h | 3 ++
2 files changed, 15 insertions(+), 38 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index deb6d43a47f..2d21dfbfcfb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -909,21 +909,13 @@ pgss_planner(Query *parse,
{
instr_time start;
instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ InstrStack *stack;
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
INSTR_TIME_SET_CURRENT(start);
+ /* We need to track buffer/WAL usage as the planner can access them. */
+ stack = InstrPushStack();
+
nesting_level++;
PG_TRY();
{
@@ -936,6 +928,7 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrPopStack(stack);
nesting_level--;
}
PG_END_TRY();
@@ -943,14 +936,6 @@ pgss_planner(Query *parse,
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
@@ -958,8 +943,8 @@ pgss_planner(Query *parse,
PGSS_PLAN,
INSTR_TIME_GET_MILLISEC(duration),
0,
- &bufusage,
- &walusage,
+ &stack->bufusage,
+ &stack->walusage,
NULL,
NULL,
0,
@@ -1155,14 +1140,10 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
instr_time start;
instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ InstrStack *stack;
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
INSTR_TIME_SET_CURRENT(start);
+ stack = InstrPushStack();
nesting_level++;
PG_TRY();
@@ -1178,6 +1159,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrPopStack(stack);
nesting_level--;
}
PG_END_TRY();
@@ -1206,14 +1188,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
@@ -1221,8 +1195,8 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(duration),
rows,
- &bufusage,
- &walusage,
+ &stack->bufusage,
+ &stack->walusage,
NULL,
NULL,
0,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bf766706580..8804ee64311 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -147,6 +147,9 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr, double nTuples, bool finalize);
+extern InstrStack * InstrPushStack(void);
+extern void InstrPopStack(InstrStack * res);
+
extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
--
2.47.1
[application/octet-stream] v2-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch (17.7K, 4-v2-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch)
download | inline diff:
From 5626b4a35f90826e7835a4f047b99fd41648e724 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:56 -0700
Subject: [PATCH v2 7/7] Convert remaining users of pgBufferUsage to use
InstrStart/InstrStop, drop the global
---
src/backend/access/heap/vacuumlazy.c | 35 ++++++++---------
src/backend/commands/analyze.c | 35 ++++++++---------
src/backend/commands/explain.c | 26 +++++--------
src/backend/commands/explain_dr.c | 31 ++++++++-------
src/backend/commands/prepare.c | 26 +++++--------
src/backend/executor/instrument.c | 56 +++++++++++-----------------
src/include/executor/instrument.h | 8 +---
7 files changed, 90 insertions(+), 127 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 981d9380a92..b7978b385f3 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -628,8 +628,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
@@ -644,6 +643,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -946,14 +947,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr, 0, true);
+
if (verbose || params.log_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -964,17 +965,13 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_dirtied;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+
+ total_blks_hit = INSTR_GET_BUFUSAGE(instr).shared_blks_hit +
+ INSTR_GET_BUFUSAGE(instr).local_blks_hit;
+ total_blks_read = INSTR_GET_BUFUSAGE(instr).shared_blks_read +
+ INSTR_GET_BUFUSAGE(instr).local_blks_read;
+ total_blks_dirtied = INSTR_GET_BUFUSAGE(instr).shared_blks_dirtied +
+ INSTR_GET_BUFUSAGE(instr).local_blks_dirtied;
initStringInfo(&buf);
if (verbose)
@@ -1136,10 +1133,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
total_blks_dirtied);
appendStringInfo(&buf,
_("WAL usage: %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes, %" PRId64 " buffers full\n"),
- walusage.wal_records,
- walusage.wal_fpi,
- walusage.wal_bytes,
- walusage.wal_buffers_full);
+ INSTR_GET_WALUSAGE(instr).wal_records,
+ INSTR_GET_WALUSAGE(instr).wal_fpi,
+ INSTR_GET_WALUSAGE(instr).wal_bytes,
+ INSTR_GET_WALUSAGE(instr).wal_buffers_full);
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
ereport(verbose ? INFO : LOG,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8ea2913d906..a407d85f864 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -302,9 +302,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -355,6 +353,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -735,12 +736,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr, 0, true);
+
if (verbose || params.log_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -749,17 +751,12 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_read;
int64 total_blks_dirtied;
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ total_blks_hit = INSTR_GET_BUFUSAGE(instr).shared_blks_hit +
+ INSTR_GET_BUFUSAGE(instr).local_blks_hit;
+ total_blks_read = INSTR_GET_BUFUSAGE(instr).shared_blks_read +
+ INSTR_GET_BUFUSAGE(instr).local_blks_read;
+ total_blks_dirtied = INSTR_GET_BUFUSAGE(instr).shared_blks_dirtied +
+ INSTR_GET_BUFUSAGE(instr).local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
@@ -832,10 +829,10 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
total_blks_dirtied);
appendStringInfo(&buf,
_("WAL usage: %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes, %" PRId64 " buffers full\n"),
- walusage.wal_records,
- walusage.wal_fpi,
- walusage.wal_bytes,
- walusage.wal_buffers_full);
+ INSTR_GET_WALUSAGE(instr).wal_records,
+ INSTR_GET_WALUSAGE(instr).wal_fpi,
+ INSTR_GET_WALUSAGE(instr).wal_bytes,
+ INSTR_GET_WALUSAGE(instr).wal_buffers_full);
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
ereport(verbose ? INFO : LOG,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 1c4d0e14334..aabb5e1dd08 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -321,14 +321,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -345,15 +347,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(instr, 0, true);
if (es->memory)
{
@@ -361,16 +360,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->total, (es->buffers ? &INSTR_GET_BUFUSAGE(instr) : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 5715546cf43..51f498aa1b4 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -109,15 +109,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = NULL;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (myState->es->timing || myState->es->buffers)
+ {
+ InstrumentOption instrument_options = 0;
+
+ if (myState->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (myState->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ instr = InstrAlloc(1, instrument_options);
+ InstrStart(instr);
+ }
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -185,18 +190,16 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
+ if (myState->es->timing || myState->es->buffers)
+ InstrStop(instr, 0, true);
+
/* Update timing data */
if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
+ INSTR_TIME_ADD(myState->metrics.timeSpent, instr->total);
/* Update buffer metrics */
if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ BufferUsageAdd(&myState->metrics.bufferUsage, &INSTR_GET_BUFUSAGE(instr));
return true;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 34b6410d6a2..d92aeb6a1df 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -596,9 +598,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +633,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(instr, 0, true);
if (es->memory)
{
@@ -642,13 +641,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -658,7 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->total, (es->buffers ? &INSTR_GET_BUFUSAGE(instr) : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index d5fdbecb025..d61830a7fd8 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -18,11 +18,9 @@
#include "executor/instrument.h"
#include "utils/memutils.h"
-BufferUsage pgBufferUsage;
WalUsage pgWalUsage;
InstrStack *pgInstrStack = NULL;
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
/*
@@ -113,6 +111,27 @@ ResOwnerReleaseInstrumentation(Datum res)
instr->finalized = true;
}
+InstrStack *
+InstrPushStack()
+{
+ InstrStack *stack = palloc0(sizeof(InstrStack));
+
+ stack->previous = pgInstrStack;
+ pgInstrStack = stack;
+
+ return stack;
+}
+
+void
+InstrPopStack(InstrStack * stack)
+{
+ Assert(stack != NULL);
+
+ pgInstrStack = stack->previous;
+ if (pgInstrStack)
+ InstrStackAdd(pgInstrStack, stack);
+}
+
/* General purpose instrumentation handling */
Instrumentation *
InstrAlloc(int n, int instrument_options)
@@ -393,12 +412,11 @@ InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
WalUsageAdd(&dst->walusage, walusage);
}
- BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -419,36 +437,6 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
-void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 8804ee64311..e45c452bc79 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -139,7 +139,6 @@ typedef struct WorkerInstrumentation
NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern PGDLLIMPORT InstrStack * pgInstrStack;
@@ -162,9 +161,8 @@ extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
extern Instrumentation *InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
@@ -175,22 +173,18 @@ extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
instr->stack.walusage
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
if (pgInstrStack) \
pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
if (pgInstrStack) \
pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
if (pgInstrStack) \
INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
if (pgInstrStack) \
INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
--
2.47.1
[application/octet-stream] v2-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch (7.3K, 5-v2-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch)
download | inline diff:
From b874638131c4d22b746bd635dcb1e3034b9ef5dc Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v2 1/7] Instrumentation: Keep time fields as instrtime,
require caller to convert
Previously the Instrumentation logic always converted to seconds, only for many
of the callers to do unnecessary division to get to milliseconds. Since an upcoming
refactoring will split the Instrumentation struct, utilize instrtime always to
keep things simpler.
---
contrib/auto_explain/auto_explain.c | 2 +-
.../pg_stat_statements/pg_stat_statements.c | 2 +-
src/backend/commands/explain.c | 8 ++++----
src/backend/executor/instrument.c | 20 ++++++++-----------
src/include/executor/instrument.h | 6 +++---
src/include/portability/instr_time.h | 2 ++
6 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 1f4badb4928..c10f2fc0f25 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -388,7 +388,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
InstrEndLoop(queryDesc->totaltime);
/* Log plan if duration is exceeded. */
- msec = queryDesc->totaltime->total * 1000.0;
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 1cb368c8590..b9c971de1e5 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1091,7 +1091,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- queryDesc->totaltime->total * 1000.0, /* convert to msec */
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
&queryDesc->totaltime->bufusage,
&queryDesc->totaltime->walusage,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8345bc0264b..95b7a9d227f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1830,8 +1830,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate->instrument && planstate->instrument->nloops > 0)
{
double nloops = planstate->instrument->nloops;
- double startup_ms = 1000.0 * planstate->instrument->startup / nloops;
- double total_ms = 1000.0 * planstate->instrument->total / nloops;
+ double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1896,8 +1896,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
- startup_ms = 1000.0 * instrument->startup / nloops;
- total_ms = 1000.0 * instrument->total / nloops;
+ startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..1c92abe6761 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -114,7 +114,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (!instr->running)
{
instr->running = true;
- instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ instr->firsttuple = instr->counter;
}
else
{
@@ -123,7 +123,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
* this might be the first tuple
*/
if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ instr->firsttuple = instr->counter;
}
}
@@ -139,8 +139,6 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
void
InstrEndLoop(Instrumentation *instr)
{
- double totaltime;
-
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
@@ -149,10 +147,8 @@ InstrEndLoop(Instrumentation *instr)
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
- totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
-
- instr->startup += instr->firsttuple;
- instr->total += totaltime;
+ INSTR_TIME_ADD(instr->startup, instr->firsttuple);
+ INSTR_TIME_ADD(instr->total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
@@ -160,7 +156,7 @@ InstrEndLoop(Instrumentation *instr)
instr->running = false;
INSTR_TIME_SET_ZERO(instr->starttime);
INSTR_TIME_SET_ZERO(instr->counter);
- instr->firsttuple = 0;
+ INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
}
@@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->running = true;
dst->firsttuple = add->firsttuple;
}
- else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
+ else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
dst->firsttuple = add->firsttuple;
INSTR_TIME_ADD(dst->counter, add->counter);
dst->tuplecount += add->tuplecount;
- dst->startup += add->startup;
- dst->total += add->total;
+ INSTR_TIME_ADD(dst->startup, add->startup);
+ INSTR_TIME_ADD(dst->total, add->total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6c..ba5c986907e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -77,13 +77,13 @@ typedef struct Instrumentation
bool running; /* true if we've completed first tuple */
instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
- double firsttuple; /* time for first tuple of this cycle */
+ instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
BufferUsage bufusage_start; /* buffer usage at start */
WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
- double startup; /* total startup time (in seconds) */
- double total; /* total time (in seconds) */
+ instr_time startup; /* total startup time */
+ instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..646934020d1 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -184,6 +184,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_CMP_LT(x,y) \
+ ((x).ticks > (y).ticks)
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.1
[application/octet-stream] v2-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch (9.5K, 6-v2-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch)
download | inline diff:
From f52c067dc350299fb8bd20e3e08025f745b8e215 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:24:49 -0700
Subject: [PATCH v2 5/7] Use Instrumentation stack for parallel query
aggregation in workers
---
src/backend/access/brin/brin.c | 6 ++++--
src/backend/access/gin/gininsert.c | 6 ++++--
src/backend/access/nbtree/nbtsort.c | 6 ++++--
src/backend/commands/vacuumparallel.c | 6 ++++--
src/backend/executor/execParallel.c | 6 ++++--
src/backend/executor/instrument.c | 21 ++++++++++-----------
src/include/executor/instrument.h | 4 ++--
7 files changed, 32 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 7ff7467e462..4626093116e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2870,6 +2870,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2919,7 +2920,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,7 +2935,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e9d4b27427e..022085ca645 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2079,6 +2079,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2147,7 +2148,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2162,7 +2163,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8828a7a8f89..615fd1e03f7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1752,6 +1752,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1827,7 +1828,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,7 +1838,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..c5309a015e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -994,6 +994,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ Instrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1083,7 +1084,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,7 +1092,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e87810d292e..061c6a4aa69 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1434,6 +1434,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ Instrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1494,7 +1495,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1510,7 +1511,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 8ef626721f3..d5fdbecb025 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,10 +19,8 @@
#include "utils/memutils.h"
BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
InstrStack *pgInstrStack = NULL;
-static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -364,22 +362,23 @@ InstrStackAdd(InstrStack * dst, InstrStack * add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
-/* note current values during parallel executor startup */
-void
+/* start instrumentation during parallel executor startup */
+Instrumentation *
InstrStartParallelQuery(void)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
+ Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrStart(instr);
+ return instr;
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+ InstrStop(instr, 0, true);
+ memcpy(bufusage, &INSTR_GET_BUFUSAGE(instr), sizeof(BufferUsage));
+ memcpy(walusage, &INSTR_GET_WALUSAGE(instr), sizeof(WalUsage));
}
/* accumulate work done by workers in leader's stats */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d04607ce40c..bf766706580 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -156,8 +156,8 @@ extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation * instr);
extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern Instrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
--
2.47.1
[application/octet-stream] v2-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch (20.5K, 7-v2-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch)
download | inline diff:
From 39e4a7cae8edc26bfcc9ec99756ad01cd2f587b9 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v2 4/7] Introduce stack for tracking per-node WAL/buffer usage
---
.../pg_stat_statements/pg_stat_statements.c | 4 +-
src/backend/commands/explain.c | 8 +-
src/backend/commands/trigger.c | 4 +-
src/backend/executor/execMain.c | 25 ++-
src/backend/executor/execProcnode.c | 29 +++
src/backend/executor/instrument.c | 199 ++++++++++++++----
src/include/executor/executor.h | 1 +
src/include/executor/instrument.h | 53 ++++-
8 files changed, 260 insertions(+), 63 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4ec33fbf470..deb6d43a47f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1087,8 +1087,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
+ &INSTR_GET_WALUSAGE(queryDesc->totaltime),
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d005fcdbc98..1c4d0e14334 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2280,9 +2280,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->stack.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->stack.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -2299,9 +2299,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->stack.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->stack.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 9b53dd99e99..67a2fdd034a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStop(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1, false);
return (HeapTuple) DatumGetPointer(result);
}
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStop(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1, false);
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index e459b3aa797..37f00026dc6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -329,6 +329,13 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+ /*
+ * Start up required top-level instrumentation stack for WAL/buffer
+ * tracking
+ */
+ if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+ queryDesc->totaltime = InstrAlloc(1, estate->es_instrument);
+
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
InstrStart(queryDesc->totaltime);
@@ -383,7 +390,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime, estate->es_processed, false);
MemoryContextSwitchTo(oldcontext);
}
@@ -442,8 +449,15 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per node statistics, and then shut down instrumentation
+ * stack
+ */
+ if (queryDesc->totaltime && estate->es_instrument)
+ ExecAccumNodeInstrumentation(queryDesc->planstate);
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime, 0, true);
MemoryContextSwitchTo(oldcontext);
@@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
+ {
+ if ((instrument_options & INSTRUMENT_TIMER) != 0)
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
+ else
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
+ }
}
else
{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d286471254b..1b3b39222a9 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,7 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecAccumNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecAccumNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation stack).
+ */
+void
+ExecAccumNodeInstrumentation(PlanState *node)
+{
+ (void) ExecAccumNodeInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ if (node == NULL)
+ return false;
+
+ check_stack_depth();
+
+ planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
+
+ if (node->instrument && node->instrument->stack.previous)
+ InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1fe0f4204e5..8ef626721f3 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,56 +16,150 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
+InstrStack *pgInstrStack = NULL;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+/*
+ * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+ .release_priority = RELEASE_PRIO_FIRST,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrStack(ResourceOwner owner, Instrumentation *instr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(instr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrStack(ResourceOwner owner, Instrumentation *instr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(instr), &instrumentation_resowner_desc);
+}
+
+static void
+InstrPushStackResource(Instrumentation *res)
+{
+ ResourceOwner owner = CurrentResourceOwner;
+
+ Assert(owner != NULL);
+
+ res->owner = owner;
+
+ ResourceOwnerEnlarge(owner);
+ ResourceOwnerRememberInstrStack(owner, res);
+
+ res->stack.previous = pgInstrStack;
+ pgInstrStack = &res->stack;
+}
+
+static void
+InstrPopStackResource(Instrumentation *res)
+{
+ Assert(res != NULL);
+ Assert(res->owner != NULL);
+
+ pgInstrStack = res->stack.previous;
+
+ ResourceOwnerForgetInstrStack(res->owner, res);
+}
+
+static bool
+StackIsParent(InstrStack * stack, InstrStack * entry)
+{
+ if (entry->previous == NULL)
+ return false;
+
+ if (entry->previous == stack)
+ return true;
+
+ return StackIsParent(stack, entry->previous);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
+
+ /*
+ * Because registered resources are *not* called in reverse order, we'll
+ * get what was first registered first at shutdown. Thus, on any later
+ * resources we need to not change the stack, which was already set to the
+ * correct previous entry.
+ */
+ if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
+ pgInstrStack = instr->stack.previous;
+
+ /*
+ * Always accumulate all collected stats before the abort, even if we
+ * already walked up the stack with an earlier resource.
+ */
+ if (pgInstrStack)
+ InstrStackAdd(pgInstrStack, &instr->stack);
+
+ instr->finalized = true;
+}
/* General purpose instrumentation handling */
Instrumentation *
InstrAlloc(int n, int instrument_options)
{
- Instrumentation *instr;
+ Instrumentation *instr = NULL;
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
+ /*
+ * If resource owner will be used, we must allocate in the transaction
+ * context (not the calling context, usually a lower context), because the
+ * memory might otherwise be freed too early in an abort situation.
+ */
+ if (need_buffers || need_wal)
+ instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
+ else
+ instr = palloc0(n * sizeof(Instrumentation));
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ for (i = 0; i < n; i++)
{
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- }
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
}
return instr;
}
+
void
InstrStart(Instrumentation *instr)
{
+ Assert(!instr->finalized);
+
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStart called twice in a row");
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
-
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPushStackResource(instr);
}
+
void
-InstrStop(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr, double nTuples, bool finalize)
{
instr_time endtime;
@@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPopStackResource(instr);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ if (finalize)
+ {
+ instr->finalized = true;
+ if (pgInstrStack)
+ InstrStackAdd(pgInstrStack, &instr->stack);
+ }
}
/* Allocate new node instrumentation structure(s) */
@@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStartNode called twice in a row");
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(pgInstrStack != NULL);
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ instr->stack.previous = pgInstrStack;
+ pgInstrStack = &instr->stack;
+ }
}
/* Exit from a plan node */
@@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
-
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(instr->stack.previous != NULL);
+ pgInstrStack = instr->stack.previous;
+ }
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
/* Add delta of buffer usage since entry to node's totals */
if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
+}
+
+void
+InstrStackAdd(InstrStack * dst, InstrStack * add)
+{
+ Assert(dst != NULL);
+ Assert(add != NULL);
+
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* note current values during parallel executor startup */
@@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
void
InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
{
+ if (pgInstrStack != NULL)
+ {
+ InstrStack *dst = pgInstrStack;
+
+ BufferUsageAdd(&dst->bufusage, bufusage);
+ WalUsageAdd(&dst->walusage, walusage);
+ }
+
BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 31133514e84..ba76a370d3f 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -297,6 +297,7 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecAccumNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 78d3653997b..d04607ce40c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
#define INSTRUMENT_H
#include "portability/instr_time.h"
+#include "utils/resowner.h"
/*
@@ -66,11 +67,23 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/* Stack of WAL/buffer usage used for per-node instrumentation */
+typedef struct InstrStack
+{
+ struct InstrStack *previous;
+ BufferUsage bufusage;
+ WalUsage walusage;
+} InstrStack;
+
/*
* General purpose instrumentation that can capture time, WAL/buffer usage and tuples
*
* Initialized through InstrAlloc, followed by one or more calls to a pair of
* InstrStart/InstrStop (activity is measured inbetween).
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller *must* not exit out of
+ * top level transaction between InstrStart/InstrStop calls in regular execution. If this is needed,
+ * directly use InstrPushStack/InstrPopStack in a PG_TRY/PG_FINALLY block instead.
*/
typedef struct Instrumentation
{
@@ -79,18 +92,22 @@ typedef struct Instrumentation
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
/* Internal state keeping: */
+ bool finalized; /* true if no more InstrStart calls are
+ * allowed */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
double ntuples; /* total tuples counted in InstrStop */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack stack; /* stack tracking buffer/WAL usage */
+ ResourceOwner owner;
} Instrumentation;
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Requires use of an outer InstrStart/InstrStop to handle the stack used for WAL/buffer
+ * usage statistics, and relies on it for managing aborts. Solely intended for
+ * the executor and anyone reporting about its activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -105,8 +122,6 @@ typedef struct NodeInstrumentation
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
instr_time total; /* total time */
@@ -115,8 +130,7 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack stack; /* stack tracking buffer/WAL usage */
} NodeInstrumentation;
typedef struct WorkerInstrumentation
@@ -127,10 +141,11 @@ typedef struct WorkerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
+extern PGDLLIMPORT InstrStack * pgInstrStack;
extern Instrumentation *InstrAlloc(int n, int instrument_options);
extern void InstrStart(Instrumentation *instr);
-extern void InstrStop(Instrumentation *instr, double nTuples);
+extern void InstrStop(Instrumentation *instr, double nTuples, bool finalize);
extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
bool async_mode);
@@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_GET_BUFUSAGE(instr) \
+ instr->stack.bufusage
+
+#define INSTR_GET_WALUSAGE(instr) \
+ instr->stack.walusage
+
#define INSTR_BUFUSAGE_INCR(fld) do { \
pgBufferUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
pgBufferUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ if (pgInstrStack) \
+ INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ if (pgInstrStack) \
+ INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v2-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch (9.0K, 8-v2-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch)
download | inline diff:
From 4f2bb304213b600c7f368d16547cba52641157db Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:34:42 -0700
Subject: [PATCH v2 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
---
src/backend/access/transam/xlog.c | 8 ++++----
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 46 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7ffb2179151..61516f35676 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1078,9 +1078,9 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2060,7 +2060,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index fe470de63f2..d872d9efb93 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -705,7 +705,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -737,7 +737,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
else
PinBuffer_Locked(bufHdr); /* pin for first time */
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
@@ -1147,14 +1147,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1888,9 +1888,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -1958,9 +1958,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2842,7 +2842,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -2960,7 +2960,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4368,7 +4368,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5524,7 +5524,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (dirtied)
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04fef13409b..74c95a3fc59 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -216,7 +216,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -476,7 +476,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -507,7 +507,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u32(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..9d39df998cb 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -474,13 +474,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -548,13 +548,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..4f6274eb573 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1ae533f6704..78d3653997b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -149,4 +149,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v2-0002-Separate-node-instrumentation-from-other-use-of-I.patch (21.4K, 9-v2-0002-Separate-node-instrumentation-from-other-use-of-I.patch)
download | inline diff:
From a6893036163db67839c0cf2a40d0032858032424 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v2 2/7] Separate node instrumentation from other use of
Instrumentation struct
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This dual use of the struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time,
WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
outside of per-node instrumentation. Introduce the NodeInstrumentation
struct to carry forward the per-node instrumentation information.
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 11 +--
src/backend/commands/trigger.c | 8 +-
src/backend/executor/execMain.c | 10 +--
src/backend/executor/execParallel.c | 22 +++--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 86 +++++++++++++++++--
src/include/executor/instrument.h | 51 ++++++++---
src/include/nodes/execnodes.h | 3 +-
11 files changed, 151 insertions(+), 62 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c10f2fc0f25..ee0c3b4c91b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b9c971de1e5..4ec33fbf470 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1021,7 +1021,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1080,12 +1080,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 456b267f70b..7619ac486c0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 95b7a9d227f..d005fcdbc98 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1102,9 +1102,6 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
@@ -1135,7 +1132,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- 1000.0 * instr->total, instr->ntuples);
+ 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), instr->ntuples);
else
appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
}
@@ -1146,7 +1143,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Constraint Name", conname, es);
ExplainPropertyText("Relation", relname, es);
if (es->timing)
- ExplainPropertyFloat("Time", "ms", 1000.0 * instr->total, 3,
+ ExplainPropertyFloat("Time", "ms", 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), 3,
es);
ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
}
@@ -1888,7 +1885,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -2294,7 +2291,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 579ac8d76ae..9b53dd99e99 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStart(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -4381,7 +4381,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStart(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1);
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ff12e2e1364..e459b3aa797 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime, estate->es_processed);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime, 0);
MemoryContextSwitchTo(oldcontext);
@@ -1266,7 +1266,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..e87810d292e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -713,7 +717,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -799,7 +803,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -809,7 +813,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1036,7 +1040,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1064,7 +1068,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
@@ -1296,7 +1300,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..d286471254b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(1, estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1c92abe6761..1fe0f4204e5 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,9 +26,9 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int n, int instrument_options)
{
Instrumentation *instr;
@@ -41,6 +41,74 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
int i;
+ for (i = 0; i < n; i++)
+ {
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
+ }
+ }
+
+ return instr;
+}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer &&
+ !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
+ elog(ERROR, "InstrStart called twice in a row");
+
+ if (instr->need_bufusage)
+ instr->bufusage_start = pgBufferUsage;
+
+ if (instr->need_walusage)
+ instr->walusage_start = pgWalUsage;
+}
+void
+InstrStop(Instrumentation *instr, double nTuples)
+{
+ instr_time endtime;
+
+ /* count the specified tuples */
+ instr->ntuples += nTuples;
+
+ /* let's update the time only if the timer was requested */
+ if (instr->need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->need_bufusage)
+ BufferUsageAccumDiff(&instr->bufusage,
+ &pgBufferUsage, &instr->bufusage_start);
+
+ if (instr->need_walusage)
+ WalUsageAccumDiff(&instr->walusage,
+ &pgWalUsage, &instr->walusage_start);
+}
+
+/* Allocate new node instrumentation structure(s) */
+NodeInstrumentation *
+InstrAllocNode(int n, int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr;
+
+ /* initialize all fields to zeroes, then modify as needed */
+ instr = palloc0(n * sizeof(NodeInstrumentation));
+ if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ {
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
for (i = 0; i < n; i++)
{
instr[i].need_bufusage = need_buffers;
@@ -55,9 +123,9 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitNode(NodeInstrumentation * instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
+ memset(instr, 0, sizeof(NodeInstrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
@@ -65,7 +133,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStartNode(NodeInstrumentation * instr)
{
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
@@ -81,7 +149,7 @@ InstrStartNode(Instrumentation *instr)
/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStopNode(NodeInstrumentation * instr, double nTuples)
{
double save_tuplecount = instr->tuplecount;
instr_time endtime;
@@ -129,7 +197,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -137,7 +205,7 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation * instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
@@ -162,7 +230,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
{
if (!dst->running && add->running)
{
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index ba5c986907e..1ae533f6704 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -66,7 +66,33 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
+{
+ /* Parameters set at creation: */
+ bool need_timer; /* true if we need timer data */
+ bool need_bufusage; /* true if we need buffer usage data */
+ bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ double ntuples; /* total tuples counted in InstrStop */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
{
/* Parameters set at node creation: */
bool need_timer; /* true if we need timer data */
@@ -91,25 +117,30 @@ typedef struct Instrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
typedef struct WorkerInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr, double nTuples);
+
+extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation * instr);
+extern void InstrStopNode(NodeInstrumentation * instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation * instr);
+extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index de782014b2d..9b3bd66d401 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1169,7 +1169,8 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
/* Per-worker JIT instrumentation */
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2025-10-22 11:28 ` Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2025-10-22 11:28 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>
On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <[email protected]> wrote:
> Attached an updated patch set that addresses the feedback, and also adds
> the complete removal of the global pgBufferUsage variable in later patches
> (0005-0007), to avoid counting both the stack and the variable.
>
See attached the same patch set rebased on latest master.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v3-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch (7.3K, 3-v3-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch)
download | inline diff:
From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
require caller to convert
Previously the Instrumentation logic always converted to seconds, only for many
of the callers to do unnecessary division to get to milliseconds. Since an upcoming
refactoring will split the Instrumentation struct, utilize instrtime always to
keep things simpler.
---
contrib/auto_explain/auto_explain.c | 2 +-
.../pg_stat_statements/pg_stat_statements.c | 2 +-
src/backend/commands/explain.c | 8 ++++----
src/backend/executor/instrument.c | 20 ++++++++-----------
src/include/executor/instrument.h | 6 +++---
src/include/portability/instr_time.h | 2 ++
6 files changed, 19 insertions(+), 21 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 1f4badb4928..c10f2fc0f25 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -388,7 +388,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
InstrEndLoop(queryDesc->totaltime);
/* Log plan if duration is exceeded. */
- msec = queryDesc->totaltime->total * 1000.0;
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f2187167c5c..fe987ceaf40 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1093,7 +1093,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- queryDesc->totaltime->total * 1000.0, /* convert to msec */
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
&queryDesc->totaltime->bufusage,
&queryDesc->totaltime->walusage,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e6edae0845c..46c5bf252fc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1835,8 +1835,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate->instrument && planstate->instrument->nloops > 0)
{
double nloops = planstate->instrument->nloops;
- double startup_ms = 1000.0 * planstate->instrument->startup / nloops;
- double total_ms = 1000.0 * planstate->instrument->total / nloops;
+ double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1901,8 +1901,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
- startup_ms = 1000.0 * instrument->startup / nloops;
- total_ms = 1000.0 * instrument->total / nloops;
+ startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..1c92abe6761 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -114,7 +114,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (!instr->running)
{
instr->running = true;
- instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ instr->firsttuple = instr->counter;
}
else
{
@@ -123,7 +123,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
* this might be the first tuple
*/
if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ instr->firsttuple = instr->counter;
}
}
@@ -139,8 +139,6 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
void
InstrEndLoop(Instrumentation *instr)
{
- double totaltime;
-
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
@@ -149,10 +147,8 @@ InstrEndLoop(Instrumentation *instr)
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
- totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
-
- instr->startup += instr->firsttuple;
- instr->total += totaltime;
+ INSTR_TIME_ADD(instr->startup, instr->firsttuple);
+ INSTR_TIME_ADD(instr->total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
@@ -160,7 +156,7 @@ InstrEndLoop(Instrumentation *instr)
instr->running = false;
INSTR_TIME_SET_ZERO(instr->starttime);
INSTR_TIME_SET_ZERO(instr->counter);
- instr->firsttuple = 0;
+ INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
}
@@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->running = true;
dst->firsttuple = add->firsttuple;
}
- else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
+ else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
dst->firsttuple = add->firsttuple;
INSTR_TIME_ADD(dst->counter, add->counter);
dst->tuplecount += add->tuplecount;
- dst->startup += add->startup;
- dst->total += add->total;
+ INSTR_TIME_ADD(dst->startup, add->startup);
+ INSTR_TIME_ADD(dst->total, add->total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 03653ab6c6c..ba5c986907e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -77,13 +77,13 @@ typedef struct Instrumentation
bool running; /* true if we've completed first tuple */
instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
- double firsttuple; /* time for first tuple of this cycle */
+ instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
BufferUsage bufusage_start; /* buffer usage at start */
WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
- double startup; /* total startup time (in seconds) */
- double total; /* total time (in seconds) */
+ instr_time startup; /* total startup time */
+ instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..646934020d1 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -184,6 +184,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_CMP_LT(x,y) \
+ ((x).ticks > (y).ticks)
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.1
[application/octet-stream] v3-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch (20.5K, 4-v3-0004-Introduce-stack-for-tracking-per-node-WAL-buffer-.patch)
download | inline diff:
From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage
---
.../pg_stat_statements/pg_stat_statements.c | 4 +-
src/backend/commands/explain.c | 8 +-
src/backend/commands/trigger.c | 4 +-
src/backend/executor/execMain.c | 25 ++-
src/backend/executor/execProcnode.c | 29 +++
src/backend/executor/instrument.c | 199 ++++++++++++++----
src/include/executor/executor.h | 1 +
src/include/executor/instrument.h | 53 ++++-
8 files changed, 260 insertions(+), 63 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index f43a33b3787..eeabd820d8e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
+ &INSTR_GET_WALUSAGE(queryDesc->totaltime),
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index de66e48366d..bb0689b95d4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2286,9 +2286,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->stack.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->stack.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -2305,9 +2305,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->stack.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->stack.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 9b53dd99e99..67a2fdd034a 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStop(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1, false);
return (HeapTuple) DatumGetPointer(result);
}
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStop(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1, false);
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 9bc7b4e20f7..6cedac373a0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -329,6 +329,13 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+ /*
+ * Start up required top-level instrumentation stack for WAL/buffer
+ * tracking
+ */
+ if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+ queryDesc->totaltime = InstrAlloc(1, estate->es_instrument);
+
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
InstrStart(queryDesc->totaltime);
@@ -383,7 +390,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime, estate->es_processed, false);
MemoryContextSwitchTo(oldcontext);
}
@@ -442,8 +449,15 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per node statistics, and then shut down instrumentation
+ * stack
+ */
+ if (queryDesc->totaltime && estate->es_instrument)
+ ExecAccumNodeInstrumentation(queryDesc->planstate);
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime, 0, true);
MemoryContextSwitchTo(oldcontext);
@@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
+ {
+ if ((instrument_options & INSTRUMENT_TIMER) != 0)
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
+ else
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
+ }
}
else
{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d286471254b..1b3b39222a9 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,7 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecAccumNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecAccumNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation stack).
+ */
+void
+ExecAccumNodeInstrumentation(PlanState *node)
+{
+ (void) ExecAccumNodeInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ if (node == NULL)
+ return false;
+
+ check_stack_depth();
+
+ planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
+
+ if (node->instrument && node->instrument->stack.previous)
+ InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1fe0f4204e5..8ef626721f3 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,56 +16,150 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
+InstrStack *pgInstrStack = NULL;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+/*
+ * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
+ .release_priority = RELEASE_PRIO_FIRST,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrStack(ResourceOwner owner, Instrumentation *instr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(instr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrStack(ResourceOwner owner, Instrumentation *instr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(instr), &instrumentation_resowner_desc);
+}
+
+static void
+InstrPushStackResource(Instrumentation *res)
+{
+ ResourceOwner owner = CurrentResourceOwner;
+
+ Assert(owner != NULL);
+
+ res->owner = owner;
+
+ ResourceOwnerEnlarge(owner);
+ ResourceOwnerRememberInstrStack(owner, res);
+
+ res->stack.previous = pgInstrStack;
+ pgInstrStack = &res->stack;
+}
+
+static void
+InstrPopStackResource(Instrumentation *res)
+{
+ Assert(res != NULL);
+ Assert(res->owner != NULL);
+
+ pgInstrStack = res->stack.previous;
+
+ ResourceOwnerForgetInstrStack(res->owner, res);
+}
+
+static bool
+StackIsParent(InstrStack * stack, InstrStack * entry)
+{
+ if (entry->previous == NULL)
+ return false;
+
+ if (entry->previous == stack)
+ return true;
+
+ return StackIsParent(stack, entry->previous);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
+
+ /*
+ * Because registered resources are *not* called in reverse order, we'll
+ * get what was first registered first at shutdown. Thus, on any later
+ * resources we need to not change the stack, which was already set to the
+ * correct previous entry.
+ */
+ if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
+ pgInstrStack = instr->stack.previous;
+
+ /*
+ * Always accumulate all collected stats before the abort, even if we
+ * already walked up the stack with an earlier resource.
+ */
+ if (pgInstrStack)
+ InstrStackAdd(pgInstrStack, &instr->stack);
+
+ instr->finalized = true;
+}
/* General purpose instrumentation handling */
Instrumentation *
InstrAlloc(int n, int instrument_options)
{
- Instrumentation *instr;
+ Instrumentation *instr = NULL;
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
+ /*
+ * If resource owner will be used, we must allocate in the transaction
+ * context (not the calling context, usually a lower context), because the
+ * memory might otherwise be freed too early in an abort situation.
+ */
+ if (need_buffers || need_wal)
+ instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
+ else
+ instr = palloc0(n * sizeof(Instrumentation));
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ for (i = 0; i < n; i++)
{
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- }
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
}
return instr;
}
+
void
InstrStart(Instrumentation *instr)
{
+ Assert(!instr->finalized);
+
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStart called twice in a row");
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
-
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPushStackResource(instr);
}
+
void
-InstrStop(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr, double nTuples, bool finalize)
{
instr_time endtime;
@@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPopStackResource(instr);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ if (finalize)
+ {
+ instr->finalized = true;
+ if (pgInstrStack)
+ InstrStackAdd(pgInstrStack, &instr->stack);
+ }
}
/* Allocate new node instrumentation structure(s) */
@@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStartNode called twice in a row");
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(pgInstrStack != NULL);
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ instr->stack.previous = pgInstrStack;
+ pgInstrStack = &instr->stack;
+ }
}
/* Exit from a plan node */
@@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
-
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(instr->stack.previous != NULL);
+ pgInstrStack = instr->stack.previous;
+ }
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
/* Add delta of buffer usage since entry to node's totals */
if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
+}
+
+void
+InstrStackAdd(InstrStack * dst, InstrStack * add)
+{
+ Assert(dst != NULL);
+ Assert(add != NULL);
+
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* note current values during parallel executor startup */
@@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
void
InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
{
+ if (pgInstrStack != NULL)
+ {
+ InstrStack *dst = pgInstrStack;
+
+ BufferUsageAdd(&dst->bufusage, bufusage);
+ WalUsageAdd(&dst->walusage, walusage);
+ }
+
BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3248e78cd28..66c308506ab 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -297,6 +297,7 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecAccumNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 78d3653997b..d04607ce40c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -14,6 +14,7 @@
#define INSTRUMENT_H
#include "portability/instr_time.h"
+#include "utils/resowner.h"
/*
@@ -66,11 +67,23 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/* Stack of WAL/buffer usage used for per-node instrumentation */
+typedef struct InstrStack
+{
+ struct InstrStack *previous;
+ BufferUsage bufusage;
+ WalUsage walusage;
+} InstrStack;
+
/*
* General purpose instrumentation that can capture time, WAL/buffer usage and tuples
*
* Initialized through InstrAlloc, followed by one or more calls to a pair of
* InstrStart/InstrStop (activity is measured inbetween).
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller *must* not exit out of
+ * top level transaction between InstrStart/InstrStop calls in regular execution. If this is needed,
+ * directly use InstrPushStack/InstrPopStack in a PG_TRY/PG_FINALLY block instead.
*/
typedef struct Instrumentation
{
@@ -79,18 +92,22 @@ typedef struct Instrumentation
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
/* Internal state keeping: */
+ bool finalized; /* true if no more InstrStart calls are
+ * allowed */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
double ntuples; /* total tuples counted in InstrStop */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack stack; /* stack tracking buffer/WAL usage */
+ ResourceOwner owner;
} Instrumentation;
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Requires use of an outer InstrStart/InstrStop to handle the stack used for WAL/buffer
+ * usage statistics, and relies on it for managing aborts. Solely intended for
+ * the executor and anyone reporting about its activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -105,8 +122,6 @@ typedef struct NodeInstrumentation
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
instr_time total; /* total time */
@@ -115,8 +130,7 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack stack; /* stack tracking buffer/WAL usage */
} NodeInstrumentation;
typedef struct WorkerInstrumentation
@@ -127,10 +141,11 @@ typedef struct WorkerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
+extern PGDLLIMPORT InstrStack * pgInstrStack;
extern Instrumentation *InstrAlloc(int n, int instrument_options);
extern void InstrStart(Instrumentation *instr);
-extern void InstrStop(Instrumentation *instr, double nTuples);
+extern void InstrStop(Instrumentation *instr, double nTuples, bool finalize);
extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
bool async_mode);
@@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_GET_BUFUSAGE(instr) \
+ instr->stack.bufusage
+
+#define INSTR_GET_WALUSAGE(instr) \
+ instr->stack.walusage
+
#define INSTR_BUFUSAGE_INCR(fld) do { \
pgBufferUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
pgBufferUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ if (pgInstrStack) \
+ INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ if (pgInstrStack) \
+ INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v3-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch (9.5K, 5-v3-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch)
download | inline diff:
From ed8e8daf913ed8547b05d7485accd065a6f109c7 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:24:49 -0700
Subject: [PATCH v3 5/7] Use Instrumentation stack for parallel query
aggregation in workers
---
src/backend/access/brin/brin.c | 6 ++++--
src/backend/access/gin/gininsert.c | 6 ++++--
src/backend/access/nbtree/nbtsort.c | 6 ++++--
src/backend/commands/vacuumparallel.c | 6 ++++--
src/backend/executor/execParallel.c | 6 ++++--
src/backend/executor/instrument.c | 21 ++++++++++-----------
src/include/executor/instrument.h | 4 ++--
7 files changed, 32 insertions(+), 23 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2f7d1437919..a36606eed0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2870,6 +2870,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2919,7 +2920,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,7 +2935,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..b454934c109 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2083,6 +2083,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2151,7 +2152,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2166,7 +2167,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 8828a7a8f89..615fd1e03f7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1752,6 +1752,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1827,7 +1828,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,7 +1838,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..c5309a015e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -994,6 +994,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ Instrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1083,7 +1084,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,7 +1092,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e87810d292e..061c6a4aa69 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1434,6 +1434,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ Instrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1494,7 +1495,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1510,7 +1511,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 8ef626721f3..d5fdbecb025 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,10 +19,8 @@
#include "utils/memutils.h"
BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
InstrStack *pgInstrStack = NULL;
-static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -364,22 +362,23 @@ InstrStackAdd(InstrStack * dst, InstrStack * add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
-/* note current values during parallel executor startup */
-void
+/* start instrumentation during parallel executor startup */
+Instrumentation *
InstrStartParallelQuery(void)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
+ Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrStart(instr);
+ return instr;
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+ InstrStop(instr, 0, true);
+ memcpy(bufusage, &INSTR_GET_BUFUSAGE(instr), sizeof(BufferUsage));
+ memcpy(walusage, &INSTR_GET_WALUSAGE(instr), sizeof(WalUsage));
}
/* accumulate work done by workers in leader's stats */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d04607ce40c..bf766706580 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -156,8 +156,8 @@ extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation * instr);
extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern Instrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
--
2.47.1
[application/octet-stream] v3-0002-Separate-node-instrumentation-from-other-use-of-I.patch (21.4K, 6-v3-0002-Separate-node-instrumentation-from-other-use-of-I.patch)
download | inline diff:
From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
Instrumentation struct
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This dual use of the struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time,
WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
outside of per-node instrumentation. Introduce the NodeInstrumentation
struct to carry forward the per-node instrumentation information.
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 11 +--
src/backend/commands/trigger.c | 8 +-
src/backend/executor/execMain.c | 10 +--
src/backend/executor/execParallel.c | 22 +++--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 86 +++++++++++++++++--
src/include/executor/instrument.h | 51 ++++++++---
src/include/nodes/execnodes.h | 3 +-
11 files changed, 151 insertions(+), 62 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c10f2fc0f25..ee0c3b4c91b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index fe987ceaf40..f43a33b3787 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1023,7 +1023,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1082,12 +1082,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 456b267f70b..7619ac486c0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 46c5bf252fc..de66e48366d 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1103,9 +1103,6 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
@@ -1136,7 +1133,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- 1000.0 * instr->total, instr->ntuples);
+ 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), instr->ntuples);
else
appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
}
@@ -1147,7 +1144,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Constraint Name", conname, es);
ExplainPropertyText("Relation", relname, es);
if (es->timing)
- ExplainPropertyFloat("Time", "ms", 1000.0 * instr->total, 3,
+ ExplainPropertyFloat("Time", "ms", 1000.0 * INSTR_TIME_GET_DOUBLE(instr->total), 3,
es);
ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
}
@@ -1893,7 +1890,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -2300,7 +2297,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 579ac8d76ae..9b53dd99e99 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStart(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2392,7 +2392,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -4381,7 +4381,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStart(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4607,7 +4607,7 @@ AfterTriggerExecute(EState *estate,
* one "tuple returned" (really the number of firings).
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStop(instr + tgindx, 1);
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 713e926329c..9bc7b4e20f7 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime, estate->es_processed);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime, 0);
MemoryContextSwitchTo(oldcontext);
@@ -1266,7 +1266,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..e87810d292e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -713,7 +717,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -799,7 +803,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -809,7 +813,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1036,7 +1040,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1064,7 +1068,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
@@ -1296,7 +1300,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..d286471254b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(1, estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1c92abe6761..1fe0f4204e5 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,9 +26,9 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int n, int instrument_options)
{
Instrumentation *instr;
@@ -41,6 +41,74 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
int i;
+ for (i = 0; i < n; i++)
+ {
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
+ }
+ }
+
+ return instr;
+}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer &&
+ !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
+ elog(ERROR, "InstrStart called twice in a row");
+
+ if (instr->need_bufusage)
+ instr->bufusage_start = pgBufferUsage;
+
+ if (instr->need_walusage)
+ instr->walusage_start = pgWalUsage;
+}
+void
+InstrStop(Instrumentation *instr, double nTuples)
+{
+ instr_time endtime;
+
+ /* count the specified tuples */
+ instr->ntuples += nTuples;
+
+ /* let's update the time only if the timer was requested */
+ if (instr->need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->need_bufusage)
+ BufferUsageAccumDiff(&instr->bufusage,
+ &pgBufferUsage, &instr->bufusage_start);
+
+ if (instr->need_walusage)
+ WalUsageAccumDiff(&instr->walusage,
+ &pgWalUsage, &instr->walusage_start);
+}
+
+/* Allocate new node instrumentation structure(s) */
+NodeInstrumentation *
+InstrAllocNode(int n, int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr;
+
+ /* initialize all fields to zeroes, then modify as needed */
+ instr = palloc0(n * sizeof(NodeInstrumentation));
+ if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ {
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
for (i = 0; i < n; i++)
{
instr[i].need_bufusage = need_buffers;
@@ -55,9 +123,9 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitNode(NodeInstrumentation * instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
+ memset(instr, 0, sizeof(NodeInstrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
@@ -65,7 +133,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStartNode(NodeInstrumentation * instr)
{
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
@@ -81,7 +149,7 @@ InstrStartNode(Instrumentation *instr)
/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStopNode(NodeInstrumentation * instr, double nTuples)
{
double save_tuplecount = instr->tuplecount;
instr_time endtime;
@@ -129,7 +197,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -137,7 +205,7 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation * instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
@@ -162,7 +230,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
{
if (!dst->running && add->running)
{
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index ba5c986907e..1ae533f6704 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -66,7 +66,33 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
+{
+ /* Parameters set at creation: */
+ bool need_timer; /* true if we need timer data */
+ bool need_bufusage; /* true if we need buffer usage data */
+ bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ double ntuples; /* total tuples counted in InstrStop */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
{
/* Parameters set at node creation: */
bool need_timer; /* true if we need timer data */
@@ -91,25 +117,30 @@ typedef struct Instrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
typedef struct WorkerInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr, double nTuples);
+
+extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation * instr);
+extern void InstrStopNode(NodeInstrumentation * instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation * instr);
+extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a36653c37f9..eb0b8f835c2 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1166,7 +1166,8 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
/* Per-worker JIT instrumentation */
--
2.47.1
[application/octet-stream] v3-0006-Introduce-alternate-Instrumentation-stack-mechani.patch (4.8K, 7-v3-0006-Introduce-alternate-Instrumentation-stack-mechani.patch)
download | inline diff:
From 314e3e7305da740a8fadb2c481a096cc0ca7fff0 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:02 -0700
Subject: [PATCH v3 6/7] Introduce alternate Instrumentation stack mechanism
relying on PG_FINALLY
The resource owner-based Instrumentation stack cannot handle wrapping
certain utility commands that close and re-open the top-level transaction,
like the CLUSTER command. This is a problem for pg_stat_statements tracking
of utility commands specifically. To support tracking such activity, allow
issuing explicit InstrPushStack/InstrPopStack commands to modify the stack,
with the InstrPopStack in a PG_FINALLY to ensure cleanup on abort.
---
.../pg_stat_statements/pg_stat_statements.c | 50 +++++--------------
src/include/executor/instrument.h | 3 ++
2 files changed, 15 insertions(+), 38 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index eeabd820d8e..0e57ce65062 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -911,21 +911,13 @@ pgss_planner(Query *parse,
{
instr_time start;
instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ InstrStack *stack;
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
INSTR_TIME_SET_CURRENT(start);
+ /* We need to track buffer/WAL usage as the planner can access them. */
+ stack = InstrPushStack();
+
nesting_level++;
PG_TRY();
{
@@ -938,6 +930,7 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrPopStack(stack);
nesting_level--;
}
PG_END_TRY();
@@ -945,14 +938,6 @@ pgss_planner(Query *parse,
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
@@ -960,8 +945,8 @@ pgss_planner(Query *parse,
PGSS_PLAN,
INSTR_TIME_GET_MILLISEC(duration),
0,
- &bufusage,
- &walusage,
+ &stack->bufusage,
+ &stack->walusage,
NULL,
NULL,
0,
@@ -1157,14 +1142,10 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
instr_time start;
instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ InstrStack *stack;
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
INSTR_TIME_SET_CURRENT(start);
+ stack = InstrPushStack();
nesting_level++;
PG_TRY();
@@ -1180,6 +1161,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrPopStack(stack);
nesting_level--;
}
PG_END_TRY();
@@ -1208,14 +1190,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
@@ -1223,8 +1197,8 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(duration),
rows,
- &bufusage,
- &walusage,
+ &stack->bufusage,
+ &stack->walusage,
NULL,
NULL,
0,
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index bf766706580..8804ee64311 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -147,6 +147,9 @@ extern Instrumentation *InstrAlloc(int n, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr, double nTuples, bool finalize);
+extern InstrStack * InstrPushStack(void);
+extern void InstrPopStack(InstrStack * res);
+
extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
--
2.47.1
[application/octet-stream] v3-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch (17.8K, 8-v3-0007-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch)
download | inline diff:
From 5983589a18f286029d1796c3d6363de326ff4463 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:56 -0700
Subject: [PATCH v3 7/7] Convert remaining users of pgBufferUsage to use
InstrStart/InstrStop, drop the global
---
src/backend/access/heap/vacuumlazy.c | 35 ++++++++---------
src/backend/commands/analyze.c | 35 ++++++++---------
src/backend/commands/explain.c | 26 +++++--------
src/backend/commands/explain_dr.c | 31 ++++++++-------
src/backend/commands/prepare.c | 26 +++++--------
src/backend/executor/instrument.c | 56 +++++++++++-----------------
src/include/executor/instrument.h | 8 +---
7 files changed, 90 insertions(+), 127 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d2b031fdd06..338d540aa01 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -641,8 +641,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
@@ -657,6 +656,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -959,14 +960,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr, 0, true);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -977,17 +978,13 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_dirtied;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+
+ total_blks_hit = INSTR_GET_BUFUSAGE(instr).shared_blks_hit +
+ INSTR_GET_BUFUSAGE(instr).local_blks_hit;
+ total_blks_read = INSTR_GET_BUFUSAGE(instr).shared_blks_read +
+ INSTR_GET_BUFUSAGE(instr).local_blks_read;
+ total_blks_dirtied = INSTR_GET_BUFUSAGE(instr).shared_blks_dirtied +
+ INSTR_GET_BUFUSAGE(instr).local_blks_dirtied;
initStringInfo(&buf);
if (verbose)
@@ -1149,10 +1146,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
total_blks_dirtied);
appendStringInfo(&buf,
_("WAL usage: %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes, %" PRId64 " buffers full\n"),
- walusage.wal_records,
- walusage.wal_fpi,
- walusage.wal_bytes,
- walusage.wal_buffers_full);
+ INSTR_GET_WALUSAGE(instr).wal_records,
+ INSTR_GET_WALUSAGE(instr).wal_fpi,
+ INSTR_GET_WALUSAGE(instr).wal_bytes,
+ INSTR_GET_WALUSAGE(instr).wal_buffers_full);
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
ereport(verbose ? INFO : LOG,
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c2e216563c6..92ca59778c7 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -302,9 +302,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -355,6 +353,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -735,12 +736,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr, 0, true);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -749,17 +751,12 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_read;
int64 total_blks_dirtied;
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ total_blks_hit = INSTR_GET_BUFUSAGE(instr).shared_blks_hit +
+ INSTR_GET_BUFUSAGE(instr).local_blks_hit;
+ total_blks_read = INSTR_GET_BUFUSAGE(instr).shared_blks_read +
+ INSTR_GET_BUFUSAGE(instr).local_blks_read;
+ total_blks_dirtied = INSTR_GET_BUFUSAGE(instr).shared_blks_dirtied +
+ INSTR_GET_BUFUSAGE(instr).local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
@@ -832,10 +829,10 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
total_blks_dirtied);
appendStringInfo(&buf,
_("WAL usage: %" PRId64 " records, %" PRId64 " full page images, %" PRIu64 " bytes, %" PRId64 " buffers full\n"),
- walusage.wal_records,
- walusage.wal_fpi,
- walusage.wal_bytes,
- walusage.wal_buffers_full);
+ INSTR_GET_WALUSAGE(instr).wal_records,
+ INSTR_GET_WALUSAGE(instr).wal_fpi,
+ INSTR_GET_WALUSAGE(instr).wal_bytes,
+ INSTR_GET_WALUSAGE(instr).wal_buffers_full);
appendStringInfo(&buf, _("system usage: %s"), pg_rusage_show(&ru0));
ereport(verbose ? INFO : LOG,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index bb0689b95d4..4a396575bae 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -322,14 +322,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -346,15 +348,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(instr, 0, true);
if (es->memory)
{
@@ -362,16 +361,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->total, (es->buffers ? &INSTR_GET_BUFUSAGE(instr) : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 95685d7e88d..9fa0b51e62a 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = NULL;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (myState->es->timing || myState->es->buffers)
+ {
+ InstrumentOption instrument_options = 0;
+
+ if (myState->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (myState->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ instr = InstrAlloc(1, instrument_options);
+ InstrStart(instr);
+ }
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +191,16 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
+ if (myState->es->timing || myState->es->buffers)
+ InstrStop(instr, 0, true);
+
/* Update timing data */
if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
+ INSTR_TIME_ADD(myState->metrics.timeSpent, instr->total);
/* Update buffer metrics */
if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ BufferUsageAdd(&myState->metrics.bufferUsage, &INSTR_GET_BUFUSAGE(instr));
return true;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 34b6410d6a2..d92aeb6a1df 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -596,9 +598,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +633,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(instr, 0, true);
if (es->memory)
{
@@ -642,13 +641,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -658,7 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->total, (es->buffers ? &INSTR_GET_BUFUSAGE(instr) : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index d5fdbecb025..d61830a7fd8 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -18,11 +18,9 @@
#include "executor/instrument.h"
#include "utils/memutils.h"
-BufferUsage pgBufferUsage;
WalUsage pgWalUsage;
InstrStack *pgInstrStack = NULL;
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
/*
@@ -113,6 +111,27 @@ ResOwnerReleaseInstrumentation(Datum res)
instr->finalized = true;
}
+InstrStack *
+InstrPushStack()
+{
+ InstrStack *stack = palloc0(sizeof(InstrStack));
+
+ stack->previous = pgInstrStack;
+ pgInstrStack = stack;
+
+ return stack;
+}
+
+void
+InstrPopStack(InstrStack * stack)
+{
+ Assert(stack != NULL);
+
+ pgInstrStack = stack->previous;
+ if (pgInstrStack)
+ InstrStackAdd(pgInstrStack, stack);
+}
+
/* General purpose instrumentation handling */
Instrumentation *
InstrAlloc(int n, int instrument_options)
@@ -393,12 +412,11 @@ InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
WalUsageAdd(&dst->walusage, walusage);
}
- BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -419,36 +437,6 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
-void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 8804ee64311..e45c452bc79 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -139,7 +139,6 @@ typedef struct WorkerInstrumentation
NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern PGDLLIMPORT InstrStack * pgInstrStack;
@@ -162,9 +161,8 @@ extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
extern Instrumentation *InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
@@ -175,22 +173,18 @@ extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
instr->stack.walusage
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
if (pgInstrStack) \
pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
if (pgInstrStack) \
pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
if (pgInstrStack) \
INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
if (pgInstrStack) \
INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
--
2.47.1
[application/octet-stream] v3-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch (9.1K, 9-v3-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch)
download | inline diff:
From 4b7d15a6950f9374df6d05f84e213d84e11d54a1 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:34:42 -0700
Subject: [PATCH v3 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
---
src/backend/access/transam/xlog.c | 8 ++++----
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 46 insertions(+), 27 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eceab341255..e324e5a78ce 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1078,9 +1078,9 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2060,7 +2060,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e8544acb784..92cb4ea5645 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -706,7 +706,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -727,7 +727,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1128,14 +1128,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1869,9 +1869,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -1939,9 +1939,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2819,7 +2819,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -2975,7 +2975,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4394,7 +4394,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5557,7 +5557,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (dirtied)
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 15aac7d1c9f..4481920ea5f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u32(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..9d39df998cb 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -474,13 +474,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -548,13 +548,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..4f6274eb573 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1ae533f6704..78d3653997b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -149,4 +149,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2025-10-22 12:59 ` Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2025-10-22 12:59 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>
On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote:
> On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <[email protected]> wrote:
>
> > Attached an updated patch set that addresses the feedback, and also adds
> > the complete removal of the global pgBufferUsage variable in later patches
> > (0005-0007), to avoid counting both the stack and the variable.
> >
>
> See attached the same patch set rebased on latest master.
> From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
> require caller to convert
>
> Previously the Instrumentation logic always converted to seconds, only for many
> of the callers to do unnecessary division to get to milliseconds. Since an upcoming
> refactoring will split the Instrumentation struct, utilize instrtime always to
> keep things simpler.
LGTM, think we should apply this regardless of the rest of the patches.
> @@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
> dst->running = true;
> dst->firsttuple = add->firsttuple;
> }
> - else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
> + else if (dst->running && add->running && INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
> dst->firsttuple = add->firsttuple;
This isn't due to this patch, but it seems a bit odd that we use the minimum
time for the first tuple, but the average time for the node's completion...
> diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
> index f71a851b18d..646934020d1 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -184,6 +184,8 @@ GetTimerFrequency(void)
> #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
> ((x).ticks += (y).ticks - (z).ticks)
>
> +#define INSTR_TIME_CMP_LT(x,y) \
> + ((x).ticks > (y).ticks)
>
> #define INSTR_TIME_GET_DOUBLE(t) \
> ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
> --
> 2.47.1
Any reason to actually have _CMP_ in the name? Other operations like _ADD
don't have such an additional verb in the name.
> From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
> Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This dual use of the struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time,
> WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
> outside of per-node instrumentation. Introduce the NodeInstrumentation
> struct to carry forward the per-node instrumentation information.
> @@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
> */
> oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
>
> - /*
> - * Make sure stats accumulation is done. (Note: it's okay if several
> - * levels of hook all do this.)
> - */
> - InstrEndLoop(queryDesc->totaltime);
> -
> /* Log plan if duration is exceeded. */
> msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
> if (msec >= auto_explain_log_min_duration)
Maybe add a comment about the removal of these InstrEndLoop() calls to the
commit message? If I understand correctly they were superfluous before, but
that's not entirely obvious when just looking at the patch.
> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
> typedef struct Instrumentation
> +{
> + /* Parameters set at creation: */
> + bool need_timer; /* true if we need timer data */
> + bool need_bufusage; /* true if we need buffer usage data */
> + bool need_walusage; /* true if we need WAL usage data */
> + /* Internal state keeping: */
> + instr_time starttime; /* start time of last InstrStart */
> + BufferUsage bufusage_start; /* buffer usage at start */
> + WalUsage walusage_start; /* WAL usage at start */
> + /* Accumulated statistics: */
> + instr_time total; /* total runtime */
> + double ntuples; /* total tuples counted in InstrStop */
> + BufferUsage bufusage; /* total buffer usage */
> + WalUsage walusage; /* total WAL usage */
> +} Instrumentation;
Maybe add a comment explaining why ntuples is in here?
> From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage
Could use a commit message :)
> ---
> .../pg_stat_statements/pg_stat_statements.c | 4 +-
> src/backend/commands/explain.c | 8 +-
> src/backend/commands/trigger.c | 4 +-
> src/backend/executor/execMain.c | 25 ++-
> src/backend/executor/execProcnode.c | 29 +++
> src/backend/executor/instrument.c | 199 ++++++++++++++----
> src/include/executor/executor.h | 1 +
> src/include/executor/instrument.h | 53 ++++-
> 8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> PGSS_EXEC,
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> queryDesc->estate->es_total_processed,
> - &queryDesc->totaltime->bufusage,
> - &queryDesc->totaltime->walusage,
> + &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> + &INSTR_GET_WALUSAGE(queryDesc->totaltime),
Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(
Maybe it's worth refactoring this a bit in a precursor patch?
> @@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> palloc0(n * sizeof(ExprState *));
> if (instrument_options)
> - resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options);
> + {
> + if ((instrument_options & INSTRUMENT_TIMER) != 0)
> + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_TIMER);
> + else
> + resultRelInfo->ri_TrigInstrument = InstrAlloc(n, INSTRUMENT_ROWS);
> + }
> }
> else
> {
I'd not duplicate the InstrAlloc(), but compute the flags separately.
> /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
> return false;
> }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> + (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}
I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.
> +static bool
> +ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> + if (node == NULL)
> + return false;
> +
> + check_stack_depth();
> +
> + planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
There already is a check_stack_depth() in planstate_tree_walker().
> + if (node->instrument && node->instrument->stack.previous)
> + InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
> +
> + return false;
> +}
E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too? It probably is ok, but it's worth explicitly checking and adding a
comment.
> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> + .name = "instrumentation",
> + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> + .release_priority = RELEASE_PRIO_FIRST,
> + .ReleaseResource = ResOwnerReleaseInstrumentation,
> + .DebugPrint = NULL, /* default message is fine */
> +};
Is there a reason to do the release here before the lock release? And why
_FIRST?
> +static void
> +ResOwnerReleaseInstrumentation(Datum res)
> +{
> + Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
> +
> + /*
> + * Because registered resources are *not* called in reverse order, we'll
> + * get what was first registered first at shutdown. Thus, on any later
> + * resources we need to not change the stack, which was already set to the
> + * correct previous entry.
> + */
FWIW, the release order is not guaranteed to be in that order either,
e.g. once resowner switches to hashing, it'll essentially be random.
> + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> + pgInstrStack = instr->stack.previous;
Hm - this is effectively O(stack-depth^2), right? It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?
> + /*
> + * Always accumulate all collected stats before the abort, even if we
> + * already walked up the stack with an earlier resource.
> + */
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);
Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...
> /* General purpose instrumentation handling */
> Instrumentation *
> InstrAlloc(int n, int instrument_options)
> {
> - Instrumentation *instr;
> + Instrumentation *instr = NULL;
> + bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> + int i;
> +
> + /*
> + * If resource owner will be used, we must allocate in the transaction
> + * context (not the calling context, usually a lower context), because the
> + * memory might otherwise be freed too early in an abort situation.
> + */
> + if (need_buffers || need_wal)
> + instr = MemoryContextAllocZero(CurTransactionContext, n * sizeof(Instrumentation));
> + else
> + instr = palloc0(n * sizeof(Instrumentation));
Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?
> - /* initialize all fields to zeroes, then modify as needed */
> - instr = palloc0(n * sizeof(Instrumentation));
> - if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
> + for (i = 0; i < n; i++)
> {
> - bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
> - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> - bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
> - int i;
> -
> - for (i = 0; i < n; i++)
> - {
> - instr[i].need_bufusage = need_buffers;
> - instr[i].need_walusage = need_wal;
> - instr[i].need_timer = need_timer;
> - }
> + instr[i].need_bufusage = need_buffers;
> + instr[i].need_walusage = need_wal;
> + instr[i].need_timer = need_timer;
> }
>
> return instr;
> }
> +
> void
> InstrStart(Instrumentation *instr)
> {
> + Assert(!instr->finalized);
> +
> if (instr->need_timer &&
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStart called twice in a row");
>
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> -
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + InstrPushStackResource(instr);
> }
> +
> void
> -InstrStop(Instrumentation *instr, double nTuples)
> +InstrStop(Instrumentation *instr, double nTuples, bool finalize)
> {
> instr_time endtime;
>
> @@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + InstrPopStackResource(instr);
>
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + if (finalize)
> + {
> + instr->finalized = true;
> + if (pgInstrStack)
> + InstrStackAdd(pgInstrStack, &instr->stack);
> + }
> }
>
> /* Allocate new node instrumentation structure(s) */
> @@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
> !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
> elog(ERROR, "InstrStartNode called twice in a row");
>
> - /* save buffer usage totals at node entry, if needed */
> - if (instr->need_bufusage)
> - instr->bufusage_start = pgBufferUsage;
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + /* Ensure that we always have a parent, even at the top most node */
> + Assert(pgInstrStack != NULL);
>
> - if (instr->need_walusage)
> - instr->walusage_start = pgWalUsage;
> + instr->stack.previous = pgInstrStack;
> + pgInstrStack = &instr->stack;
> + }
> }
>
> /* Exit from a plan node */
> @@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
> INSTR_TIME_SET_ZERO(instr->starttime);
> }
>
> - /* Add delta of buffer usage since entry to node's totals */
> - if (instr->need_bufusage)
> - BufferUsageAccumDiff(&instr->bufusage,
> - &pgBufferUsage, &instr->bufusage_start);
> -
> - if (instr->need_walusage)
> - WalUsageAccumDiff(&instr->walusage,
> - &pgWalUsage, &instr->walusage_start);
> + if (instr->need_bufusage || instr->need_walusage)
> + {
> + /* Ensure that we always have a parent, even at the top most node */
> + Assert(instr->stack.previous != NULL);
> + pgInstrStack = instr->stack.previous;
> + }
>
> /* Is this the first tuple of this cycle? */
> if (!instr->running)
> @@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
>
> /* Add delta of buffer usage since entry to node's totals */
> if (dst->need_bufusage)
> - BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
>
> if (dst->need_walusage)
> - WalUsageAdd(&dst->walusage, &add->walusage);
> + WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
> +}
> +
> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> + Assert(dst != NULL);
> + Assert(add != NULL);
> +
> + BufferUsageAdd(&dst->bufusage, &add->bufusage);
> + WalUsageAdd(&dst->walusage, &add->walusage);
> }
>
Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...
> /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> void
> InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> {
> + if (pgInstrStack != NULL)
> + {
> + InstrStack *dst = pgInstrStack;
> +
> + BufferUsageAdd(&dst->bufusage, bufusage);
> + WalUsageAdd(&dst->walusage, walusage);
> + }
> +
> BufferUsageAdd(&pgBufferUsage, bufusage);
> WalUsageAdd(&pgWalUsage, walusage);
> }
Is the pgInstrStack == NULL case actually reachable?
> diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
> index 78d3653997b..d04607ce40c 100644
> --- a/src/include/executor/instrument.h
> +++ b/src/include/executor/instrument.h
> @@ -14,6 +14,7 @@
> #define INSTRUMENT_H
>
> #include "portability/instr_time.h"
> +#include "utils/resowner.h"
I'd probably not include resowner here but just forward declare the typedef.
> @@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
> extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
> extern void BufferUsageAccumDiff(BufferUsage *dst,
> const BufferUsage *add, const BufferUsage *sub);
> +extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
> extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
> const WalUsage *sub);
>
> +#define INSTR_GET_BUFUSAGE(instr) \
> + instr->stack.bufusage
> +
> +#define INSTR_GET_WALUSAGE(instr) \
> + instr->stack.walusage
Not convinced that having these macros is worthwhile.
At this point I reached return -ENEEDCOFFEE :)
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2025-10-31 07:18 ` Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2025-10-31 07:18 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>
Hi Andres,
Thanks for the detailed review!
Attached v4 patchset that addresses feedback (unless otherwise noted below)
and is rebased on master. Other changes:
- Ensured each patch is individually pgindent clean (and compiles)
- Refactored 0003 a bit to consistently use InstrPushStack/InstrPopStack
helpers for modifying the active stack entry
- Building on that refactoring, merged v3/0006 "Introduce alternate
Instrumentation stack mechanism relying on PG_FINALLY" into the main commit
introducing the stack mechanism (the "alternate" mechanism is just using
these helpers directly and making sure InstrPopStack is called via
PG_FINALLY, instead of using resource owners)
- Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
illustrates how the stack mechanism can be used to separate index and table
buffer accesses in the EXPLAIN for Index Scans
On Wed, Oct 22, 2025 at 5:59 AM Andres Freund <[email protected]> wrote:
> > +/*
> > + * General purpose instrumentation that can capture time, WAL/buffer
> usage and tuples
> > + *
> > + * Initialized through InstrAlloc, followed by one or more calls to a
> pair of
> > + * InstrStart/InstrStop (activity is measured inbetween).
> > + */
> > typedef struct Instrumentation
> > +{
> > + /* Parameters set at creation: */
> > + bool need_timer; /* true if we need timer
> data */
> > + bool need_bufusage; /* true if we need buffer usage
> data */
> > + bool need_walusage; /* true if we need WAL usage data
> */
> > + /* Internal state keeping: */
> > + instr_time starttime; /* start time of last
> InstrStart */
> > + BufferUsage bufusage_start; /* buffer usage at start */
> > + WalUsage walusage_start; /* WAL usage at start */
> > + /* Accumulated statistics: */
> > + instr_time total; /* total runtime */
> > + double ntuples; /* total tuples counted in
> InstrStop */
> > + BufferUsage bufusage; /* total buffer usage */
> > + WalUsage walusage; /* total WAL usage */
> > +} Instrumentation;
>
> Maybe add a comment explaining why ntuples is in here?
>
After thinking about this some more, I'd think we should just go ahead and
special case trigger instrumentation, and specifically count firings of the
trigger (which was counted in "ntuples" before).
I've adjusted the 0002 patch accordingly to split out both node and trigger
instrumentation.
>
>
>
> > ---
> > .../pg_stat_statements/pg_stat_statements.c | 4 +-
> > src/backend/commands/explain.c | 8 +-
> > src/backend/commands/trigger.c | 4 +-
> > src/backend/executor/execMain.c | 25 ++-
> > src/backend/executor/execProcnode.c | 29 +++
> > src/backend/executor/instrument.c | 199 ++++++++++++++----
> > src/include/executor/executor.h | 1 +
> > src/include/executor/instrument.h | 53 ++++-
> > 8 files changed, 260 insertions(+), 63 deletions(-)
> >
> > diff --git a/contrib/pg_stat_statements/pg_stat_statements.c
> b/contrib/pg_stat_statements/pg_stat_statements.c
> > index f43a33b3787..eeabd820d8e 100644
> > --- a/contrib/pg_stat_statements/pg_stat_statements.c
> > +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> > @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
> > PGSS_EXEC,
> >
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
> > queryDesc->estate->es_total_processed,
> > - &queryDesc->totaltime->bufusage,
> > - &queryDesc->totaltime->walusage,
> > +
> &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> > +
> &INSTR_GET_WALUSAGE(queryDesc->totaltime),
>
> Getting a pointer to something returned by a macro is a bit ugly... Perhaps
> it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
> easily possible given how pgss_planner() currently tracks things :(
>
> Maybe it's worth refactoring this a bit in a precursor patch?
>
I've dropped the macro (its just one additional indirection after all) - I
think we could refactor this further (i.e. pass the stack), but that
doesn't seem strictly necessary.
> /*
> ------------------------------------------------------------------------
> > @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void
> *context)
> > return false;
> > }
> >
> > +/*
> > + * ExecAccumNodeInstrumentation
> > + *
> > + * Accumulate instrumentation stats from all execution nodes to their
> respective
> > + * parents (or the original parent instrumentation stack).
> > + */
> > +void
> > +ExecAccumNodeInstrumentation(PlanState *node)
> > +{
> > + (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> > +}
>
> I wonder if this is too narrow a name. There might be other uses of a pass
> across the node tree at that point. OTOH, it's probably better to just
> rename
> it at that later point.
>
Yeah, I can't think of a better name, so I've left this the same for now.
>
> > + if (node->instrument && node->instrument->stack.previous)
> > + InstrStackAdd(node->instrument->stack.previous,
> &node->instrument->stack);
> > +
> > + return false;
> > +}
>
> E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then
> also
> have special handling for a few node types. Do we need something like that
> here too? It probably is ok, but it's worth explicitly checking and
> adding a
> comment.
>
I went through and double checked - we don't need to special case these
node types in my understanding, and I couldn't see any specific cases where
intermediary nodes would be removed in shutdown either (o.e.
causing dropped stats since a node was removed with its associated stats
not being added to the parent yet).
I added a comment to make it clear this must run after ExecShutdownNode.
> > +/*
> > + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> > + */
> > +static void ResOwnerReleaseInstrumentation(Datum res);
> > +static const ResourceOwnerDesc instrumentation_resowner_desc =
> > +{
> > + .name = "instrumentation",
> > + .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> > + .release_priority = RELEASE_PRIO_FIRST,
> > + .ReleaseResource = ResOwnerReleaseInstrumentation,
> > + .DebugPrint = NULL, /* default message is fine
> */
> > +};
>
> Is there a reason to do the release here before the lock release? And why
> _FIRST?
>
Adjusted to have its own phase, and after lock release.
> > + if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> > + pgInstrStack = instr->stack.previous;
>
> Hm - this is effectively O(stack-depth^2), right? It's probably fine,
> given
> that we have fairly limited nesting (explain + pg_stat_statements +
> auto_explain is probably the current max), but seems worth noting in a
> comment?
>
Yeah, I added a comment - I don't see a case where this is a bottleneck
today given the limited nesting of resource owner using stacks.
> > + /*
> > + * Always accumulate all collected stats before the abort, even if
> we
> > + * already walked up the stack with an earlier resource.
> > + */
> > + if (pgInstrStack)
> > + InstrStackAdd(pgInstrStack, &instr->stack);
>
> Why are we accumulating stats in case of errors? It's probably fine, but
> doing
> less as part of cleanup is pre ferrable...
>
In my understanding, we need to do this in case of functions called in a
query that catch a rollback/error, since we'd otherwise not account for
that function's activity as part of the top-level query.
> /* General purpose instrumentation handling */
> > Instrumentation *
> > InstrAlloc(int n, int instrument_options)
> > {
> ...
> > + if (need_buffers || need_wal)
> > + instr = MemoryContextAllocZero(CurTransactionContext, n *
> sizeof(Instrumentation));
> > + else
> > + instr = palloc0(n * sizeof(Instrumentation));
>
> Is this long-lived enough? I'm e.g. wondering about utility statements that
> internally starting transactions, wouldn't that cause problems with a user
> like pgss tracking something like CIC?
>
I ended up refactoring this a bit, since it seemed useful to do an explicit
pfree at InstrStop or when aborting, both to avoid leaks, and to
theoretically support using TopMemoryContext here.
That said, from my testing I think CurTransactionContext is sufficient,
because we just need something that lives long enough during resource owner
abort situations (e.g. per-query context doesn't work, since the abort
frees it before we do our resource owner handling).
The pgss+CIC case isn't relevant here (I think), because utility statements
don't use the resource owner mechanism at all (with the exception of
EXPLAIN which calls into ExecutorStart), instead we use a PG_TRY/PG_FINALLY
in pg_stat_statements to pop the stack.
> > +void
> > +InstrStackAdd(InstrStack * dst, InstrStack * add)
> > +{
> > + Assert(dst != NULL);
> > + Assert(add != NULL);
> > +
> > + BufferUsageAdd(&dst->bufusage, &add->bufusage);
> > + WalUsageAdd(&dst->walusage, &add->walusage);
> > }
> >
>
> Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
> usage? Those operations aren't cheap...
>
I briefly considered whether we could add this to the InstrStack itself
(i.e. whether we actually care about buffer, WAL usage or both), but I
think where it gets messy is that we can have indirect requirements to
track this - you might have pg_stat_statements capturing both, but e.g. the
utility statement being executed only caring about emitting WAL usage in
the log.
I'm also not familiar with an in-core use case today where only WAL (but
not buffers) is needed, short of doing something like "EXPLAIN (ANALYZE,
BUFFERS OFF, WAL ON)" without having pg_stat_statements or similar enabled.
Do you have a specific example where this could help?
> /* note current values during parallel executor startup */
> > @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage,
> WalUsage *walusage)
> > void
> > InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> > {
> > + if (pgInstrStack != NULL)
> > + {
> > + InstrStack *dst = pgInstrStack;
> > +
> > + BufferUsageAdd(&dst->bufusage, bufusage);
> > + WalUsageAdd(&dst->walusage, walusage);
> > + }
> > +
> > BufferUsageAdd(&pgBufferUsage, bufusage);
> > WalUsageAdd(&pgWalUsage, walusage);
> > }
>
> Is the pgInstrStack == NULL case actually reachable?
>
In my reading of the code, its necessary because we unconditionally track
WAL/buffer usage in parallel workers, even if the leader doesn't actually
need it.
We could be smarter about this (i.e. tell the workers not to collect the
information in the first place), but for now it seemed easiest to just
discard it.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v4-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch (9.1K, 3-v4-0003-Replace-direct-changes-of-pgBufferUsage-pgWalUsag.patch)
download | inline diff:
From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:34:42 -0700
Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 47 insertions(+), 28 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fd91bcd68ec..8b3697d8820 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1079,10 +1079,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2062,7 +2062,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e8544acb784..92cb4ea5645 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -706,7 +706,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -727,7 +727,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1128,14 +1128,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1869,9 +1869,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -1939,9 +1939,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2819,7 +2819,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -2975,7 +2975,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4394,7 +4394,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5557,7 +5557,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (dirtied)
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 15aac7d1c9f..4481920ea5f 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u32(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 366d70d38a1..9d39df998cb 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -474,13 +474,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -548,13 +548,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13ae57ed649..4f6274eb573 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 4986f6cea54..8e435e1f92c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -161,4 +161,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v4-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch (8.1K, 4-v4-0001-Instrumentation-Keep-time-fields-as-instrtime-req.patch)
download | inline diff:
From 1e8e2d54997602cfc289d8eb850b2a8e9745040c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 31 Aug 2025 16:37:05 -0700
Subject: [PATCH v4 1/7] Instrumentation: Keep time fields as instrtime,
require caller to convert
Previously the Instrumentation logic always converted to seconds, only for many
of the callers to do unnecessary division to get to milliseconds. Since an upcoming
refactoring will split the Instrumentation struct, utilize instrtime always to
keep things simpler.
---
contrib/auto_explain/auto_explain.c | 2 +-
.../pg_stat_statements/pg_stat_statements.c | 2 +-
src/backend/commands/explain.c | 12 +++++------
src/backend/executor/instrument.c | 20 ++++++++-----------
src/include/executor/instrument.h | 6 +++---
src/include/portability/instr_time.h | 2 ++
6 files changed, 21 insertions(+), 23 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 1f4badb4928..c10f2fc0f25 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -388,7 +388,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
InstrEndLoop(queryDesc->totaltime);
/* Log plan if duration is exceeded. */
- msec = queryDesc->totaltime->total * 1000.0;
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 39208f80b5b..47de4c98ae3 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1093,7 +1093,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- queryDesc->totaltime->total * 1000.0, /* convert to msec */
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
&queryDesc->totaltime->bufusage,
&queryDesc->totaltime->walusage,
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 7e699f8595e..ba34cce26eb 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1136,7 +1136,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- 1000.0 * instr->total, instr->ntuples);
+ INSTR_TIME_GET_MILLISEC(instr->total), instr->ntuples);
else
appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
}
@@ -1147,7 +1147,7 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Constraint Name", conname, es);
ExplainPropertyText("Relation", relname, es);
if (es->timing)
- ExplainPropertyFloat("Time", "ms", 1000.0 * instr->total, 3,
+ ExplainPropertyFloat("Time", "ms", INSTR_TIME_GET_MILLISEC(instr->total), 3,
es);
ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
}
@@ -1835,8 +1835,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
planstate->instrument && planstate->instrument->nloops > 0)
{
double nloops = planstate->instrument->nloops;
- double startup_ms = 1000.0 * planstate->instrument->startup / nloops;
- double total_ms = 1000.0 * planstate->instrument->total / nloops;
+ double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1901,8 +1901,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
- startup_ms = 1000.0 * instrument->startup / nloops;
- total_ms = 1000.0 * instrument->total / nloops;
+ startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9e11c662a7c..20653a5c4c4 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -114,7 +114,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (!instr->running)
{
instr->running = true;
- instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ instr->firsttuple = instr->counter;
}
else
{
@@ -123,7 +123,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
* this might be the first tuple
*/
if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = INSTR_TIME_GET_DOUBLE(instr->counter);
+ instr->firsttuple = instr->counter;
}
}
@@ -139,8 +139,6 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
void
InstrEndLoop(Instrumentation *instr)
{
- double totaltime;
-
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
@@ -149,10 +147,8 @@ InstrEndLoop(Instrumentation *instr)
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
- totaltime = INSTR_TIME_GET_DOUBLE(instr->counter);
-
- instr->startup += instr->firsttuple;
- instr->total += totaltime;
+ INSTR_TIME_ADD(instr->startup, instr->firsttuple);
+ INSTR_TIME_ADD(instr->total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
@@ -160,7 +156,7 @@ InstrEndLoop(Instrumentation *instr)
instr->running = false;
INSTR_TIME_SET_ZERO(instr->starttime);
INSTR_TIME_SET_ZERO(instr->counter);
- instr->firsttuple = 0;
+ INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
}
@@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->running = true;
dst->firsttuple = add->firsttuple;
}
- else if (dst->running && add->running && dst->firsttuple > add->firsttuple)
+ else if (dst->running && add->running && INSTR_TIME_LT(dst->firsttuple, add->firsttuple))
dst->firsttuple = add->firsttuple;
INSTR_TIME_ADD(dst->counter, add->counter);
dst->tuplecount += add->tuplecount;
- dst->startup += add->startup;
- dst->total += add->total;
+ INSTR_TIME_ADD(dst->startup, add->startup);
+ INSTR_TIME_ADD(dst->total, add->total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index ffe470f2b84..dfc8b3c9765 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -78,13 +78,13 @@ typedef struct Instrumentation
bool running; /* true if we've completed first tuple */
instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
- double firsttuple; /* time for first tuple of this cycle */
+ instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
BufferUsage bufusage_start; /* buffer usage at start */
WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
- double startup; /* total startup time (in seconds) */
- double total; /* total time (in seconds) */
+ instr_time startup; /* total startup time */
+ instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..1c1c18f780a 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -184,6 +184,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_LT(x,y) \
+ ((x).ticks > (y).ticks)
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.1
[application/octet-stream] v4-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch (9.4K, 5-v4-0005-Use-Instrumentation-stack-for-parallel-query-aggr.patch)
download | inline diff:
From cd43ed5f81d929ed21b8d3b015b3625abdfdeeba Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:24:49 -0700
Subject: [PATCH v4 5/7] Use Instrumentation stack for parallel query
aggregation in workers
---
src/backend/access/brin/brin.c | 6 ++++--
src/backend/access/gin/gininsert.c | 6 ++++--
src/backend/access/nbtree/nbtsort.c | 6 ++++--
src/backend/commands/vacuumparallel.c | 6 ++++--
src/backend/executor/execParallel.c | 6 ++++--
src/backend/executor/instrument.c | 19 +++++++++----------
src/include/executor/instrument.h | 4 ++--
7 files changed, 31 insertions(+), 22 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2f7d1437919..a36606eed0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2870,6 +2870,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2919,7 +2920,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2934,7 +2935,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 3d71b442aa9..b454934c109 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -2083,6 +2083,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2151,7 +2152,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2166,7 +2167,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index e90964080ca..717d6c1a11f 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1750,6 +1750,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ Instrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1825,7 +1826,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1835,7 +1836,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 0feea1d30ec..c5309a015e6 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -994,6 +994,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ Instrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1083,7 +1084,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,7 +1092,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index e87810d292e..061c6a4aa69 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1434,6 +1434,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ Instrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1494,7 +1495,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1510,7 +1511,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 37055d01f61..02c33b7dead 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -20,10 +20,8 @@
#include "utils/resowner.h"
BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
InstrStack *pgInstrStack = NULL;
-static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
}
/* start instrumentation during parallel executor startup */
-void
+Instrumentation *
InstrStartParallelQuery(void)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
+ Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrStart(instr);
+ return instr;
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage)
{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
+ InstrStop(instr, true);
+ memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage));
}
/* accumulate work done by workers in leader's stats */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 30d81fceaaa..adcbc75a757 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -186,8 +186,8 @@ extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation * instr);
extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern Instrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
--
2.47.1
[application/octet-stream] v4-0002-Separate-node-and-trigger-instrumentation-from-ot.patch (25.7K, 6-v4-0002-Separate-node-and-trigger-instrumentation-from-ot.patch)
download | inline diff:
From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other
use of Instrumentation struct
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information, and introduce TriggerInstrumentation to
capture trigger timing and firings (previously counted in "ntuples").
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 21 ++-
src/backend/commands/trigger.c | 22 ++--
src/backend/executor/execMain.c | 10 +-
src/backend/executor/execParallel.c | 22 ++--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 120 ++++++++++++++++--
src/include/executor/instrument.h | 62 +++++++--
src/include/nodes/execnodes.h | 5 +-
11 files changed, 209 insertions(+), 75 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index c10f2fc0f25..ee0c3b4c91b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 47de4c98ae3..7f56592f536 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1023,7 +1023,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1082,12 +1082,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 456b267f70b..7619ac486c0 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ba34cce26eb..fee782d1c55 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1099,18 +1099,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1135,10 +1132,10 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total), instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1147,9 +1144,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Constraint Name", conname, es);
ExplainPropertyText("Relation", relname, es);
if (es->timing)
- ExplainPropertyFloat("Time", "ms", INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ ExplainPropertyFloat("Time", "ms", INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
@@ -1893,7 +1890,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -2300,7 +2297,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 579ac8d76ae..19d49eacafb 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -90,7 +90,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation * instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2309,7 +2309,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation * instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2389,10 +2389,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3947,7 +3947,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation * instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4340,7 +4340,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation * instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4381,7 +4381,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4604,10 +4604,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4723,7 +4723,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 27c9eec697b..a97977d988a 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1266,7 +1266,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f098a5557cf..e87810d292e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(AssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -713,7 +717,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -799,7 +803,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -809,7 +813,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1036,7 +1040,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1064,7 +1068,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
@@ -1296,7 +1300,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f5f9cfbeead..d286471254b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(1, estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 20653a5c4c4..41a342cab7f 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,9 +26,9 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int n, int instrument_options)
{
Instrumentation *instr;
@@ -41,6 +41,108 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
int i;
+ for (i = 0; i < n; i++)
+ {
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
+ }
+ }
+
+ return instr;
+}
+
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer &&
+ !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
+ elog(ERROR, "InstrStart called twice in a row");
+
+ if (instr->need_bufusage)
+ instr->bufusage_start = pgBufferUsage;
+
+ if (instr->need_walusage)
+ instr->walusage_start = pgWalUsage;
+}
+
+void
+InstrStop(Instrumentation *instr)
+{
+ instr_time endtime;
+
+ /* let's update the time only if the timer was requested */
+ if (instr->need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->need_bufusage)
+ BufferUsageAccumDiff(&instr->bufusage,
+ &pgBufferUsage, &instr->bufusage_start);
+
+ if (instr->need_walusage)
+ WalUsageAccumDiff(&instr->walusage,
+ &pgWalUsage, &instr->walusage_start);
+}
+
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
+ for (i = 0; i < n; i++)
+ {
+ tginstr[i].instr.need_bufusage = need_buffers;
+ tginstr[i].instr.need_walusage = need_wal;
+ tginstr[i].instr.need_timer = need_timer;
+ }
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation * tginstr)
+{
+ InstrStart(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation * tginstr, int firings)
+{
+ InstrStop(&tginstr->instr);
+ tginstr->firings += firings;
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure(s) */
+NodeInstrumentation *
+InstrAllocNode(int n, int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr;
+
+ /* initialize all fields to zeroes, then modify as needed */
+ instr = palloc0(n * sizeof(NodeInstrumentation));
+ if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ {
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
+
for (i = 0; i < n; i++)
{
instr[i].need_bufusage = need_buffers;
@@ -55,9 +157,9 @@ InstrAlloc(int n, int instrument_options, bool async_mode)
/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitNode(NodeInstrumentation * instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
+ memset(instr, 0, sizeof(NodeInstrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
@@ -65,7 +167,7 @@ InstrInit(Instrumentation *instr, int instrument_options)
/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStartNode(NodeInstrumentation * instr)
{
if (instr->need_timer &&
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
@@ -81,7 +183,7 @@ InstrStartNode(Instrumentation *instr)
/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStopNode(NodeInstrumentation * instr, double nTuples)
{
double save_tuplecount = instr->tuplecount;
instr_time endtime;
@@ -129,7 +231,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -137,7 +239,7 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation * instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
@@ -162,7 +264,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
{
if (!dst->running && add->running)
{
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index dfc8b3c9765..4986f6cea54 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,7 +67,40 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
+{
+ /* Parameters set at creation: */
+ bool need_timer; /* true if we need timer data */
+ bool need_bufusage; /* true if we need buffer usage data */
+ bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/* Trigger instrumentation */
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
{
/* Parameters set at node creation: */
bool need_timer; /* true if we need timer data */
@@ -92,25 +125,34 @@ typedef struct Instrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
typedef struct WorkerInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int n, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern TriggerInstrumentation * InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation * tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation * tginstr, int firings);
+
+extern NodeInstrumentation * InstrAllocNode(int n, int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation * instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation * instr);
+extern void InstrStopNode(NodeInstrumentation * instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation * instr);
+extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18ae8f0d4bb..6d53456ddc8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -521,7 +521,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
@@ -1172,7 +1172,8 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
/* Per-worker JIT instrumentation */
--
2.47.1
[application/octet-stream] v4-0004-Optimize-measuring-WAL-buffer-usage-through-stack.patch (29.6K, 7-v4-0004-Optimize-measuring-WAL-buffer-usage-through-stack.patch)
download | inline diff:
From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code section,
we utilized continuously incrementing global counters that get updated when the
actual activity (e.g. shared block read) occurred, and then calculated a diff when
the code section ended. This resulted in a bottleneck for executor node instrumentation
specifically, with the function BufferUsageAccumDiff showing up in profiles and
in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity writes
into the current stack entry. In the case of executor nodes, this means that
each node gets its own stack entry that is pushed at InstrStartNode, and popped
at InstrEndNode. Stack entries are zero initialized (avoiding the diff mechanism)
and get added to their parent entry when they are finalized, i.e. no more
modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks must
involve either a top-level Instrumentation struct, and its associated InstrStart/
InstrStop helpers (which use resource owners to handle aborts), or dedicated
PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after
an abort.
---
.../pg_stat_statements/pg_stat_statements.c | 105 ++++-----
src/backend/commands/explain.c | 8 +-
src/backend/executor/execMain.c | 28 ++-
src/backend/executor/execProcnode.c | 31 +++
src/backend/executor/instrument.c | 218 ++++++++++++++----
src/include/executor/executor.h | 1 +
src/include/executor/instrument.h | 64 ++++-
src/include/utils/resowner.h | 1 +
8 files changed, 333 insertions(+), 123 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 7f56592f536..1ed3660cf9b 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -911,21 +911,13 @@ pgss_planner(Query *parse,
{
instr_time start;
instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ InstrStack stack = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
INSTR_TIME_SET_CURRENT(start);
+ /* We need to track buffer/WAL usage as the planner can access them. */
+ InstrPushStack(&stack);
+
nesting_level++;
PG_TRY();
{
@@ -938,6 +930,7 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrPopStack(&stack, true);
nesting_level--;
}
PG_END_TRY();
@@ -945,14 +938,6 @@ pgss_planner(Query *parse,
INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
@@ -960,8 +945,8 @@ pgss_planner(Query *parse,
PGSS_PLAN,
INSTR_TIME_GET_MILLISEC(duration),
0,
- &bufusage,
- &walusage,
+ &stack.bufusage,
+ &stack.walusage,
NULL,
NULL,
0,
@@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+
+ /*
+ * Check if stack is initialized - it is not when ExecutorRun wasn't
+ * called
+ */
+ queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->bufusage : NULL,
+ queryDesc->totaltime->stack ? &queryDesc->totaltime->stack->walusage : NULL,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1157,14 +1147,10 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
instr_time start;
instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ InstrStack stack = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
INSTR_TIME_SET_CURRENT(start);
+ InstrPushStack(&stack);
nesting_level++;
PG_TRY();
@@ -1180,6 +1166,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrPopStack(&stack, true);
nesting_level--;
}
PG_END_TRY();
@@ -1208,14 +1195,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
@@ -1223,8 +1202,8 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
PGSS_EXEC,
INSTR_TIME_GET_MILLISEC(duration),
rows,
- &bufusage,
- &walusage,
+ &stack.bufusage,
+ &stack.walusage,
NULL,
NULL,
0,
@@ -1454,27 +1433,33 @@ pgss_store(const char *query, int64 queryId,
}
}
entry->counters.rows += rows;
- entry->counters.shared_blks_hit += bufusage->shared_blks_hit;
- entry->counters.shared_blks_read += bufusage->shared_blks_read;
- entry->counters.shared_blks_dirtied += bufusage->shared_blks_dirtied;
- entry->counters.shared_blks_written += bufusage->shared_blks_written;
- entry->counters.local_blks_hit += bufusage->local_blks_hit;
- entry->counters.local_blks_read += bufusage->local_blks_read;
- entry->counters.local_blks_dirtied += bufusage->local_blks_dirtied;
- entry->counters.local_blks_written += bufusage->local_blks_written;
- entry->counters.temp_blks_read += bufusage->temp_blks_read;
- entry->counters.temp_blks_written += bufusage->temp_blks_written;
- entry->counters.shared_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_read_time);
- entry->counters.shared_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_write_time);
- entry->counters.local_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_read_time);
- entry->counters.local_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_write_time);
- entry->counters.temp_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_read_time);
- entry->counters.temp_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_write_time);
+ if (bufusage)
+ {
+ entry->counters.shared_blks_hit += bufusage->shared_blks_hit;
+ entry->counters.shared_blks_read += bufusage->shared_blks_read;
+ entry->counters.shared_blks_dirtied += bufusage->shared_blks_dirtied;
+ entry->counters.shared_blks_written += bufusage->shared_blks_written;
+ entry->counters.local_blks_hit += bufusage->local_blks_hit;
+ entry->counters.local_blks_read += bufusage->local_blks_read;
+ entry->counters.local_blks_dirtied += bufusage->local_blks_dirtied;
+ entry->counters.local_blks_written += bufusage->local_blks_written;
+ entry->counters.temp_blks_read += bufusage->temp_blks_read;
+ entry->counters.temp_blks_written += bufusage->temp_blks_written;
+ entry->counters.shared_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_read_time);
+ entry->counters.shared_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->shared_blk_write_time);
+ entry->counters.local_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_read_time);
+ entry->counters.local_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->local_blk_write_time);
+ entry->counters.temp_blk_read_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_read_time);
+ entry->counters.temp_blk_write_time += INSTR_TIME_GET_MILLISEC(bufusage->temp_blk_write_time);
+ }
entry->counters.usage += USAGE_EXEC(total_time);
- entry->counters.wal_records += walusage->wal_records;
- entry->counters.wal_fpi += walusage->wal_fpi;
- entry->counters.wal_bytes += walusage->wal_bytes;
- entry->counters.wal_buffers_full += walusage->wal_buffers_full;
+ if (walusage)
+ {
+ entry->counters.wal_records += walusage->wal_records;
+ entry->counters.wal_fpi += walusage->wal_fpi;
+ entry->counters.wal_bytes += walusage->wal_bytes;
+ entry->counters.wal_buffers_full += walusage->wal_buffers_full;
+ }
if (jitusage)
{
entry->counters.jit_functions += jitusage->created_functions;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index fee782d1c55..545185148ac 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -2286,9 +2286,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->stack.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->stack.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -2305,9 +2305,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->stack.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->stack.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index a97977d988a..baf2bef17d6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -329,6 +329,13 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+ /*
+ * Start up required top-level instrumentation stack for WAL/buffer
+ * tracking
+ */
+ if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+ queryDesc->totaltime = InstrAlloc(1, estate->es_instrument);
+
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
InstrStart(queryDesc->totaltime);
@@ -383,7 +390,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrStop(queryDesc->totaltime, false);
MemoryContextSwitchTo(oldcontext);
}
@@ -442,8 +449,15 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per node statistics, and then shut down instrumentation
+ * stack
+ */
+ if (queryDesc->totaltime && estate->es_instrument)
+ ExecAccumNodeInstrumentation(queryDesc->planstate);
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrStop(queryDesc->totaltime, true);
MemoryContextSwitchTo(oldcontext);
@@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0(n * sizeof(ExprState *));
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ {
+ /*
+ * Triggers do not individually track buffer/WAL usage, even if
+ * otherwise tracked
+ */
+ int opts = (instrument_options & INSTRUMENT_TIMER) != 0 ? INSTRUMENT_TIMER : INSTRUMENT_ROWS;
+
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, opts);
+ }
}
else
{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d286471254b..d00cf820a27 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,7 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecAccumNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -828,6 +829,36 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecAccumNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation stack).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecAccumNodeInstrumentation(PlanState *node)
+{
+ (void) ExecAccumNodeInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ if (node == NULL)
+ return false;
+
+ planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
+
+ if (node->instrument && node->instrument->stack.previous)
+ InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 41a342cab7f..37055d01f61 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,37 +16,103 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
+InstrStack *pgInstrStack = NULL;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+/*
+ * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrStack(ResourceOwner owner, InstrStack * stack)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(stack), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrStack(ResourceOwner owner, InstrStack * stack)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(stack), &instrumentation_resowner_desc);
+}
+
+static bool
+StackIsParent(InstrStack * stack, InstrStack * entry)
+{
+ if (entry->previous == NULL)
+ return false;
+
+ if (entry->previous == stack)
+ return true;
+
+ return StackIsParent(stack, entry->previous);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ InstrStack *stack = (InstrStack *) DatumGetPointer(res);
+
+ if (pgInstrStack)
+ {
+ /*
+ * Because registered resources are *not* cleaned up in a guaranteed
+ * order, we may get a child context after we've processed the parent.
+ * Thus, we only change the stack if its not already a parent of the
+ * stack being released.
+ *
+ * If we already walked up the stack with an earlier resource, simply
+ * accumulate all collected stats before the abort to the current
+ * stack.
+ *
+ * Note that StackIsParent will recurse as needed, so it is
+ * inadvisible to use deeply nested stacks.
+ */
+ if (!StackIsParent(pgInstrStack, stack))
+ InstrPopStack(stack, true);
+ else
+ InstrStackAdd(pgInstrStack, stack);
+ }
+
+ /*
+ * Ensure long-lived memory is freed now, as we don't expect InstrStop to
+ * be called
+ */
+ pfree(stack);
+}
/* General purpose instrumentation handling */
Instrumentation *
InstrAlloc(int n, int instrument_options)
{
- Instrumentation *instr;
+ Instrumentation *instr = palloc0(n * sizeof(Instrumentation));
+ bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
+ bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ int i;
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
+ for (i = 0; i < n; i++)
{
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- }
+ instr[i].need_bufusage = need_buffers;
+ instr[i].need_walusage = need_wal;
+ instr[i].need_timer = need_timer;
}
return instr;
@@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStart called twice in a row");
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ instr->owner = CurrentResourceOwner;
+
+ /*
+ * Allocate the stack resource in a memory context that survives
+ * during an abort. This will be freed by InstrStop (regular
+ * execution) or ResOwnerReleaseInstrumentation (abort).
+ *
+ * We don't do this in InstrAlloc to avoid leaking when InstrStart +
+ * InstrStop isn't called.
+ */
+ if (instr->stack == NULL)
+ instr->stack = MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ ResourceOwnerEnlarge(instr->owner);
+ ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
+
+ InstrPushStack(instr->stack);
+ }
}
void
-InstrStop(Instrumentation *instr)
+InstrStop(Instrumentation *instr, bool finalize)
{
instr_time endtime;
@@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ InstrPopStack(instr->stack, finalize);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ Assert(instr->owner != NULL);
+ ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
+ instr->owner = NULL;
+
+ if (finalize)
+ {
+ /*
+ * To avoid keeping memory allocated beyond when its needed, copy
+ * the result to the current memory context, and free it in the
+ * transaction context.
+ */
+ InstrStack *stack = palloc(sizeof(InstrStack));
+
+ memcpy(stack, instr->stack, sizeof(InstrStack));
+ pfree(instr->stack);
+ instr->stack = stack;
+ }
+ }
}
/* Trigger instrumentation handling */
@@ -98,15 +194,20 @@ TriggerInstrumentation *
InstrAllocTrigger(int n, int instrument_options)
{
TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
int i;
+ /*
+ * To avoid having to determine when the last trigger fired, we never
+ * track WAL/buffer usage for now
+ */
+ Assert((instrument_options & INSTRUMENT_BUFFERS) == 0);
+ Assert((instrument_options & INSTRUMENT_WAL) == 0);
+
for (i = 0; i < n; i++)
{
- tginstr[i].instr.need_bufusage = need_buffers;
- tginstr[i].instr.need_walusage = need_wal;
+ tginstr[i].instr.need_bufusage = false;
+ tginstr[i].instr.need_walusage = false;
tginstr[i].instr.need_timer = need_timer;
}
@@ -122,7 +223,11 @@ InstrStartTrigger(TriggerInstrumentation * tginstr)
void
InstrStopTrigger(TriggerInstrumentation * tginstr, int firings)
{
- InstrStop(&tginstr->instr);
+ /*
+ * trigger instrumentation does not track WAL/buffer usage, so its okay to
+ * never finalize
+ */
+ InstrStop(&tginstr->instr, false);
tginstr->firings += firings;
}
@@ -173,12 +278,13 @@ InstrStartNode(NodeInstrumentation * instr)
!INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
elog(ERROR, "InstrStartNode called twice in a row");
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(pgInstrStack != NULL);
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ InstrPushStack(&instr->stack);
+ }
}
/* Exit from a plan node */
@@ -203,14 +309,14 @@ InstrStopNode(NodeInstrumentation * instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(instr->stack.previous != NULL);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ /* Adding to parent is handled by ExecAccumNodeInstrumentation */
+ InstrPopStack(&instr->stack, false);
+ }
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -287,13 +393,13 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
/* Add delta of buffer usage since entry to node's totals */
if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
}
-/* note current values during parallel executor startup */
+/* start instrumentation during parallel executor startup */
void
InstrStartParallelQuery(void)
{
@@ -315,10 +421,28 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
void
InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
{
+ if (pgInstrStack != NULL)
+ {
+ InstrStack *dst = pgInstrStack;
+
+ BufferUsageAdd(&dst->bufusage, bufusage);
+ WalUsageAdd(&dst->walusage, walusage);
+ }
+
BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
+void
+InstrStackAdd(InstrStack * dst, InstrStack * add)
+{
+ Assert(dst != NULL);
+ Assert(add != NULL);
+
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
+}
+
/* dst += add */
static void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 8e7a5453064..692e3182e62 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -297,6 +297,7 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecAccumNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 8e435e1f92c..30d81fceaaa 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,12 +67,25 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/* Stack of WAL/buffer usage used for per-node instrumentation */
+typedef struct InstrStack
+{
+ struct InstrStack *previous;
+ BufferUsage bufusage;
+ WalUsage walusage;
+} InstrStack;
+
/*
* General purpose instrumentation that can capture time and WAL/buffer usage
*
* Initialized through InstrAlloc, followed by one or more calls to a pair of
* InstrStart/InstrStop (activity is measured inbetween).
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller *must* not exit out of
+ * top level transaction between InstrStart/InstrStop calls in regular execution. If this is needed,
+ * directly use InstrPushStack/InstrPopStack in a PG_TRY/PG_FINALLY block instead.
*/
+struct ResourceOwnerData;
typedef struct Instrumentation
{
/* Parameters set at creation: */
@@ -81,12 +94,10 @@ typedef struct Instrumentation
bool need_walusage; /* true if we need WAL usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack *stack; /* stack tracking buffer/WAL usage */
+ struct ResourceOwnerData *owner;
} Instrumentation;
/* Trigger instrumentation */
@@ -99,6 +110,10 @@ typedef struct TriggerInstrumentation
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Requires use of an outer InstrStart/InstrStop to handle the stack used for WAL/buffer
+ * usage statistics, and relies on it for managing aborts. Solely intended for
+ * the executor and anyone reporting about its activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -113,8 +128,6 @@ typedef struct NodeInstrumentation
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
instr_time total; /* total time */
@@ -123,8 +136,7 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
+ InstrStack stack; /* stack tracking buffer/WAL usage */
} NodeInstrumentation;
typedef struct WorkerInstrumentation
@@ -135,10 +147,31 @@ typedef struct WorkerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
+extern PGDLLIMPORT InstrStack * pgInstrStack;
+
+extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
+
+static inline void
+InstrPushStack(InstrStack * stack)
+{
+ stack->previous = pgInstrStack;
+ pgInstrStack = stack;
+}
+
+static inline void
+InstrPopStack(InstrStack * stack, bool add_to_parent)
+{
+ Assert(stack != NULL);
+
+ pgInstrStack = stack->previous;
+
+ if (pgInstrStack && add_to_parent)
+ InstrStackAdd(pgInstrStack, stack);
+}
extern Instrumentation *InstrAlloc(int n, int instrument_options);
extern void InstrStart(Instrumentation *instr);
-extern void InstrStop(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr, bool finalize);
extern TriggerInstrumentation * InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation * tginstr);
@@ -163,21 +196,34 @@ extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
#define INSTR_BUFUSAGE_INCR(fld) do { \
pgBufferUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
pgBufferUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ if (pgInstrStack) \
+ INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ if (pgInstrStack) \
+ INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ if (pgInstrStack) \
+ pgInstrStack->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index aede4bfc820..c02b75480ff 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
--
2.47.1
[application/octet-stream] v4-0006-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch (15.9K, 8-v4-0006-Convert-remaining-users-of-pgBufferUsage-to-use-I.patch)
download | inline diff:
From 38d44a3af0da95796374a063a20a3cd38a3a6d0f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:26:56 -0700
Subject: [PATCH v4 6/7] Convert remaining users of pgBufferUsage to use
InstrStart/InstrStop, drop the global
---
src/backend/access/heap/vacuumlazy.c | 29 +++++++++++------------
src/backend/commands/analyze.c | 31 ++++++++++++------------
src/backend/commands/explain.c | 26 +++++++--------------
src/backend/commands/explain_dr.c | 31 +++++++++++++-----------
src/backend/commands/prepare.c | 26 +++++++--------------
src/backend/executor/instrument.c | 35 +---------------------------
src/include/executor/instrument.h | 8 +------
7 files changed, 66 insertions(+), 120 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d2b031fdd06..f29f2c0784c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -641,8 +641,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
@@ -657,6 +656,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -959,14 +960,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr, true);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufusage = instr->stack->bufusage;
+ WalUsage walusage = instr->stack->walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
initStringInfo(&buf);
if (verbose)
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c2e216563c6..c9a25b4df7b 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -302,9 +302,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -355,6 +353,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(1, INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -735,12 +736,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr, true);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -748,18 +750,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->stack->bufusage;
+ WalUsage walusage = instr->stack->walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 545185148ac..dd3bf615581 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -322,14 +322,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -346,15 +348,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(instr, true);
if (es->memory)
{
@@ -362,16 +361,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->total, (es->buffers ? &instr->stack->bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 95685d7e88d..56e924b9ec1 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = NULL;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (myState->es->timing || myState->es->buffers)
+ {
+ InstrumentOption instrument_options = 0;
+
+ if (myState->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (myState->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ instr = InstrAlloc(1, instrument_options);
+ InstrStart(instr);
+ }
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +191,16 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
+ if (myState->es->timing || myState->es->buffers)
+ InstrStop(instr, true);
+
/* Update timing data */
if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
+ INSTR_TIME_ADD(myState->metrics.timeSpent, instr->total);
/* Update buffer metrics */
if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ BufferUsageAdd(&myState->metrics.bufferUsage, &instr->stack->bufusage);
return true;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 34b6410d6a2..77b4c59e71c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrAlloc(1, INSTRUMENT_TIMER);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -596,9 +598,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +633,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(instr, true);
if (es->memory)
{
@@ -642,13 +641,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -658,7 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->total, (es->buffers ? &instr->stack->bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 02c33b7dead..2a141fdae07 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -19,11 +19,9 @@
#include "utils/memutils.h"
#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
WalUsage pgWalUsage;
InstrStack *pgInstrStack = NULL;
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
/*
@@ -428,7 +426,6 @@ InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
WalUsageAdd(&dst->walusage, walusage);
}
- BufferUsageAdd(&pgBufferUsage, bufusage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -443,7 +440,7 @@ InstrStackAdd(InstrStack * dst, InstrStack * add)
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -464,36 +461,6 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
-void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index adcbc75a757..a627cfcac2d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -145,7 +145,6 @@ typedef struct WorkerInstrumentation
NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern PGDLLIMPORT InstrStack * pgInstrStack;
@@ -189,28 +188,23 @@ extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
extern Instrumentation *InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
if (pgInstrStack) \
pgInstrStack->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
if (pgInstrStack) \
pgInstrStack->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
if (pgInstrStack) \
INSTR_TIME_ADD(pgInstrStack->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
if (pgInstrStack) \
INSTR_TIME_ACCUM_DIFF(pgInstrStack->bufusage.fld, endval, startval); \
} while (0)
--
2.47.1
[application/octet-stream] v4-0007-Index-scans-Split-heap-and-index-buffer-access-re.patch (12.2K, 9-v4-0007-Index-scans-Split-heap-and-index-buffer-access-re.patch)
download | inline diff:
From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 30 Oct 2025 22:27:30 -0700
Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access
reporting in EXPLAIN
This makes it clear whether activity was on the index directly, or
on the table based on heap fetches.
---
src/backend/commands/explain.c | 56 ++++++++++++++++------------
src/backend/executor/execProcnode.c | 13 +++++++
src/backend/executor/instrument.c | 25 +++++++++++++
src/backend/executor/nodeIndexscan.c | 15 +++++++-
src/include/access/genam.h | 3 ++
src/include/executor/instrument.h | 3 ++
6 files changed, 91 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dd3bf615581..fb96dd5248c 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -143,7 +143,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -603,7 +603,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1020,7 +1020,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
es->indent--;
}
}
@@ -1034,7 +1034,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1960,6 +1960,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument.table_stack.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -2278,7 +2281,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->stack.bufusage);
+ show_buffer_usage(es, &planstate->instrument->stack.bufusage,
+ IsA(plan, IndexScan) ? "Index" : NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->stack.walusage);
@@ -2297,7 +2301,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->stack.bufusage);
+ show_buffer_usage(es, &instrument->stack.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->stack.walusage);
ExplainCloseWorker(n, es);
@@ -4097,7 +4101,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4122,6 +4126,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4177,6 +4183,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4218,44 +4226,46 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
- ExplainPropertyInteger("Shared Hit Blocks", NULL,
+ char *prefix = title ? psprintf("%s ", title) : pstrdup("");
+
+ ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
usage->shared_blks_hit, es);
- ExplainPropertyInteger("Shared Read Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sShared Read Blocks", prefix), NULL,
usage->shared_blks_read, es);
- ExplainPropertyInteger("Shared Dirtied Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sShared Dirtied Blocks", prefix), NULL,
usage->shared_blks_dirtied, es);
- ExplainPropertyInteger("Shared Written Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sShared Written Blocks", prefix), NULL,
usage->shared_blks_written, es);
- ExplainPropertyInteger("Local Hit Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Hit Blocks", prefix), NULL,
usage->local_blks_hit, es);
- ExplainPropertyInteger("Local Read Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Read Blocks", prefix), NULL,
usage->local_blks_read, es);
- ExplainPropertyInteger("Local Dirtied Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Dirtied Blocks", prefix), NULL,
usage->local_blks_dirtied, es);
- ExplainPropertyInteger("Local Written Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Written Blocks", prefix), NULL,
usage->local_blks_written, es);
- ExplainPropertyInteger("Temp Read Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sTemp Read Blocks", prefix), NULL,
usage->temp_blks_read, es);
- ExplainPropertyInteger("Temp Written Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sTemp Written Blocks", prefix), NULL,
usage->temp_blks_written, es);
if (track_io_timing)
{
- ExplainPropertyFloat("Shared I/O Read Time", "ms",
+ ExplainPropertyFloat(psprintf("%sShared I/O Read Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
- ExplainPropertyFloat("Shared I/O Write Time", "ms",
+ ExplainPropertyFloat(psprintf("%sShared I/O Write Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time),
3, es);
- ExplainPropertyFloat("Local I/O Read Time", "ms",
+ ExplainPropertyFloat(psprintf("%sLocal I/O Read Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time),
3, es);
- ExplainPropertyFloat("Local I/O Write Time", "ms",
+ ExplainPropertyFloat(psprintf("%sLocal I/O Write Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time),
3, es);
- ExplainPropertyFloat("Temp I/O Read Time", "ms",
+ ExplainPropertyFloat(psprintf("%sTemp I/O Read Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time),
3, es);
- ExplainPropertyFloat("Temp I/O Write Time", "ms",
+ ExplainPropertyFloat(psprintf("%sTemp I/O Write Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d00cf820a27..f19af428d97 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, context);
if (node->instrument && node->instrument->stack.previous)
+ {
+ /*
+ * Index Scan nodes account for heap buffer usage separately, so we
+ * need to explitly add here
+ */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrStackAdd(node->instrument->stack.previous, &iss->iss_Instrument.table_stack);
+ }
+
InstrStackAdd(node->instrument->stack.previous, &node->instrument->stack);
+ }
return false;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2a141fdae07..f6abc8d0c19 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -395,6 +395,31 @@ InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add)
WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
}
+void
+InstrStartNodeStack(NodeInstrumentation * instr, InstrStack * stack)
+{
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(pgInstrStack != NULL);
+
+ InstrPushStack(stack);
+ }
+}
+
+void
+InstrStopNodeStack(NodeInstrumentation * instr, InstrStack * stack)
+{
+ if (instr->need_bufusage || instr->need_walusage)
+ {
+ /* Ensure that we always have a parent, even at the top most node */
+ Assert(stack->previous != NULL);
+
+ /* Adding to parent is handled by ExecAccumNodeInstrumentation */
+ InstrPopStack(stack, false);
+ }
+}
+
/* start instrumentation during parallel executor startup */
Instrumentation *
InstrStartParallelQuery(void)
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 7fcaa37fe62..fcb611acb4a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -83,6 +83,7 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
/*
@@ -128,8 +129,20 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (node->ss.ps.instrument)
+ InstrStartNodeStack(node->ss.ps.instrument, &node->iss_Instrument.table_stack);
+
+ if (unlikely(!index_fetch_heap(scandesc, slot)))
+ continue;
+
+ if (node->ss.ps.instrument)
+ InstrStopNodeStack(node->ss.ps.instrument, &node->iss_Instrument.table_stack);
+
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
CHECK_FOR_INTERRUPTS();
/*
diff --git a/src/include/access/genam.h b/src/include/access/genam.h
index 9200a22bd9f..7813b4688f5 100644
--- a/src/include/access/genam.h
+++ b/src/include/access/genam.h
@@ -17,6 +17,7 @@
#include "access/htup.h"
#include "access/sdir.h"
#include "access/skey.h"
+#include "executor/instrument.h"
#include "nodes/tidbitmap.h"
#include "storage/buf.h"
#include "storage/lockdefs.h"
@@ -40,6 +41,8 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+ /* Buffer usage of heap access during index scans */
+ InstrStack table_stack;
} IndexScanInstrumentation;
/*
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a627cfcac2d..9bff6d8303f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -185,6 +185,9 @@ extern void InstrUpdateTupleCount(NodeInstrumentation * instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation * instr);
extern void InstrAggNode(NodeInstrumentation * dst, NodeInstrumentation * add);
+extern void InstrStartNodeStack(NodeInstrumentation * dst, InstrStack * stack);
+extern void InstrStopNodeStack(NodeInstrumentation * dst, InstrStack * stack);
+
extern Instrumentation *InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-08 04:27 ` Lukas Fittl <[email protected]>
2026-03-08 04:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
0 siblings, 2 replies; 42+ messages in thread
From: Lukas Fittl @ 2026-03-08 04:27 UTC (permalink / raw)
To: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; +Cc: Peter Smith <[email protected]>
On Mon, Feb 23, 2026 at 8:18 PM Lukas Fittl <[email protected]> wrote:
> 0001 addresses the issue that Peter raised and corrects the macro name
> + adds a missing comment.
This has been pushed by Andres, thanks again Peter for noting this!
See attached v7, with the changes as noted later. But first, some
fresh performance numbers that take into account extra inlining added
in a new v7/0006:
Example (default shared_buffers, runtimes are best out of 3-ish):
CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;
250ms actual runtime (no instrumentation)
BUFFERS OFF, TIMING OFF:
295ms master
295ms with stack-based instrumentation only (v7/0005) -- no change
because BUFFERS OFF
260ms with ExecProcNodeInstr inlining work (v7/0006)
BUFFERS ON, TIMING OFF:
380ms master
305ms with stack-based instrumentation only (v7/0005)
280ms with ExecProcNodeInstr inlining work (v7/0006)
In summary: For BUFFERS ON, we're going from 52% overhead in this
stress test, to 12% overhead (22% without the ExecProcNodeInstr
change). With rows instrumentation only, we go from 18% to 3%
overhead.
> 0002/0003 are the same as in v5 and are preparatory commits that
> should be in good shape to go.
I split out the trigger instrumentation refactoring into its own
commit (0001) per Andres' request off-list.
0002 is the node instrumentation refactoring, which changed slightly,
because Instrumentation is now treated as a base struct and
NodeInstrumentation now contains an Instrumentation "instr" field to
reduce duplication.
0003 replaces pgBufferUsage/pgWalUsage with INSTR_* macros, pretty
much unchanged from v6/0002.
> 0004 adds new regression tests that pass on both master and the
> patched version, and stress test some of the nested instrumentation
> handling.
This patch is the same in v7 as in v6.
> 0005 is what was previously 0003, and introduces the stack-based
> instrumentation mechanism.
0005 is still the stack-based instrumentation commit, but has several
improvements over v6:
- Andres suggested a different stack structure in a conversation
off-list, where we track the stack entries in a separately kept array
instead of with inline pointers on the Instrumentation structure -
that ends up making the whole change a lot easier to reason about, and
clarifies the abort handling as well
- I've separated out the ResOwner handling into a new
"QueryInstrumentation" struct - this makes it clearer that one can use
Instrumentation directly with certain caveats (i.e. one must deal with
aborts directly), and feels better now that we're including
Instrumentation in NodeInstrumentation as well
> 0006 could probably be merged into 0005, and is the existing
> mechanical change to drop the pgBufferUsage global, but kept separate
> for easier review for now.
I've merged that together now into 0005 because it feels better to
reason about this as a whole and drop pgBufferUsage within the same
commit where we're adding the stack-based instrumentation.
There is now a new 0006 patch that further optimizes ExecProcNodeInstr
(with performance numbers as noted in the beginning of this mail).
During my testing I realized that the conditional checks in
InstrStartNode/InstrStopNode are quite unnecessary - we never change
what gets instrumented during execution, and so we can create
specialized functions for the different instrumentation combinations,
giving the compiler a much better chance at generating optimal
instructions. The assembly here looks a lot better now.
> 0007 is the stack-mechanism for splitting out "Table Buffers" for
> Index Scans.
This stayed pretty much the same from the earlier version, but I
reworked this a bit to avoid special Instr functions for dealing with
this case, instead it re-uses NodeInstrumentation and
InstrStartNode/InstrStopNode.
> 0008 adds a pg_session_buffer_usage() module for testing the global
> counters. This helps to verify the handling of aborts behave sanely
> (and can run on both master and patched), but I don't think we should
> commit this.
This is identical with v6/0008, but continues to prove very useful in
testing the refactorings.
> Two questions on my mind:
>
> 1) Should we call this "instrumentation context" (InstrContext?)
> instead of "instrumentation stack" (InstrStack)? When writing comments
> I found the difference between "stack entry" and "stack" confusing at
> times, and "context" feels a bit more clear as a term (e.g. as in
> "CurrentInstrContext" or "CurrentInstrumentationContext").
With the refactoring of the stack-mechanism done now, I think we don't
need to change the naming here, since "InstrStack" as a commonly used
struct is gone now (its just Instrumentation now), so I'd suggest
keeping the "stack" terminology to describe the approach itself.
> 2) For 0007, "Table Buffers" feels a bit inconsistent with "Heap
> Fetches" and "Heap Blocks" used elsewhere, should we potentially use
> "Heap Buffers", or change the existing names to "Table ..." instead?
That is still an open question.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v7-0004-instrumentation-Add-additional-regression-tests-c.patch (23.5K, 2-v7-0004-instrumentation-Add-additional-regression-tests-c.patch)
download | inline diff:
From 03eb693edd0c20386a5c1fb87bdd343444f3a0f4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v7 4/8] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 197 ++++++++++++++++++
src/test/regress/sql/explain.sql | 194 +++++++++++++++++
6 files changed, 598 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
--
2.47.1
[application/octet-stream] v7-0002-instrumentation-Separate-per-node-logic-from-othe.patch (26.3K, 3-v7-0002-instrumentation-Separate-per-node-logic-from-othe.patch)
download | inline diff:
From 0032774bacdeff43d6565e43e04510c1b1eaf6c2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v7 2/8] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 142 ++++++++++++------
src/include/executor/instrument.h | 60 +++++---
src/include/nodes/execnodes.h | 6 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 172 insertions(+), 113 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4a427533bd8..388b068ccec 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1023,7 +1023,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1082,12 +1082,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 60d90329a65..6f0cb2a285b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 09b13807d92..389181b8d9b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1835,7 +1835,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1888,11 +1888,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1901,7 +1901,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2288,18 +2288,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2307,9 +2307,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1a3b8021600..c0b174cfbc0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..c153d5c1c3b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -725,7 +729,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -811,7 +815,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -821,7 +825,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1053,7 +1057,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1081,9 +1085,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1313,7 +1317,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7e40b852517..1846661b503 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
-
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
if (!dst->running && add->running)
{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a43bd428a91..605c7a6cc39 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1175,8 +1175,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fd3ec7a7236..7dc0073ab68 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1785,6 +1785,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3377,9 +3378,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v7-0001-instrumentation-Separate-trigger-logic-from-other.patch (9.7K, 4-v7-0001-instrumentation-Separate-trigger-logic-from-other.patch)
download | inline diff:
From 2c0c9c172cfd23cb8926803ee8db51915b7ccc7d Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v7 1/8] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 60 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 93918a223b8..09b13807d92 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1099,18 +1099,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1135,11 +1132,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1149,9 +1146,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 98d402c0a3b..c3360073141 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -90,7 +90,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2309,7 +2309,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2389,10 +2389,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3936,7 +3936,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4330,7 +4330,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4371,7 +4371,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4588,10 +4588,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4707,7 +4707,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index bfd3ebc601e..1a3b8021600 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1270,7 +1270,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 63c067d5aae..a43bd428a91 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -524,7 +524,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 77e3c04144e..fd3ec7a7236 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3156,6 +3156,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v7-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch (9.9K, 5-v7-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch)
download | inline diff:
From 29f9820901383800c659c082876c734cde723601 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 12:12:39 -0800
Subject: [PATCH v7 3/8] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/executor/instrument.c | 1 -
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
7 files changed, 47 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b9b678f3722..f6ac3c530b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1081,10 +1081,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2063,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6a4a08ebb0c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -54,7 +54,6 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
instr->bufusage_start = pgBufferUsage;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5f3d083e938..59b098ba0a0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -827,7 +827,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -848,7 +848,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1249,14 +1249,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1990,9 +1990,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -2060,9 +2060,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2955,7 +2955,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3100,7 +3100,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4529,7 +4529,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5690,7 +5690,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
if (dirtied)
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04a540379a2..e6054e745e8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..9e7a88ec0d0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..1139be8333e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v7-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch (66.3K, 6-v7-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch)
download | inline diff:
From a59c9778b9a1cdd98d0ea58f1a6a391efa84b007 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v7 5/8] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 87 +---
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 15 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 31 +-
src/backend/commands/explain.c | 26 +-
src/backend/commands/explain_dr.c | 31 +-
src/backend/commands/prepare.c | 27 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/execMain.c | 66 ++-
src/backend/executor/execParallel.c | 8 +-
src/backend/executor/execProcnode.c | 84 +++-
src/backend/executor/instrument.c | 389 +++++++++++++-----
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 2 +
src/include/executor/instrument.h | 179 +++++++-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
22 files changed, 720 insertions(+), 300 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 388b068ccec..8448f9c13fa 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -909,22 +909,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -938,30 +927,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1013,19 +992,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1087,10 +1056,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1154,17 +1123,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1180,6 +1143,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1194,9 +1158,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1208,23 +1169,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 146ee97a47d..5ab571b29fa 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2435,8 +2435,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index c7e38dbe193..bb91cc600eb 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -984,8 +984,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2110,6 +2110,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2178,7 +2179,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2193,7 +2194,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..b4cbd0e682c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -641,8 +641,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -658,6 +657,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -983,14 +984,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ instr = InstrQueryStopFinalize(instr);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -999,12 +1000,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 69ef1527e06..dfe4fd9459c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1465,8 +1465,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1752,6 +1752,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1827,7 +1828,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,7 +1838,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 53adac9139b..38f8b379fa4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -308,9 +308,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -361,6 +359,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -741,12 +742,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ instr = InstrQueryStopFinalize(instr);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -754,18 +756,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 389181b8d9b..aa76f68bd10 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -322,14 +322,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -346,15 +348,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ instr = InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -362,16 +361,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..6868d8972ac 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ QueryInstrumentation *instr = NULL;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (myState->es->timing || myState->es->buffers)
+ {
+ InstrumentOption instrument_options = 0;
+
+ if (myState->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (myState->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ instr = InstrQueryAlloc(instrument_options);
+ InstrQueryStart(instr);
+ }
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +191,16 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
+ if (myState->es->timing || myState->es->buffers)
+ instr = InstrQueryStopFinalize(instr);
+
/* Update timing data */
if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
+ INSTR_TIME_ADD(myState->metrics.timeSpent, instr->instr.total);
/* Update buffer metrics */
if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ BufferUsageAdd(&myState->metrics.bufferUsage, &instr->instr.bufusage);
return true;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 5b86a727587..d81f6b30e9c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,13 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -596,9 +599,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +634,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ instr = InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -642,13 +642,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -658,7 +651,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..75074fe4efa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -995,6 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1084,7 +1085,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1092,7 +1093,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c0b174cfbc0..82253317e96 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -76,6 +76,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -329,9 +330,28 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /*
+ * Start up required top-level instrumentation stack for WAL/buffer
+ * tracking
+ */
+ if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+ queryDesc->totaltime = InstrQueryAlloc(estate->es_instrument);
+
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ {
+ /* Allow instrumentation of Executor overall runtime */
+ InstrQueryStart(queryDesc->totaltime);
+
+ /*
+ * Remember all node entries for abort recovery. We do this once here
+ * after the first call to InstrQueryStart has pushed the parent
+ * entry.
+ */
+ if ((estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) &&
+ !queryDesc->already_executed)
+ ExecRememberNodeInstrumentation(queryDesc->planstate,
+ queryDesc->totaltime);
+ }
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +403,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +453,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -442,8 +462,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per-node and trigger statistics to their respective parent
+ * instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and the
+ * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
+ * we accumulated here, the leader would double-count: worker parent nodes
+ * would already include their children's stats, and then the leader's
+ * accumulation would add the children again.
+ */
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1484,6 +1522,24 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti && (ti->instr.need_bufusage || ti->instr.need_walusage))
+ InstrAccum(instr_stack.current, &ti->instr);
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c153d5c1c3b..e6ad86cb887 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -694,7 +694,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1456,6 +1456,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1516,7 +1517,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1532,7 +1533,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 1846661b503..c788b5b00f9 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -787,10 +789,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -828,6 +830,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+ (void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ if (node->instrument)
+ InstrQueryRememberNode(parent, node->instrument);
+
+ return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing our
+ * instrumentation as the parent context. This ensures children can
+ * accumulate to us even if they were never executed by the leader (e.g.
+ * nodes beneath Gather that only workers ran).
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ node->instrument ? &node->instrument->instr : parent);
+
+ if (!node->instrument)
+ return false;
+
+ node->instrument = InstrFinalizeNode(node->instrument, parent);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6a4a08ebb0c..1afa5e94960 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,25 +16,31 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-
-
-/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+void
+InstrStackGrow(void)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
- InstrInitOptions(instr, instrument_options);
- return instr;
+ if (instr_stack.entries == NULL)
+ {
+ instr_stack.stack_space = 10; /* Allocate sufficient initial space
+ * for typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * instr_stack.stack_space);
+ }
+ else
+ {
+ instr_stack.stack_space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, instr_stack.stack_space);
+ }
}
+/* General purpose instrumentation handling */
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
@@ -54,38 +60,249 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
+
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccum(instr_stack.current, instr);
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ slist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child node entries. */
+ slist_foreach_modify(iter, &qinstr->unfinalized_children)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ NodeInstrumentation *child = slist_container(NodeInstrumentation, unfinalized_node, iter.cur);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccum(&qinstr->instr, &child->instr);
- INSTR_TIME_SET_ZERO(instr->starttime);
+ /*
+ * Free NodeInstrumentation now, since InstrFinalizeNode won't be
+ * called
+ */
+ pfree(child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /* Free QueryInstrumentation now, since InstrStop won't be called */
+ pfree(qinstr);
+}
+
+/*
+ * Allocate in TopMemoryContext so that the Instrumentation survives
+ * transaction abort — ResourceOwner release needs to access it.
+ */
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr = MemoryContextAllocZero(TopMemoryContext, sizeof(QueryInstrumentation));
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+QueryInstrumentation *
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ QueryInstrumentation *copy;
+
+ InstrStopFinalize(&qinstr->instr);
+
+ if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+
+ /*
+ * Copy to the current memory context so the caller doesn't need to
+ * explicitly free the TopMemoryContext allocation.
+ */
+ copy = palloc(sizeof(QueryInstrumentation));
+ memcpy(copy, qinstr, sizeof(QueryInstrumentation));
+ pfree(qinstr);
+ return copy;
+}
+
+/*
+ * Register a child NodeInstrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *child)
+{
+ if (child->instr.need_bufusage || child->instr.need_walusage)
+ slist_push_head(&parent->unfinalized_children, &child->unfinalized_node);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ qinstr = InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
@@ -94,7 +311,13 @@ InstrStop(Instrumentation *instr)
NodeInstrumentation *
InstrAllocNode(int instrument_options, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ /*
+ * We can utilize TopTransactionContext instead of TopMemoryContext here
+ * because nodes don't get used for utility commands that restart
+ * transactions, which would require a context that survives longer
+ * (EXPLAIN ANALYZE is fine).
+ */
+ NodeInstrumentation *instr = MemoryContextAlloc(TopTransactionContext, sizeof(NodeInstrumentation));
InstrInitNode(instr, instrument_options);
instr->async_mode = async_mode;
@@ -117,6 +340,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -146,14 +370,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_bufusage || instr->instr.need_walusage)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -172,6 +394,22 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
}
}
+/* Add per-node instrumentation to the parent and move into per-query memory context */
+NodeInstrumentation *
+InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent)
+{
+ NodeInstrumentation *dst = palloc(sizeof(NodeInstrumentation));
+
+ memcpy(dst, instr, sizeof(NodeInstrumentation));
+ pfree(instr);
+
+ /* Accumulate node's buffer/WAL usage to the parent */
+ if (dst->instr.need_bufusage || dst->instr.need_walusage)
+ InstrAccum(parent, &dst->instr);
+
+ return dst;
+}
+
/* Update tuple count */
void
InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
@@ -188,8 +426,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -259,38 +497,27 @@ InstrStartTrigger(TriggerInstrumentation *tginstr)
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
-void
-InstrStartParallelQuery(void)
-{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
-
-/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccum(Instrumentation *dst, Instrumentation *add)
{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -311,39 +538,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 59b098ba0a0..1ff50aecc10 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1261,9 +1261,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
if (*foundPtr)
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9e7a88ec0d0..60400f0c81f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d46ba59895d..c22199c6869 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -300,6 +300,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1139be8333e..cc33b32af1e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,10 +69,22 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
@@ -81,16 +94,52 @@ typedef struct Instrumentation
bool need_walusage; /* true if we need WAL usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop. In the case of a
+ * transaction abort, logic equivalent to InstrQueryStop will be called
+ * automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * NodeInstrumentation child entries that need to be cleaned up on abort,
+ * since they are not registered as a resource owner themselves.
+ */
+ slist_head unfinalized_children; /* head of unfinalized children list */
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -109,8 +158,15 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
+
+ /* Abort handling */
+ slist_node unfinalized_node; /* node in parent's unfinalized list */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,19 +180,110 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * Any caller using this directly must manage the passed in entry and call
+ * InstrPopStack on its own again, typically by using a PG_FINALLY block to
+ * ensure the stack gets reset via InstrPopStack on abort. Use InstrStart
+ * instead when you want automatic handling of abort cases using the resource
+ * owner infrastructure.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, instead call InstrPopAndFinalizeStack which can skip intermediate
+ * stack entries, or instead use InstrStart/InstrStop.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrAccum(Instrumentation *dst, Instrumentation *add);
+
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern QueryInstrumentation *InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern NodeInstrumentation *InstrAllocNode(int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern NodeInstrumentation *InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent);
extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
@@ -145,31 +292,31 @@ extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
+ instr_stack.current->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ instr_stack.current->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7dc0073ab68..5580a080210 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1320,6 +1320,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2428,6 +2429,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v7-0006-instrumentation-Optimize-ExecProcNodeInstr-instru.patch (11.6K, 7-v7-0006-instrumentation-Optimize-ExecProcNodeInstr-instru.patch)
download | inline diff:
From ddc08b652cd95bdd2bd10c4fb3fca3d292e2c625 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v7 6/8] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/executor/execProcnode.c | 22 +--
src/backend/executor/instrument.c | 224 +++++++++++++++++++++-------
src/include/executor/instrument.h | 5 +
3 files changed, 174 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c788b5b00f9..6a74ca516ae 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -120,7 +120,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -464,7 +463,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -472,25 +471,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1afa5e94960..602facec401 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -49,29 +49,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_bufusage || instr->need_walusage)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -79,6 +70,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -333,65 +334,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_bufusage || instr->instr.need_walusage)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Add per-node instrumentation to the parent and move into per-query memory context */
@@ -475,6 +468,125 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ InstrPushStack(&instr->instr);
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNodeTimer(instr);
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsBuffersWalOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ InstrPushStack(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ TupleTableSlot *result;
+
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNodeTimer(instr);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ TupleTableSlot *result;
+
+ result = node->ExecProcNodeReal(node);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_buf = (instr->instr.need_bufusage ||
+ instr->instr.need_walusage);
+
+ if (need_timer && need_buf)
+ return ExecProcNodeInstrFull;
+ else if (need_buf)
+ return ExecProcNodeInstrRowsBuffersWalOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(int n, int instrument_options)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc33b32af1e..ac7f0d21c37 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -288,6 +288,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
--
2.47.1
[application/octet-stream] v7-0007-Index-scans-Show-table-buffer-accesses-separately.patch (16.8K, 8-v7-0007-Index-scans-Show-table-buffer-accesses-separately.patch)
download | inline diff:
From 3ed273fa607d47346b20eecce0e48f33500a297f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v7 7/8] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan does scanning on the table, for example due to additional
data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by:
Discussion:
---
doc/src/sgml/perform.sgml | 13 ++++--
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 55 +++++++++++++++-----------
src/backend/executor/execProcnode.c | 35 ++++++++++++++++
src/backend/executor/nodeIndexscan.c | 33 +++++++++++++++-
src/include/executor/instrument_node.h | 6 +++
src/include/nodes/execnodes.h | 7 ++++
7 files changed, 122 insertions(+), 28 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 5f6f1db0467..9219625faf6 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -949,7 +950,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -958,7 +960,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1147,13 +1151,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..912c96f2ff5 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -506,6 +506,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index aa76f68bd10..437246d1aa4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -143,7 +143,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -603,7 +603,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1020,7 +1020,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
es->indent--;
}
}
@@ -1034,7 +1034,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1962,6 +1962,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_InstrumentTable->instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -2280,7 +2283,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2299,7 +2302,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4099,7 +4102,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4124,6 +4127,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4179,6 +4184,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4220,44 +4227,46 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
- ExplainPropertyInteger("Shared Hit Blocks", NULL,
+ char *prefix = title ? psprintf("%s ", title) : pstrdup("");
+
+ ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
usage->shared_blks_hit, es);
- ExplainPropertyInteger("Shared Read Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sShared Read Blocks", prefix), NULL,
usage->shared_blks_read, es);
- ExplainPropertyInteger("Shared Dirtied Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sShared Dirtied Blocks", prefix), NULL,
usage->shared_blks_dirtied, es);
- ExplainPropertyInteger("Shared Written Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sShared Written Blocks", prefix), NULL,
usage->shared_blks_written, es);
- ExplainPropertyInteger("Local Hit Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Hit Blocks", prefix), NULL,
usage->local_blks_hit, es);
- ExplainPropertyInteger("Local Read Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Read Blocks", prefix), NULL,
usage->local_blks_read, es);
- ExplainPropertyInteger("Local Dirtied Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Dirtied Blocks", prefix), NULL,
usage->local_blks_dirtied, es);
- ExplainPropertyInteger("Local Written Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sLocal Written Blocks", prefix), NULL,
usage->local_blks_written, es);
- ExplainPropertyInteger("Temp Read Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sTemp Read Blocks", prefix), NULL,
usage->temp_blks_read, es);
- ExplainPropertyInteger("Temp Written Blocks", NULL,
+ ExplainPropertyInteger(psprintf("%sTemp Written Blocks", prefix), NULL,
usage->temp_blks_written, es);
if (track_io_timing)
{
- ExplainPropertyFloat("Shared I/O Read Time", "ms",
+ ExplainPropertyFloat(psprintf("%sShared I/O Read Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
- ExplainPropertyFloat("Shared I/O Write Time", "ms",
+ ExplainPropertyFloat(psprintf("%sShared I/O Write Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time),
3, es);
- ExplainPropertyFloat("Local I/O Read Time", "ms",
+ ExplainPropertyFloat(psprintf("%sLocal I/O Read Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time),
3, es);
- ExplainPropertyFloat("Local I/O Write Time", "ms",
+ ExplainPropertyFloat(psprintf("%sLocal I/O Write Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time),
3, es);
- ExplainPropertyFloat("Temp I/O Read Time", "ms",
+ ExplainPropertyFloat(psprintf("%sTemp I/O Read Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time),
3, es);
- ExplainPropertyFloat("Temp I/O Write Time", "ms",
+ ExplainPropertyFloat(psprintf("%sTemp I/O Write Time", prefix), "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6a74ca516ae..5e476939edf 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,9 +414,24 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
+ {
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ /* IndexScan tracks table access separately from index access. */
+ if (IsA(result, IndexScanState) && (estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ IndexScanState *iss = castNode(IndexScanState, result);
+
+ /*
+ * We intentionally don't collect timing here (even if enabled),
+ * since we don't need it, and IndexNext calls InstrPushStack /
+ * InstrPopStack (instead of InstrNode*) to reduce overhead.
+ */
+ iss->iss_InstrumentTable = InstrAllocNode(INSTRUMENT_BUFFERS, false);
+ }
+ }
+
return result;
}
@@ -836,8 +851,19 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
return false;
if (node->instrument)
+ {
InstrQueryRememberNode(parent, node->instrument);
+ /* IndexScan has a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_InstrumentTable)
+ InstrQueryRememberNode(parent, iss->iss_InstrumentTable);
+ }
+ }
+
return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
}
@@ -879,6 +905,15 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
if (!node->instrument)
return false;
+ /* IndexScan has a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_InstrumentTable)
+ iss->iss_InstrumentTable = InstrFinalizeNode(iss->iss_InstrumentTable, &node->instrument->instr);
+ }
+
node->instrument = InstrFinalizeNode(node->instrument, parent);
return false;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a616abff04c..4794095092e 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -83,7 +83,9 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
/*
* extract necessary information from index scan node
@@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (node->iss_InstrumentTable)
+ InstrPushStack(&node->iss_InstrumentTable->instr);
+
+ found = index_fetch_heap(scandesc, slot);
+
+ if (node->iss_InstrumentTable)
+ InstrPopStack(&node->iss_InstrumentTable->instr);
+
+ if (unlikely(!found))
+ continue;
+
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
CHECK_FOR_INTERRUPTS();
/*
@@ -812,6 +828,11 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument.nsearches;
+ if (node->iss_InstrumentTable)
+ {
+ BufferUsageAdd(&winstrument->worker_table_bufusage, &node->iss_InstrumentTable->instr.bufusage);
+ WalUsageAdd(&winstrument->worker_table_walusage, &node->iss_InstrumentTable->instr.walusage);
+ }
}
/*
@@ -1819,4 +1840,14 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ if (node->iss_InstrumentTable)
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ BufferUsageAdd(&node->iss_InstrumentTable->instr.bufusage,
+ &node->iss_SharedInfo->winstrument[i].worker_table_bufusage);
+ WalUsageAdd(&node->iss_InstrumentTable->instr.walusage,
+ &node->iss_SharedInfo->winstrument[i].worker_table_walusage);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94fa..170b6143ef6 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,10 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Used for passing iss_InstrumentTableStack data from parallel workers */
+ BufferUsage worker_table_bufusage;
+ WalUsage worker_table_walusage;
} IndexScanInstrumentation;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 605c7a6cc39..c778641c13d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1728,6 +1728,13 @@ typedef struct IndexScanState
IndexScanInstrumentation iss_Instrument;
SharedIndexScanInstrumentation *iss_SharedInfo;
+ /*
+ * Instrumentation utilized for tracking table access. This is separate
+ * from iss_Instrument since it needs to be allocated in the right context
+ * and IndexScanInstrumentation shouldn't contain pointers.
+ */
+ NodeInstrumentation *iss_InstrumentTable;
+
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
bool iss_ReachedEnd;
--
2.47.1
[application/octet-stream] v7-0008-Add-pg_session_buffer_usage-contrib-module.patch (25.5K, 9-v7-0008-Add-pg_session_buffer_usage-contrib-module.patch)
download | inline diff:
From 15900843fffa2c6405a673f50f98c37c09e617e8 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v7 8/8] Add pg_session_buffer_usage contrib module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
contrib/meson.build | 1 +
contrib/pg_session_buffer_usage/Makefile | 23 ++
.../expected/pg_session_buffer_usage.out | 283 ++++++++++++++++++
contrib/pg_session_buffer_usage/meson.build | 34 +++
.../pg_session_buffer_usage--1.0.sql | 31 ++
.../pg_session_buffer_usage.c | 95 ++++++
.../pg_session_buffer_usage.control | 5 +
.../sql/pg_session_buffer_usage.sql | 204 +++++++++++++
8 files changed, 676 insertions(+)
create mode 100644 contrib/pg_session_buffer_usage/Makefile
create mode 100644 contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
create mode 100644 contrib/pg_session_buffer_usage/meson.build
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
create mode 100644 contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
diff --git a/contrib/meson.build b/contrib/meson.build
index def13257cbe..cab1b211678 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -50,6 +50,7 @@ subdir('pg_logicalinspect')
subdir('pg_overexplain')
subdir('pg_prewarm')
subdir('pgrowlocks')
+subdir('pg_session_buffer_usage')
subdir('pg_stat_statements')
subdir('pgstattuple')
subdir('pg_surgery')
diff --git a/contrib/pg_session_buffer_usage/Makefile b/contrib/pg_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..75bd8e09b3d
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_session_buffer_usage/Makefile
+
+MODULE_big = pg_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ pg_session_buffer_usage.o
+
+EXTENSION = pg_session_buffer_usage
+DATA = pg_session_buffer_usage--1.0.sql
+PGFILEDESC = "pg_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = pg_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_session_buffer_usage
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
new file mode 100644
index 00000000000..242b4003950
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
@@ -0,0 +1,283 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
diff --git a/contrib/pg_session_buffer_usage/meson.build b/contrib/pg_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..34c7502beb4
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+pg_session_buffer_usage_sources = files(
+ 'pg_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ pg_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_session_buffer_usage',
+ '--FILEDESC', 'pg_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+pg_session_buffer_usage = shared_module('pg_session_buffer_usage',
+ pg_session_buffer_usage_sources,
+ kwargs: contrib_mod_args,
+)
+contrib_targets += pg_session_buffer_usage
+
+install_data(
+ 'pg_session_buffer_usage--1.0.sql',
+ 'pg_session_buffer_usage.control',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'pg_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'pg_session_buffer_usage',
+ ],
+ },
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..b300fdbc643
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION pg_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION pg_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
new file mode 100644
index 00000000000..f869956b3a9
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "pg_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage);
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: pg_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+pg_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: pg_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+pg_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
new file mode 100644
index 00000000000..fabd05ee024
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# pg_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/pg_session_buffer_usage'
+relocatable = true
diff --git a/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
new file mode 100644
index 00000000000..8f5810fadd3
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
@@ -0,0 +1,204 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT pg_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT pg_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-08 04:31 ` Lukas Fittl <[email protected]>
1 sibling, 0 replies; 42+ messages in thread
From: Lukas Fittl @ 2026-03-08 04:31 UTC (permalink / raw)
To: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; +Cc: Peter Smith <[email protected]>
On Sat, Mar 7, 2026 at 8:27 PM Lukas Fittl <[email protected]> wrote:
> Example (default shared_buffers, runtimes are best out of 3-ish):
>
> CREATE TABLE lotsarows(key int not null);
> INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> VACUUM FREEZE lotsarows;
>
> 250ms actual runtime (no instrumentation)
>
> BUFFERS OFF, TIMING OFF:
> 295ms master
> 295ms with stack-based instrumentation only (v7/0005) -- no change
> because BUFFERS OFF
> 260ms with ExecProcNodeInstr inlining work (v7/0006)
>
> BUFFERS ON, TIMING OFF:
> 380ms master
> 305ms with stack-based instrumentation only (v7/0005)
> 280ms with ExecProcNodeInstr inlining work (v7/0006)
>
> In summary: For BUFFERS ON, we're going from 52% overhead in this
> stress test, to 12% overhead (22% without the ExecProcNodeInstr
> change). With rows instrumentation only, we go from 18% to 3%
> overhead.
Erm, and I forgot the query here, this is testing "SELECT count(*)
FROM lotsarows;", just like over in [0].
Thanks,
Lukas
[0]: https://www.postgresql.org/message-id/flat/[email protected]
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-09 21:55 ` Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
1 sibling, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-03-09 21:55 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
Hello
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the
original instrumentation, but doesn't remove it from the
unfinalized_children list. In normal execution in
InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles
this, but what about the error path, if something happens between the
two?
Won't we end up in ResOwnerReleaseInstrumentation and do use after
free reads and then a double free on the now invalid pointer?
+ if (myState->es->timing || myState->es->buffers)
+ instr = InstrQueryStopFinalize(instr);
+
Is it okay to leak 1 instrumentation copy per tuple in the query
context? This freshly palloced object will be out of scope a few lines
after this.
@@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)
IndexNextWithReorder doesn't need the same handling?
+ if (scandesc->xs_heap_continue)
+ elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
Shouldn't this say index scans? (there's another preexisting
indexonylscan mention in this file, but that also seems wrong)
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
I don't see pgBufferUsage anywhere else in the current code, it was
probably removed between rebases?
+ char *prefix = title ? psprintf("%s ", title) : pstrdup("");
+
+ ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
usage->shared_blks_hit, es);
(And many similar ExplainPropery after this)
title is NULL most of the time, and this results in 16 allocations for
that common case - isn't there a better solution like using
ExplainOpenGroup or something?
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, instead call InstrPopAndFinalizeStack which can skip intermediate
+ * stack entries, or instead use InstrStart/InstrStop.
InstrPopAndFinalizeStack doesn't exists in the latest patch version
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-03-09 23:45 ` Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-09 23:45 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
Hi Zsolt,
Thanks for reviewing!
On Mon, Mar 9, 2026 at 2:55 PM Zsolt Parragi <[email protected]> wrote:
> + if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
> + {
> + ExecFinalizeNodeInstrumentation(queryDesc->planstate);
> +
> + ExecFinalizeTriggerInstrumentation(estate);
> + }
> +
> if (queryDesc->totaltime)
> - InstrStop(queryDesc->totaltime);
> + queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
>
> In ExecFinalizeNodeInstrumentation InstrFinalizeNode pfrees the
> original instrumentation, but doesn't remove it from the
> unfinalized_children list. In normal execution in
> InstrQueryStopFinalize ResourceOwnerForgetInstrumentation handles
> this, but what about the error path, if something happens between the
> two?
> Won't we end up in ResOwnerReleaseInstrumentation and do use after
> free reads and then a double free on the now invalid pointer?
ResourceOwnerForgetInstrumentation directly follows the call to
ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not
sure which error case you're thinking of?
We could explicitly zap the unfinalized_children list at the end of
ExecFinalizeNodeInstrumentation (and have it take a
QueryInstrumentation argument) to protect against this, but I don't
think that makes much of a difference with the code as currently
written.
Maybe I'm misunderstanding which situation you're thinking of?
> + if (myState->es->timing || myState->es->buffers)
> + instr = InstrQueryStopFinalize(instr);
> +
>
> Is it okay to leak 1 instrumentation copy per tuple in the query
> context? This freshly palloced object will be out of scope a few lines
> after this.
I don't think that's a permanent leak, since it would be in the memory
context of the caller, i.e. the per-query memory context, but yeah,
this doesn't seem ideal, since we're keeping this memory around for
each tuple processed.
First of all, we could just do a pfree in serializeAnalyzeReceive
after we added the stats, but that said, this extra instrumentation
for EXPLAIN (SERIALIZE) is a bit of a curious edge case I suppose,
since serializeAnalyzeReceive gets called once per tuple - and so
doing the ResOwner dance for each tuple is probably not ideal.
I wonder if maybe we shouldn't treat this as a case of
NodeInstrumentation instead, and attach to the query's totaltime
instrumentation for abort safety. The only thing that's a bit
inconvenient about that is that calling
InstrQueryRememberNode/InstrFinalizeNode requires passing in that
parent QueryInstrumentation, and the SerializeDestReceiver doesn't
currently have easy access to that. If we brute-force solve that we
could just add an extra method to set it after CreateQueryDesc is
called, but:
Stepping back, from an overall API design perspective, we could also
track the current QueryInstrumentation in a global variable - that'd
make it easier to attach/finalize extra instrumentation on it even if
we don't have direct access to querydesc->totaltime, or we're in a
non-executor context that uses QueryInstrumentation (like some utility
commands) for future use cases of extra sub-query-level
instrumentation. For example, that could also come in handy if we
wanted VACUUM VERBOSE break out the buffer access it does on indexes
vs tables.
> @@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)
>
> IndexNextWithReorder doesn't need the same handling?
Yes - good point, that's an unintentional omission.
For context, I hadn't spent substantial amounts of time on this part
of the series (rather on getting the design of the stack mechanism
right), but will fix this in the next revision to also handle
IndexNextWithReorder.
I've also been thinking whether we should teach Index-only Scans to
break out the per-table buffer stats for the (rare) heap fetches. I
guess we should? (it'd be fairly easy to instrument that
index_fetch_heap there now)
> + if (scandesc->xs_heap_continue)
> + elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
>
> Shouldn't this say index scans? (there's another preexisting
> indexonylscan mention in this file, but that also seems wrong)
Hmm. Looking at this again, I think the handling of xs_heap_continue
isn't right altogether. Basically it should be this instead, I think,
so we correctly call the table AM's table_index_fetch_tuple again if
call_again gets set:
while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
if (node->iss_InstrumentTable)
InstrPushStack(&node->iss_InstrumentTable->instr);
for (;;)
{
found = index_fetch_heap(scandesc, slot);
if (found || !scandesc->xs_heap_continue)
break;
}
if (node->iss_InstrumentTable)
InstrPopStack(&node->iss_InstrumentTable->instr);
if (unlikely(!found))
continue;
Which matches the loop in index_getnext_slot (i.e keep calling
index_fetch_heap if xs_heap_continue=true). Based on that, we wouldn't
need the elog at all.
> +#if HAVE_INSTR_STACK
> + usage = &instr_top.bufusage;
> +#else
> + usage = &pgBufferUsage;
> +#endif
>
> I don't see pgBufferUsage anywhere else in the current code, it was
> probably removed between rebases?
That's intentional to allow testing pg_session_buffer_usage both on
master (HAVE_INSTR_STACK=0) and with the patched version
(HAVE_INSTR_STACK=1) - mainly to ensure we're matching behavior for
anyone interested in the top-line buffer or WAL numbers. I don't think
we should commit pg_session_buffer_usage (at least not as-is), its
just there for testing the earlier changes.
> + char *prefix = title ? psprintf("%s ", title) : pstrdup("");
> +
> + ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
> usage->shared_blks_hit, es);
>
> (And many similar ExplainPropery after this)
>
> title is NULL most of the time, and this results in 16 allocations for
> that common case - isn't there a better solution like using
> ExplainOpenGroup or something?
Yeah, I guess that's fair - I don't know if the extra allocations
really matter, but I can see your point.
If we used ExplainOpenGroup that would make it a nested structure in
the JSON/etc representation, so it'd look like this:
"Shared Read Blocks:" 123.0,
...
"Temp I/O Write Time:" 42.0,
"Table": {
"Shared Read Blocks:" 123.0,
...
"Temp I/O Write Time:" 42.0,
}
compared to text representation (simplified):
Buffers: shared read=123.0
I/O Timings: temp write=42.0
Table Buffers: shared read=123.0
Table I/O Timings: temp write=42.0
I think that can work, and we have precedence for using groups in a
node like this, e.g. with "Workers".
If we go with this, I do wonder if we should clarify the JSON/etc
group key as "Table Access" (one group) or "Table Buffers" / "Table
I/O" (two groups), instead of just "Table"?
> + * Callers must ensure that no intermediate stack entries are skipped, to
> + * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
> + * block, instead call InstrPopAndFinalizeStack which can skip intermediate
> + * stack entries, or instead use InstrStart/InstrStop.
>
> InstrPopAndFinalizeStack doesn't exists in the latest patch version
Ah, yeah, that should say InstrStopFinalize (basically direct use of
InstrPushStack/InstrPopStack is discouraged over use of the
Instrumentation functions), I'll fix it in the next revision.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-10 08:12 ` Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-03-10 08:12 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
> ResourceOwnerForgetInstrumentation directly follows the call to
> ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not
> sure which error case you're thinking of?
There are a few pallocs between them, so OOM is possible, even if
unlikely. I mainly mentioned this because even if unlikely it can
happen in theory, and the fix seems simple to me.
> I don't think that's a permanent leak, since it would be in the memory
> context of the caller, i.e. the per-query memory context
Yes, it's definitely not permanent, but could be bad with many tuples.
> and so
> doing the ResOwner dance for each tuple is probably not ideal.
These approaches are interesting, but also add complexity, so I'm
unsure which is better for this, the pfree calls add one line and
solve the main issue with the current code.
> Basically it should be this instead, I think,
> so we correctly call the table AM's table_index_fetch_tuple again if
> call_again gets set:
Right, this code will be better.
> I don't know if the extra allocations
> really matter, but I can see your point.
Yeah, probably doesn't matter that much, but the code also wasn't that
nice in that form. I didn't try to actually modify it, but by just
looking at it the grouped option seemed cleaner to me, and the output
should also be self-explanatory and logical to users.
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-03-14 20:49 ` Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-14 20:49 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; +Cc: Peter Smith <[email protected]>
On Tue, Mar 10, 2026 at 1:13 AM Zsolt Parragi <[email protected]> wrote:
> > ResourceOwnerForgetInstrumentation directly follows the call to
> > ExecFinalizeNodeInstrumentation in standard_ExecutorFinish, so I'm not
> > sure which error case you're thinking of?
>
> There are a few pallocs between them, so OOM is possible, even if
> unlikely. I mainly mentioned this because even if unlikely it can
> happen in theory, and the fix seems simple to me.
Ah, yeah, I didn't consider the pallocs in InstrFinalizeNode causing
an OOM that would cause an abort - good thinking!
I've adjusted this to use a dlist instead of an slist and
InstrFinalizeNode now deletes the node from the list.
> > I don't think that's a permanent leak, since it would be in the memory
> > context of the caller, i.e. the per-query memory context
>
> Yes, it's definitely not permanent, but could be bad with many tuples.
>
> > and so
> > doing the ResOwner dance for each tuple is probably not ideal.
>
> These approaches are interesting, but also add complexity, so I'm
> unsure which is better for this, the pfree calls add one line and
> solve the main issue with the current code.
Yeah, fair point on complexity - I've just added the pfree for now.
> > Basically it should be this instead, I think,
> > so we correctly call the table AM's table_index_fetch_tuple again if
> > call_again gets set:
>
> Right, this code will be better.
Implemented this fix in IndexNext, and also expanded the tracking of
table access to IndexNextWithReorder.
Regarding Index-Only Scans, I did not add instrumentation for table
access yet - I might add that in a follow-up revision or we could also
do it in a follow-on patch.
> > I don't know if the extra allocations
> > really matter, but I can see your point.
>
> Yeah, probably doesn't matter that much, but the code also wasn't that
> nice in that form. I didn't try to actually modify it, but by just
> looking at it the grouped option seemed cleaner to me, and the output
> should also be self-explanatory and logical to users.
Yep, fair point - I've now added two groups "Table Buffers" and "Table
I/O Timings" that get used in structured output.
---
See attached v8 rebased on latest master, that also fixes the issues
Zsolt pointed out in 0005 and 0007.
Additionally, two other minor changes in 0005 (the commit that adds
stack-based instrumentation):
1) Change the allocation for query/node instrumentation so that we
only use top-level memory context when WAL/buffer usage is requested
(i.e. instrumentation stack is needed) - easy enough to do, and makes
the timing/rows only case a bit cheaper.
2) Fix a missing addition to the remaining pgWalUsage global in the
parallel query case, when instrumentation is used. For context, to
avoid double counting we don't have the parallel workers call
ExecFinalizeNodeInstrumentation (instead the per-node numbers get
reported back to the leader, and that one bubbles them up), but that
also means that InstrAccumParallelQuery doesn't see the per-node
activity. This now gets added in ExecParallelRetrieveInstrumentation.
All other patches as before.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v8-0001-instrumentation-Separate-trigger-logic-from-other.patch (9.7K, 2-v8-0001-instrumentation-Separate-trigger-logic-from-other.patch)
download | inline diff:
From 74b6480f37d2b923e84c8ad55f2ff8bce0e619ab Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v8 1/8] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 60 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 93918a223b8..09b13807d92 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1099,18 +1099,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1135,11 +1132,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1149,9 +1146,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 98d402c0a3b..c3360073141 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -90,7 +90,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2309,7 +2309,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2389,10 +2389,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3936,7 +3936,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4330,7 +4330,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4371,7 +4371,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4588,10 +4588,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4707,7 +4707,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index bfd3ebc601e..1a3b8021600 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1270,7 +1270,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 63c067d5aae..a43bd428a91 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -524,7 +524,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ec8513d90b5..6e78674e282 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3163,6 +3163,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v8-0004-instrumentation-Add-additional-regression-tests-c.patch (23.5K, 3-v8-0004-instrumentation-Add-additional-regression-tests-c.patch)
download | inline diff:
From a0ea75a6504eb9bc96e6d1e093f10e4d04dac8ef Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v8 4/8] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 197 ++++++++++++++++++
src/test/regress/sql/explain.sql | 194 +++++++++++++++++
6 files changed, 598 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
--
2.47.1
[application/octet-stream] v8-0002-instrumentation-Separate-per-node-logic-from-othe.patch (26.3K, 4-v8-0002-instrumentation-Separate-per-node-logic-from-othe.patch)
download | inline diff:
From 9288c1968b8004ef0496cf8ed1a8e30c436d7450 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v8 2/8] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 142 ++++++++++++------
src/include/executor/instrument.h | 60 +++++---
src/include/nodes/execnodes.h | 6 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 172 insertions(+), 113 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4a427533bd8..388b068ccec 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1023,7 +1023,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1082,12 +1082,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 60d90329a65..6f0cb2a285b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 09b13807d92..389181b8d9b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1835,7 +1835,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1888,11 +1888,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1901,7 +1901,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2288,18 +2288,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2307,9 +2307,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1a3b8021600..c0b174cfbc0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..c153d5c1c3b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -725,7 +729,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -811,7 +815,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -821,7 +825,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1053,7 +1057,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1081,9 +1085,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1313,7 +1317,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7e40b852517..1846661b503 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
-
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
if (!dst->running && add->running)
{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a43bd428a91..605c7a6cc39 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1175,8 +1175,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6e78674e282..e05a1e52db4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1786,6 +1786,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3384,9 +3385,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v8-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch (9.9K, 5-v8-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch)
download | inline diff:
From 7ffc07433d15880aa1ba848e1b7b4544aedaaa78 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 12:12:39 -0800
Subject: [PATCH v8 3/8] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/executor/instrument.c | 1 -
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
7 files changed, 47 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5c9a34374d..9b33584f454 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1081,10 +1081,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2063,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6a4a08ebb0c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -54,7 +54,6 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
instr->bufusage_start = pgBufferUsage;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529a..dfa37e5ed44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -835,7 +835,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -856,7 +856,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1257,14 +1257,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1998,9 +1998,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -2068,9 +2068,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2959,7 +2959,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3105,7 +3105,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4520,7 +4520,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5663,7 +5663,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 404c6bccbdd..8845b0aeed6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..9e7a88ec0d0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..1139be8333e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v8-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch (68.2K, 6-v8-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch)
download | inline diff:
From 8b70adeaa5017f26cf9747fdd28995f12e241e91 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v8 5/8] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 87 +---
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 15 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 31 +-
src/backend/commands/explain.c | 26 +-
src/backend/commands/explain_dr.c | 34 +-
src/backend/commands/prepare.c | 27 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/execMain.c | 66 ++-
src/backend/executor/execParallel.c | 22 +-
src/backend/executor/execProcnode.c | 84 +++-
src/backend/executor/instrument.c | 412 ++++++++++++++----
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 2 +
src/include/executor/instrument.h | 179 +++++++-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
22 files changed, 760 insertions(+), 300 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 388b068ccec..8448f9c13fa 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -909,22 +909,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -938,30 +927,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1013,19 +992,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1087,10 +1056,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1154,17 +1123,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1180,6 +1143,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1194,9 +1158,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1208,23 +1169,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1909c3254b5..d62eb7dee9c 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2432,8 +2432,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2884,6 +2884,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2933,7 +2934,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2948,7 +2949,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 97cea5f7d4e..8cdcd2a9bec 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -988,8 +988,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2114,6 +2114,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2182,7 +2183,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2197,7 +2198,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..b4cbd0e682c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -641,8 +641,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -658,6 +657,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -983,14 +984,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ instr = InstrQueryStopFinalize(instr);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -999,12 +1000,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 69ef1527e06..dfe4fd9459c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1465,8 +1465,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1752,6 +1752,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1827,7 +1828,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,7 +1838,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 53adac9139b..38f8b379fa4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -308,9 +308,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -361,6 +359,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -741,12 +742,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ instr = InstrQueryStopFinalize(instr);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -754,18 +756,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 389181b8d9b..aa76f68bd10 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -322,14 +322,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -346,15 +348,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ instr = InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -362,16 +361,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..c9695b03a60 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ QueryInstrumentation *instr = NULL;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (myState->es->timing || myState->es->buffers)
+ {
+ InstrumentOption instrument_options = 0;
+
+ if (myState->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (myState->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ instr = InstrQueryAlloc(instrument_options);
+ InstrQueryStart(instr);
+ }
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +191,19 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
+ if (myState->es->timing || myState->es->buffers)
+ instr = InstrQueryStopFinalize(instr);
+
/* Update timing data */
if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
+ INSTR_TIME_ADD(myState->metrics.timeSpent, instr->instr.total);
/* Update buffer metrics */
if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ BufferUsageAdd(&myState->metrics.bufferUsage, &instr->instr.bufusage);
+
+ if (myState->es->timing || myState->es->buffers)
+ pfree(instr);
return true;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 5b86a727587..d81f6b30e9c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,13 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -596,9 +599,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +634,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ instr = InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -642,13 +642,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -658,7 +651,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..75074fe4efa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -995,6 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1084,7 +1085,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1092,7 +1093,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c0b174cfbc0..82253317e96 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -76,6 +76,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -329,9 +330,28 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /*
+ * Start up required top-level instrumentation stack for WAL/buffer
+ * tracking
+ */
+ if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+ queryDesc->totaltime = InstrQueryAlloc(estate->es_instrument);
+
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ {
+ /* Allow instrumentation of Executor overall runtime */
+ InstrQueryStart(queryDesc->totaltime);
+
+ /*
+ * Remember all node entries for abort recovery. We do this once here
+ * after the first call to InstrQueryStart has pushed the parent
+ * entry.
+ */
+ if ((estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) &&
+ !queryDesc->already_executed)
+ ExecRememberNodeInstrumentation(queryDesc->planstate,
+ queryDesc->totaltime);
+ }
/*
* extract information from the query descriptor and the query feature.
@@ -383,7 +403,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -433,7 +453,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -442,8 +462,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per-node and trigger statistics to their respective parent
+ * instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and the
+ * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
+ * we accumulated here, the leader would double-count: worker parent nodes
+ * would already include their children's stats, and then the leader's
+ * accumulation would add the children again.
+ */
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1484,6 +1522,24 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti && (ti->instr.need_bufusage || ti->instr.need_walusage))
+ InstrAccum(instr_stack.current, &ti->instr);
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c153d5c1c3b..73534fa6c7e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -694,7 +694,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1075,8 +1075,22 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
@@ -1456,6 +1470,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1516,7 +1531,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1532,7 +1547,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 1846661b503..c788b5b00f9 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -787,10 +789,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -828,6 +830,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+ (void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ if (node->instrument)
+ InstrQueryRememberNode(parent, node->instrument);
+
+ return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing our
+ * instrumentation as the parent context. This ensures children can
+ * accumulate to us even if they were never executed by the leader (e.g.
+ * nodes beneath Gather that only workers ran).
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ node->instrument ? &node->instrument->instr : parent);
+
+ if (!node->instrument)
+ return false;
+
+ node->instrument = InstrFinalizeNode(node->instrument, parent);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6a4a08ebb0c..bd8ae3fdcc0 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,25 +16,31 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-
-
-/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+void
+InstrStackGrow(void)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
- InstrInitOptions(instr, instrument_options);
- return instr;
+ if (instr_stack.entries == NULL)
+ {
+ instr_stack.stack_space = 10; /* Allocate sufficient initial space
+ * for typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * instr_stack.stack_space);
+ }
+ else
+ {
+ instr_stack.stack_space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, instr_stack.stack_space);
+ }
}
+/* General purpose instrumentation handling */
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
@@ -54,38 +60,257 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
+
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccum(instr_stack.current, instr);
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ dlist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child node entries. */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_children)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ NodeInstrumentation *child = dlist_container(NodeInstrumentation, unfinalized_node, iter.cur);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccum(&qinstr->instr, &child->instr);
- INSTR_TIME_SET_ZERO(instr->starttime);
+ /*
+ * Free NodeInstrumentation now, since InstrFinalizeNode won't be
+ * called
+ */
+ pfree(child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ /* Free QueryInstrumentation now, since InstrStop won't be called */
+ pfree(qinstr);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+
+ /*
+ * If needed, allocate in TopMemoryContext so that the Instrumentation
+ * survives transaction abort — ResourceOwner release needs to access
+ * it.
+ */
+ if ((instrument_options & INSTRUMENT_BUFFERS) != 0 || (instrument_options & INSTRUMENT_WAL) != 0)
+ instr = MemoryContextAllocZero(TopMemoryContext, sizeof(QueryInstrumentation));
+ else
+ instr = palloc0(sizeof(QueryInstrumentation));
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_children);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+QueryInstrumentation *
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ QueryInstrumentation *copy;
+
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_bufusage && !qinstr->instr.need_walusage)
+ return qinstr;
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Copy to the current memory context so the caller doesn't need to
+ * explicitly free the TopMemoryContext allocation.
+ */
+ copy = palloc(sizeof(QueryInstrumentation));
+ memcpy(copy, qinstr, sizeof(QueryInstrumentation));
+ pfree(qinstr);
+ return copy;
+}
+
+/*
+ * Register a child NodeInstrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *child)
+{
+ if (child->instr.need_bufusage || child->instr.need_walusage)
+ dlist_push_head(&parent->unfinalized_children, &child->unfinalized_node);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ qinstr = InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
@@ -94,7 +319,19 @@ InstrStop(Instrumentation *instr)
NodeInstrumentation *
InstrAllocNode(int instrument_options, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr;
+
+ /*
+ * If needed, allocate in a context that supports stack-based
+ * instrumentation abort handling. We can utilize TopTransactionContext
+ * instead of TopMemoryContext here because nodes don't get used for
+ * utility commands that restart transactions, which would require a
+ * context that survives longer (EXPLAIN ANALYZE is fine).
+ */
+ if ((instrument_options & INSTRUMENT_BUFFERS) != 0 || (instrument_options & INSTRUMENT_WAL) != 0)
+ instr = MemoryContextAlloc(TopTransactionContext, sizeof(NodeInstrumentation));
+ else
+ instr = palloc(sizeof(NodeInstrumentation));
InstrInitNode(instr, instrument_options);
instr->async_mode = async_mode;
@@ -117,6 +354,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -146,14 +384,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_bufusage || instr->instr.need_walusage)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -172,6 +408,31 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
}
}
+NodeInstrumentation *
+InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent)
+{
+ NodeInstrumentation *dst;
+
+ /* If we didn't use stack based instrumentation, nothing to be done */
+ if (!instr->instr.need_bufusage && !instr->instr.need_walusage)
+ return instr;
+
+ /* Copy into per-query memory context */
+ dst = palloc(sizeof(NodeInstrumentation));
+ memcpy(dst, instr, sizeof(NodeInstrumentation));
+
+ /* Accumulate node's buffer/WAL usage to the parent */
+ InstrAccum(parent, &dst->instr);
+
+ /* Unregister from query's unfinalized list before freeing */
+ if (instr->instr.need_bufusage || instr->instr.need_walusage)
+ dlist_delete(&instr->unfinalized_node);
+
+ pfree(instr);
+
+ return dst;
+}
+
/* Update tuple count */
void
InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
@@ -188,8 +449,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -259,38 +520,27 @@ InstrStartTrigger(TriggerInstrumentation *tginstr)
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccum(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
-
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -311,39 +561,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dfa37e5ed44..41a0baa3449 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1269,9 +1269,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
if (*foundPtr)
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9e7a88ec0d0..60400f0c81f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 82c442d23f8..b902cfcbe6e 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -300,6 +300,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1139be8333e..a5e48d8cc45 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,10 +69,22 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
@@ -81,16 +94,52 @@ typedef struct Instrumentation
bool need_walusage; /* true if we need WAL usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * NodeInstrumentation child entries that need to be cleaned up on abort,
+ * since they are not registered as a resource owner themselves.
+ */
+ dlist_head unfinalized_children; /* head of unfinalized children list */
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -109,8 +158,15 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
+
+ /* Abort handling */
+ dlist_node unfinalized_node; /* node in parent's unfinalized list */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,19 +180,110 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * Any caller using this directly must manage the passed in entry and call
+ * InstrPopStack on its own again, typically by using a PG_FINALLY block to
+ * ensure the stack gets reset via InstrPopStack on abort. Use InstrStart
+ * instead when you want automatic handling of abort cases using the resource
+ * owner infrastructure.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrAccum(Instrumentation *dst, Instrumentation *add);
+
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern QueryInstrumentation *InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern NodeInstrumentation *InstrAllocNode(int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern NodeInstrumentation *InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent);
extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
@@ -145,31 +292,31 @@ extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
+ instr_stack.current->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ instr_stack.current->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e05a1e52db4..1c9be944c5a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1320,6 +1320,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2431,6 +2432,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v8-0008-Add-pg_session_buffer_usage-contrib-module.patch (25.5K, 7-v8-0008-Add-pg_session_buffer_usage-contrib-module.patch)
download | inline diff:
From a0543e3b3aa11becca07a05e7d19a9e83276898f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v8 8/8] Add pg_session_buffer_usage contrib module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
contrib/meson.build | 1 +
contrib/pg_session_buffer_usage/Makefile | 23 ++
.../expected/pg_session_buffer_usage.out | 283 ++++++++++++++++++
contrib/pg_session_buffer_usage/meson.build | 34 +++
.../pg_session_buffer_usage--1.0.sql | 31 ++
.../pg_session_buffer_usage.c | 95 ++++++
.../pg_session_buffer_usage.control | 5 +
.../sql/pg_session_buffer_usage.sql | 204 +++++++++++++
8 files changed, 676 insertions(+)
create mode 100644 contrib/pg_session_buffer_usage/Makefile
create mode 100644 contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
create mode 100644 contrib/pg_session_buffer_usage/meson.build
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
create mode 100644 contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
diff --git a/contrib/meson.build b/contrib/meson.build
index 5a752eac347..2b1399e56f3 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -51,6 +51,7 @@ subdir('pg_overexplain')
subdir('pg_plan_advice')
subdir('pg_prewarm')
subdir('pgrowlocks')
+subdir('pg_session_buffer_usage')
subdir('pg_stat_statements')
subdir('pgstattuple')
subdir('pg_surgery')
diff --git a/contrib/pg_session_buffer_usage/Makefile b/contrib/pg_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..75bd8e09b3d
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_session_buffer_usage/Makefile
+
+MODULE_big = pg_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ pg_session_buffer_usage.o
+
+EXTENSION = pg_session_buffer_usage
+DATA = pg_session_buffer_usage--1.0.sql
+PGFILEDESC = "pg_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = pg_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_session_buffer_usage
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
new file mode 100644
index 00000000000..242b4003950
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
@@ -0,0 +1,283 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
diff --git a/contrib/pg_session_buffer_usage/meson.build b/contrib/pg_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..34c7502beb4
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+pg_session_buffer_usage_sources = files(
+ 'pg_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ pg_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_session_buffer_usage',
+ '--FILEDESC', 'pg_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+pg_session_buffer_usage = shared_module('pg_session_buffer_usage',
+ pg_session_buffer_usage_sources,
+ kwargs: contrib_mod_args,
+)
+contrib_targets += pg_session_buffer_usage
+
+install_data(
+ 'pg_session_buffer_usage--1.0.sql',
+ 'pg_session_buffer_usage.control',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'pg_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'pg_session_buffer_usage',
+ ],
+ },
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..b300fdbc643
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION pg_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION pg_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
new file mode 100644
index 00000000000..f869956b3a9
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "pg_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage);
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: pg_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+pg_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: pg_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+pg_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
new file mode 100644
index 00000000000..fabd05ee024
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# pg_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/pg_session_buffer_usage'
+relocatable = true
diff --git a/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
new file mode 100644
index 00000000000..8f5810fadd3
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
@@ -0,0 +1,204 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT pg_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT pg_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
--
2.47.1
[application/octet-stream] v8-0007-Index-scans-Show-table-buffer-accesses-separately.patch (17.6K, 8-v8-0007-Index-scans-Show-table-buffer-accesses-separately.patch)
download | inline diff:
From 42dc8cdda3346bedb7839ceda44d61db4e21f72d Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v8 7/8] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan does scanning on the table, for example due to additional
data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 44 ++++++++--
src/backend/executor/execProcnode.c | 35 ++++++++
src/backend/executor/nodeIndexscan.c | 110 +++++++++++++++++++------
src/include/executor/instrument_node.h | 6 ++
src/include/nodes/execnodes.h | 7 ++
7 files changed, 182 insertions(+), 34 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 5f6f1db0467..9219625faf6 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -949,7 +950,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -958,7 +960,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1147,13 +1151,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..912c96f2ff5 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -506,6 +506,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index aa76f68bd10..8a641f9d05f 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -143,7 +143,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -603,7 +603,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1020,7 +1020,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
es->indent--;
}
}
@@ -1034,7 +1034,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1962,6 +1962,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_InstrumentTable->instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -2280,7 +2283,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2299,7 +2302,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4099,7 +4102,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4124,6 +4127,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4179,6 +4184,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4220,6 +4227,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4240,8 +4255,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4260,6 +4287,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6a74ca516ae..5e476939edf 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,9 +414,24 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
+ {
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ /* IndexScan tracks table access separately from index access. */
+ if (IsA(result, IndexScanState) && (estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ IndexScanState *iss = castNode(IndexScanState, result);
+
+ /*
+ * We intentionally don't collect timing here (even if enabled),
+ * since we don't need it, and IndexNext calls InstrPushStack /
+ * InstrPopStack (instead of InstrNode*) to reduce overhead.
+ */
+ iss->iss_InstrumentTable = InstrAllocNode(INSTRUMENT_BUFFERS, false);
+ }
+ }
+
return result;
}
@@ -836,8 +851,19 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
return false;
if (node->instrument)
+ {
InstrQueryRememberNode(parent, node->instrument);
+ /* IndexScan has a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_InstrumentTable)
+ InstrQueryRememberNode(parent, iss->iss_InstrumentTable);
+ }
+ }
+
return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
}
@@ -879,6 +905,15 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
if (!node->instrument)
return false;
+ /* IndexScan has a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_InstrumentTable)
+ iss->iss_InstrumentTable = InstrFinalizeNode(iss->iss_InstrumentTable, &node->instrument->instr);
+ }
+
node->instrument = InstrFinalizeNode(node->instrument, parent);
return false;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 4513b1f7a90..3bd35392168 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -83,7 +83,9 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
/*
* extract necessary information from index scan node
@@ -128,8 +130,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (node->iss_InstrumentTable)
+ InstrPushStack(&node->iss_InstrumentTable->instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (node->iss_InstrumentTable)
+ InstrPopStack(&node->iss_InstrumentTable->instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -257,36 +275,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (node->iss_InstrumentTable)
+ InstrPushStack(&node->iss_InstrumentTable->instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (node->iss_InstrumentTable)
+ InstrPopStack(&node->iss_InstrumentTable->instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -812,6 +861,11 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument.nsearches;
+ if (node->iss_InstrumentTable)
+ {
+ BufferUsageAdd(&winstrument->worker_table_bufusage, &node->iss_InstrumentTable->instr.bufusage);
+ WalUsageAdd(&winstrument->worker_table_walusage, &node->iss_InstrumentTable->instr.walusage);
+ }
}
/*
@@ -1819,4 +1873,14 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ if (node->iss_InstrumentTable)
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ BufferUsageAdd(&node->iss_InstrumentTable->instr.bufusage,
+ &node->iss_SharedInfo->winstrument[i].worker_table_bufusage);
+ WalUsageAdd(&node->iss_InstrumentTable->instr.walusage,
+ &node->iss_SharedInfo->winstrument[i].worker_table_walusage);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94fa..170b6143ef6 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,10 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Used for passing iss_InstrumentTableStack data from parallel workers */
+ BufferUsage worker_table_bufusage;
+ WalUsage worker_table_walusage;
} IndexScanInstrumentation;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 605c7a6cc39..c778641c13d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1728,6 +1728,13 @@ typedef struct IndexScanState
IndexScanInstrumentation iss_Instrument;
SharedIndexScanInstrumentation *iss_SharedInfo;
+ /*
+ * Instrumentation utilized for tracking table access. This is separate
+ * from iss_Instrument since it needs to be allocated in the right context
+ * and IndexScanInstrumentation shouldn't contain pointers.
+ */
+ NodeInstrumentation *iss_InstrumentTable;
+
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
bool iss_ReachedEnd;
--
2.47.1
[application/octet-stream] v8-0006-instrumentation-Optimize-ExecProcNodeInstr-instru.patch (11.5K, 9-v8-0006-instrumentation-Optimize-ExecProcNodeInstr-instru.patch)
download | inline diff:
From 0c9c1c2e91b8aa4f2a0817deaf129e1e246ac92c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v8 6/8] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/executor/execProcnode.c | 22 +--
src/backend/executor/instrument.c | 224 +++++++++++++++++++++-------
src/include/executor/instrument.h | 5 +
3 files changed, 174 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c788b5b00f9..6a74ca516ae 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -120,7 +120,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -464,7 +463,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -472,25 +471,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bd8ae3fdcc0..2727e7b5ce4 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -49,29 +49,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_bufusage || instr->need_walusage)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -79,6 +70,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_bufusage || instr->need_walusage)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -347,65 +348,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_bufusage || instr->instr.need_walusage)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
NodeInstrumentation *
@@ -498,6 +491,125 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ InstrPushStack(&instr->instr);
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNodeTimer(instr);
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsBuffersWalOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ InstrPushStack(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ TupleTableSlot *result;
+
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNodeTimer(instr);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ TupleTableSlot *result;
+
+ result = node->ExecProcNodeReal(node);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_buf = (instr->instr.need_bufusage ||
+ instr->instr.need_walusage);
+
+ if (need_timer && need_buf)
+ return ExecProcNodeInstrFull;
+ else if (need_buf)
+ return ExecProcNodeInstrRowsBuffersWalOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(int n, int instrument_options)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a5e48d8cc45..2b5484861a9 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -288,6 +288,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-17 06:21 ` Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-03-17 06:21 UTC (permalink / raw)
To: Tomas Vondra <[email protected]>; +Cc: Lukas Fittl <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti && (ti->instr.need_bufusage || ti->instr.need_walusage))
+ InstrAccum(instr_stack.current, &ti->instr);
I think there's one more bug here, isn't ti an array? This seems to
only process the first entry, not all of them.
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
Shouldn't this have the same additional assertion as InstrPopStack?
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-03-17 08:18 ` Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-17 08:18 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
Hi Tomas and Zsolt,
On Mon, Mar 16, 2026 at 4:50 PM Tomas Vondra <[email protected]> wrote:
> On 3/14/26 21:49, Lukas Fittl wrote:
> > Regarding Index-Only Scans, I did not add instrumentation for table
> > access yet - I might add that in a follow-up revision or we could also
> > do it in a follow-on patch.
> >
>
> I think we should, and it probably should be done in the same commit as
> for plain index scans. Mostly for consistency / less confusion.
>
> Every now and then there's an index-only scan that has to do a lot of
> heap fetches, possibly just as many as the plain index scan. But the IOS
> version would not say how many buffer accesses are for table, and users
> might assume an index-only scan does not access table. Confusing.
Makes sense - added support for index-only scans in the same commit now.
> I only started to look at the patch today, so I don't have any real
> review comments. But I noticed the pg_session_buffer_usage is added only
> to the contrib/meson.build and not to the Makefile. I assume that's not
> intentional.
Yeah, that was missed, thanks!
For context, pg_session_buffer_usage is not necessarily meant to be
committed, its just testing a few edge cases around aborts that can't
be easily tested otherwise.
On Mon, Mar 16, 2026 at 11:21 PM Zsolt Parragi
<[email protected]> wrote:
>
> + TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
> +
> + if (ti && (ti->instr.need_bufusage || ti->instr.need_walusage))
> + InstrAccum(instr_stack.current, &ti->instr);
>
> I think there's one more bug here, isn't ti an array? This seems to
> only process the first entry, not all of them.
Good catch, fixed!
> +InstrPopStackTo(Instrumentation *prev)
> +{
> + Assert(instr_stack.stack_size > 0);
> + instr_stack.stack_size--;
> + instr_stack.current = prev;
> +}
> +
>
> Shouldn't this have the same additional assertion as InstrPopStack?
Yep, added. It has to be slightly different because we're passing the
previous stack entry (hence the "To" in the function name), vs the
current one (to save instructions), but probably helpful to have that
assurance.
---
See attached v9, with 0001 to 0004 the same as before.
0005 has the following additional changes beyond what's mentioned above:
- Replaced individual need_bufusage/need_walusage flags on
Instrumentation with a single "need_stack" (to fix duplication of
checks noted by Tomas over in [0])
- Renamed InstrAccum to InstrAccumStack to clarify it only accumulates
stack entries, not total time (which is also part of Instrumentation,
but doesn't use the stack mechanism)
- Fixed a stale comment on InstrPushStack regarding resource owner /
PG_FINALLY use
0006 is the patch moved here now that I shared over in [0] to help
with the IOUsage case. I think up to 0006 would be the required
patches to allow the work in that thread to use the stack-based
instrumentation.
0007 (ExecProcNodeInstr optimization) has minor change to adjust
ExecProcNodeInstr function naming to be less specific to WAL/buffer
usage, in anticipation of potentially adding IOUsage.
0008 (Index scans: Show table buffer accesses) added Index Only Scan
support, and switched IndexScanInstrumentation to use Instrumentation
struct instead of individual Buffer/WALUsage for passing back parallel
worker info - that avoids accidentally missing new types of
stack-based instrumentation being passed back, which would then be
missing from query totals.
Thanks,
Lukas
[0]: https://www.postgresql.org/message-id/[email protected]
--
Lukas Fittl
Attachments:
[application/octet-stream] v9-0002-instrumentation-Separate-per-node-logic-from-othe.patch (26.3K, 2-v9-0002-instrumentation-Separate-per-node-logic-from-othe.patch)
download | inline diff:
From 167bc013a583723fc9bed071caade7d3dd0198a1 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v9 2/9] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 142 ++++++++++++------
src/include/executor/instrument.h | 60 +++++---
src/include/nodes/execnodes.h | 6 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 172 insertions(+), 113 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..3e79108846e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1024,7 +1024,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1083,12 +1083,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 41e47cc795b..cc8ec24c30e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2779,7 +2779,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 8391c5ab2da..bebad57eff5 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1890,11 +1890,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1903,7 +1903,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2290,18 +2290,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2309,9 +2309,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 53631163dd6..1b950040597 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +385,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +435,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -445,7 +445,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..c153d5c1c3b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -725,7 +729,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -811,7 +815,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -821,7 +825,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1053,7 +1057,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1081,9 +1085,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1313,7 +1317,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d35976925ae..132fe37ef60 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,8 +414,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
-
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
if (!dst->running && add->running)
{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 22ae052cdd3..fbf13683581 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1184,8 +1184,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 08e55cbb294..caa0caef324 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1801,6 +1801,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3404,9 +3405,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v9-0001-instrumentation-Separate-trigger-logic-from-other.patch (9.7K, 3-v9-0001-instrumentation-Separate-trigger-logic-from-other.patch)
download | inline diff:
From 8d044728e3785e4b314c7e2557120a14f4b40740 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v9 1/9] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 2 +-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 60 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 296ea8a1ed2..8391c5ab2da 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1101,18 +1101,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1137,11 +1134,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1151,9 +1148,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 9c0438a125a..db7a1f75650 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -92,7 +92,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2311,7 +2311,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2346,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2391,10 +2391,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3938,7 +3938,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4332,7 +4332,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4373,7 +4373,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4590,10 +4590,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4709,7 +4709,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 58b84955c2b..53631163dd6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1285,7 +1285,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 0716c5a9aed..22ae052cdd3 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -533,7 +533,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 52f8603a7be..08e55cbb294 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3183,6 +3183,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v9-0004-instrumentation-Add-additional-regression-tests-c.patch (23.5K, 4-v9-0004-instrumentation-Add-additional-regression-tests-c.patch)
download | inline diff:
From f0789709d2ca151fdda0ef97613373c47e8ddcd0 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v9 4/9] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 197 ++++++++++++++++++
src/test/regress/sql/explain.sql | 194 +++++++++++++++++
6 files changed, 598 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
--
2.47.1
[application/octet-stream] v9-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch (68.7K, 5-v9-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch)
download | inline diff:
From 376589f8c52f37a9eec8d1b11b16be653107ae18 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v9 5/9] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 87 +---
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 15 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 31 +-
src/backend/commands/explain.c | 26 +-
src/backend/commands/explain_dr.c | 34 +-
src/backend/commands/prepare.c | 27 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/execMain.c | 72 ++-
src/backend/executor/execParallel.c | 22 +-
src/backend/executor/execProcnode.c | 84 +++-
src/backend/executor/instrument.c | 422 +++++++++++++-----
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 2 +
src/include/executor/instrument.h | 178 +++++++-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
22 files changed, 766 insertions(+), 309 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 3e79108846e..9856dec3a5f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -910,22 +910,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -939,30 +928,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1014,19 +993,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1088,10 +1057,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1155,17 +1124,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1181,6 +1144,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1195,9 +1159,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1209,23 +1170,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2a0f8c8e3b8..1ceb2306954 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2886,6 +2886,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2935,7 +2936,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2950,7 +2951,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e54782d9dd8..04cd53916ca 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2117,6 +2117,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2185,7 +2186,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2200,7 +2201,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..b4cbd0e682c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -641,8 +641,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -658,6 +657,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -983,14 +984,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ instr = InstrQueryStopFinalize(instr);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -999,12 +1000,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47a9bda30c9..6a261c8dcbd 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index eeed91be266..670b5b0e8ce 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ instr = InstrQueryStopFinalize(instr);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index bebad57eff5..fa250b0196a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,14 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -348,15 +350,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ instr = InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -364,16 +363,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..c9695b03a60 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ QueryInstrumentation *instr = NULL;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (myState->es->timing || myState->es->buffers)
+ {
+ InstrumentOption instrument_options = 0;
+
+ if (myState->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (myState->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ instr = InstrQueryAlloc(instrument_options);
+ InstrQueryStart(instr);
+ }
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +191,19 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
+ if (myState->es->timing || myState->es->buffers)
+ instr = InstrQueryStopFinalize(instr);
+
/* Update timing data */
if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
+ INSTR_TIME_ADD(myState->metrics.timeSpent, instr->instr.total);
/* Update buffer metrics */
if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ BufferUsageAdd(&myState->metrics.bufferUsage, &instr->instr.bufusage);
+
+ if (myState->es->timing || myState->es->buffers)
+ pfree(instr);
return true;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..6e98e0742b8 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -580,13 +580,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -598,9 +601,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +636,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ instr = InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -644,13 +644,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +653,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..75074fe4efa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -995,6 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1084,7 +1085,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1092,7 +1093,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1b950040597..771b1286015 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -331,9 +332,28 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /*
+ * Start up required top-level instrumentation stack for WAL/buffer
+ * tracking
+ */
+ if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+ queryDesc->totaltime = InstrQueryAlloc(estate->es_instrument);
+
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ {
+ /* Allow instrumentation of Executor overall runtime */
+ InstrQueryStart(queryDesc->totaltime);
+
+ /*
+ * Remember all node entries for abort recovery. We do this once here
+ * after the first call to InstrQueryStart has pushed the parent
+ * entry.
+ */
+ if ((estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) &&
+ !queryDesc->already_executed)
+ ExecRememberNodeInstrumentation(queryDesc->planstate,
+ queryDesc->totaltime);
+ }
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +405,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +455,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -444,8 +464,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per-node and trigger statistics to their respective parent
+ * instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and the
+ * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
+ * we accumulated here, the leader would double-count: worker parent nodes
+ * would already include their children's stats, and then the leader's
+ * accumulation would add the children again.
+ */
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1499,6 +1537,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(instr_stack.current, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c153d5c1c3b..73534fa6c7e 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -694,7 +694,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1075,8 +1075,22 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
@@ -1456,6 +1470,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1516,7 +1531,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1532,7 +1547,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..b1181715c30 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -788,10 +790,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +831,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+ (void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ if (node->instrument)
+ InstrQueryRememberNode(parent, node->instrument);
+
+ return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing our
+ * instrumentation as the parent context. This ensures children can
+ * accumulate to us even if they were never executed by the leader (e.g.
+ * nodes beneath Gather that only workers ran).
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ node->instrument ? &node->instrument->instr : parent);
+
+ if (!node->instrument)
+ return false;
+
+ node->instrument = InstrFinalizeNode(node->instrument, parent);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6a4a08ebb0c..d80c9ff2d41 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,30 +16,35 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-
-
-/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+void
+InstrStackGrow(void)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
- InstrInitOptions(instr, instrument_options);
- return instr;
+ if (instr_stack.entries == NULL)
+ {
+ instr_stack.stack_space = 10; /* Allocate sufficient initial space
+ * for typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * instr_stack.stack_space);
+ }
+ else
+ {
+ instr_stack.stack_space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, instr_stack.stack_space);
+ }
}
+/* General purpose instrumentation handling */
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -54,38 +59,257 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
+
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_stack)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ dlist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child node entries. */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_children)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ NodeInstrumentation *child = dlist_container(NodeInstrumentation, unfinalized_node, iter.cur);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccumStack(&qinstr->instr, &child->instr);
- INSTR_TIME_SET_ZERO(instr->starttime);
+ /*
+ * Free NodeInstrumentation now, since InstrFinalizeNode won't be
+ * called
+ */
+ pfree(child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /* Free QueryInstrumentation now, since InstrStop won't be called */
+ pfree(qinstr);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+
+ /*
+ * If needed, allocate in TopMemoryContext so that the Instrumentation
+ * survives transaction abort — ResourceOwner release needs to access
+ * it.
+ */
+ if ((instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0)
+ instr = MemoryContextAllocZero(TopMemoryContext, sizeof(QueryInstrumentation));
+ else
+ instr = palloc0(sizeof(QueryInstrumentation));
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_children);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+QueryInstrumentation *
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ QueryInstrumentation *copy;
+
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ return qinstr;
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Copy to the current memory context so the caller doesn't need to
+ * explicitly free the TopMemoryContext allocation.
+ */
+ copy = palloc(sizeof(QueryInstrumentation));
+ memcpy(copy, qinstr, sizeof(QueryInstrumentation));
+ pfree(qinstr);
+ return copy;
+}
+
+/*
+ * Register a child NodeInstrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *child)
+{
+ if (child->instr.need_stack)
+ dlist_push_head(&parent->unfinalized_children, &child->unfinalized_node);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ qinstr = InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
@@ -94,7 +318,19 @@ InstrStop(Instrumentation *instr)
NodeInstrumentation *
InstrAllocNode(int instrument_options, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr;
+
+ /*
+ * If needed, allocate in a context that supports stack-based
+ * instrumentation abort handling. We can utilize TopTransactionContext
+ * instead of TopMemoryContext here because nodes don't get used for
+ * utility commands that restart transactions, which would require a
+ * context that survives longer (EXPLAIN ANALYZE is fine).
+ */
+ if ((instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0)
+ instr = MemoryContextAlloc(TopTransactionContext, sizeof(NodeInstrumentation));
+ else
+ instr = palloc(sizeof(NodeInstrumentation));
InstrInitNode(instr, instrument_options);
instr->async_mode = async_mode;
@@ -117,6 +353,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -146,14 +383,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_stack)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -172,6 +407,31 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
}
}
+NodeInstrumentation *
+InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent)
+{
+ NodeInstrumentation *dst;
+
+ /* If we didn't use stack based instrumentation, nothing to be done */
+ if (!instr->instr.need_stack)
+ return instr;
+
+ /* Copy into per-query memory context */
+ dst = palloc(sizeof(NodeInstrumentation));
+ memcpy(dst, instr, sizeof(NodeInstrumentation));
+
+ /* Accumulate node's buffer/WAL usage to the parent */
+ InstrAccumStack(parent, &dst->instr);
+
+ /* Unregister from query's unfinalized list before freeing */
+ if (instr->instr.need_stack)
+ dlist_delete(&instr->unfinalized_node);
+
+ pfree(instr);
+
+ return dst;
+}
+
/* Update tuple count */
void
InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
@@ -188,8 +448,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -230,11 +490,8 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
@@ -259,38 +516,27 @@ InstrStartTrigger(TriggerInstrumentation *tginstr)
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
-
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -311,39 +557,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dfa37e5ed44..41a0baa3449 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1269,9 +1269,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
if (*foundPtr)
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9e7a88ec0d0..60400f0c81f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 064df01811e..2f0d16db586 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -301,6 +301,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1139be8333e..b92cc65f159 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,76 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * NodeInstrumentation child entries that need to be cleaned up on abort,
+ * since they are not registered as a resource owner themselves.
+ */
+ dlist_head unfinalized_children; /* head of unfinalized children list */
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -109,8 +157,15 @@ typedef struct NodeInstrumentation
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
+
+ /* Abort handling */
+ dlist_node unfinalized_node; /* node in parent's unfinalized list */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,19 +179,106 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
+
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern QueryInstrumentation *InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern NodeInstrumentation *InstrAllocNode(int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern NodeInstrumentation *InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent);
extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
@@ -145,31 +287,31 @@ extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
+ instr_stack.current->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ instr_stack.current->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index caa0caef324..6ea86c2eaac 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1335,6 +1335,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2450,6 +2451,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v9-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch (9.9K, 6-v9-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch)
download | inline diff:
From 8a9763f3be3db95d521c55e206af75cef04bcbd4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 12:12:39 -0800
Subject: [PATCH v9 3/9] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/executor/instrument.c | 1 -
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
7 files changed, 47 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5c9a34374d..9b33584f454 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1081,10 +1081,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2063,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6a4a08ebb0c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -54,7 +54,6 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
instr->bufusage_start = pgBufferUsage;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529a..dfa37e5ed44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -835,7 +835,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -856,7 +856,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1257,14 +1257,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1998,9 +1998,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -2068,9 +2068,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2959,7 +2959,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3105,7 +3105,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4520,7 +4520,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5663,7 +5663,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 404c6bccbdd..8845b0aeed6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..9e7a88ec0d0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..1139be8333e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v9-0006-instrumentation-Use-Instrumentation-struct-for-pa.patch (29.1K, 7-v9-0006-instrumentation-Use-Instrumentation-struct-for-pa.patch)
download | inline diff:
From 6ebb1389d94b2c1df8a1e7962d42aea372fa124b Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v9 6/9] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 13 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 98 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1ceb2306954..1c95ec9f605 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2887,8 +2876,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2949,11 +2937,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 04cd53916ca..51bb098a2a2 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2118,8 +2107,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2199,11 +2187,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6a261c8dcbd..504b34cc906 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 75074fe4efa..753dd965d78 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -738,7 +726,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -996,8 +984,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1091,11 +1078,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 73534fa6c7e..ebab6bc1652 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -625,8 +624,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -690,21 +687,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -790,17 +780,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -826,9 +817,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1230,7 +1221,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1244,10 +1235,10 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
@@ -1471,8 +1462,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1487,7 +1476,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1545,11 +1534,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index d80c9ff2d41..e7528c55f0b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -284,11 +284,11 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
qinstr = InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -304,12 +304,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b92cc65f159..e567cd691b4 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -270,8 +270,8 @@ extern QueryInstrumentation *InstrQueryStopFinalize(QueryInstrumentation *instr)
extern void InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(int instrument_options,
bool async_mode);
--
2.47.1
[application/octet-stream] v9-0009-Add-pg_session_buffer_usage-contrib-module.patch (25.9K, 8-v9-0009-Add-pg_session_buffer_usage-contrib-module.patch)
download | inline diff:
From 3c88838fe0a93f195275067eaaeabe6b3c489241 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v9 9/9] Add pg_session_buffer_usage contrib module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
contrib/Makefile | 1 +
contrib/meson.build | 1 +
contrib/pg_session_buffer_usage/Makefile | 23 ++
.../expected/pg_session_buffer_usage.out | 283 ++++++++++++++++++
contrib/pg_session_buffer_usage/meson.build | 34 +++
.../pg_session_buffer_usage--1.0.sql | 31 ++
.../pg_session_buffer_usage.c | 95 ++++++
.../pg_session_buffer_usage.control | 5 +
.../sql/pg_session_buffer_usage.sql | 204 +++++++++++++
9 files changed, 677 insertions(+)
create mode 100644 contrib/pg_session_buffer_usage/Makefile
create mode 100644 contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
create mode 100644 contrib/pg_session_buffer_usage/meson.build
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
create mode 100644 contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
diff --git a/contrib/Makefile b/contrib/Makefile
index dd04c20acd2..ac04f9eb997 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -36,6 +36,7 @@ SUBDIRS = \
pg_overexplain \
pg_plan_advice \
pg_prewarm \
+ pg_session_buffer_usage \
pg_stat_statements \
pg_surgery \
pg_trgm \
diff --git a/contrib/meson.build b/contrib/meson.build
index 5a752eac347..2b1399e56f3 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -51,6 +51,7 @@ subdir('pg_overexplain')
subdir('pg_plan_advice')
subdir('pg_prewarm')
subdir('pgrowlocks')
+subdir('pg_session_buffer_usage')
subdir('pg_stat_statements')
subdir('pgstattuple')
subdir('pg_surgery')
diff --git a/contrib/pg_session_buffer_usage/Makefile b/contrib/pg_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..75bd8e09b3d
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_session_buffer_usage/Makefile
+
+MODULE_big = pg_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ pg_session_buffer_usage.o
+
+EXTENSION = pg_session_buffer_usage
+DATA = pg_session_buffer_usage--1.0.sql
+PGFILEDESC = "pg_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = pg_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_session_buffer_usage
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
new file mode 100644
index 00000000000..242b4003950
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
@@ -0,0 +1,283 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
diff --git a/contrib/pg_session_buffer_usage/meson.build b/contrib/pg_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..34c7502beb4
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+pg_session_buffer_usage_sources = files(
+ 'pg_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ pg_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_session_buffer_usage',
+ '--FILEDESC', 'pg_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+pg_session_buffer_usage = shared_module('pg_session_buffer_usage',
+ pg_session_buffer_usage_sources,
+ kwargs: contrib_mod_args,
+)
+contrib_targets += pg_session_buffer_usage
+
+install_data(
+ 'pg_session_buffer_usage--1.0.sql',
+ 'pg_session_buffer_usage.control',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'pg_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'pg_session_buffer_usage',
+ ],
+ },
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..b300fdbc643
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION pg_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION pg_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
new file mode 100644
index 00000000000..f869956b3a9
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "pg_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage);
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: pg_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+pg_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: pg_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+pg_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
new file mode 100644
index 00000000000..fabd05ee024
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# pg_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/pg_session_buffer_usage'
+relocatable = true
diff --git a/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
new file mode 100644
index 00000000000..8f5810fadd3
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
@@ -0,0 +1,204 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT pg_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT pg_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
--
2.47.1
[application/octet-stream] v9-0008-Index-scans-Show-table-buffer-accesses-separately.patch (21.1K, 9-v9-0008-Index-scans-Show-table-buffer-accesses-separately.patch)
download | inline diff:
From 61d4014bc4366314b41615e60e3b8c5ab855b44f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v9 8/9] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 ++++++++--
src/backend/executor/execProcnode.c | 57 ++++++++++++
src/backend/executor/nodeIndexonlyscan.c | 25 +++++-
src/backend/executor/nodeIndexscan.c | 107 ++++++++++++++++++-----
src/include/executor/instrument_node.h | 5 ++
src/include/nodes/execnodes.h | 15 ++++
8 files changed, 235 insertions(+), 35 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 5f6f1db0467..9219625faf6 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -949,7 +950,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -958,7 +960,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1147,13 +1151,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..912c96f2ff5 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -506,6 +506,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index fa250b0196a..8dd56394cf2 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -605,7 +605,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1022,7 +1022,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
es->indent--;
}
}
@@ -1036,7 +1036,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->bufferUsage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1964,6 +1964,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_InstrumentTable->instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1981,6 +1984,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_InstrumentTable->instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2282,7 +2288,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2301,7 +2307,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4101,7 +4107,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4126,6 +4132,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4181,6 +4189,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4222,6 +4232,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4242,8 +4260,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4262,6 +4292,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 828a1fe3b1d..eeebe2ce64f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -415,9 +415,32 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
+ {
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ /*
+ * IndexScan / IndexOnlyScan track table and index access separately.
+ *
+ * We intentionally don't collect timing for them (even if enabled),
+ * since we don't need it, and the executor nodes call InstrPushStack
+ * / InstrPopStack (instead of the full InstrNode*) to reduce
+ * overhead.
+ */
+ if (IsA(result, IndexScanState) && (estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ IndexScanState *iss = castNode(IndexScanState, result);
+
+ iss->iss_InstrumentTable = InstrAllocNode(INSTRUMENT_BUFFERS, false);
+ }
+ else if (IsA(result, IndexOnlyScanState) && (estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, result);
+
+ ioss->ioss_InstrumentTable = InstrAllocNode(INSTRUMENT_BUFFERS, false);
+ }
+ }
+
return result;
}
@@ -837,8 +860,26 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
return false;
if (node->instrument)
+ {
InstrQueryRememberNode(parent, node->instrument);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_InstrumentTable)
+ InstrQueryRememberNode(parent, iss->iss_InstrumentTable);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_InstrumentTable)
+ InstrQueryRememberNode(parent, ioss->ioss_InstrumentTable);
+ }
+ }
+
return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
}
@@ -880,6 +921,22 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
if (!node->instrument)
return false;
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_InstrumentTable)
+ iss->iss_InstrumentTable = InstrFinalizeNode(iss->iss_InstrumentTable, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_InstrumentTable)
+ ioss->ioss_InstrumentTable = InstrFinalizeNode(ioss->ioss_InstrumentTable, &node->instrument->instr);
+ }
+
node->instrument = InstrFinalizeNode(node->instrument, parent);
return false;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index c8db357e69f..3f94671b55c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -163,11 +163,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (node->ioss_InstrumentTable)
+ InstrPushStack(&node->ioss_InstrumentTable->instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (node->ioss_InstrumentTable)
+ InstrPopStack(&node->ioss_InstrumentTable->instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -434,6 +445,10 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument.nsearches;
+ if (node->ioss_InstrumentTable)
+ {
+ InstrAccumStack(&winstrument->worker_table_instr, &node->ioss_InstrumentTable->instr);
+ }
}
/*
@@ -889,4 +904,12 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ if (node->ioss_InstrumentTable)
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_InstrumentTable->instr,
+ &node->ioss_SharedInfo->winstrument[i].worker_table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index bd83e4712b3..281b92c2299 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,9 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
/*
* extract necessary information from index scan node
@@ -130,8 +132,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (node->iss_InstrumentTable)
+ InstrPushStack(&node->iss_InstrumentTable->instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (node->iss_InstrumentTable)
+ InstrPopStack(&node->iss_InstrumentTable->instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -259,36 +277,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (node->iss_InstrumentTable)
+ InstrPushStack(&node->iss_InstrumentTable->instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (node->iss_InstrumentTable)
+ InstrPopStack(&node->iss_InstrumentTable->instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -814,6 +863,10 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument.nsearches;
+ if (node->iss_InstrumentTable)
+ {
+ InstrAccumStack(&winstrument->worker_table_instr, &node->iss_InstrumentTable->instr);
+ }
}
/*
@@ -1822,4 +1875,12 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ if (node->iss_InstrumentTable)
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_InstrumentTable->instr,
+ &node->iss_SharedInfo->winstrument[i].worker_table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94fa..d6dd46692ef 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Used for passing i(o)ss_InstrumentTable data from parallel workers */
+ Instrumentation worker_table_instr;
} IndexScanInstrumentation;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index fbf13683581..cd6736acc80 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1737,6 +1737,13 @@ typedef struct IndexScanState
IndexScanInstrumentation iss_Instrument;
SharedIndexScanInstrumentation *iss_SharedInfo;
+ /*
+ * Instrumentation utilized for tracking table access. This is separate
+ * from iss_Instrument since it needs to be allocated in the right context
+ * and IndexScanInstrumentation shouldn't contain pointers.
+ */
+ NodeInstrumentation *iss_InstrumentTable;
+
/* These are needed for re-checking ORDER BY expr ordering */
pairingheap *iss_ReorderQueue;
bool iss_ReachedEnd;
@@ -1787,6 +1794,14 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
IndexScanInstrumentation ioss_Instrument;
SharedIndexScanInstrumentation *ioss_SharedInfo;
+
+ /*
+ * Instrumentation utilized for tracking table access. This is separate
+ * from ioss_Instrument since it needs to be allocated in the right
+ * context and IndexScanInstrumentation shouldn't contain pointers.
+ */
+ NodeInstrumentation *ioss_InstrumentTable;
+
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
Size ioss_PscanLen;
--
2.47.1
[application/octet-stream] v9-0007-instrumentation-Optimize-ExecProcNodeInstr-instru.patch (11.7K, 10-v9-0007-instrumentation-Optimize-ExecProcNodeInstr-instru.patch)
download | inline diff:
From b2da642a3485028ffa88db25b0fd808d23b01a48 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v9 7/9] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +--
src/backend/executor/instrument.c | 224 +++++++++++++++++++++-------
src/include/executor/instrument.h | 5 +
3 files changed, 174 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index b1181715c30..828a1fe3b1d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -465,7 +464,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -473,25 +472,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index e7528c55f0b..95d3f83d46b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -48,29 +48,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -78,6 +69,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -345,65 +346,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_stack)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
NodeInstrumentation *
@@ -493,6 +486,125 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ InstrPushStack(&instr->instr);
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNodeTimer(instr);
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ InstrPushStack(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ TupleTableSlot *result;
+
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNodeTimer(instr);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ NodeInstrumentation *instr = node->instrument;
+ TupleTableSlot *result;
+
+ result = node->ExecProcNodeReal(node);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(int n, int instrument_options)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index e567cd691b4..ce8d9589363 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -283,6 +283,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-18 20:49 ` Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-03-18 20:49 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
+ instr_stack.stack_space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries,
Instrumentation *, instr_stack.stack_space);
Can't this also cause issues with OOM? repalloc_array failing, but we
already doubled stack_space.
The initialization above uses the same order, but that should be safe
as entries is initially NULL.
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
...
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
There seems to be one more bug in this:
1. EXPLAIN ANALYZE fires a trigger
2. The trigger function throws ERROR, InstrStopTrigger never runs
3. ResOwnerReleaseInstrumentation runs but only checks
unfinalized_children, not triggers
4. InstrStopFinalize discards the trigger entry
5. Trigger instrumentation information shows 0
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-03-18 23:36 ` Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-18 23:36 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
On Wed, Mar 18, 2026 at 1:49 PM Zsolt Parragi <[email protected]> wrote:
>
> + instr_stack.stack_space *= 2;
> + instr_stack.entries = repalloc_array(instr_stack.entries,
> Instrumentation *, instr_stack.stack_space);
>
> Can't this also cause issues with OOM? repalloc_array failing, but we
> already doubled stack_space.
> The initialization above uses the same order, but that should be safe
> as entries is initially NULL.
Yeah, that's fair, I think its reasonable to do the update of
instr_stack.stack_space after the allocation succeeded. I also think
its good to do that even for the initialization case - it seems
extremely unlikely, but the fact that entries is NULL doesn't protect
us against a subsequent InstrPushStack running after an OOM and the
check whether InstrStackGrow should be called runs based on the stack
space, not the entries array.
I have a fix for that staged for the next iteration.
> + * 2) Accumulate all instrumentation to the currently active instrumentation,
> + * so that callers get a complete picture of activity, even after an abort
> ...
> + if (idx >= 0)
> + {
> + while (instr_stack.stack_size > idx + 1)
> + instr_stack.stack_size--;
> +
> + InstrPopStack(instr);
> + }
>
> There seems to be one more bug in this:
>
> 1. EXPLAIN ANALYZE fires a trigger
> 2. The trigger function throws ERROR, InstrStopTrigger never runs
> 3. ResOwnerReleaseInstrumentation runs but only checks
> unfinalized_children, not triggers
> 4. InstrStopFinalize discards the trigger entry
> 5. Trigger instrumentation information shows 0
Hmm, so I think you're correct that a trigger function error would
cause any stack-based instrumentation from the trigger to get lost.
In practice that doesn't matter today, since triggers never capture
WAL/buffer usage data (only timing), but its maybe a design flaw
because trigger instrumentation is its own thing without a defined
relationship with the stack, unlike NodeInstrumentation which is
registered to the query that handles the aborts.
In a sense this is a similar situation to the EXPLAIN (SERIALIZE)
per-tuple handling we talked about previously - we have
instrumentation that's related to a query, but its not per-node, so
using NodeInstrumentation doesn't really make sense.
I could imagine three ways forward, if we want to address that now (vs
documenting that this isn't handled but effectively not a problem):
1) We add a second list for unfinalized trigger instrumentations on
QueryInstrumentation, and accumulate that too in
ResOwnerReleaseInstrumentation
2) We call InstrStopFinalize in ExecCallTriggerFunc if an error was
thrown from the trigger function (i.e. turn the existing PG_FINALLY
block into a PG_CATCH)
3) We generalize the QueryInstrumentation children handling to allow
other types of Instrumentation (i.e. not just NodeInstrumentation)
I think (3) would be ideal and would let us deal with EXPLAIN
(SERIALIZE) too, but is complicated by the fact that
ResOwnerReleaseInstrumentation needs to have reference to the full
allocated struct (not just the Instrumentation contained within) so it
can call pfree to avoid leaking memory until top transaction end.
In a prior iteration we had the Instrumentation allocated separately
inside NodeInstrumentation (so its a pointer and can thus be freed
independently / replaced with a copy on clean exit), which allows
ResOwnerReleaseInstrumentation to just deal with Instrumentation, but
that becomes inconvenient when dealing with parallel workers.
There is an alternate design I had considered, which is to basically
keep two copies of Instrumentation in NodeInstrumentation: One is a
pointer to the running instrumentation (allocated in a memory context
that survives long enough in an abort, and which will be freed upon
abort or clean exit), and one is a direct member of the containing
struct (like we have it today), and gets updated via a memcpy() upon a
clean exit. I think that'd make the API easier to use and the same
concept could then be applied to TriggerInstrumentation, but the big
downside is that we'd be doubling memory usage because whilst we're
running we'd both have the allocation in the higher memory context,
and the direct member of the containing struct (to return the result
to the caller).
Because of that, I feel like we should do (1) or (2) for now - but
I'll also wait if Andres or others have additional feedback on 0005
before proceeding with further changes.
I also do think that the 0001-0004 patches are good to be committed
unless anyone had additional feedback there.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-19 00:45 ` Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-19 00:45 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
On Wed, Mar 18, 2026 at 4:36 PM Lukas Fittl <[email protected]> wrote:
> On Wed, Mar 18, 2026 at 1:49 PM Zsolt Parragi <[email protected]> wrote:
> > There seems to be one more bug in this:
> >
> > 1. EXPLAIN ANALYZE fires a trigger
> > 2. The trigger function throws ERROR, InstrStopTrigger never runs
> > 3. ResOwnerReleaseInstrumentation runs but only checks
> > unfinalized_children, not triggers
> > 4. InstrStopFinalize discards the trigger entry
> > 5. Trigger instrumentation information shows 0
>
> Hmm, so I think you're correct that a trigger function error would
> cause any stack-based instrumentation from the trigger to get lost.
>
> In practice that doesn't matter today, since triggers never capture
> WAL/buffer usage data (only timing),
After twisting and turning this in my head more, I realize that's
actually not correct - as it stands, trigger instrumentation is
inheriting the instrumentation options from the overall query, and so
that will cause a typical EXPLAIN (ANALYZE) to also capture Buffer/WAL
usage for triggers - it just won't be shown in EXPLAIN.
Since its not used in practice, we could fix that by explicitly
setting INSTRUMENT_TIMER for triggers, but AFAIR Andres had noted on a
prior iteration that special casing this doesn't seem right, since we
should probably output buffer/WAL usage for triggers anyway.
So I guess that brings us back to, we should fix it with one of the
ways I mentioned. FWIW, I was able to create a test case in the
pg_session_buffer_usage module to that effect, so there is indeed a
current issue where activity during triggers gets lost and won't be
added to the overall totals on abort.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-23 14:41 ` Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Heikki Linnakangas @ 2026-03-23 14:41 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; Zsolt Parragi <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
On 19/03/2026 02:45, Lukas Fittl wrote:
> On Wed, Mar 18, 2026 at 4:36 PM Lukas Fittl <[email protected]> wrote:
>> On Wed, Mar 18, 2026 at 1:49 PM Zsolt Parragi <[email protected]> wrote:
>>> There seems to be one more bug in this:
>>>
>>> 1. EXPLAIN ANALYZE fires a trigger
>>> 2. The trigger function throws ERROR, InstrStopTrigger never runs
>>> 3. ResOwnerReleaseInstrumentation runs but only checks
>>> unfinalized_children, not triggers
>>> 4. InstrStopFinalize discards the trigger entry
>>> 5. Trigger instrumentation information shows 0
>>
>> Hmm, so I think you're correct that a trigger function error would
>> cause any stack-based instrumentation from the trigger to get lost.
>>
>> In practice that doesn't matter today, since triggers never capture
>> WAL/buffer usage data (only timing),
>
> After twisting and turning this in my head more, I realize that's
> actually not correct - as it stands, trigger instrumentation is
> inheriting the instrumentation options from the overall query, and so
> that will cause a typical EXPLAIN (ANALYZE) to also capture Buffer/WAL
> usage for triggers - it just won't be shown in EXPLAIN.
>
> Since its not used in practice, we could fix that by explicitly
> setting INSTRUMENT_TIMER for triggers, but AFAIR Andres had noted on a
> prior iteration that special casing this doesn't seem right, since we
> should probably output buffer/WAL usage for triggers anyway.
>
> So I guess that brings us back to, we should fix it with one of the
> ways I mentioned. FWIW, I was able to create a test case in the
> pg_session_buffer_usage module to that effect, so there is indeed a
> current issue where activity during triggers gets lost and won't be
> added to the overall totals on abort.
I'm looking at this finalize at resowner part of this patch, and this
maybe a stupid question, but:
Why does the instrumentation need to be "finalized" on abort? If you run
EXPLAIN ANALYZE and the query aborts, you don't get to see the stats anyway.
- Heikki
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
@ 2026-03-23 19:07 ` Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-03-23 19:07 UTC (permalink / raw)
To: Heikki Linnakangas <[email protected]>; +Cc: Lukas Fittl <[email protected]>; Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
> I'm looking at this finalize at resowner part of this patch, and this
> maybe a stupid question, but:
>
> Why does the instrumentation need to be "finalized" on abort? If you run
> EXPLAIN ANALYZE and the query aborts, you don't get to see the stats anyway.
The pg_session_buffer_usage in 0009 makes the information available, I
was able to see the issue with failing triggers with that. Even if
that part doesn't get committed in the end, a 3rd party extension
could still implement the same thing, and notice the missing
statistics. (And maybe it is useful to see some statistics about
failing queries?)
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-03-23 20:03 ` Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-23 20:03 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; Heikki Linnakangas <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Andres Freund <[email protected]>; Peter Smith <[email protected]>
On Mon, Mar 23, 2026 at 12:07 PM Zsolt Parragi
<[email protected]> wrote:
>
> > I'm looking at this finalize at resowner part of this patch, and this
> > maybe a stupid question, but:
> >
> > Why does the instrumentation need to be "finalized" on abort? If you run
> > EXPLAIN ANALYZE and the query aborts, you don't get to see the stats anyway.
>
> The pg_session_buffer_usage in 0009 makes the information available, I
> was able to see the issue with failing triggers with that. Even if
> that part doesn't get committed in the end, a 3rd party extension
> could still implement the same thing, and notice the missing
> statistics. (And maybe it is useful to see some statistics about
> failing queries?)
Right, there are basically two reasons we need the resource owner
logic, or something equivalent:
1) To ensure the active stack entry gets reset correctly to where it
was before, so that we don't corrupt it after an abort (if that's not
done, we'll fail during regular regression tests)
2) To accumulate statistics to parent stack entries that did not
participate in the abort
As Zsolt noted, (2) matters for the pg_session_buffer_usage module,
which is mainly intended to ensure that the logic we're adding keeps a
top-level instrumentation available that includes the activity of
aborted transactions/queries, since we're removing pgBufferUsage. It
also matters for some edge cases in-tree today, e.g. procedures being
tracked in pg_stat_statements. And I do see us being interested in
tracking failed query activities in the future as Zsolt noted.
---
FWIW, on the topic of resource owners and allocations, I've done a
test over the weekend, and here is a question:
It seems we could switch the Instrumentation allocations we're doing
when inside a portal to PortalContext, and CurrentMemoryContext when
outside a portal - instead of allocating in
TopMemoryContext/TopTransactionContext. That works in practice,
because resource owner cleanup happens before PortalContext cleanup,
and simplifies the code a bit since we can skip copying into the
current memory context (because the caller wants to be able to use the
result after the finalize call). And if we leak we'd only leak until
PortalContext gets cleaned up, instead of TopMemoryContext.
To expand on that, in the previously posted v9 we have the following
allocations:
A) InstrStackState allocated under TopMemoryContext (long-lived, never freed)
B) QueryInstrumentation allocated under TopMemoryContext (short-lived
during query execution, explicitly freed up on abort or finalize call)
C) NodeInstrumentation allocated under TopTransactionContext
(short-lived during query execution, explicitly freed up on abort or
finalize call)
D) In other use cases, e.g. ANALYZE command that logs buffer usage,
QueryInstrumentation allocated under TopMemoryContext (short-lived
during command execution, explicitly freed up on abort or finalize
call)
And we could switch it instead to:
A) InstrStackState allocated under TopMemoryContext (long-lived, never freed)
B) QueryInstrumentation allocated under PortalContext (short-lived
during query execution, *automatically* freed up on abort, manually on
ExecutorEnd to avoid waiting for holdable cursors to free
PortalContext)
C) NodeInstrumentation allocated under PortalContext (short-lived
during query execution, *automatically* freed up on abort, manually on
ExecutorEnd to avoid waiting for holdable cursors to free
PortalContext)
D) In other use cases, e.g. ANALYZE command that logs buffer usage,
QueryInstrumentation allocated under CurrentMemoryContext (short-lived
during command execution, *automatically* freed up on abort and
success case)
However, this goes against the principle noted by Heikki over in [0]
that ResOwners should use TopMemoryContext to avoid relying on the
ordering of clean up operations.
Thoughts?
Thanks,
Lukas
[0]: https://www.postgresql.org/message-id/flat/a3197b31-f40d-4164-872d-906d8e9b374a%40iki.fi#526984c1be0...
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-24 06:03 ` Lukas Fittl <[email protected]>
2026-03-24 22:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
0 siblings, 2 replies; 42+ messages in thread
From: Lukas Fittl @ 2026-03-24 06:03 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; Heikki Linnakangas <[email protected]>; Andres Freund <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Peter Smith <[email protected]>
On Mon, Mar 23, 2026 at 1:03 PM Lukas Fittl <[email protected]> wrote:
> FWIW, on the topic of resource owners and allocations, I've done a
> test over the weekend, and here is a question:
>
> It seems we could switch the Instrumentation allocations we're doing
> when inside a portal to PortalContext, and CurrentMemoryContext when
> outside a portal - instead of allocating in
> TopMemoryContext/TopTransactionContext. That works in practice,
> because resource owner cleanup happens before PortalContext cleanup,
> and simplifies the code a bit since we can skip copying into the
> current memory context (because the caller wants to be able to use the
> result after the finalize call). And if we leak we'd only leak until
> PortalContext gets cleaned up, instead of TopMemoryContext.
>
> To expand on that, in the previously posted v9 we have the following
> allocations:
>
> A) InstrStackState allocated under TopMemoryContext (long-lived, never freed)
> B) QueryInstrumentation allocated under TopMemoryContext (short-lived
> during query execution, explicitly freed up on abort or finalize call)
> C) NodeInstrumentation allocated under TopTransactionContext
> (short-lived during query execution, explicitly freed up on abort or
> finalize call)
> D) In other use cases, e.g. ANALYZE command that logs buffer usage,
> QueryInstrumentation allocated under TopMemoryContext (short-lived
> during command execution, explicitly freed up on abort or finalize
> call)
>
> And we could switch it instead to:
>
> A) InstrStackState allocated under TopMemoryContext (long-lived, never freed)
> B) QueryInstrumentation allocated under PortalContext (short-lived
> during query execution, *automatically* freed up on abort, manually on
> ExecutorEnd to avoid waiting for holdable cursors to free
> PortalContext)
> C) NodeInstrumentation allocated under PortalContext (short-lived
> during query execution, *automatically* freed up on abort, manually on
> ExecutorEnd to avoid waiting for holdable cursors to free
> PortalContext)
> D) In other use cases, e.g. ANALYZE command that logs buffer usage,
> QueryInstrumentation allocated under CurrentMemoryContext (short-lived
> during command execution, *automatically* freed up on abort and
> success case)
>
> However, this goes against the principle noted by Heikki over in [0]
> that ResOwners should use TopMemoryContext to avoid relying on the
> ordering of clean up operations.
I've pondered this question more today, and I think maybe this
complexity isn't the right way to approach this.
Instead I've tried introducing a memory context for instrumentation
managed as a resource owner, and I am now (for now) convinced that
this is the right trade-off for the problem at hand.
The benefit of using our own memory context is that we can free it all
at once (which is a lot less brittle when different types of
instrumentation are involved), *and* we can re-assign the context
parent to be that of the current context on finalize, cleanly moving
it out of TopMemoryContext without doing a copy. It also makes it
easier for callers to allocate in the right context, without having to
introduce a bunch more "Alloc" methods (e.g. relevant for the table
stack tracking for index scans). We also have precedence for the use
of small memory contexts in the executor with the existence of
per-tuple memory contexts.
The main downside is that for the cases where we don't have child
instrumentation, but want the resource owner logic (e.g. ANALYZE
command, or regular query execution with pg_stat_statements enabled),
we have more memory overhead: 1kB (ALLOCSET_SMALL_SIZES minimum) for
what could otherwise be ~200B. I think that's probably okay for
current use cases, but we could avoid that by only using the separate
contexts when we have child instrumentations that will be tracked.
See attached v10, rebased, with these additional changes:
In 0001/0002 I've added forward declarations in execnodes.h, which are
necessary since fba4233c8328.
In 0005 (stack-based instrumentation) I've also addressed the
previously raised concerns about trigger and EXPLAIN (SERIALIZE)
handling, and it now treats both kinds as children of the query's
instrumentation context. To assist with initializing that, we have to
add a query instrumentation reference to EState, but I think that's
acceptable. To reduce code churn I've repurposed the existing
es_instrument field for that, and we now remember the instrumentation
options on QueryInstrumentation.
In 0007 (Optimize ExecProcNodeInstr instructions by inlining) I've
adjusted the ExecProcNodeInstr logic to use a single function that
contains the logic, with separate callers that pass in fixed
constants, to let the compiler figure out the different variants with
less code duplication, per an off-list suggestion from Andres.
In 0008 (Index scans: Show table buffer accesses) this now utilizes
the fact that f026fbf059f2 made IndexScanInstrumentation a heap
allocation, and puts that allocation in the instrumentation memory
context, so it can participate directly in the stack with an inlined
Instrumentation field to track table access, avoiding a duplicate
field previously necessary.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v10-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch (85.9K, 2-v10-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From c72bf29037194ff0ce24b243edf03fce7d35c292 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v10 5/9] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 87 +---
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 15 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 31 +-
src/backend/commands/explain.c | 43 +-
src/backend/commands/explain_dr.c | 57 ++-
src/backend/commands/prepare.c | 27 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/trigger.c | 17 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/execMain.c | 83 +++-
src/backend/executor/execParallel.c | 32 +-
src/backend/executor/execPartition.c | 2 +-
src/backend/executor/execProcnode.c | 84 +++-
src/backend/executor/execUtils.c | 2 +-
src/backend/executor/instrument.c | 447 ++++++++++++++----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/commands/explain_dr.h | 5 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 5 +-
src/include/executor/instrument.h | 198 +++++++-
src/include/nodes/execnodes.h | 3 +-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
29 files changed, 861 insertions(+), 356 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 3e79108846e..9856dec3a5f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -910,22 +910,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -939,30 +928,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1014,19 +993,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1088,10 +1057,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1155,17 +1124,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1181,6 +1144,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1195,9 +1159,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1209,23 +1170,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2a0f8c8e3b8..1ceb2306954 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2886,6 +2886,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2935,7 +2936,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2950,7 +2951,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e54782d9dd8..04cd53916ca 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2117,6 +2117,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2185,7 +2186,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2200,7 +2201,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1a446050d85..a7db507fa0d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -643,8 +643,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -660,6 +659,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -990,14 +991,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrQueryStopFinalize(instr);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1006,12 +1007,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47a9bda30c9..6a261c8dcbd 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index eeed91be266..c21b2019eab 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrQueryStopFinalize(instr);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e73dc129132..dc5e63955bc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,14 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -348,15 +350,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -364,16 +363,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +582,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1016,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1024,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1035,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..e1fc723c758 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,11 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (instr->need_timer || instr->need_stack)
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +182,9 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ if (instr->need_timer || instr->need_stack)
+ InstrStop(instr);
return true;
}
@@ -233,9 +220,17 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ {
+ int instrument_options = 0;
+
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
+ }
}
/*
@@ -246,6 +241,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
@@ -296,16 +293,18 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
* receiver if the subject statement is CREATE TABLE AS. In that
* case, return all-zeroes stats.
*/
-SerializeMetrics
+/*
+ * GetSerializationMetrics - get serialization metrics
+ *
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
+ */
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..f7e158e4dd9 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -580,13 +580,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -598,9 +601,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +636,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -644,13 +644,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +653,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c69c12dc014..90ac5ccaacd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 29b80d75143..f2597b917e1 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -93,6 +93,7 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2312,6 +2313,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2348,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(qinstr, instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2441,6 +2443,7 @@ ExecBSInsertTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2502,6 +2505,7 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2606,6 +2610,7 @@ ExecIRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2670,6 +2675,7 @@ ExecBSDeleteTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2780,6 +2786,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2884,6 +2891,7 @@ ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (rettuple == NULL)
return false; /* Delete was suppressed */
@@ -2942,6 +2950,7 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -3094,6 +3103,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
@@ -3258,6 +3268,7 @@ ExecIRUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -3316,6 +3327,7 @@ ExecBSTruncateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -4373,7 +4385,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(estate->es_instrument, instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4561,6 +4573,7 @@ AfterTriggerExecute(EState *estate,
tgindx,
finfo,
NULL,
+ NULL,
per_tuple_context);
if (rettuple != NULL &&
rettuple != LocTriggerData.tg_trigtuple &&
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..c330c891c03 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1006,6 +1006,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1095,7 +1096,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1103,7 +1104,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1b950040597..5366c1e801c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
estate->es_top_eflags = eflags;
- estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up query-level instrumentation if needed. We do this before
+ * InitPlan so that node and trigger instrumentation can be allocated
+ * within the query's dedicated instrumentation memory context.
+ */
+ if (!queryDesc->totaltime && queryDesc->instrument_options)
+ {
+ queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
+ estate->es_instrument = queryDesc->totaltime;
+ }
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
@@ -331,9 +342,21 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /* Start up instrumentation for this execution run */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ {
+ InstrQueryStart(queryDesc->totaltime);
+
+ /*
+ * Remember all node entries for abort recovery. We do this once here
+ * after InstrQueryStart has pushed the parent stack entry.
+ */
+ if (estate->es_instrument &&
+ estate->es_instrument->instr.need_stack &&
+ !queryDesc->already_executed)
+ ExecRememberNodeInstrumentation(queryDesc->planstate,
+ queryDesc->totaltime);
+ }
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +408,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +458,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -444,8 +467,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per-node and trigger statistics to their respective parent
+ * instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and the
+ * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
+ * we accumulated here, the leader would double-count: worker parent nodes
+ * would already include their children's stats, and then the leader's
+ * accumulation would add the children again.
+ */
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1263,7 +1304,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
}
else
{
@@ -1499,6 +1540,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_instrument->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c153d5c1c3b..0b18a05c434 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -694,7 +694,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -819,13 +819,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
- instrumentation->instrument_options = estate->es_instrument;
+ instrumentation->instrument_options = estate->es_instrument->instrument_options;
instrumentation->instrument_offset = instrument_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInitNode(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1075,14 +1075,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1456,6 +1470,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1516,7 +1531,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1532,7 +1547,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6f2909a1bc3 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1381,7 +1381,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..21ad1b04a57 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -788,10 +790,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +831,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+ (void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ if (node->instrument)
+ InstrQueryRememberChild(parent, &node->instrument->instr);
+
+ return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing our
+ * instrumentation as the parent context. This ensures children can
+ * accumulate to us even if they were never executed by the leader (e.g.
+ * nodes beneath Gather that only workers ran).
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ node->instrument ? &node->instrument->instr : parent);
+
+ if (!node->instrument)
+ return false;
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 9886ab06b69..c20dc50f6fd 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -150,7 +150,7 @@ CreateExecutorState(void)
estate->es_total_processed = 0;
estate->es_top_eflags = 0;
- estate->es_instrument = 0;
+ estate->es_instrument = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6a4a08ebb0c..6892706a83a 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,30 +16,46 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -54,49 +70,295 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_stack)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
+
+ InstrAccumStack(&qinstr->instr, child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ return;
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
- InstrInitNode(instr, instrument_options);
+ InstrInitNode(instr, qinstr->instrument_options);
instr->async_mode = async_mode;
return instr;
@@ -117,6 +379,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -146,14 +409,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_stack)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -188,8 +449,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -230,67 +491,73 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
{
- TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
- InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, qinstr->instrument_options);
return tginstr;
}
void
-InstrStartTrigger(TriggerInstrumentation *tginstr)
+InstrStartTrigger(QueryInstrumentation *qinstr, TriggerInstrumentation *tginstr)
{
InstrStart(&tginstr->instr);
+
+ /*
+ * On first call, register with the parent QueryInstrumentation for abort
+ * recovery.
+ */
+ if (qinstr && tginstr->instr.need_stack &&
+ dlist_node_is_detached(&tginstr->instr.unfinalized_entry))
+ dlist_push_head(&qinstr->unfinalized_entries,
+ &tginstr->instr.unfinalized_entry);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -311,39 +578,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 2d7708805a6..f655035e213 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -903,7 +903,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index dfa37e5ed44..41a0baa3449 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1269,9 +1269,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
if (*foundPtr)
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9e7a88ec0d0..60400f0c81f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..fa98d29589f 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* per-tuple timing/buffer measurement */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 07f4b1f7490..f56b13841fb 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,7 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +302,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1139be8333e..2d218dc2a15 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,91 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +174,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,16 +191,102 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
- bool async_mode);
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
@@ -141,35 +294,36 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
-extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
+extern void InstrStartTrigger(QueryInstrumentation *qinstr,
+ TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
+ instr_stack.current->bufusage.fld += val; \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ instr_stack.current->walusage.fld += val; \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 502ad4f2da5..aef1003f608 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -53,6 +53,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -731,7 +732,7 @@ typedef struct EState
* ExecutorRun() calls. */
int es_top_eflags; /* eflags passed to ExecutorStart */
- int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_instrument; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e9059d1b255..cae1e2a8857 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,6 +1341,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2458,6 +2459,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v10-0001-instrumentation-Separate-trigger-logic-from-othe.patch (10.1K, 3-v10-0001-instrumentation-Separate-trigger-logic-from-othe.patch)
download | inline diff:
From 30eef87a653d2c9da919927442b6e91bbc720ae2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v10 1/9] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 3 ++-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 61 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4b70166b0e..eb6ef23c2d6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1101,18 +1101,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1137,11 +1134,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1151,9 +1148,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 6596843a8d8..29b80d75143 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -92,7 +92,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2311,7 +2311,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2346,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2391,10 +2391,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3938,7 +3938,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4332,7 +4332,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4373,7 +4373,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4590,10 +4590,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4709,7 +4709,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 58b84955c2b..53631163dd6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1285,7 +1285,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 684e398f824..178229c5c44 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -59,6 +59,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
typedef struct Tuplesortstate Tuplesortstate;
@@ -533,7 +534,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c07c945f05..6ec3cdeff5b 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3191,6 +3191,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v10-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.9K, 4-v10-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 1e88eba5ad45baa0768fbb15023ff532f64e5d2f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 12:12:39 -0800
Subject: [PATCH v10 3/9] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/executor/instrument.c | 1 -
src/backend/storage/buffer/bufmgr.c | 24 ++++++++++++------------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
7 files changed, 47 insertions(+), 29 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5c9a34374d..9b33584f454 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1081,10 +1081,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2063,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6a4a08ebb0c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -54,7 +54,6 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
instr->bufusage_start = pgBufferUsage;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 00bc609529a..dfa37e5ed44 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -835,7 +835,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -856,7 +856,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1257,14 +1257,14 @@ PinBufferForBlock(Relation rel,
{
bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
if (*foundPtr)
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
}
else
{
bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
strategy, foundPtr, io_context);
if (*foundPtr)
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
}
if (rel)
{
@@ -1998,9 +1998,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
if (operation->rel)
pgstat_count_buffer_hit(operation->rel);
@@ -2068,9 +2068,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it.
@@ -2959,7 +2959,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3105,7 +3105,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4520,7 +4520,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5663,7 +5663,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 404c6bccbdd..8845b0aeed6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..9e7a88ec0d0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..1139be8333e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += val; \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += val; \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v10-0004-instrumentation-Add-additional-regression-tests-.patch (23.5K, 5-v10-0004-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From 176955f3400a7dbd1d2163faba954bb363df65d5 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v10 4/9] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 197 ++++++++++++++++++
src/test/regress/sql/explain.sql | 194 +++++++++++++++++
6 files changed, 598 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
--
2.47.1
[application/octet-stream] v10-0002-instrumentation-Separate-per-node-logic-from-oth.patch (27.1K, 6-v10-0002-instrumentation-Separate-per-node-logic-from-oth.patch)
download | inline diff:
From 02925109417e3a5737f86e24758585b7fd9a9d68 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v10 2/9] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 142 ++++++++++++------
src/include/executor/instrument.h | 60 +++++---
src/include/nodes/execnodes.h | 9 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 174 insertions(+), 114 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..3e79108846e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1024,7 +1024,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1083,12 +1083,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 41e47cc795b..cc8ec24c30e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2779,7 +2779,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index eb6ef23c2d6..e73dc129132 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1890,11 +1890,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1903,7 +1903,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2290,18 +2290,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2309,9 +2309,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 53631163dd6..1b950040597 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +385,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +435,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -445,7 +445,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..c153d5c1c3b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -725,7 +729,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -811,7 +815,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -821,7 +825,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1053,7 +1057,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1081,9 +1085,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1313,7 +1317,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d35976925ae..132fe37ef60 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,8 +414,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
-
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
if (!dst->running && add->running)
{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 178229c5c44..502ad4f2da5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -59,6 +59,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct NodeInstrumentation NodeInstrumentation;
typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
@@ -67,7 +68,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
-typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
/* ----------------
@@ -1185,8 +1186,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6ec3cdeff5b..e9059d1b255 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1807,6 +1807,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3413,9 +3414,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v10-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (11.2K, 7-v10-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From be4de5f3b73d55e2b19f9495167e6ef337e92549 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v10 7/9] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +---
src/backend/executor/instrument.c | 198 ++++++++++++++++++++--------
src/include/executor/instrument.h | 5 +
3 files changed, 148 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 21ad1b04a57..9f5698063f0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -465,7 +464,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -473,25 +472,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 09d5ffe8651..4ea807e295f 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -59,29 +59,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -89,6 +80,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -372,65 +373,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_stack)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Update tuple count */
@@ -495,6 +488,99 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopNodeTimer(instr);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index c34a1cfff42..a2590a2e5b1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -294,6 +294,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
extern void InstrStartTrigger(QueryInstrumentation *qinstr,
TriggerInstrumentation *tginstr);
--
2.47.1
[application/octet-stream] v10-0006-instrumentation-Use-Instrumentation-struct-for-p.patch (29.2K, 8-v10-0006-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From 01b2b833fdf45b29dc97c17ba3c69645276a8a3b Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v10 6/9] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1ceb2306954..1c95ec9f605 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2887,8 +2876,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2949,11 +2937,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 04cd53916ca..51bb098a2a2 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2118,8 +2107,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2199,11 +2187,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6a261c8dcbd..504b34cc906 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c330c891c03..b5fed54fb85 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -749,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1007,8 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1102,11 +1089,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 0b18a05c434..7a390350564 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -625,8 +624,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -690,21 +687,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -790,17 +780,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -826,9 +817,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1230,7 +1221,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1244,10 +1235,10 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
@@ -1471,8 +1462,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1487,7 +1476,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1545,11 +1534,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6892706a83a..09d5ffe8651 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -322,11 +322,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -342,12 +343,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 2d218dc2a15..c34a1cfff42 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -283,8 +283,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
--
2.47.1
[application/octet-stream] v10-0009-Add-pg_session_buffer_usage-contrib-module.patch (29.3K, 9-v10-0009-Add-pg_session_buffer_usage-contrib-module.patch)
download | inline diff:
From 7f30babc3ae5f2e4470fcee9bf6d7aa20fadb2ce Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v10 9/9] Add pg_session_buffer_usage contrib module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
contrib/Makefile | 1 +
contrib/meson.build | 1 +
contrib/pg_session_buffer_usage/Makefile | 23 ++
.../expected/pg_session_buffer_usage.out | 342 ++++++++++++++++++
contrib/pg_session_buffer_usage/meson.build | 34 ++
.../pg_session_buffer_usage--1.0.sql | 31 ++
.../pg_session_buffer_usage.c | 95 +++++
.../pg_session_buffer_usage.control | 5 +
.../sql/pg_session_buffer_usage.sql | 245 +++++++++++++
9 files changed, 777 insertions(+)
create mode 100644 contrib/pg_session_buffer_usage/Makefile
create mode 100644 contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
create mode 100644 contrib/pg_session_buffer_usage/meson.build
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
create mode 100644 contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
diff --git a/contrib/Makefile b/contrib/Makefile
index dd04c20acd2..ac04f9eb997 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -36,6 +36,7 @@ SUBDIRS = \
pg_overexplain \
pg_plan_advice \
pg_prewarm \
+ pg_session_buffer_usage \
pg_stat_statements \
pg_surgery \
pg_trgm \
diff --git a/contrib/meson.build b/contrib/meson.build
index 5a752eac347..2b1399e56f3 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -51,6 +51,7 @@ subdir('pg_overexplain')
subdir('pg_plan_advice')
subdir('pg_prewarm')
subdir('pgrowlocks')
+subdir('pg_session_buffer_usage')
subdir('pg_stat_statements')
subdir('pgstattuple')
subdir('pg_surgery')
diff --git a/contrib/pg_session_buffer_usage/Makefile b/contrib/pg_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..75bd8e09b3d
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_session_buffer_usage/Makefile
+
+MODULE_big = pg_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ pg_session_buffer_usage.o
+
+EXTENSION = pg_session_buffer_usage
+DATA = pg_session_buffer_usage--1.0.sql
+PGFILEDESC = "pg_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = pg_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_session_buffer_usage
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
new file mode 100644
index 00000000000..5e3f90a7b69
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM pg_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM pg_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset
+-------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
diff --git a/contrib/pg_session_buffer_usage/meson.build b/contrib/pg_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..34c7502beb4
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+pg_session_buffer_usage_sources = files(
+ 'pg_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ pg_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'pg_session_buffer_usage',
+ '--FILEDESC', 'pg_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+pg_session_buffer_usage = shared_module('pg_session_buffer_usage',
+ pg_session_buffer_usage_sources,
+ kwargs: contrib_mod_args,
+)
+contrib_targets += pg_session_buffer_usage
+
+install_data(
+ 'pg_session_buffer_usage--1.0.sql',
+ 'pg_session_buffer_usage.control',
+ kwargs: contrib_data_args,
+)
+
+tests += {
+ 'name': 'pg_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'pg_session_buffer_usage',
+ ],
+ },
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..b300fdbc643
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION pg_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION pg_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
new file mode 100644
index 00000000000..f869956b3a9
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "pg_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage);
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: pg_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+pg_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: pg_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+pg_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
new file mode 100644
index 00000000000..fabd05ee024
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# pg_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/pg_session_buffer_usage'
+relocatable = true
diff --git a/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
new file mode 100644
index 00000000000..e92068168b2
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT pg_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT pg_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM pg_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT pg_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM pg_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
--
2.47.1
[application/octet-stream] v10-0008-Index-scans-Show-table-buffer-accesses-separatel.patch (22.7K, 10-v10-0008-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From 6cef5306c481e55a6e3c4a985b9943605774f707 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v10 8/9] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 +++++++--
src/backend/executor/execProcnode.c | 53 ++++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 27 ++++-
src/backend/executor/nodeIndexscan.c | 113 ++++++++++++++++-----
src/include/executor/instrument_node.h | 5 +
src/include/nodes/execnodes.h | 1 +
9 files changed, 224 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dc5e63955bc..eef343a9d97 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -610,7 +610,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1027,7 +1027,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1041,7 +1041,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1969,6 +1969,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1986,6 +1989,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2287,7 +2293,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2306,7 +2312,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4106,7 +4112,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4131,6 +4137,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4186,6 +4194,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4227,6 +4237,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4247,8 +4265,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4267,6 +4297,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9f5698063f0..71a897f2b84 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -418,6 +418,29 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ /*
+ * IndexScan / IndexOnlyScan track table and index access separately.
+ *
+ * We intentionally don't collect timing for them (even if enabled), since
+ * we don't need it, and executor nodes call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if (estate->es_instrument && (estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ if (IsA(result, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, result);
+
+ InstrInitOptions(&iss->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ }
+ else if (IsA(result, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, result);
+
+ InstrInitOptions(&ioss->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ }
+ }
+
return result;
}
@@ -837,8 +860,24 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
return false;
if (node->instrument)
+ {
InstrQueryRememberChild(parent, &node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrQueryRememberChild(parent, &iss->iss_Instrument->table_instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrQueryRememberChild(parent, &ioss->ioss_Instrument->table_instr);
+ }
+ }
+
return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
}
@@ -880,6 +919,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
if (!node->instrument)
return false;
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d61..63e24a0bcd4 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9eab81fd1c8..66b02788b3c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -163,11 +167,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -434,6 +449,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -608,7 +624,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -893,4 +909,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 06143e94c5a..e66b6d6407b 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -130,8 +136,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -179,6 +201,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -198,6 +221,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -259,36 +285,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -814,6 +871,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -976,7 +1034,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1826,4 +1884,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94fa..e8531b84efa 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index aef1003f608..ef951643339 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1790,6 +1790,7 @@ typedef struct IndexOnlyScanState
struct IndexScanDescData *ioss_ScanDesc;
IndexScanInstrumentation *ioss_Instrument;
SharedIndexScanInstrumentation *ioss_SharedInfo;
+
TupleTableSlot *ioss_TableSlot;
Buffer ioss_VMBuffer;
Size ioss_PscanLen;
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-24 22:59 ` Zsolt Parragi <[email protected]>
2026-03-25 05:34 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
1 sibling, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-03-24 22:59 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Andres Freund <[email protected]>; Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Peter Smith <[email protected]>
I like the new approach, but doesn't `EXPLAIN (BUFFERS)` leak some
memory because the resource owner isn't registered on that path? It
seems to be visible with pg_log_backend_memory_contexts.
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += val; \
+ instr_stack.current->bufusage.fld += val; \
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += val; \
+ instr_stack.current->walusage.fld += val; \
Nitpick, but these could use += (val)
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 22:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-03-25 05:34 ` Lukas Fittl <[email protected]>
0 siblings, 0 replies; 42+ messages in thread
From: Lukas Fittl @ 2026-03-25 05:34 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Andres Freund <[email protected]>; Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Peter Smith <[email protected]>
Hi Zsolt,
On Tue, Mar 24, 2026 at 3:59 PM Zsolt Parragi <[email protected]> wrote:
> I like the new approach, but doesn't `EXPLAIN (BUFFERS)` leak some
> memory because the resource owner isn't registered on that path? It
> seems to be visible with pg_log_backend_memory_contexts.
Ah, yes, good catch! Its nice how the separate memory context makes
this very clear now.
I think this was actually an existing problem that only surfaced now.
The issue is that EXPLAIN (BUFFERS) will allocate the per-query
instrumentation, but not actually use it, since the buffer tracking is
only relevant for planning, which has its own instrumentation. I can
fix this locally by adding the following to FreeExecutorState:
/*
* Make sure the instrumentation context gets freed. This usually gets
* re-parented under the per-query context in InstrQueryStopFinalize, but
* that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
* gets called, so we would otherwise leak it in TopMemoryContext.
*/
if (estate->es_instrument && estate->es_instrument->instr.need_stack)
MemoryContextDelete(estate->es_instrument->instr_cxt);
I'll include this in the next revision unless I come up with a better
idea. FWIW, I also considered just not setting INSTRUMENT_BUFFERS in
ExplainOnePlan unless ANALYZE is active, but I think there might be
other cases where that doesn't work as expected, so I think the
explicit delete is better.
>
> #define INSTR_BUFUSAGE_ADD(fld,val) do { \
> - pgBufferUsage.fld += val; \
> + instr_stack.current->bufusage.fld += val; \
>
> #define INSTR_WALUSAGE_ADD(fld,val) do { \
> pgWalUsage.fld += val; \
> + instr_stack.current->walusage.fld += val; \
>
> Nitpick, but these could use += (val)
Ack, makes sense - I'll adjust in the next revision.
I'll give it a day or so for further feedback before posting the next update.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-25 10:47 ` Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
1 sibling, 1 reply; 42+ messages in thread
From: Heikki Linnakangas @ 2026-03-25 10:47 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; Zsolt Parragi <[email protected]>; Andres Freund <[email protected]>; +Cc: Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Peter Smith <[email protected]>
On 24/03/2026 08:03, Lukas Fittl wrote:
> Instead I've tried introducing a memory context for instrumentation
> managed as a resource owner, and I am now (for now) convinced that
> this is the right trade-off for the problem at hand.
Yes, that seems better.
This patch could use an overview README file, I'm struggling to
understand how the this all works. Here's my understanding so far,
please correct me if I'm wrong:
There are *two* data structures tracking the Instrumentation nodes. The
patch only talks about a stack, but I think there's also implicitly a
tree in there.
Tree
----
All Instrumentation nodes are part of a tree. For example, if you have
two portals open, the tree might look like this:
Session - Query A - NestLoop - Seq Scan A
- Seq Scan B
- Query B - Seq Scan C
When a node is "finalized", its counters are added to its parent.
This tree is a somewhat implicit in the patch. Each QueryInstrumentation
has a list of child nodes, but only unfinalized ones. Don't we need that
at the session level too? When a Query is released on abort, its
counters need to be added to the parent too. If I understand correctly,
the patch tries to use the stack for that, but it's confusing.
I think it would make the patch more clear to talk explicitly about the
tree, and represent it explicitly in the Instrumentation nodes. I.e. add
a "parent" pointer, or a "children" list, or both to the Instrumentation
struct.
Stack
-----
At all times, there's a stack that tracks what is the Instrumentation in
the tree that is *currently* executing. For example, while executing the
Seq Scan B, the stack would look like this:
0: Session
1: Query A
2: NestLoop
3: Seq Scan B
And when the code is sending a result row back to the client, while the
query is being executed, the stack would be just:
0: Session
In the patch, the stack is represented by an array. It could also be
implemented with a CurrentInstrumentation global variable, similar to
CurrentMemoryContext and CurrentResourceOwner.
Abort handling
--------------
On abort, two things need to happen:
1. Reset the stack to the appropriate level. This ensures that any we
don't later try to update the counters on an Instrumentation nodes that
is going away with the abort. In the above example, the stack would be
reset to the 0: Session level.
2. Finalize all the Instrumentation nodes as part of the ResourceOwner
cleanup. All Instrumentation nodes that are released roll up their
counters to their parents.
Questions:
Is the stack always a path from the root of the tree, down to some node?
Or could you have e.g. recursion like A -> B -> C -> A? (I don't know if
it makes a difference, just wondering)
What happens if you release e.g. the NestLoop before its children? All
the Instrumentation nodes belonging to a query would usually be part of
the same ResourceOwner and there's no guarantee on what order the
resources are released.
- Heikki
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
@ 2026-03-26 00:41 ` Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-26 00:41 UTC (permalink / raw)
To: Heikki Linnakangas <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Andres Freund <[email protected]>; Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Peter Smith <[email protected]>
Hi Heikki,
On Wed, Mar 25, 2026 at 3:47 AM Heikki Linnakangas <[email protected]> wrote:
>
> On 24/03/2026 08:03, Lukas Fittl wrote:
> > Instead I've tried introducing a memory context for instrumentation
> > managed as a resource owner, and I am now (for now) convinced that
> > this is the right trade-off for the problem at hand.
>
> Yes, that seems better.
Thanks for reviewing!
> This patch could use an overview README file, I'm struggling to
> understand how the this all works. Here's my understanding so far,
> please correct me if I'm wrong:
Sure, happy to put this together - I wonder where would place that
best - probably src/backend/executor/README.instrument ?
> There are *two* data structures tracking the Instrumentation nodes. The
> patch only talks about a stack, but I think there's also implicitly a
> tree in there.
>
> Tree
> ----
>
> All Instrumentation nodes are part of a tree. For example, if you have
> two portals open, the tree might look like this:
>
> Session - Query A - NestLoop - Seq Scan A
> - Seq Scan B
>
> - Query B - Seq Scan C
>
> When a node is "finalized", its counters are added to its parent.
>
> This tree is a somewhat implicit in the patch. Each QueryInstrumentation
> has a list of child nodes, but only unfinalized ones. Don't we need that
> at the session level too? When a Query is released on abort, its
> counters need to be added to the parent too. If I understand correctly,
> the patch tries to use the stack for that, but it's confusing.
If I follow you correctly, we're talking about the work that
InstrStopFinalize is doing (both in regular flow, and in abort):
void
InstrStopFinalize(Instrumentation *instr)
{
...
InstrAccumStack(instr_stack.current, instr);
}
The "instr_stack.current" global referenced here is effectively the
instrumentation that was active before InstrStart was called, and
would be a parent in the tree in that sense.
Its worth noting that on abort we don't care about the tree structure
below the aborted activity, i.e. each QueryInstrumentation acts as a
finalization point of sorts, with any tree structure below (e.g. that
of executor nodes) not being finalized to their respective parents,
but instead just getting added to the QueryInstrumentation they were
attached to (which then gets finalized into "instr_stack.current").
> I think it would make the patch more clear to talk explicitly about the
> tree, and represent it explicitly in the Instrumentation nodes. I.e. add
> a "parent" pointer, or a "children" list, or both to the Instrumentation
> struct.
I'm happy to clarify the mechanism, but I'm hesitant to add more
pointers to Instrumentation, since its a base struct that gets re-used
in different places, and also gets copied to parallel workers (so any
pointer requires extra scrutiny to avoid mis-use) - and I don't think
we actually need to track the parent pointer, since in practice it
will always be the current stack entry.
>
>
> Stack
> -----
>
> At all times, there's a stack that tracks what is the Instrumentation in
> the tree that is *currently* executing. For example, while executing the
> Seq Scan B, the stack would look like this:
>
> 0: Session
> 1: Query A
> 2: NestLoop
> 3: Seq Scan B
>
> And when the code is sending a result row back to the client, while the
> query is being executed, the stack would be just:
>
> 0: Session
>
>
> In the patch, the stack is represented by an array. It could also be
> implemented with a CurrentInstrumentation global variable, similar to
> CurrentMemoryContext and CurrentResourceOwner.
It could be, and in fact earlier iterations were closer to that, but I
modified that to the current version based on Andres' feedback - The
array structure is a lot easier to work with during abort when things
execute out-of-order (as you also note later).
>
>
> Abort handling
> --------------
>
> On abort, two things need to happen:
>
> 1. Reset the stack to the appropriate level. This ensures that any we
> don't later try to update the counters on an Instrumentation nodes that
> is going away with the abort. In the above example, the stack would be
> reset to the 0: Session level.
Correct, but just to clarify, the main problem we deal with in terms
of reset is making sure that we cancel out any InstrPushStack that was
done, but will now no longer have a matching InstrPopStack getting
called.
We need to make sure that we get to the stack entry that was active
before the aborted activity started.
> 2. Finalize all the Instrumentation nodes as part of the ResourceOwner
> cleanup. All Instrumentation nodes that are released roll up their
> counters to their parents.
>
>
> Questions:
>
> Is the stack always a path from the root of the tree, down to some node?
> Or could you have e.g. recursion like A -> B -> C -> A? (I don't know if
> it makes a difference, just wondering)
I don't think this happens in practice - I think for the stack itself
that'd probably be fine (since you're just putting the entry back on
that was on before, in a sense), but it'd e.g. result in timer
instrumentation behaving incorrectly.
We could probably add an assert to explicitly prevent that if we're
worried about it, but the existing Start/Stop instrumentation calls
haven't seen this issue I think, and they'd already have had a problem
with that.
> What happens if you release e.g. the NestLoop before its children? All
> the Instrumentation nodes belonging to a query would usually be part of
> the same ResourceOwner and there's no guarantee on what order the
> resources are released.
Correct, and in fact prior versions of the patch struggled with that
exact problem. Its both an issue for resource owner managed cleanup,
and when you have PG_FINALLY in the picture (e.g. pg_stat_statements).
But that's exactly why its not really a full tree - in the abort case
we do not care about the relationship of child instrumentations
underneath the QueryInstrumentation - we just make sure that the stack
is reset to the entry that was active before the QueryInstrumentation
started, and that all activity that occurred is added to the
QueryInstrumentation.
If you had a situation that had two QueryInstrumentations active (i.e.
both registered as resource owner), we go up to whichever one of the
two is higher up in the stack, per logic in InstrStopFinalize.
Thanks for thinking this through & hopefully this clarifies things a bit?
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-03-27 07:21 ` Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-03-27 07:21 UTC (permalink / raw)
To: Heikki Linnakangas <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Andres Freund <[email protected]>; Tomas Vondra <[email protected]>; PostgreSQL Hackers <[email protected]>; Peter Smith <[email protected]>
On Wed, Mar 25, 2026 at 5:41 PM Lukas Fittl <[email protected]> wrote:
> On Wed, Mar 25, 2026 at 3:47 AM Heikki Linnakangas <[email protected]> wrote:
> > This patch could use an overview README file, I'm struggling to
> > understand how the this all works. Here's my understanding so far,
> > please correct me if I'm wrong:
>
> Sure, happy to put this together - I wonder where would place that
> best - probably src/backend/executor/README.instrument ?
I've gone ahead and added that in
src/backend/executor/README.instrument for now, trying to take some of
your prior email as inspiration, whilst not fully committing to
describing it as a tree - but happy to revise that if we feel its
important for clarity.
See attached v11, rebased after the pgBufferUsage calls were moved in
df09452c3209, and also fixing the issues that Zsolt noted in a
previous email reviewing v10.
I've also moved pg_session_buffer_usage to be a
test_session_buffer_usage module instead, since its not intended as a
user accessible module. If we wanted to commit that (not sure if its
worth the cycles), we could potentially merge it with the 0004 commit
that expands the main regression tests.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v11-0001-instrumentation-Separate-trigger-logic-from-othe.patch (10.1K, 2-v11-0001-instrumentation-Separate-trigger-logic-from-othe.patch)
download | inline diff:
From aef72daa02b5cd48df3b5bc131a87f53a306b680 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v11 1/9] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 3 ++-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 61 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4b70166b0e..eb6ef23c2d6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1101,18 +1101,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1137,11 +1134,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1151,9 +1148,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 6596843a8d8..29b80d75143 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -92,7 +92,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2311,7 +2311,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2346,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2391,10 +2391,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3938,7 +3938,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4332,7 +4332,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4373,7 +4373,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4590,10 +4590,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4709,7 +4709,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 58b84955c2b..53631163dd6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1285,7 +1285,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 684e398f824..178229c5c44 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -59,6 +59,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
typedef struct Tuplesortstate Tuplesortstate;
@@ -533,7 +534,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 712d84128ca..f778c283034 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3196,6 +3196,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v11-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.0K, 3-v11-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 7790a6b5a81cc19d3f1c812b91170f622cca08f4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 26 Mar 2026 23:31:04 -0700
Subject: [PATCH v11 3/9] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 20 ++++++++++----------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f5c9a34374d..9b33584f454 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1081,10 +1081,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2063,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e212f6110f2..ce1af4ad563 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -840,7 +840,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -861,7 +861,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1716,9 +1716,9 @@ TrackBufferHit(IOObject io_object, IOContext io_context,
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
@@ -2092,9 +2092,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it. Otherwise
@@ -2981,7 +2981,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3127,7 +3127,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4542,7 +4542,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5685,7 +5685,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 404c6bccbdd..8845b0aeed6 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..9e7a88ec0d0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..d4769f3da7b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += (val); \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += (val); \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v11-0002-instrumentation-Separate-per-node-logic-from-oth.patch (27.1K, 4-v11-0002-instrumentation-Separate-per-node-logic-from-oth.patch)
download | inline diff:
From c14acb9509eb60559459d36c4b045835884243c4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v11 2/9] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 142 ++++++++++++------
src/include/executor/instrument.h | 60 +++++---
src/include/nodes/execnodes.h | 9 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 174 insertions(+), 114 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 6cb14824ec3..3e79108846e 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1024,7 +1024,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1083,12 +1083,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 41e47cc795b..cc8ec24c30e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2779,7 +2779,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index eb6ef23c2d6..e73dc129132 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1890,11 +1890,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1903,7 +1903,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2290,18 +2290,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2309,9 +2309,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 53631163dd6..1b950040597 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +385,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +435,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -445,7 +445,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..c153d5c1c3b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -725,7 +729,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -811,7 +815,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -821,7 +825,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1053,7 +1057,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1081,9 +1085,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1313,7 +1317,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d35976925ae..132fe37ef60 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,8 +414,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
-
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
if (!dst->running && add->running)
{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 178229c5c44..502ad4f2da5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -59,6 +59,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct NodeInstrumentation NodeInstrumentation;
typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
@@ -67,7 +68,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
-typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
/* ----------------
@@ -1185,8 +1186,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f778c283034..00dd1fc6ff9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1808,6 +1808,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3418,9 +3419,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v11-0004-instrumentation-Add-additional-regression-tests-.patch (23.5K, 5-v11-0004-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From 14b2de993ecf2d2258797d64bb15f39e1d7cb67a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v11 4/9] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 197 ++++++++++++++++++
src/test/regress/sql/explain.sql | 194 +++++++++++++++++
6 files changed, 598 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
--
2.47.1
[application/octet-stream] v11-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch (96.5K, 6-v11-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From 26ae52fb9e9ecb4475b7db71773da92ea86ad8a4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v11 5/9] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 87 +---
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 15 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 31 +-
src/backend/commands/explain.c | 43 +-
src/backend/commands/explain_dr.c | 57 ++-
src/backend/commands/prepare.c | 27 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/trigger.c | 17 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/README.instrument | 227 +++++++++
src/backend/executor/execMain.c | 83 +++-
src/backend/executor/execParallel.c | 32 +-
src/backend/executor/execPartition.c | 2 +-
src/backend/executor/execProcnode.c | 84 +++-
src/backend/executor/execUtils.c | 11 +-
src/backend/executor/instrument.c | 448 +++++++++++++-----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/commands/explain_dr.h | 5 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 5 +-
src/include/executor/instrument.h | 198 +++++++-
src/include/nodes/execnodes.h | 3 +-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
30 files changed, 1097 insertions(+), 357 deletions(-)
create mode 100644 src/backend/executor/README.instrument
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 3e79108846e..9856dec3a5f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -910,22 +910,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -939,30 +928,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1014,19 +993,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1088,10 +1057,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1155,17 +1124,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1181,6 +1144,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1195,9 +1159,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1209,23 +1170,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 2a0f8c8e3b8..1ceb2306954 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2886,6 +2886,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2935,7 +2936,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2950,7 +2951,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index e54782d9dd8..04cd53916ca 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2117,6 +2117,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2185,7 +2186,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2200,7 +2201,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index f698c2d899b..c95e6801ead 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,8 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -654,6 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -984,14 +985,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrQueryStopFinalize(instr);
+
if (verbose || params.log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1000,12 +1001,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 47a9bda30c9..6a261c8dcbd 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index eeed91be266..c21b2019eab 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrQueryStopFinalize(instr);
+
if (verbose || params.log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params.log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e73dc129132..dc5e63955bc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,14 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -348,15 +350,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -364,16 +363,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +582,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1016,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1024,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1035,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..e1fc723c758 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,11 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (instr->need_timer || instr->need_stack)
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +182,9 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ if (instr->need_timer || instr->need_stack)
+ InstrStop(instr);
return true;
}
@@ -233,9 +220,17 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ {
+ int instrument_options = 0;
+
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
+ }
}
/*
@@ -246,6 +241,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
@@ -296,16 +293,18 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
* receiver if the subject statement is CREATE TABLE AS. In that
* case, return all-zeroes stats.
*/
-SerializeMetrics
+/*
+ * GetSerializationMetrics - get serialization metrics
+ *
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
+ */
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..f7e158e4dd9 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -580,13 +580,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -598,9 +601,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +636,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -644,13 +644,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +653,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c69c12dc014..90ac5ccaacd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 29b80d75143..f2597b917e1 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -93,6 +93,7 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2312,6 +2313,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2348,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(qinstr, instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2441,6 +2443,7 @@ ExecBSInsertTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2502,6 +2505,7 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2606,6 +2610,7 @@ ExecIRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2670,6 +2675,7 @@ ExecBSDeleteTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2780,6 +2786,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2884,6 +2891,7 @@ ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (rettuple == NULL)
return false; /* Delete was suppressed */
@@ -2942,6 +2950,7 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -3094,6 +3103,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
@@ -3258,6 +3268,7 @@ ExecIRUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -3316,6 +3327,7 @@ ExecBSTruncateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -4373,7 +4385,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(estate->es_instrument, instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4561,6 +4573,7 @@ AfterTriggerExecute(EState *estate,
tgindx,
finfo,
NULL,
+ NULL,
per_tuple_context);
if (rettuple != NULL &&
rettuple != LocTriggerData.tg_trigtuple &&
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..c330c891c03 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1006,6 +1006,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1095,7 +1096,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1103,7 +1104,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/README.instrument b/src/backend/executor/README.instrument
new file mode 100644
index 00000000000..580fd5d85e0
--- /dev/null
+++ b/src/backend/executor/README.instrument
@@ -0,0 +1,227 @@
+src/backend/executor/README.instrument
+
+Instrumentation
+===============
+
+The instrumentation subsystem measures time, buffer usage and WAL activity
+during query execution and other similar activities. It is used by
+EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
+activity and/or timing metrics over a section of code.
+
+The design has two central goals:
+
+* Make it cheap to measure activity in a section of code, even when
+ that section is called many times and the aggregate is what is used
+ (as is the case with per-node instrumentation in the executor)
+
+* Ensure nested instrumentation accurately measures activity/timing,
+ and counter updates from activity get written to the currently
+ active instrumentation and accumulated upward to parent nodes when
+ finalized, considering aborts due to errors.
+
+The key data structures are defined in src/include/executor/instrument.h
+and the implementation lives in src/backend/executor/instrument.c.
+
+
+Instrumentation Options
+-----------------------
+
+Callers specify what to measure with a bitmask of InstrumentOption flags:
+
+ INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
+ INSTRUMENT_TIMER -- wall-clock timing and row counts
+ INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
+ INSTRUMENT_WAL -- WAL records, FPI, bytes
+
+INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
+(described below) for efficient handling of counter values.
+
+
+Struct Hierarchy
+----------------
+
+There are four instrumentation structs, each specialized for a different
+scope:
+
+Instrumentation Base struct. Holds timing and buffer/WAL counters.
+
+QueryInstrumentation Extends Instrumentation for query-level tracking. When
+ stack-based tracking is enabled, it owns a dedicated
+ MemoryContext and uses the ResourceOwner mechanism for
+ abort cleanup.
+
+NodeInstrumentation Extends Instrumentation for per-plan-node statistics
+ (startup time, tuple counts, loop counts, etc).
+
+TriggerInstrumentation Extends Instrumentation with a firing count.
+
+
+Stack-based instrumentation
+===========================
+
+For tracking WAL or buffer usage counters, the specialized stack-based
+instrumentation is used.
+
+At all times, there is a stack that tracks which Instrumentation is currently
+active. The stack is represented by instr_stack, a per-backend global
+that holds a dynamic array of Instrumentation pointers. The field
+instr_stack.current always points to the current stack entry that should
+be updated when activity occurs. When the stack array is empty, the
+current stack points to instr_top.
+
+For example, if a backend has two portals open, the overall nesting of
+Instrumentation and their respective InstrStart/InstrStop calls creates a
+tree-like structure like this:
+
+ Session (instr_top)
+ |
+ +-- Query A (QueryInstrumentation)
+ | |
+ | +-- NestLoop (NodeInstrumentation)
+ | |
+ | +-- Seq Scan A (NodeInstrumentation)
+ | +-- Seq Scan B (NodeInstrumentation)
+ |
+ +-- Query B (QueryInstrumentation)
+ |
+ +-- Seq Scan C (NodeInstrumentation)
+
+While executing Seq Scan B, the stack looks like:
+
+ instr_top (implicit bottom, not in the entries array)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B <-- instr_stack.current
+
+When no query is running, the stack is empty (stack_size == 0) and
+instr_stack.current points to instr_top.
+
+Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
+INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
+writes directly into instr_stack.current. Each instrumentation node starts
+zeroed, so the values it accumulates while on top of the stack represent
+exactly the activity that occurred during that time.
+
+Every Instrumentation node has a target, or parent, it will be accumulated
+into, which is typically the Instrumentation that was the current stack
+entry when it was created.
+
+For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
+its instrumentation data gets added to the immediate parent in
+the execution tree, the NestLoop, which will then get added to Query A's
+QueryInstrumentation, which then accumulates to the parent.
+
+While we can typically think of this as a tree, the NodeInstrumentation
+underneath a particular QueryInstrumentation could behave differently --
+for example, it could propagate directly to the QueryInstrumentation, in
+order to not show cumulative numbers in EXPLAIN ANALYZE.
+
+Note these relationships are partially implicit, especially when it comes
+to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
+unfinalized child nodes. The parent of a QueryInstrumentation itself is
+determined by the stack (see below): when a query is finalized or cleaned
+up on abort, its counters are accumulated to whatever entry is then current
+on the stack, which is typically instr_top.
+
+
+Finalization and Abort Safety
+=============================
+
+Finalization is the process of rolling up a node's buffer/WAL counters to
+its parent. In normal execution, nodes are pushed onto the stack when they
+start and popped when they stop; at finalization time their accumulated
+counters are added to the parent.
+
+Due to the use of longjmp for error handling, functions can exit abruptly
+without executing their normal cleanup code. On abort, two things need
+to happen:
+
+1. Reset the stack to the appropriate level. This ensures that we don't
+ later try to update counters on a freed stack entry. We also need to
+ ensure that the stack entry that was current before a particular
+ Instrumentation started, is current again after it stops.
+
+2. Finalize all affected Instrumentation nodes, rolling up their counters
+ to the highest surviving Instrumentation, so that data is not lost.
+
+For example, if Seq Scan B aborts while the stack is:
+
+ instr_top (implicit bottom)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B
+
+The abort handler for Query A accumulates all unfinalized children (Seq
+Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
+unwinds the stack and accumulates Query A's counters to instr_top.
+
+Note that on abort the children do not accumulate through each other (Seq
+Scan B -> NestLoop -> Query A); they all accumulate directly to their
+parent QueryInstrumentation. This means the order in which children are
+released does not matter -- important because ResourceOwner cleanup does
+not guarantee a particular release order. The per-node breakdown is lost,
+but the query-level total is what survives the abort.
+
+If multiple QueryInstrumentations are active on the stack (e.g. nested
+portals), each one's abort handler uses InstrStopFinalize to unwind to
+whichever entry is higher up, so they compose correctly regardless of
+release order.
+
+There are two mechanisms for achieving abort safety:
+
+Resource Owner (QueryInstrumentation)
+-------------------------------------
+
+QueryInstrumentation registers with the current ResourceOwner at start.
+On transaction abort, the resource owner system calls the release callback,
+which walks unfinalized child entries, accumulates their data, unwinds the
+stack, and destroys the dedicated memory context (freeing the
+QueryInstrumentation and all child allocations as a unit).
+
+This is the recommended approach when the instrumented code already has an
+appropriate resource owner (e.g. it runs inside a portal). The query
+executor uses this path.
+
+PG_FINALLY (base Instrumentation)
+----------------------------------
+
+When no suitable resource owner exists, or when the caller wants to inspect
+the instrumentation data even after an error, the base Instrumentation can
+be used with a PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
+
+Both mechanisms add overhead, so neither is suitable for high-frequency
+instrumentation like per-node measurements in the executor. Instead,
+plan node and trigger children rely on their parent QueryInstrumentation
+for abort safety: they are allocated in the parent's memory context and
+registered in its unfinalized-entries list, so the parent's abort handler
+recovers their data automatically. In normal execution, children are
+finalized explicitly by the caller.
+
+Parallel Query
+--------------
+
+Parallel workers get their own QueryInstrumentation so they can measure
+buffer and WAL activity independently, then copy the totals into shared
+memory at shutdown. The leader accumulates these into its own stack.
+
+When per-node instrumentation is active, parallel workers skip per-node
+finalization at shutdown to avoid double-counting; the per-node data is
+aggregated separately through InstrAggNode().
+
+
+Memory Handling
+===============
+
+Instrumentation objects that use the stack must survive until finalization
+runs, including the abort case. To ensure this, QueryInstrumentation
+creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
+of TopMemoryContext. All child instrumentation (nodes, triggers) should be
+allocated in this context.
+
+On successful completion, instr_cxt is reparented to CurrentMemoryContext
+so its lifetime is tied to the caller's context. On abort, the
+ResourceOwner cleanup frees it after accumulating the instrumentation data
+to the current stack entry after resetting the stack.
+
+When the stack is not needed (timer/rows only), Instrumentation allocations
+happen in CurrentMemoryContext instead of TopMemoryContext.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1b950040597..5366c1e801c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
estate->es_top_eflags = eflags;
- estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up query-level instrumentation if needed. We do this before
+ * InitPlan so that node and trigger instrumentation can be allocated
+ * within the query's dedicated instrumentation memory context.
+ */
+ if (!queryDesc->totaltime && queryDesc->instrument_options)
+ {
+ queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
+ estate->es_instrument = queryDesc->totaltime;
+ }
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
@@ -331,9 +342,21 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /* Start up instrumentation for this execution run */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ {
+ InstrQueryStart(queryDesc->totaltime);
+
+ /*
+ * Remember all node entries for abort recovery. We do this once here
+ * after InstrQueryStart has pushed the parent stack entry.
+ */
+ if (estate->es_instrument &&
+ estate->es_instrument->instr.need_stack &&
+ !queryDesc->already_executed)
+ ExecRememberNodeInstrumentation(queryDesc->planstate,
+ queryDesc->totaltime);
+ }
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +408,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +458,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -444,8 +467,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per-node and trigger statistics to their respective parent
+ * instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and the
+ * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
+ * we accumulated here, the leader would double-count: worker parent nodes
+ * would already include their children's stats, and then the leader's
+ * accumulation would add the children again.
+ */
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1263,7 +1304,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
}
else
{
@@ -1499,6 +1540,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_instrument->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c153d5c1c3b..0b18a05c434 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -694,7 +694,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -819,13 +819,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
- instrumentation->instrument_options = estate->es_instrument;
+ instrumentation->instrument_options = estate->es_instrument->instrument_options;
instrumentation->instrument_offset = instrument_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInitNode(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1075,14 +1075,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1456,6 +1470,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1516,7 +1531,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1532,7 +1547,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6f2909a1bc3 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1381,7 +1381,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..21ad1b04a57 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -788,10 +790,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +831,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+ (void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ if (node->instrument)
+ InstrQueryRememberChild(parent, &node->instrument->instr);
+
+ return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing our
+ * instrumentation as the parent context. This ensures children can
+ * accumulate to us even if they were never executed by the leader (e.g.
+ * nodes beneath Gather that only workers ran).
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ node->instrument ? &node->instrument->instr : parent);
+
+ if (!node->instrument)
+ return false;
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 9886ab06b69..0da0ff6b339 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -150,7 +150,7 @@ CreateExecutorState(void)
estate->es_total_processed = 0;
estate->es_top_eflags = 0;
- estate->es_instrument = 0;
+ estate->es_instrument = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
@@ -227,6 +227,15 @@ FreeExecutorState(EState *estate)
estate->es_partition_directory = NULL;
}
+ /*
+ * Make sure the instrumentation context gets freed. This usually gets
+ * re-parented under the per-query context in InstrQueryStopFinalize, but
+ * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
+ * gets called, so we would otherwise leak it in TopMemoryContext.
+ */
+ if (estate->es_instrument && estate->es_instrument->instr.need_stack)
+ MemoryContextDelete(estate->es_instrument->instr_cxt);
+
/*
* Free the per-query memory context, thereby releasing all working
* memory, including the EState node itself.
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6892706a83a 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,30 +16,46 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -54,50 +70,295 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_stack)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
+
+ InstrAccumStack(&qinstr->instr, child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ return;
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
- InstrInitNode(instr, instrument_options);
+ InstrInitNode(instr, qinstr->instrument_options);
instr->async_mode = async_mode;
return instr;
@@ -118,6 +379,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -147,14 +409,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_stack)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -189,8 +449,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -231,67 +491,73 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
{
- TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
- InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, qinstr->instrument_options);
return tginstr;
}
void
-InstrStartTrigger(TriggerInstrumentation *tginstr)
+InstrStartTrigger(QueryInstrumentation *qinstr, TriggerInstrumentation *tginstr)
{
InstrStart(&tginstr->instr);
+
+ /*
+ * On first call, register with the parent QueryInstrumentation for abort
+ * recovery.
+ */
+ if (qinstr && tginstr->instr.need_stack &&
+ dlist_node_is_detached(&tginstr->instr.unfinalized_entry))
+ dlist_push_head(&qinstr->unfinalized_entries,
+ &tginstr->instr.unfinalized_entry);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -312,39 +578,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 27d398d576d..4f7f097be2f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -903,7 +903,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index ce1af4ad563..0a623b68996 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1259,9 +1259,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
}
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9e7a88ec0d0..60400f0c81f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..fa98d29589f 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* per-tuple timing/buffer measurement */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 07f4b1f7490..f56b13841fb 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,7 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +302,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d4769f3da7b..f49c3f99cf2 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,91 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +174,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,16 +191,102 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
- bool async_mode);
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
@@ -141,35 +294,36 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
-extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
+extern void InstrStartTrigger(QueryInstrumentation *qinstr,
+ TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += (val); \
+ instr_stack.current->bufusage.fld += (val); \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += (val); \
+ instr_stack.current->walusage.fld += (val); \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 502ad4f2da5..aef1003f608 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -53,6 +53,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -731,7 +732,7 @@ typedef struct EState
* ExecutorRun() calls. */
int es_top_eflags; /* eflags passed to ExecutorStart */
- int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_instrument; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 00dd1fc6ff9..d3203ac5e9a 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1341,6 +1341,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2463,6 +2464,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v11-0009-Add-test_session_buffer_usage-test-module.patch (30.0K, 7-v11-0009-Add-test_session_buffer_usage-test-module.patch)
download | inline diff:
From 4c4c6438bacee3669920962a391322572b927da4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v11 9/9] Add test_session_buffer_usage test module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
.../test_session_buffer_usage/Makefile | 23 ++
.../expected/test_session_buffer_usage.out | 342 ++++++++++++++++++
.../test_session_buffer_usage/meson.build | 33 ++
.../sql/test_session_buffer_usage.sql | 245 +++++++++++++
.../test_session_buffer_usage--1.0.sql | 31 ++
.../test_session_buffer_usage.c | 95 +++++
.../test_session_buffer_usage.control | 5 +
9 files changed, 776 insertions(+)
create mode 100644 src/test/modules/test_session_buffer_usage/Makefile
create mode 100644 src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
create mode 100644 src/test/modules/test_session_buffer_usage/meson.build
create mode 100644 src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 28ce3b35eda..4f1380286a6 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -47,6 +47,7 @@ SUBDIRS = \
test_resowner \
test_rls_hooks \
test_saslprep \
+ test_session_buffer_usage \
test_shm_mq \
test_slru \
test_tidstore \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 3ac291656c1..41e0c3895e8 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -48,6 +48,7 @@ subdir('test_regex')
subdir('test_resowner')
subdir('test_rls_hooks')
subdir('test_saslprep')
+subdir('test_session_buffer_usage')
subdir('test_shm_mq')
subdir('test_slru')
subdir('test_tidstore')
diff --git a/src/test/modules/test_session_buffer_usage/Makefile b/src/test/modules/test_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..1252b222cb9
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_session_buffer_usage/Makefile
+
+MODULE_big = test_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ test_session_buffer_usage.o
+
+EXTENSION = test_session_buffer_usage
+DATA = test_session_buffer_usage--1.0.sql
+PGFILEDESC = "test_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = test_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_session_buffer_usage
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
new file mode 100644
index 00000000000..5f7d349871a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/meson.build b/src/test/modules/test_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..b96f67dc7fe
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_session_buffer_usage_sources = files(
+ 'test_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ test_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_session_buffer_usage',
+ '--FILEDESC', 'test_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+test_session_buffer_usage = shared_module('test_session_buffer_usage',
+ test_session_buffer_usage_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_session_buffer_usage
+
+test_install_data += files(
+ 'test_session_buffer_usage.control',
+ 'test_session_buffer_usage--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_session_buffer_usage',
+ ],
+ },
+}
diff --git a/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
new file mode 100644
index 00000000000..daf2159c4a6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT test_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT test_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..e9833be470a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION test_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION test_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
new file mode 100644
index 00000000000..50eb1a2ffe6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(test_session_buffer_usage);
+PG_FUNCTION_INFO_V1(test_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: test_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+test_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: test_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+test_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
new file mode 100644
index 00000000000..41cfb15a765
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# test_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/test_session_buffer_usage'
+relocatable = true
--
2.47.1
[application/octet-stream] v11-0006-instrumentation-Use-Instrumentation-struct-for-p.patch (29.2K, 8-v11-0006-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From 64d056536a2ccd06755a6be20f1ea7ae41613682 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v11 6/9] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 1ceb2306954..1c95ec9f605 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2887,8 +2876,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2949,11 +2937,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 04cd53916ca..51bb098a2a2 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2118,8 +2107,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2199,11 +2187,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 6a261c8dcbd..504b34cc906 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c330c891c03..b5fed54fb85 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -749,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1007,8 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1102,11 +1089,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 0b18a05c434..7a390350564 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -625,8 +624,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -690,21 +687,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -790,17 +780,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -826,9 +817,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1230,7 +1221,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1244,10 +1235,10 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
@@ -1471,8 +1462,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1487,7 +1476,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1545,11 +1534,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6892706a83a..09d5ffe8651 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -322,11 +322,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -342,12 +343,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f49c3f99cf2..b30a15bc027 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -283,8 +283,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
--
2.47.1
[application/octet-stream] v11-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (11.2K, 9-v11-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From 7b2e31cc1a11444ab20b045ddc4052a48f83602c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v11 7/9] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +---
src/backend/executor/instrument.c | 198 ++++++++++++++++++++--------
src/include/executor/instrument.h | 5 +
3 files changed, 148 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 21ad1b04a57..9f5698063f0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -465,7 +464,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -473,25 +472,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 09d5ffe8651..4ea807e295f 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -59,29 +59,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -89,6 +80,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -372,65 +373,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_stack)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Update tuple count */
@@ -495,6 +488,99 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopNodeTimer(instr);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b30a15bc027..cad052a3a90 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -294,6 +294,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
extern void InstrStartTrigger(QueryInstrumentation *qinstr,
TriggerInstrumentation *tginstr);
--
2.47.1
[application/octet-stream] v11-0008-Index-scans-Show-table-buffer-accesses-separatel.patch (22.2K, 10-v11-0008-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From a6edd51c369fc871d7c40509cb82caaad0eca2c3 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v11 8/9] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 +++++++--
src/backend/executor/execProcnode.c | 53 ++++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 27 ++++-
src/backend/executor/nodeIndexscan.c | 113 ++++++++++++++++-----
src/include/executor/instrument_node.h | 5 +
8 files changed, 223 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dc5e63955bc..eef343a9d97 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -610,7 +610,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1027,7 +1027,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1041,7 +1041,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1969,6 +1969,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1986,6 +1989,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2287,7 +2293,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2306,7 +2312,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4106,7 +4112,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4131,6 +4137,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4186,6 +4194,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4227,6 +4237,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4247,8 +4265,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4267,6 +4297,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9f5698063f0..71a897f2b84 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -418,6 +418,29 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ /*
+ * IndexScan / IndexOnlyScan track table and index access separately.
+ *
+ * We intentionally don't collect timing for them (even if enabled), since
+ * we don't need it, and executor nodes call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if (estate->es_instrument && (estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ if (IsA(result, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, result);
+
+ InstrInitOptions(&iss->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ }
+ else if (IsA(result, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, result);
+
+ InstrInitOptions(&ioss->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ }
+ }
+
return result;
}
@@ -837,8 +860,24 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
return false;
if (node->instrument)
+ {
InstrQueryRememberChild(parent, &node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrQueryRememberChild(parent, &iss->iss_Instrument->table_instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrQueryRememberChild(parent, &ioss->ioss_Instrument->table_instr);
+ }
+ }
+
return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
}
@@ -880,6 +919,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
if (!node->instrument)
return false;
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d61..63e24a0bcd4 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index 9eab81fd1c8..66b02788b3c 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -163,11 +167,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -434,6 +449,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -608,7 +624,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -893,4 +909,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 06143e94c5a..e66b6d6407b 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -130,8 +136,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -179,6 +201,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -198,6 +221,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -259,36 +285,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -814,6 +871,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -976,7 +1034,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1826,4 +1884,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94fa..e8531b84efa 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-04 09:43 ` Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-04-04 09:43 UTC (permalink / raw)
To: Andres Freund <[email protected]>; PostgreSQL Hackers <[email protected]>; +Cc: Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>; Heikki Linnakangas <[email protected]>
Hi,
Attached v12, rebased, otherwise no changes.
I realize time to freeze is getting close, and whilst I'd love to see
this go in, I'm also realistic - so I'll just do my best to support
review in the off chance we can make it for this release.
On that note, I think 0001 and 0002 are independently useful
refactorings to split the different kinds of instrumentation that
should be ready to go, and I don't think should conflict much with
other patches in this commitfest.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/x-patch] v12-0002-instrumentation-Separate-per-node-logic-from-oth.patch (27.1K, 2-v12-0002-instrumentation-Separate-per-node-logic-from-oth.patch)
download | inline diff:
From cd12175c8c11b4e8709c73b64a901d5a5d4ea418 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v12 2/9] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 142 ++++++++++++------
src/include/executor/instrument.h | 60 +++++---
src/include/nodes/execnodes.h | 9 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 174 insertions(+), 114 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 5494d41dca1..fbf32f0e72c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1025,7 +1025,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1084,12 +1084,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 41e47cc795b..cc8ec24c30e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2779,7 +2779,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index eb6ef23c2d6..e73dc129132 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1890,11 +1890,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1903,7 +1903,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2290,18 +2290,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2309,9 +2309,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0237d8c3b1d..b0f636bf8b6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +385,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +435,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -445,7 +445,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 755191b51ef..78f60c1530c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -731,7 +735,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -817,7 +821,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -827,7 +831,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1059,7 +1063,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1087,9 +1091,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1319,7 +1323,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d35976925ae..132fe37ef60 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,8 +414,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
-
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
/* aggregate instrumentation information */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
if (!dst->running && add->running)
{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 908898aa7c9..3ecae7552fc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct NodeInstrumentation NodeInstrumentation;
typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
@@ -68,7 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
-typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
/* ----------------
@@ -1207,8 +1208,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7ddf970fb97..449acca8dc1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1822,6 +1822,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3436,9 +3437,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/x-patch] v12-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch (96.5K, 3-v12-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From 90a7ed18f14c09c8a1299db3a015747fc6b6761c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v12 5/9] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 87 +---
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 15 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 31 +-
src/backend/commands/explain.c | 43 +-
src/backend/commands/explain_dr.c | 57 ++-
src/backend/commands/prepare.c | 27 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/trigger.c | 17 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/README.instrument | 227 +++++++++
src/backend/executor/execMain.c | 83 +++-
src/backend/executor/execParallel.c | 32 +-
src/backend/executor/execPartition.c | 2 +-
src/backend/executor/execProcnode.c | 84 +++-
src/backend/executor/execUtils.c | 11 +-
src/backend/executor/instrument.c | 448 +++++++++++++-----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/commands/explain_dr.h | 5 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 5 +-
src/include/executor/instrument.h | 198 +++++++-
src/include/nodes/execnodes.h | 3 +-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
30 files changed, 1097 insertions(+), 357 deletions(-)
create mode 100644 src/backend/executor/README.instrument
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index fbf32f0e72c..78f1518c940 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -911,22 +911,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -940,30 +929,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1015,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1089,10 +1058,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1156,17 +1125,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1182,6 +1145,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1196,9 +1160,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1210,23 +1171,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e09..3a5176c76c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 9d83a495775..0d80f72a0b0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2118,6 +2118,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2186,7 +2187,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2201,7 +2202,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88c71cd85b6..291d9d67bc2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,8 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -654,6 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -984,14 +985,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrQueryStopFinalize(instr);
+
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1000,12 +1001,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 756dfa3dcf4..2d7b7cef912 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..10f8a2dc81c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrQueryStopFinalize(instr);
+
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e73dc129132..dc5e63955bc 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,14 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+ else
+ instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
if (es->memory)
{
/*
@@ -348,15 +350,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -364,16 +363,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +582,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1016,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1024,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1035,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..e1fc723c758 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,11 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (instr->need_timer || instr->need_stack)
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +182,9 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ if (instr->need_timer || instr->need_stack)
+ InstrStop(instr);
return true;
}
@@ -233,9 +220,17 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ {
+ int instrument_options = 0;
+
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
+ }
}
/*
@@ -246,6 +241,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
@@ -296,16 +293,18 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
* receiver if the subject statement is CREATE TABLE AS. In that
* case, return all-zeroes stats.
*/
-SerializeMetrics
+/*
+ * GetSerializationMetrics - get serialization metrics
+ *
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
+ */
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..f7e158e4dd9 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -580,13 +580,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ QueryInstrumentation *instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -598,9 +601,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrQueryStart(instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +636,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrQueryStopFinalize(instr);
if (es->memory)
{
@@ -644,13 +644,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +653,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ce2e81f9c2..f72c1ac521a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 4d4e96a5302..b8b8840345b 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -93,6 +93,7 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2312,6 +2313,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2348,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(qinstr, instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2441,6 +2443,7 @@ ExecBSInsertTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2502,6 +2505,7 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2606,6 +2610,7 @@ ExecIRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2670,6 +2675,7 @@ ExecBSDeleteTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2780,6 +2786,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2884,6 +2891,7 @@ ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (rettuple == NULL)
return false; /* Delete was suppressed */
@@ -2942,6 +2950,7 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -3094,6 +3103,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
@@ -3258,6 +3268,7 @@ ExecIRUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -3316,6 +3327,7 @@ ExecBSTruncateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -4383,7 +4395,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(estate->es_instrument, instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4571,6 +4583,7 @@ AfterTriggerExecute(EState *estate,
tgindx,
finfo,
NULL,
+ NULL,
per_tuple_context);
if (rettuple != NULL &&
rettuple != LocTriggerData.tg_trigtuple &&
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..c330c891c03 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1006,6 +1006,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1095,7 +1096,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1103,7 +1104,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/README.instrument b/src/backend/executor/README.instrument
new file mode 100644
index 00000000000..580fd5d85e0
--- /dev/null
+++ b/src/backend/executor/README.instrument
@@ -0,0 +1,227 @@
+src/backend/executor/README.instrument
+
+Instrumentation
+===============
+
+The instrumentation subsystem measures time, buffer usage and WAL activity
+during query execution and other similar activities. It is used by
+EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
+activity and/or timing metrics over a section of code.
+
+The design has two central goals:
+
+* Make it cheap to measure activity in a section of code, even when
+ that section is called many times and the aggregate is what is used
+ (as is the case with per-node instrumentation in the executor)
+
+* Ensure nested instrumentation accurately measures activity/timing,
+ and counter updates from activity get written to the currently
+ active instrumentation and accumulated upward to parent nodes when
+ finalized, considering aborts due to errors.
+
+The key data structures are defined in src/include/executor/instrument.h
+and the implementation lives in src/backend/executor/instrument.c.
+
+
+Instrumentation Options
+-----------------------
+
+Callers specify what to measure with a bitmask of InstrumentOption flags:
+
+ INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
+ INSTRUMENT_TIMER -- wall-clock timing and row counts
+ INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
+ INSTRUMENT_WAL -- WAL records, FPI, bytes
+
+INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
+(described below) for efficient handling of counter values.
+
+
+Struct Hierarchy
+----------------
+
+There are four instrumentation structs, each specialized for a different
+scope:
+
+Instrumentation Base struct. Holds timing and buffer/WAL counters.
+
+QueryInstrumentation Extends Instrumentation for query-level tracking. When
+ stack-based tracking is enabled, it owns a dedicated
+ MemoryContext and uses the ResourceOwner mechanism for
+ abort cleanup.
+
+NodeInstrumentation Extends Instrumentation for per-plan-node statistics
+ (startup time, tuple counts, loop counts, etc).
+
+TriggerInstrumentation Extends Instrumentation with a firing count.
+
+
+Stack-based instrumentation
+===========================
+
+For tracking WAL or buffer usage counters, the specialized stack-based
+instrumentation is used.
+
+At all times, there is a stack that tracks which Instrumentation is currently
+active. The stack is represented by instr_stack, a per-backend global
+that holds a dynamic array of Instrumentation pointers. The field
+instr_stack.current always points to the current stack entry that should
+be updated when activity occurs. When the stack array is empty, the
+current stack points to instr_top.
+
+For example, if a backend has two portals open, the overall nesting of
+Instrumentation and their respective InstrStart/InstrStop calls creates a
+tree-like structure like this:
+
+ Session (instr_top)
+ |
+ +-- Query A (QueryInstrumentation)
+ | |
+ | +-- NestLoop (NodeInstrumentation)
+ | |
+ | +-- Seq Scan A (NodeInstrumentation)
+ | +-- Seq Scan B (NodeInstrumentation)
+ |
+ +-- Query B (QueryInstrumentation)
+ |
+ +-- Seq Scan C (NodeInstrumentation)
+
+While executing Seq Scan B, the stack looks like:
+
+ instr_top (implicit bottom, not in the entries array)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B <-- instr_stack.current
+
+When no query is running, the stack is empty (stack_size == 0) and
+instr_stack.current points to instr_top.
+
+Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
+INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
+writes directly into instr_stack.current. Each instrumentation node starts
+zeroed, so the values it accumulates while on top of the stack represent
+exactly the activity that occurred during that time.
+
+Every Instrumentation node has a target, or parent, it will be accumulated
+into, which is typically the Instrumentation that was the current stack
+entry when it was created.
+
+For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
+its instrumentation data gets added to the immediate parent in
+the execution tree, the NestLoop, which will then get added to Query A's
+QueryInstrumentation, which then accumulates to the parent.
+
+While we can typically think of this as a tree, the NodeInstrumentation
+underneath a particular QueryInstrumentation could behave differently --
+for example, it could propagate directly to the QueryInstrumentation, in
+order to not show cumulative numbers in EXPLAIN ANALYZE.
+
+Note these relationships are partially implicit, especially when it comes
+to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
+unfinalized child nodes. The parent of a QueryInstrumentation itself is
+determined by the stack (see below): when a query is finalized or cleaned
+up on abort, its counters are accumulated to whatever entry is then current
+on the stack, which is typically instr_top.
+
+
+Finalization and Abort Safety
+=============================
+
+Finalization is the process of rolling up a node's buffer/WAL counters to
+its parent. In normal execution, nodes are pushed onto the stack when they
+start and popped when they stop; at finalization time their accumulated
+counters are added to the parent.
+
+Due to the use of longjmp for error handling, functions can exit abruptly
+without executing their normal cleanup code. On abort, two things need
+to happen:
+
+1. Reset the stack to the appropriate level. This ensures that we don't
+ later try to update counters on a freed stack entry. We also need to
+ ensure that the stack entry that was current before a particular
+ Instrumentation started, is current again after it stops.
+
+2. Finalize all affected Instrumentation nodes, rolling up their counters
+ to the highest surviving Instrumentation, so that data is not lost.
+
+For example, if Seq Scan B aborts while the stack is:
+
+ instr_top (implicit bottom)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B
+
+The abort handler for Query A accumulates all unfinalized children (Seq
+Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
+unwinds the stack and accumulates Query A's counters to instr_top.
+
+Note that on abort the children do not accumulate through each other (Seq
+Scan B -> NestLoop -> Query A); they all accumulate directly to their
+parent QueryInstrumentation. This means the order in which children are
+released does not matter -- important because ResourceOwner cleanup does
+not guarantee a particular release order. The per-node breakdown is lost,
+but the query-level total is what survives the abort.
+
+If multiple QueryInstrumentations are active on the stack (e.g. nested
+portals), each one's abort handler uses InstrStopFinalize to unwind to
+whichever entry is higher up, so they compose correctly regardless of
+release order.
+
+There are two mechanisms for achieving abort safety:
+
+Resource Owner (QueryInstrumentation)
+-------------------------------------
+
+QueryInstrumentation registers with the current ResourceOwner at start.
+On transaction abort, the resource owner system calls the release callback,
+which walks unfinalized child entries, accumulates their data, unwinds the
+stack, and destroys the dedicated memory context (freeing the
+QueryInstrumentation and all child allocations as a unit).
+
+This is the recommended approach when the instrumented code already has an
+appropriate resource owner (e.g. it runs inside a portal). The query
+executor uses this path.
+
+PG_FINALLY (base Instrumentation)
+----------------------------------
+
+When no suitable resource owner exists, or when the caller wants to inspect
+the instrumentation data even after an error, the base Instrumentation can
+be used with a PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
+
+Both mechanisms add overhead, so neither is suitable for high-frequency
+instrumentation like per-node measurements in the executor. Instead,
+plan node and trigger children rely on their parent QueryInstrumentation
+for abort safety: they are allocated in the parent's memory context and
+registered in its unfinalized-entries list, so the parent's abort handler
+recovers their data automatically. In normal execution, children are
+finalized explicitly by the caller.
+
+Parallel Query
+--------------
+
+Parallel workers get their own QueryInstrumentation so they can measure
+buffer and WAL activity independently, then copy the totals into shared
+memory at shutdown. The leader accumulates these into its own stack.
+
+When per-node instrumentation is active, parallel workers skip per-node
+finalization at shutdown to avoid double-counting; the per-node data is
+aggregated separately through InstrAggNode().
+
+
+Memory Handling
+===============
+
+Instrumentation objects that use the stack must survive until finalization
+runs, including the abort case. To ensure this, QueryInstrumentation
+creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
+of TopMemoryContext. All child instrumentation (nodes, triggers) should be
+allocated in this context.
+
+On successful completion, instr_cxt is reparented to CurrentMemoryContext
+so its lifetime is tied to the caller's context. On abort, the
+ResourceOwner cleanup frees it after accumulating the instrumentation data
+to the current stack entry after resetting the stack.
+
+When the stack is not needed (timer/rows only), Instrumentation allocations
+happen in CurrentMemoryContext instead of TopMemoryContext.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b0f636bf8b6..ff856f52eef 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
estate->es_top_eflags = eflags;
- estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up query-level instrumentation if needed. We do this before
+ * InitPlan so that node and trigger instrumentation can be allocated
+ * within the query's dedicated instrumentation memory context.
+ */
+ if (!queryDesc->totaltime && queryDesc->instrument_options)
+ {
+ queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
+ estate->es_instrument = queryDesc->totaltime;
+ }
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
@@ -331,9 +342,21 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /* Start up instrumentation for this execution run */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ {
+ InstrQueryStart(queryDesc->totaltime);
+
+ /*
+ * Remember all node entries for abort recovery. We do this once here
+ * after InstrQueryStart has pushed the parent stack entry.
+ */
+ if (estate->es_instrument &&
+ estate->es_instrument->instr.need_stack &&
+ !queryDesc->already_executed)
+ ExecRememberNodeInstrumentation(queryDesc->planstate,
+ queryDesc->totaltime);
+ }
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +408,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +458,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -444,8 +467,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ /*
+ * Accumulate per-node and trigger statistics to their respective parent
+ * instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and the
+ * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
+ * we accumulated here, the leader would double-count: worker parent nodes
+ * would already include their children's stats, and then the leader's
+ * accumulation would add the children again.
+ */
+ if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1263,7 +1304,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
}
else
{
@@ -1500,6 +1541,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_instrument->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 78f60c1530c..6bcd922eea5 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -700,7 +700,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -825,13 +825,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
- instrumentation->instrument_options = estate->es_instrument;
+ instrumentation->instrument_options = estate->es_instrument->instrument_options;
instrumentation->instrument_offset = instrument_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInitNode(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1462,6 +1476,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1522,7 +1537,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1538,7 +1553,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6f2909a1bc3 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1381,7 +1381,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..21ad1b04a57 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -788,10 +790,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +831,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+ (void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ if (node->instrument)
+ InstrQueryRememberChild(parent, &node->instrument->instr);
+
+ return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing our
+ * instrumentation as the parent context. This ensures children can
+ * accumulate to us even if they were never executed by the leader (e.g.
+ * nodes beneath Gather that only workers ran).
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ node->instrument ? &node->instrument->instr : parent);
+
+ if (!node->instrument)
+ return false;
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..700764daf45 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -150,7 +150,7 @@ CreateExecutorState(void)
estate->es_total_processed = 0;
estate->es_top_eflags = 0;
- estate->es_instrument = 0;
+ estate->es_instrument = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
@@ -227,6 +227,15 @@ FreeExecutorState(EState *estate)
estate->es_partition_directory = NULL;
}
+ /*
+ * Make sure the instrumentation context gets freed. This usually gets
+ * re-parented under the per-query context in InstrQueryStopFinalize, but
+ * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
+ * gets called, so we would otherwise leak it in TopMemoryContext.
+ */
+ if (estate->es_instrument && estate->es_instrument->instr.need_stack)
+ MemoryContextDelete(estate->es_instrument->instr_cxt);
+
/*
* Free the per-query memory context, thereby releasing all working
* memory, including the EState node itself.
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6892706a83a 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,30 +16,46 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -54,50 +70,295 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_stack)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx >= 0)
+ {
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
+
+ InstrAccumStack(&qinstr->instr, child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ return;
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
- InstrInitNode(instr, instrument_options);
+ InstrInitNode(instr, qinstr->instrument_options);
instr->async_mode = async_mode;
return instr;
@@ -118,6 +379,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -147,14 +409,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_stack)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -189,8 +449,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -231,67 +491,73 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
{
- TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
- InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, qinstr->instrument_options);
return tginstr;
}
void
-InstrStartTrigger(TriggerInstrumentation *tginstr)
+InstrStartTrigger(QueryInstrumentation *qinstr, TriggerInstrumentation *tginstr)
{
InstrStart(&tginstr->instr);
+
+ /*
+ * On first call, register with the parent QueryInstrumentation for abort
+ * recovery.
+ */
+ if (qinstr && tginstr->instr.need_stack &&
+ dlist_node_is_detached(&tginstr->instr.unfinalized_entry))
+ dlist_push_head(&qinstr->unfinalized_entries,
+ &tginstr->instr.unfinalized_entry);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -312,39 +578,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b38170f0fbe..3ca0a7a635d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -904,7 +904,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e1c39160db..cf4f4246ca2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1266,9 +1266,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
}
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index e3829d7fe7c..e7fc7f071d8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..fa98d29589f 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* per-tuple timing/buffer measurement */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 491c4886506..75434f64ba7 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,7 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +302,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d4769f3da7b..f49c3f99cf2 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,91 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +174,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,16 +191,102 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
- bool async_mode);
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
@@ -141,35 +294,36 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
-extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
+extern void InstrStartTrigger(QueryInstrumentation *qinstr,
+ TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += (val); \
+ instr_stack.current->bufusage.fld += (val); \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += (val); \
+ instr_stack.current->walusage.fld += (val); \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..b28288aa1e8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -54,6 +54,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -753,7 +754,7 @@ typedef struct EState
* ExecutorRun() calls. */
int es_top_eflags; /* eflags passed to ExecutorStart */
- int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_instrument; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 449acca8dc1..7393926e34d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1355,6 +1355,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2477,6 +2478,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/x-patch] v12-0001-instrumentation-Separate-trigger-logic-from-othe.patch (10.1K, 4-v12-0001-instrumentation-Separate-trigger-logic-from-othe.patch)
download | inline diff:
From 16e47ac288208a6ec5ba7fccbd6fc669ee537e63 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v12 1/9] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 3 ++-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 61 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4b70166b0e..eb6ef23c2d6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1101,18 +1101,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1137,11 +1134,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1151,9 +1148,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 90e94fb8a5a..4d4e96a5302 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -92,7 +92,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2311,7 +2311,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2346,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2391,10 +2391,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3947,7 +3947,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4342,7 +4342,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4383,7 +4383,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4600,10 +4600,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4719,7 +4719,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 45e00c6af85..0237d8c3b1d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1285,7 +1285,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 090cfccf65f..908898aa7c9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
typedef struct Tuplesortstate Tuplesortstate;
@@ -552,7 +553,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c72f6c59573..7ddf970fb97 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3213,6 +3213,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/x-patch] v12-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.0K, 5-v12-0003-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 486145723e9a9863c87dfe2f51cc36b905d2ef6a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 26 Mar 2026 23:31:04 -0700
Subject: [PATCH v12 3/9] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 20 ++++++++++----------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9e8999bbb61..71c9a265662 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1103,10 +1103,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2085,7 +2085,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cc0b0bdd92..3e1c39160db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -840,7 +840,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -861,7 +861,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1684,9 +1684,9 @@ TrackBufferHit(IOObject io_object, IOContext io_context,
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
@@ -2148,9 +2148,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it. Otherwise
@@ -3043,7 +3043,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3189,7 +3189,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4601,7 +4601,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5796,7 +5796,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 396da84b25c..851b99056d5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -218,7 +218,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -479,7 +479,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -510,7 +510,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 2be26e92283..e3829d7fe7c 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..d4769f3da7b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += (val); \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += (val); \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/x-patch] v12-0004-instrumentation-Add-additional-regression-tests-.patch (23.5K, 6-v12-0004-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From ba18ab8d156609a563c153097c173dad8d53989e Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v12 4/9] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 197 ++++++++++++++++++
src/test/regress/sql/explain.sql | 194 +++++++++++++++++
6 files changed, 598 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
--
2.47.1
[application/x-patch] v12-0006-instrumentation-Use-Instrumentation-struct-for-p.patch (29.2K, 7-v12-0006-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From 80dbf65f79deca08f5e10872cac226d0d8edca0e Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v12 6/9] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3a5176c76c7..9e545b4ef0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2888,8 +2877,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2950,11 +2938,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d80f72a0b0..f3de62ce7f3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2119,8 +2108,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2200,11 +2188,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2d7b7cef912..cb238f862a7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c330c891c03..b5fed54fb85 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -749,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1007,8 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1102,11 +1089,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 6bcd922eea5..63e9b1f4095 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -631,8 +630,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -696,21 +693,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -796,17 +786,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -832,9 +823,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1236,7 +1227,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1250,10 +1241,10 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
@@ -1477,8 +1468,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1493,7 +1482,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1551,11 +1540,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6892706a83a..09d5ffe8651 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -322,11 +322,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -342,12 +343,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f49c3f99cf2..b30a15bc027 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -283,8 +283,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
--
2.47.1
[application/x-patch] v12-0009-Add-test_session_buffer_usage-test-module.patch (30.0K, 8-v12-0009-Add-test_session_buffer_usage-test-module.patch)
download | inline diff:
From 39690c12171c2db0d10e4d015eb7dc7801c262fe Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v12 9/9] Add test_session_buffer_usage test module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
.../test_session_buffer_usage/Makefile | 23 ++
.../expected/test_session_buffer_usage.out | 342 ++++++++++++++++++
.../test_session_buffer_usage/meson.build | 33 ++
.../sql/test_session_buffer_usage.sql | 245 +++++++++++++
.../test_session_buffer_usage--1.0.sql | 31 ++
.../test_session_buffer_usage.c | 95 +++++
.../test_session_buffer_usage.control | 5 +
9 files changed, 776 insertions(+)
create mode 100644 src/test/modules/test_session_buffer_usage/Makefile
create mode 100644 src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
create mode 100644 src/test/modules/test_session_buffer_usage/meson.build
create mode 100644 src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 864b407abcf..c5ace162fe2 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -48,6 +48,7 @@ SUBDIRS = \
test_resowner \
test_rls_hooks \
test_saslprep \
+ test_session_buffer_usage \
test_shm_mq \
test_slru \
test_tidstore \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index e5acacd5083..802cc93d71a 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -49,6 +49,7 @@ subdir('test_regex')
subdir('test_resowner')
subdir('test_rls_hooks')
subdir('test_saslprep')
+subdir('test_session_buffer_usage')
subdir('test_shm_mq')
subdir('test_slru')
subdir('test_tidstore')
diff --git a/src/test/modules/test_session_buffer_usage/Makefile b/src/test/modules/test_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..1252b222cb9
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_session_buffer_usage/Makefile
+
+MODULE_big = test_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ test_session_buffer_usage.o
+
+EXTENSION = test_session_buffer_usage
+DATA = test_session_buffer_usage--1.0.sql
+PGFILEDESC = "test_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = test_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_session_buffer_usage
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
new file mode 100644
index 00000000000..5f7d349871a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/meson.build b/src/test/modules/test_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..b96f67dc7fe
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_session_buffer_usage_sources = files(
+ 'test_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ test_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_session_buffer_usage',
+ '--FILEDESC', 'test_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+test_session_buffer_usage = shared_module('test_session_buffer_usage',
+ test_session_buffer_usage_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_session_buffer_usage
+
+test_install_data += files(
+ 'test_session_buffer_usage.control',
+ 'test_session_buffer_usage--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_session_buffer_usage',
+ ],
+ },
+}
diff --git a/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
new file mode 100644
index 00000000000..daf2159c4a6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT test_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT test_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..e9833be470a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION test_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION test_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
new file mode 100644
index 00000000000..50eb1a2ffe6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(test_session_buffer_usage);
+PG_FUNCTION_INFO_V1(test_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: test_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+test_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: test_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+test_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
new file mode 100644
index 00000000000..41cfb15a765
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# test_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/test_session_buffer_usage'
+relocatable = true
--
2.47.1
[application/x-patch] v12-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (11.2K, 9-v12-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +---
src/backend/executor/instrument.c | 198 ++++++++++++++++++++--------
src/include/executor/instrument.h | 5 +
3 files changed, 148 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 21ad1b04a57..9f5698063f0 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -465,7 +464,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -473,25 +472,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 09d5ffe8651..4ea807e295f 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -59,29 +59,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -89,6 +80,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -372,65 +373,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_stack)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Update tuple count */
@@ -495,6 +488,99 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopNodeTimer(instr);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b30a15bc027..cad052a3a90 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -294,6 +294,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
extern void InstrStartTrigger(QueryInstrumentation *qinstr,
TriggerInstrumentation *tginstr);
--
2.47.1
[application/x-patch] v12-0008-Index-scans-Show-table-buffer-accesses-separatel.patch (22.2K, 10-v12-0008-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From e0abf9505223b00dc361c915676987436d177599 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v12 8/9] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 +++++++--
src/backend/executor/execProcnode.c | 53 ++++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 27 ++++-
src/backend/executor/nodeIndexscan.c | 113 ++++++++++++++++-----
src/include/executor/instrument_node.h | 5 +
8 files changed, 223 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index dc5e63955bc..eef343a9d97 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -610,7 +610,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1027,7 +1027,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1041,7 +1041,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1969,6 +1969,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1986,6 +1989,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2287,7 +2293,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2306,7 +2312,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4106,7 +4112,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4131,6 +4137,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4186,6 +4194,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4227,6 +4237,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4247,8 +4265,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4267,6 +4297,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 9f5698063f0..71a897f2b84 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -418,6 +418,29 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ /*
+ * IndexScan / IndexOnlyScan track table and index access separately.
+ *
+ * We intentionally don't collect timing for them (even if enabled), since
+ * we don't need it, and executor nodes call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if (estate->es_instrument && (estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ if (IsA(result, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, result);
+
+ InstrInitOptions(&iss->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ }
+ else if (IsA(result, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, result);
+
+ InstrInitOptions(&ioss->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ }
+ }
+
return result;
}
@@ -837,8 +860,24 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
return false;
if (node->instrument)
+ {
InstrQueryRememberChild(parent, &node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrQueryRememberChild(parent, &iss->iss_Instrument->table_instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrQueryRememberChild(parent, &ioss->ioss_Instrument->table_instr);
+ }
+ }
+
return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
}
@@ -880,6 +919,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
if (!node->instrument)
return false;
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d61..63e24a0bcd4 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index de6154fd541..0df828e3ed7 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -436,6 +451,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -610,7 +626,7 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -899,4 +915,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 1620d146071..d32a59fb605 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -132,8 +138,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -181,6 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -200,6 +223,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -263,36 +289,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -818,6 +875,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -980,7 +1038,7 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1834,4 +1892,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 2a0ff377a73..e2315cef384 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-04 19:39 ` Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2026-04-04 19:39 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>; Heikki Linnakangas <[email protected]>
Hi,
On 2026-04-04 02:43:50 -0700, Lukas Fittl wrote:
> Attached v12, rebased, otherwise no changes.
>
> I realize time to freeze is getting close, and whilst I'd love to see
> this go in, I'm also realistic - so I'll just do my best to support
> review in the off chance we can make it for this release.
I unfortunately think there's enough nontrivial design decisions - that I
don't have a sufficiently confident assesment of - that I would be pretty
hesitant to commit this at this stage of the cycle without some architectural
review by senior folks. If Heikki did another round or two of review, it'd be a
different story.
I think the high level design is a huge improvement and goes in the right
direction, but some of the lower level stuff I'm far less confident about.
> On that note, I think 0001 and 0002 are independently useful
> refactorings to split the different kinds of instrumentation that
> should be ready to go, and I don't think should conflict much with
> other patches in this commitfest.
Yea, I'll see that those get committed.
I could also see 0004 as potentially worth getting committed separately,
although I'm a bit worried about test stability.
I recently looked at a coverage report, in the context of the index
prefetching patch, and was a it surprised that prominent parallel executor
nodes have no coverage for EXPLAIN ANALYZE.
parallel BHS is not covered:
https://coverage.postgresql.org/src/backend/executor/nodeBitmapHeapscan.c.gcov.html#L536
parallel IOS is not covered:
https://coverage.postgresql.org/src/backend/executor/nodeIndexonlyscan.c.gcov.html#L430
> From 90a7ed18f14c09c8a1299db3a015747fc6b6761c Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v12 5/9] Optimize measuring WAL/buffer usage through
> stack-based instrumentation
>
> Previously, in order to determine the buffer/WAL usage of a given code
> section, we utilized continuously incrementing global counters that get
> updated when the actual activity (e.g. shared block read) occurred, and
> then calculated a diff when the code section ended. This resulted in a
> bottleneck for executor node instrumentation specifically, with the
> function BufferUsageAccumDiff showing up in profiles and in some cases
> adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
>
> Instead, introduce a stack-based mechanism, where the actual activity
> writes into the current stack entry. In the case of executor nodes, this
> means that each node gets its own stack entry that is pushed at
> InstrStartNode, and popped at InstrEndNode. Stack entries are zero
> initialized (avoiding the diff mechanism) and get added to their parent
> entry when they are finalized, i.e. no more modifications can occur.
>
> To correctly handle abort situations, any use of instrumentation stacks
> must involve either a top-level QueryInstrumentation struct, and its
> associated InstrQueryStart/InstrQueryStop helpers (which use resource
> owners to handle aborts), or the Instrumentation struct itself with
> dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
> consistent state after an abort.
>
> This also drops the global pgBufferUsage, any callers interested in
> measuring buffer activity should instead utilize InstrStart/InstrStop.
>
> The related global pgWalUsage is kept for now due to its use in pgstat
> to track aggregate WAL activity and heap_page_prune_and_freeze for
> measuring FPIs.
Probably worth stating what the performance overhead of WAL and BUFFERS is
after this patch?
> @@ -1015,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
> */
> if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
> {
> - /*
> - * Set up to track total elapsed time in ExecutorRun. Make sure the
> - * space is allocated in the per-query context so it will go away at
> - * ExecutorEnd.
> - */
> + /* Set up to track total elapsed time in ExecutorRun. */
> if (queryDesc->totaltime == NULL)
> - {
> - MemoryContext oldcxt;
> -
> - oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
> - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
> - MemoryContextSwitchTo(oldcxt);
> - }
> + queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
> }
> }
Not at all the fault of this patch, but it does seem somewhat odd to me that
we handle pgss/auto_explain wanting instrumentation by them updating the
QueryDesc->totaltime, rather than having extensions add an eflag to ask
standard_ExecutorStart to do so.
> @@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
> * and PARALLEL_KEY_BUFFER_USAGE.
> *
> * If there are no extensions loaded that care, we could skip this. We
> - * have no way of knowing whether anyone's looking at pgWalUsage or
> - * pgBufferUsage, so do it unconditionally.
> + * have no way of knowing whether anyone's looking at instrumentation, so
> + * do it unconditionally.
> */
> shm_toc_estimate_chunk(&pcxt->estimator,
> mul_size(sizeof(WalUsage), pcxt->nworkers));
> @@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> Relation indexRel;
> LOCKMODE heapLockmode;
> LOCKMODE indexLockmode;
> + QueryInstrumentation *instr;
> WalUsage *walusage;
> BufferUsage *bufferusage;
> int sortmem;
> @@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> tuplesort_attach_shared(sharedsort, seg);
>
> /* Prepare to track buffer usage during parallel execution */
> - InstrStartParallelQuery();
> + instr = InstrStartParallelQuery();
>
> /*
> * Might as well use reliable figure when doling out maintenance_work_mem
> @@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> /* Report WAL/buffer usage during parallel execution */
> bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
> walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
> - InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> + InstrEndParallelQuery(instr,
> + &bufferusage[ParallelWorkerNumber],
> &walusage[ParallelWorkerNumber]);
>
> index_close(indexRel, indexLockmode);
Again not your fault, but it feels like the parallel index build
infrastructure is all wrong. Reimplementing this stuff for every index type
makes no sense.
> @@ -324,14 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
> QueryEnvironment *queryEnv)
> {
> PlannedStmt *plan;
> - instr_time planstart,
> - planduration;
> - BufferUsage bufusage_start,
> - bufusage;
> + QueryInstrumentation *instr = NULL;
> MemoryContextCounters mem_counters;
> MemoryContext planner_ctx = NULL;
> MemoryContext saved_ctx = NULL;
>
> + if (es->buffers)
> + instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
> + else
> + instr = InstrQueryAlloc(INSTRUMENT_TIMER);
I was momentarily confused why this only checks es->buffers, not es->wal, but
I now see this is just for the planning, and we haven't displayed WAL there so
far, even if it's possible that we would emit some WAL. Probably would name
it plan_instr or such. And I'd probably go for one InstrQueryAlloc() with a
flags variable that's set depending on es->buffers, because it seems likely
we'll add more stuff to track over time...
> if (es->memory)
> {
> /*
> @@ -348,15 +350,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
> saved_ctx = MemoryContextSwitchTo(planner_ctx);
> }
>
> - if (es->buffers)
> - bufusage_start = pgBufferUsage;
> - INSTR_TIME_SET_CURRENT(planstart);
> + InstrQueryStart(instr);
>
> /* plan the query */
> plan = pg_plan_query(query, queryString, cursorOptions, params, es);
>
> - INSTR_TIME_SET_CURRENT(planduration);
> - INSTR_TIME_SUBTRACT(planduration, planstart);
> + InstrQueryStopFinalize(instr);
>
> if (es->memory)
> {
> @@ -364,16 +363,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
> MemoryContextMemConsumed(planner_ctx, &mem_counters);
> }
>
> - /* calc differences of buffer counters. */
> - if (es->buffers)
> - {
> - memset(&bufusage, 0, sizeof(BufferUsage));
> - BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
> - }
> -
> /* run it (if needed) and produce output */
> ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
> - &planduration, (es->buffers ? &bufusage : NULL),
> + &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
> es->memory ? &mem_counters : NULL);
> }
Kinda wonder if some of this could be moved into a preliminary patch, without
depending on the stack infrastructure. Even with the current instrumentation
buffers and timing could be handled that way.
> @@ -590,7 +582,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
>
> /* grab serialization metrics before we destroy the DestReceiver */
> if (es->serialize != EXPLAIN_SERIALIZE_NONE)
> - serializeMetrics = GetSerializationMetrics(dest);
> + {
> + SerializeMetrics *metrics = GetSerializationMetrics(dest);
> +
> + if (metrics)
> + memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
> + }
The current code just returning a made up zeroed metrics in that case is
somewhat weird...
> +++ b/src/backend/executor/README.instrument
> @@ -0,0 +1,227 @@
> +src/backend/executor/README.instrument
> +
> +Instrumentation
> +===============
> +
> +The instrumentation subsystem measures time, buffer usage and WAL activity
> +during query execution and other similar activities. It is used by
> +EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
> +activity and/or timing metrics over a section of code.
> +
> +The design has two central goals:
> +
> +* Make it cheap to measure activity in a section of code, even when
> + that section is called many times and the aggregate is what is used
> + (as is the case with per-node instrumentation in the executor)
> +
> +* Ensure nested instrumentation accurately measures activity/timing,
> + and counter updates from activity get written to the currently
> + active instrumentation and accumulated upward to parent nodes when
> + finalized, considering aborts due to errors.
The second paragraph assumes implementation choices that haven't been
introduced yet. Without knowing that it's something stack based it's very
unclear what "currently active" and "accumulate upward" mean.
I'd probably just say that nested instrumentation needs to work including in
the case of errors getting thrown.
> +Instrumentation Options
> +-----------------------
> +
> +Callers specify what to measure with a bitmask of InstrumentOption flags:
> +
> + INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
> + INSTRUMENT_TIMER -- wall-clock timing and row counts
> + INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
> + INSTRUMENT_WAL -- WAL records, FPI, bytes
> +
> +INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
> +(described below) for efficient handling of counter values.
Not for now, but it seems like es->memory really should also be implemented
this way, rather than be implemented ad-hoc.
> +Struct Hierarchy
> +----------------
> +
> +There are four instrumentation structs, each specialized for a different
> +scope:
I'd just say "the following", otherwise the "four" will inevitably not get
updated when another type of instrumentation is introduced :)
> +Stack-based instrumentation
> +===========================
> +
> +For tracking WAL or buffer usage counters, the specialized stack-based
> +instrumentation is used.
I think I'd add a sentence explaining why a stack is used, i.e. that the
alternative approach of taking a snapshot of all stats at the start of the
instrumented section and diffing it at the end & accumulating the differences,
is quite expensive.
And then explain that the solution, i.e. to point to a struct that should
receive stats, has the problem of not supporting nesting, if done naively.
Without those it's a bit hard to understand the motivation for all this.
> +Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
> +INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
> +writes directly into instr_stack.current. Each instrumentation node starts
> +zeroed, so the values it accumulates while on top of the stack represent
> +exactly the activity that occurred during that time.
> +
> +Every Instrumentation node has a target, or parent, it will be accumulated
> +into, which is typically the Instrumentation that was the current stack
> +entry when it was created.
*(other than instr_top)?
> +For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
> +its instrumentation data gets added to the immediate parent in
> +the execution tree, the NestLoop, which will then get added to Query A's
> +QueryInstrumentation, which then accumulates to the parent.
> +
> +While we can typically think of this as a tree, the NodeInstrumentation
> +underneath a particular QueryInstrumentation could behave differently --
> +for example, it could propagate directly to the QueryInstrumentation, in
> +order to not show cumulative numbers in EXPLAIN ANALYZE.
Hm. This seems like a somewhat random example, why would one want this?
> +Note these relationships are partially implicit, especially when it comes
> +to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
> +unfinalized child nodes. The parent of a QueryInstrumentation itself is
> +determined by the stack (see below): when a query is finalized or cleaned
> +up on abort, its counters are accumulated to whatever entry is then current
> +on the stack, which is typically instr_top.
> +
> +
> +Finalization and Abort Safety
> +=============================
> +
> +Finalization is the process of rolling up a node's buffer/WAL counters to
> +its parent. In normal execution, nodes are pushed onto the stack when they
> +start and popped when they stop; at finalization time their accumulated
> +counters are added to the parent.
> +Due to the use of longjmp for error handling, functions can exit abruptly
> +without executing their normal cleanup code. On abort, two things need
> +to happen:
> +
> +1. Reset the stack to the appropriate level. This ensures that we don't
s/Reset the stack to/The stack is reset to/
And maybe something like:
s/appropriate/to the level saved at the start of the aborting (sub-)transaction/
> + later try to update counters on a freed stack entry. We also need to
> + ensure that the stack entry that was current before a particular
> + Instrumentation started, is current again after it stops.
> +
> +2. Finalize all affected Instrumentation nodes, rolling up their counters
> + to the highest surviving Instrumentation, so that data is not lost.
I was about to say that to me it's the lowest surviving one, but I guess
that's just a question of whether you think of stacks as growing up or down...
Perhaps it should be something like "innermost surviving"?
> +For example, if Seq Scan B aborts while the stack is:
> +
> + instr_top (implicit bottom)
> + 0: Query A
> + 1: NestLoop
> + 2: Seq Scan B
> +
> +The abort handler for Query A accumulates all unfinalized children (Seq
> +Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
> +unwinds the stack and accumulates Query A's counters to instr_top.
s/stack/instrumentation stack/, otherwise it might be confused with the C
callstack.
> +Note that on abort the children do not accumulate through each other (Seq
> +Scan B -> NestLoop -> Query A); they all accumulate directly to their
> +parent QueryInstrumentation. This means the order in which children are
> +released does not matter -- important because ResourceOwner cleanup does
> +not guarantee a particular release order.
s/important/this is important/
> The per-node breakdown is lost, +but the query-level total is what survives
> the abort.
Is that actually the typical scenario? Most of the time - including afaict in
your example above - there won't be a query-level instrumentation that
survives, but the stats are accumulated into instr_top.
> +If multiple QueryInstrumentations are active on the stack (e.g. nested
> +portals), each one's abort handler uses InstrStopFinalize to unwind to
> +whichever entry is higher up, so they compose correctly regardless of
> +release order.
Maybe "the abort handler of each uses InstrStopFinalize() to accumulate the
statics to its parent entry"?
> +Resource Owner (QueryInstrumentation)
> +-------------------------------------
> +
> +QueryInstrumentation registers with the current ResourceOwner at start.
> +On transaction abort, the resource owner system calls the release callback,
> +which walks unfinalized child entries, accumulates their data, unwinds the
> +stack, and destroys the dedicated memory context (freeing the
> +QueryInstrumentation and all child allocations as a unit).
> +
> +This is the recommended approach when the instrumented code already has an
> +appropriate resource owner (e.g. it runs inside a portal). The query
> +executor uses this path.
> +
> +PG_FINALLY (base Instrumentation)
> +----------------------------------
> +
> +When no suitable resource owner exists, or when the caller wants to inspect
> +the instrumentation data even after an error, the base Instrumentation can
> +be used with a PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
> +
> +Both mechanisms add overhead, so neither is suitable for high-frequency
I guess the "both" here refers to "Resource Owner" and "PG_FINALLY"? Given
this paragraph looks to be in the PG_FINALLY section, that's not quite
obvious.
Might just need another newline or such to make it clearer that this is not in
the PG_FINALLY section. Or maybe the two options should be in a bulleted list
or such.
> +instrumentation like per-node measurements in the executor. Instead,
> +plan node and trigger children rely on their parent QueryInstrumentation
> +for abort safety: they are allocated in the parent's memory context and
> +registered in its unfinalized-entries list, so the parent's abort handler
> +recovers their data automatically. In normal execution, children are
> +finalized explicitly by the caller.
> +Parallel Query
> +--------------
> +
> +Parallel workers get their own QueryInstrumentation so they can measure
> +buffer and WAL activity independently, then copy the totals into shared
> +memory at shutdown. The leader accumulates these into its own stack.
s/shared/dynamic shared/
Maybe s/shutdown/worker shutdown/?
> +When per-node instrumentation is active, parallel workers skip per-node
> +finalization at shutdown to avoid double-counting; the per-node data is
> +aggregated separately through InstrAggNode().
That's a bit gnarly, but I don't really see a better option.
> +Memory Handling
> +===============
> +
> +Instrumentation objects that use the stack must survive until finalization
> +runs, including the abort case. To ensure this, QueryInstrumentation
> +creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
> +of TopMemoryContext. All child instrumentation (nodes, triggers) should be
> +allocated in this context.
> +On successful completion, instr_cxt is reparented to CurrentMemoryContext
> +so its lifetime is tied to the caller's context. On abort, the
> +ResourceOwner cleanup frees it after accumulating the instrumentation data
> +to the current stack entry after resetting the stack.
Makes sense.
I mildly wonder if we should create one minimally sized "Instrumentations"
node under TopMemoryContext, below which the "Instrumentation" contexts are
created, instead of doing so directly under TopMemoryContext. But that's
something that can easily be evolved later.
> @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
> estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
> estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
> estate->es_top_eflags = eflags;
> - estate->es_instrument = queryDesc->instrument_options;
> estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
>
> + /*
> + * Set up query-level instrumentation if needed. We do this before
> + * InitPlan so that node and trigger instrumentation can be allocated
> + * within the query's dedicated instrumentation memory context.
> + */
> + if (!queryDesc->totaltime && queryDesc->instrument_options)
> + {
> + queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
> + estate->es_instrument = queryDesc->totaltime;
> + }
> +
> /*
> * Set up an AFTER-trigger statement context, unless told not to, or
> * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
It seems pretty weird to still have queryDesc->totaltime *sometimes* created
by pgss etc, but also create it in standard_ExecutorStart if not already
created. What if the explain options aren't compatible? Sure
pgss/auto_explain use ALL, but that's not a given.
> + /* Start up instrumentation for this execution run */
> if (queryDesc->totaltime)
> - InstrStart(queryDesc->totaltime);
> + {
> + InstrQueryStart(queryDesc->totaltime);
> +
> + /*
> + * Remember all node entries for abort recovery. We do this once here
> + * after InstrQueryStart has pushed the parent stack entry.
> + */
> + if (estate->es_instrument &&
> + estate->es_instrument->instr.need_stack &&
> + !queryDesc->already_executed)
> + ExecRememberNodeInstrumentation(queryDesc->planstate,
> + queryDesc->totaltime);
> + }
Hm. Was briefly worried about the overhead of
ExecRememberNodeInstrumentation() in the context of cursors. But I see it's
only done once.
But why do we not just associate the NodeInstrumentation's with the
QueryInstrumentation during the creation of the NodeInstrumentation?
> + /*
> + * Accumulate per-node and trigger statistics to their respective parent
> + * instrumentation stacks.
>
> + * We skip this in parallel workers because their per-node stats are
> + * reported individually via ExecParallelReportInstrumentation, and the
> + * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
> + * we accumulated here, the leader would double-count: worker parent nodes
> + * would already include their children's stats, and then the leader's
> + * accumulation would add the children again.
> + */
Haven't looked into how this all works in sufficient detail, so I'm just
asking you: This works correctly even when using EXPLAIN (ANALYZE, VERBOSE)
showing per-worker "subtrees"?
> + if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
> + {
> + ExecFinalizeNodeInstrumentation(queryDesc->planstate);
> +
> + ExecFinalizeTriggerInstrumentation(estate);
> + }
> +
> if (queryDesc->totaltime)
> - InstrStop(queryDesc->totaltime);
> + InstrQueryStopFinalize(queryDesc->totaltime);
I'd probably move the estate->es_instrument && !IsParallelWorker() check into
the if (queryDesc->totaltime).
> @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> palloc0_array(FmgrInfo, n);
> resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> palloc0_array(ExprState *, n);
> - if (instrument_options)
> - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> + if (qinstr)
> + resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
Hm. Why do we not need to pass down the instrument_options anymore? I guess
the assumption is that we always are going to use the flags from qinstr?
Is that right? Because right now pgss/auto_explain use _ALL, even when an
EXPLAIN ANALYZE doesn't.
> +static void
> +ExecFinalizeTriggerInstrumentation(EState *estate)
> +{
> + List *rels = NIL;
> +
> + rels = list_concat(rels, estate->es_tuple_routing_result_relations);
> + rels = list_concat(rels, estate->es_opened_result_relations);
> + rels = list_concat(rels, estate->es_trig_target_relations);
Maybe ExecGetTriggerResultRel() needs a comment about needing to update
ExecFinalizeTriggerInstrumentation() if trigger stuff were to be stored in yet
another place?
> @@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
> instrument = GetInstrumentationArray(instrumentation);
> instrument += i * instrumentation->num_workers;
> for (n = 0; n < instrumentation->num_workers; ++n)
> + {
> InstrAggNode(planstate->instrument, &instrument[n]);
>
> + /*
> + * Also add worker WAL usage to the global pgWalUsage counter.
> + *
> + * When per-node instrumentation is active, parallel workers skip
> + * ExecFinalizeNodeInstrumentation (to avoid double-counting in
> + * EXPLAIN), so per-node WAL activity is not rolled up into the
> + * query-level stats that InstrAccumParallelQuery receives. Without
> + * this, pgWalUsage would under-report WAL generated by parallel
> + * workers when instrumentation is active.
> + */
> + WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
> + }
I'm not sure I understand why this doesn't also lead to double counting, given
that InstrAccumParallelQuery() does also add the worker's usage to pgWalUsage?
> +static bool
> +ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> + Instrumentation *parent = (Instrumentation *) context;
> +
> + Assert(parent != NULL);
> +
> + if (node == NULL)
> + return false;
> +
> + /*
> + * Recurse into children first (bottom-up accumulation), passing our
> + * instrumentation as the parent context. This ensures children can
> + * accumulate to us even if they were never executed by the leader (e.g.
> + * nodes beneath Gather that only workers ran).
> + */
> + planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
> + node->instrument ? &node->instrument->instr : parent);
I don't think I understand that comment. What changes if the leader's node
was never executed?
> @@ -227,6 +227,15 @@ FreeExecutorState(EState *estate)
> estate->es_partition_directory = NULL;
> }
>
> + /*
> + * Make sure the instrumentation context gets freed. This usually gets
> + * re-parented under the per-query context in InstrQueryStopFinalize, but
> + * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
> + * gets called, so we would otherwise leak it in TopMemoryContext.
> + */
> + if (estate->es_instrument && estate->es_instrument->instr.need_stack)
> + MemoryContextDelete(estate->es_instrument->instr_cxt);
> +
Ugh.
> diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
> index bc551f95a08..6892706a83a 100644
> --- a/src/backend/executor/instrument.c
> +++ b/src/backend/executor/instrument.c
> @@ -16,30 +16,46 @@
> #include <unistd.h>
>
> #include "executor/instrument.h"
> +#include "utils/memutils.h"
> +#include "utils/resowner.h"
>
> -BufferUsage pgBufferUsage;
> -static BufferUsage save_pgBufferUsage;
> WalUsage pgWalUsage;
Why do we still need pgWalUsage if we have the same data in instr_stack.
> -static WalUsage save_pgWalUsage;
> +Instrumentation instr_top;
> +InstrStackState instr_stack = {0, 0, NULL, &instr_top};
I'd use designated initializers to make this easier to read.
> -static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
> -static void WalUsageAdd(WalUsage *dst, WalUsage *add);
> +void
> +InstrStackGrow(void)
> +{
> + int space = instr_stack.stack_space;
> +
> + if (instr_stack.entries == NULL)
> + {
> + space = 10; /* Allocate sufficient initial space for
> + * typical activity */
> + instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
> + sizeof(Instrumentation *) * space);
> + }
> + else
> + {
> + space *= 2;
> + instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
> + }
>
> + /* Update stack space after allocation succeeded to protect against OOMs */
> + instr_stack.stack_space = space;
> +}
Perhaps worth adding an assert to check that we actually needed space?
> +/*
> + * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
> + *
> + * Note that this intentionally allows passing a stack that is not the current
> + * top, as can happen with PG_FINALLY, or resource owners, which don't have a
> + * guaranteed cleanup order.
> + *
> + * We are careful here to achieve two goals:
> + *
> + * 1) Reset the stack to the parent of whichever of the released stack entries
> + * has the lowest index
> + * 2) Accumulate all instrumentation to the currently active instrumentation,
> + * so that callers get a complete picture of activity, even after an abort
> + */
> +void
> +InstrStopFinalize(Instrumentation *instr)
> +{
> + int idx = -1;
> +
> + for (int i = instr_stack.stack_size - 1; i >= 0; i--)
> + {
> + if (instr_stack.entries[i] == instr)
> + {
> + idx = i;
> + break;
> + }
> + }
So this may not find a stack entry, because a prior call to
InstrStopFinalize() already removed it from the stack, right?
Makes it a bit more error prone. Maybe we should store whether the element is
still on the stack in the Instrumentation, that way we a) can error out if we
don't find it on the stack b) avoid searching the stack if already removed.
> if (instr->need_timer)
> + InstrStopTimer(instr);
> +
> + InstrAccumStack(instr_stack.current, instr);
> +}
Not that it's a huge issue, but seems like it'd be neater if the need_timer
thing weren't duplicated, but implemented by calling InstrStop()?
> +void
> +InstrQueryStart(QueryInstrumentation *qinstr)
> +{
> + InstrStart(&qinstr->instr);
> +
> + if (qinstr->instr.need_stack)
> + {
> + Assert(CurrentResourceOwner != NULL);
> + qinstr->owner = CurrentResourceOwner;
> +
> + ResourceOwnerEnlarge(qinstr->owner);
> + ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
> + }
> +}
> +
> +void
> +InstrQueryStop(QueryInstrumentation *qinstr)
> +{
> + InstrStop(&qinstr->instr);
> +
> + if (qinstr->instr.need_stack)
> + {
> + Assert(qinstr->owner != NULL);
> + ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
> + qinstr->owner = NULL;
> + }
> +}
> +
> +void
> +InstrQueryStopFinalize(QueryInstrumentation *qinstr)
> +{
> + InstrStopFinalize(&qinstr->instr);
Why are these Instr[Query]StopFinalize() rather than just
Instr[Query]Finalize()?
> + if (!qinstr->instr.need_stack)
> + return;
Perhaps worth asserting that qinstr->{instr_cxt,owner} are NULL in this case?
> +/* start instrumentation during parallel executor startup */
> +QueryInstrumentation *
> +InstrStartParallelQuery(void)
> +{
> + QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> +
> + InstrQueryStart(qinstr);
> + return qinstr;
> +}
Why do we hardcode INSTRUMENT_BUFFERS | INSTRUMENT_WAL?
> From 80dbf65f79deca08f5e10872cac226d0d8edca0e Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 15 Mar 2026 21:44:58 -0700
> Subject: [PATCH v12 6/9] instrumentation: Use Instrumentation struct for
> parallel workers
>
> This simplifies the DSM allocations a bit since we don't need to
> separately allocate WAL and buffer usage, and allows the easier future
> addition of a third stack-based struct being discussed.
Does look a bit nicer.
> From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 7 Mar 2026 17:52:24 -0800
> Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
> instructions by inlining
>
> For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
> ExecProcNodeInstr when starting/stopping instrumentation for that node.
>
> Previously each ExecProcNodeInstr would check which instrumentation
> options are active in the InstrStartNode/InstrStopNode calls, and do the
> corresponding work (timers, instrumentation stack, etc.). These
> conditionals being checked for each tuple being emitted add up, and cause
> non-optimal set of instructions to be generated by the compiler.
>
> Because we already have an existing mechanism to specify a function
> pointer when instrumentation is enabled, we can instead create specialized
> functions that are tailored to the instrumentation options enabled, and
> avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
> the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
> test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
> top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
> from ~ 20% to ~ 10% on top of actual runtime.
I assume this is to a significant degree due to to allowing for inlining. Have
you checked how much of the effort you get by just putting ExecProcNodeInstr()
into instrument.c?
> +/*
> + * Specialized handling of instrumented ExecProcNode
> + *
> + * These functions are equivalent to running ExecProcNodeReal wrapped in
> + * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
> + * by checking the instrumentation options when the ExecProcNode pointer gets
> + * first set, and then using a special-purpose function for each. This results
> + * in a more optimized set of compiled instructions.
> + */
> +
> +#include "executor/tuptable.h"
> +#include "nodes/execnodes.h"
> +
> +/* Simplified pop: restore saved state instead of re-deriving from array */
> +static inline void
> +InstrPopStackTo(Instrumentation *prev)
> +{
> + Assert(instr_stack.stack_size > 0);
> + Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
> + instr_stack.stack_size--;
> + instr_stack.current = prev;
> +}
> +
> +static inline TupleTableSlot *
> +ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
This might need pg_attribute_always_inline to be reliable, the compiler
otherwise might decide that it should not actually inline the function...
> @@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
> then rejected by a recheck of the index condition. This happens because a
> GiST index is <quote>lossy</quote> for polygon containment tests: it actually
> returns the rows with polygons that overlap the target, and then we have
> - to do the exact containment test on those rows.
> + to do the exact containment test on those rows. The <literal>Table Buffers</literal>
> + counts indicate how many operations were performed on the table instead of
> + the index. This number is included in the <literal>Buffers</literal> counts.
> </para>
>
> <para>
I wonder if listing "Index Buffers" separately, instead of "Table Buffers"
would make more sense, because normally the number of index accesses is much
smaller and therefore a bit easier to put into relation to "Buffers".
> @@ -418,6 +418,29 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
> result->instrument = InstrAllocNode(estate->es_instrument,
> result->async_capable);
>
> + /*
> + * IndexScan / IndexOnlyScan track table and index access separately.
> + *
> + * We intentionally don't collect timing for them (even if enabled), since
> + * we don't need it, and executor nodes call InstrPushStack /
> + * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
> + */
> + if (estate->es_instrument && (estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
> + {
> + if (IsA(result, IndexScanState))
> + {
> + IndexScanState *iss = castNode(IndexScanState, result);
> +
> + InstrInitOptions(&iss->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
> + }
> + else if (IsA(result, IndexOnlyScanState))
> + {
> + IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, result);
> +
> + InstrInitOptions(&ioss->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
> + }
> + }
> +
> return result;
> }
Why do this is in ExecInitNode(), rather than ExecInitIndexScan(),
ExecInitIndexOnlyScan()?
> @@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
> ItemPointerGetBlockNumber(tid),
> &node->ioss_VMBuffer))
> {
> + bool found;
> +
> /*
> * Rats, we have to visit the heap to check visibility.
> */
> InstrCountTuples2(node, 1);
> - if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
> +
> + if (table_instr)
> + InstrPushStack(table_instr);
> +
> + found = index_fetch_heap(scandesc, node->ioss_TableSlot);
> +
> + if (table_instr)
> + InstrPopStack(table_instr);
> +
> + if (!found)
> continue; /* no visible tuple, try next index entry */
>
> ExecClearTuple(node->ioss_TableSlot);
As-is this will unfortunately rather terribly conflict with the way the index
prefetching patch is restructuring things, as after it neither index nor
indexonly scan does the equivalent of index_fetch_heap() anymore. This all
goes through a tableam interface, which in turn will call to the index to get
the tids (to allow for tableam specific prefetching logic, obviously).
I think this would require putting this into the IndexScanDesc via the
IndexScanInstrumentation etc.
Might be good for you to look at how that stuff works after the index
prefetching patch and comment if you see a problem.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2026-04-05 12:31 ` Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 18:22 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-04-06 09:26 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 3 replies; 42+ messages in thread
From: Lukas Fittl @ 2026-04-05 12:31 UTC (permalink / raw)
To: Andres Freund <[email protected]>; Heikki Linnakangas <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
Hi Andres,
Thanks for reviewing!
On Sat, Apr 4, 2026 at 12:39 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-04 02:43:50 -0700, Lukas Fittl wrote:
> > Attached v12, rebased, otherwise no changes.
> >
> > I realize time to freeze is getting close, and whilst I'd love to see
> > this go in, I'm also realistic - so I'll just do my best to support
> > review in the off chance we can make it for this release.
>
> I unfortunately think there's enough nontrivial design decisions - that I
> don't have a sufficiently confident assesment of - that I would be pretty
> hesitant to commit this at this stage of the cycle without some architectural
> review by senior folks. If Heikki did another round or two of review, it'd be a
> different story.
Ack - I'll keep it updated as needed in the next days to help, but
understand that there many competing priorities :)
FWIW, I do feel like Heikki has taken a look at the memory management
aspects, and Zsolt also had some good detailed feedback on lower level
logic and OOM behaviour - so I'm actually less worried about these
now.
Heikki, your further review is very welcome, if you have the time.
It'd also be great if you could review the README.instrument (now in
v13/0008) to see if that makes sense to you.
> I think the high level design is a huge improvement and goes in the right
> direction, but some of the lower level stuff I'm far less confident about.
Glad to hear - I also think that the direction here is right, and I
don't think there is a lot of variance on architectural choices
anymore.
>
> > On that note, I think 0001 and 0002 are independently useful
> > refactorings to split the different kinds of instrumentation that
> > should be ready to go, and I don't think should conflict much with
> > other patches in this commitfest.
>
> Yea, I'll see that those get committed.
>
> I could also see 0004 as potentially worth getting committed separately,
> although I'm a bit worried about test stability.
Yeah, I understand. FWIW, the current approach has been stable in CI
runs, but that doesn't mean something in the buildfarm won't complain.
> I recently looked at a coverage report, in the context of the index
> prefetching patch, and was a it surprised that prominent parallel executor
> nodes have no coverage for EXPLAIN ANALYZE.
> parallel BHS is not covered:
> https://coverage.postgresql.org/src/backend/executor/nodeBitmapHeapscan.c.gcov.html#L536
> parallel IOS is not covered:
> https://coverage.postgresql.org/src/backend/executor/nodeIndexonlyscan.c.gcov.html#L430
Ugh. Good catch, that lack of coverage is not good. I've added two
tests for that, and guess what, I found a bug (unrelated to
stack-based instrumentation) - show_tidbitmap_info doesn't correctly
accumulate Heap Blocks information from the worker instrumentation -
it only ever shows that of the leader. For such auxiliary information
in a parallel context it needs to do the extra work, e.g. like
show_indexsearches_info does.
I've put that bugfix and BHS coverage into its own commit (0004)
before the other tests, because I suspect we may even want to
backpatch that. I've added coverage of IOS via a check for Index
Searches as a new patch as well (0005). Let me know if you prefer if I
start a new thread for those.
See attached v13, with feedback addressed, unless otherwise noted
below. I've also slightly simplified InstrAggNode, since I realized it
is never called when running=true.
>
> > From 90a7ed18f14c09c8a1299db3a015747fc6b6761c Mon Sep 17 00:00:00 2001
> > From: Lukas Fittl <[email protected]>
> > Date: Tue, 9 Sep 2025 02:16:59 -0700
> > Subject: [PATCH v12 5/9] Optimize measuring WAL/buffer usage through
> > stack-based instrumentation
> >
> > Previously, in order to determine the buffer/WAL usage of a given code
> > section, we utilized continuously incrementing global counters that get
> > updated when the actual activity (e.g. shared block read) occurred, and
> > then calculated a diff when the code section ended. This resulted in a
> > bottleneck for executor node instrumentation specifically, with the
> > function BufferUsageAccumDiff showing up in profiles and in some cases
> > adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
> >
> > Instead, introduce a stack-based mechanism, where the actual activity
> > writes into the current stack entry. In the case of executor nodes, this
> > means that each node gets its own stack entry that is pushed at
> > InstrStartNode, and popped at InstrEndNode. Stack entries are zero
> > initialized (avoiding the diff mechanism) and get added to their parent
> > entry when they are finalized, i.e. no more modifications can occur.
> >
> > To correctly handle abort situations, any use of instrumentation stacks
> > must involve either a top-level QueryInstrumentation struct, and its
> > associated InstrQueryStart/InstrQueryStop helpers (which use resource
> > owners to handle aborts), or the Instrumentation struct itself with
> > dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
> > consistent state after an abort.
> >
> > This also drops the global pgBufferUsage, any callers interested in
> > measuring buffer activity should instead utilize InstrStart/InstrStop.
> >
> > The related global pgWalUsage is kept for now due to its use in pgstat
> > to track aggregate WAL activity and heap_page_prune_and_freeze for
> > measuring FPIs.
>
> Probably worth stating what the performance overhead of WAL and BUFFERS is
> after this patch?
I've added a note re: BUFFERS referencing the COUNT(*) numbers from
earlier in the thread - not sure if WAL is really worth it to talk
about (its already quite a long commit message).
>
>
> > @@ -1015,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
> > */
> > if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
> > {
> > - /*
> > - * Set up to track total elapsed time in ExecutorRun. Make sure the
> > - * space is allocated in the per-query context so it will go away at
> > - * ExecutorEnd.
> > - */
> > + /* Set up to track total elapsed time in ExecutorRun. */
> > if (queryDesc->totaltime == NULL)
> > - {
> > - MemoryContext oldcxt;
> > -
> > - oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
> > - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
> > - MemoryContextSwitchTo(oldcxt);
> > - }
> > + queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
> > }
> > }
>
> Not at all the fault of this patch, but it does seem somewhat odd to me that
> we handle pgss/auto_explain wanting instrumentation by them updating the
> QueryDesc->totaltime, rather than having extensions add an eflag to ask
> standard_ExecutorStart to do so.
Agreed, that interaction is a bit odd.
I think it'd be reasonable to do this as a separate refactoring,
especially now that I've stopped utilizing totaltime as the parent for
per-node instrumentation, per later notes.
>
> > @@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
> > * and PARALLEL_KEY_BUFFER_USAGE.
> > *
> > * If there are no extensions loaded that care, we could skip this. We
> > - * have no way of knowing whether anyone's looking at pgWalUsage or
> > - * pgBufferUsage, so do it unconditionally.
> > + * have no way of knowing whether anyone's looking at instrumentation, so
> > + * do it unconditionally.
> > */
> > shm_toc_estimate_chunk(&pcxt->estimator,
> > mul_size(sizeof(WalUsage), pcxt->nworkers));
> > @@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> > Relation indexRel;
> > LOCKMODE heapLockmode;
> > LOCKMODE indexLockmode;
> > + QueryInstrumentation *instr;
> > WalUsage *walusage;
> > BufferUsage *bufferusage;
> > int sortmem;
> > @@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> > tuplesort_attach_shared(sharedsort, seg);
> >
> > /* Prepare to track buffer usage during parallel execution */
> > - InstrStartParallelQuery();
> > + instr = InstrStartParallelQuery();
> >
> > /*
> > * Might as well use reliable figure when doling out maintenance_work_mem
> > @@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
> > /* Report WAL/buffer usage during parallel execution */
> > bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
> > walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
> > - InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
> > + InstrEndParallelQuery(instr,
> > + &bufferusage[ParallelWorkerNumber],
> > &walusage[ParallelWorkerNumber]);
> >
> > index_close(indexRel, indexLockmode);
>
> Again not your fault, but it feels like the parallel index build
> infrastructure is all wrong. Reimplementing this stuff for every index type
> makes no sense.
Fully agreed, the duplication here is quite something.
The 0006 patch cleans this up a little bit by at least using the
Instrumentation struct consistently. That is not required for the
stack-based commit, but would be helpful if e.g. we were to put IO
stats next to Buffer/WAL stats in the future.
> > +For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
> > +its instrumentation data gets added to the immediate parent in
> > +the execution tree, the NestLoop, which will then get added to Query A's
> > +QueryInstrumentation, which then accumulates to the parent.
> > +
> > +While we can typically think of this as a tree, the NodeInstrumentation
> > +underneath a particular QueryInstrumentation could behave differently --
> > +for example, it could propagate directly to the QueryInstrumentation, in
> > +order to not show cumulative numbers in EXPLAIN ANALYZE.
>
> Hm. This seems like a somewhat random example, why would one want this?
>
Hmm, yeah. I mainly included this because the fact that accumulation
for the individual nodes in the EXPLAIN happens to be in a tree-like
structure is a choice, no longer a requirement. It would be just as
easy to only accumulate to the parent QueryInstrumentation, and let
explain.c present you a choice of "BUFFERS SELF" / etc - we couldn't
have done that previously.
I've left this in for now, but maybe its better to drop it to avoid confusion?
> > +If multiple QueryInstrumentations are active on the stack (e.g. nested
> > +portals), each one's abort handler uses InstrStopFinalize to unwind to
> > +whichever entry is higher up, so they compose correctly regardless of
> > +release order.
>
> Maybe "the abort handler of each uses InstrStopFinalize() to accumulate the
> statics to its parent entry"?
I revised this as follows:
If multiple QueryInstrumentations are active on the stack (e.g. nested
portals), the abort handler of each uses InstrStopFinalize() to accumulate
the statistics to the parent entry of either the entry being released, or a
previously released entry if it was higher up in the stack, so they compose
correctly regardless of release order.
i.e. its worth noting that its explicitly not the parent of the stack
entry being released, since that parent may already have been released
itself (since cleanup can be out of order).
> > +Memory Handling
> > +===============
> > +
> > +Instrumentation objects that use the stack must survive until finalization
> > +runs, including the abort case. To ensure this, QueryInstrumentation
> > +creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
> > +of TopMemoryContext. All child instrumentation (nodes, triggers) should be
> > +allocated in this context.
>
> > +On successful completion, instr_cxt is reparented to CurrentMemoryContext
> > +so its lifetime is tied to the caller's context. On abort, the
> > +ResourceOwner cleanup frees it after accumulating the instrumentation data
> > +to the current stack entry after resetting the stack.
>
> Makes sense.
>
> I mildly wonder if we should create one minimally sized "Instrumentations"
> node under TopMemoryContext, below which the "Instrumentation" contexts are
> created, instead of doing so directly under TopMemoryContext. But that's
> something that can easily be evolved later.
>
Sure, grouping the instrumentation memory contexts could make sense,
though the existing handling already serves the purpose of clearly
showing when there is an instrumentation leak.
I'll not make this change for now to avoid code churn, but as you note
we could always adjust that part later.
>
> > @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
> > estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
> > estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
> > estate->es_top_eflags = eflags;
> > - estate->es_instrument = queryDesc->instrument_options;
> > estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
> >
> > + /*
> > + * Set up query-level instrumentation if needed. We do this before
> > + * InitPlan so that node and trigger instrumentation can be allocated
> > + * within the query's dedicated instrumentation memory context.
> > + */
> > + if (!queryDesc->totaltime && queryDesc->instrument_options)
> > + {
> > + queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
> > + estate->es_instrument = queryDesc->totaltime;
> > + }
> > +
> > /*
> > * Set up an AFTER-trigger statement context, unless told not to, or
> > * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
>
> It seems pretty weird to still have queryDesc->totaltime *sometimes* created
> by pgss etc, but also create it in standard_ExecutorStart if not already
> created. What if the explain options aren't compatible? Sure
> pgss/auto_explain use ALL, but that's not a given.
Yeah, I think in practice all use cases I've ever seen pass
INSTRUMENT_ALL (and in fact it won't behave sane if this differs
between extensions), but you're right there is no guarantee.
Overall, there are two aspects to this:
1) Query instrumentation as the parent for node instrumentation,
driven by use of EXPLAIN or auto_explain setting
queryDesc->instrument_options
2) Instrumentation as a mechanism to measure the activity of a query,
as used by pg_stat_statements or auto_explain (to get the runtime /
aggregate buffer usage)
I could see two solutions:
A) Keep two separate QueryInstrumentations (EXPLAIN/auto_explain get
es_instrument, any extensions measuring aggregate activity get
query->totaltime)
B) Have one internal QueryInstrumentation (that's responsible to be
the abort "parent" to both node instrumentation, and query->totaltime)
I was initially thinking we could maybe combine them creatively (i.e.
expand on what we've done so far), but I'm not sure there is a
reasonable design that isn't convoluted. We could also have a way for
extensions to "request" a certain level of instrumentation (instead of
directly allocating it), but it seems the current hooks are
insufficient for that.
I've gone with solution (A) for now, with es_instrument being
allocated when per-node instrumentation is needed. Obviously that gets
us two ResOwner cleanups instead of one when e.g. auto_explain is
active, but I think that's still acceptable. It also shows how its
easy to do an extra level of nesting with the stack-based
instrumentation, without too much expense.
With this in place, I do wonder if we should avoid the full memory
context setup in InstrQueryAlloc (i.e. instead just make a direct
allocation), unless we know that children are going to be attached.
The downside of that would be that we can't just re-assign the
instr_cxt in InstrQueryStopFinalize (we'd have to go back to the
previous logic of doing a memcpy into the callers context, for the
no-children case), but it might make a notable performance difference?
>
> > + /* Start up instrumentation for this execution run */
> > if (queryDesc->totaltime)
> > - InstrStart(queryDesc->totaltime);
> > + {
> > + InstrQueryStart(queryDesc->totaltime);
> > +
> > + /*
> > + * Remember all node entries for abort recovery. We do this once here
> > + * after InstrQueryStart has pushed the parent stack entry.
> > + */
> > + if (estate->es_instrument &&
> > + estate->es_instrument->instr.need_stack &&
> > + !queryDesc->already_executed)
> > + ExecRememberNodeInstrumentation(queryDesc->planstate,
> > + queryDesc->totaltime);
> > + }
>
> Hm. Was briefly worried about the overhead of
> ExecRememberNodeInstrumentation() in the context of cursors. But I see it's
> only done once.
>
> But why do we not just associate the NodeInstrumentation's with the
> QueryInstrumentation during the creation of the NodeInstrumentation?
That's a good point - if I recall correctly that was structured
differently in an earlier commit, hence the complexity. But this is no
longer necessary, and allows us to drop the
ExecRememberNodeInstrumentation machinery. Nice :)
>
> > + /*
> > + * Accumulate per-node and trigger statistics to their respective parent
> > + * instrumentation stacks.
> >
> > + * We skip this in parallel workers because their per-node stats are
> > + * reported individually via ExecParallelReportInstrumentation, and the
> > + * leader's own ExecFinalizeNodeInstrumentation handles propagation. If
> > + * we accumulated here, the leader would double-count: worker parent nodes
> > + * would already include their children's stats, and then the leader's
> > + * accumulation would add the children again.
> > + */
>
> Haven't looked into how this all works in sufficient detail, so I'm just
> asking you: This works correctly even when using EXPLAIN (ANALYZE, VERBOSE)
> showing per-worker "subtrees"?
Yeah, that's a good question, and you indeed found a bug - that was
not correctly accumulating up for the per-worker node information. The
main complexity here is the avoidance of double counting.
I can think of two very different approaches to solve this:
1) Have finalization only be responsible for accumulating into the
overall query instrumentation (or whichever instrumentation is
active), and not bother with adding per-node instrumentation to the
parent node at all. Then, in explain.c, do the accumulation. If we
ever wanted to invent a "BUFFERS SELF" type option (i.e. don't add
them to the parent), that would be the way to go. It'd also make it
easier to support accumulation for other types of statistics being
added (e.g. "EXPLAIN (IO)").
2) Specifically walk the worker instrumentation after it has been
retrieved (to avoid double counting), and add to each nodes parents.
For now I've gone with (2) and added a dedicated
ExecFinalizeWorkerInstrumentation function to deal with this.
>
>
> > @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> > palloc0_array(FmgrInfo, n);
> > resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> > palloc0_array(ExprState *, n);
> > - if (instrument_options)
> > - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> > + if (qinstr)
> > + resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
>
> Hm. Why do we not need to pass down the instrument_options anymore? I guess
> the assumption is that we always are going to use the flags from qinstr?
>
> Is that right? Because right now pgss/auto_explain use _ALL, even when an
> EXPLAIN ANALYZE doesn't.
>
With the solution mentioned earlier, where es_instrument is a separate
allocation, this problem now goes away without any extra changes
needed.
Overall, I think its reasonable to make node/trigger instrumentation
be attached to a query instrumentation that has the instrumentation
options set that should be applied. That way we don't have think about
edge cases like a query instrumentation that doesn't need a stack, but
children that do.
>
> > @@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
> > instrument = GetInstrumentationArray(instrumentation);
> > instrument += i * instrumentation->num_workers;
> > for (n = 0; n < instrumentation->num_workers; ++n)
> > + {
> > InstrAggNode(planstate->instrument, &instrument[n]);
> >
> > + /*
> > + * Also add worker WAL usage to the global pgWalUsage counter.
> > + *
> > + * When per-node instrumentation is active, parallel workers skip
> > + * ExecFinalizeNodeInstrumentation (to avoid double-counting in
> > + * EXPLAIN), so per-node WAL activity is not rolled up into the
> > + * query-level stats that InstrAccumParallelQuery receives. Without
> > + * this, pgWalUsage would under-report WAL generated by parallel
> > + * workers when instrumentation is active.
> > + */
> > + WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
> > + }
>
> I'm not sure I understand why this doesn't also lead to double counting, given
> that InstrAccumParallelQuery() does also add the worker's usage to pgWalUsage?
>
That can be explained by the somewhat hard to follow difference
between InstrAccumParallelQuery and InstrAggNode, which is a
pre-existing situation:
InstrAccumParallelQuery is used for accumulating the top level worker
instrumentation into the instrumentation that's active when
ExecParallelFinish runs. The active instrumentation at that point is
either the query's instrumentation (or if not used, instr_top), or the
Gather node.
InstrAggNode is used for accumulating each worker's per-node
instrumentation into the leader's per-node instrumentation.
When per-node instrumentation is active (node->instrument is
initialized), the WalUsageAdd occurs in both InstrAccumParallelQuery
and InstrAggNode - but per the comment in standard_ExecutorFinish, we
don't aggregate the per-node instrumentation to the top level of the
parallel worker - and therefore InstrAccumParallelQuery would report
basically no activity.
I tried to explain this in the comment above WalUsageAdd, but maybe
this needs further clarification?
>
> > +static bool
> > +ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
> > +{
> > + Instrumentation *parent = (Instrumentation *) context;
> > +
> > + Assert(parent != NULL);
> > +
> > + if (node == NULL)
> > + return false;
> > +
> > + /*
> > + * Recurse into children first (bottom-up accumulation), passing our
> > + * instrumentation as the parent context. This ensures children can
> > + * accumulate to us even if they were never executed by the leader (e.g.
> > + * nodes beneath Gather that only workers ran).
> > + */
> > + planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
> > + node->instrument ? &node->instrument->instr : parent);
>
> I don't think I understand that comment. What changes if the leader's node
> was never executed?
I think that was a timing issue in an earlier iteration, where the
stack-based instrumentation data was a separate allocation from the
main node instrumentation.
Since that is no longer an issue, we can just require node->instrument
to be initialized here. Reworded the comment and added an assert.
>
> > diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
> > index bc551f95a08..6892706a83a 100644
> > --- a/src/backend/executor/instrument.c
> > +++ b/src/backend/executor/instrument.c
> > @@ -16,30 +16,46 @@
> > #include <unistd.h>
> >
> > #include "executor/instrument.h"
> > +#include "utils/memutils.h"
> > +#include "utils/resowner.h"
> >
> > -BufferUsage pgBufferUsage;
> > -static BufferUsage save_pgBufferUsage;
> > WalUsage pgWalUsage;
>
> Why do we still need pgWalUsage if we have the same data in instr_stack.
Yeah. That is because of two reasons:
1) The questionable use of pgWalUsage to inform pruneheap.c whether an
FPI occurred. I think using pgWalUsage for this is just wrong, it
should use its own flag/counter. This can't use the top level
instrumentation stack since it'd be updated too late (only on executor
finish, not as writes are going on).
2) The use of pgWalUsage to update cumulative WAL usage statistics. We
could adjust this by having separate "pgstat_count_wal_.." functions
(mirroring how we deal with cumulative buffer usage statistics), or by
pulling the information from instrumentation stack and accepting that
WAL statistics won't be refreshed whilst a query is executing (which
is probably not okay? i.e. we might then have to invent some mechanism
to periodically "flush" before the actual finalize).
Addressing (1) would be somewhat straightforward, so maybe the best
way fowrard is to do that, and then refactor this to use separate
"pgstat_count_wal_.." functions instead of keeping the pgWalUsage
global.
I'll not do that here for now, since I don't think the double writing
of WAL stats is performance critical, and we'd still do that anyway
when having separate "pgstat_count_wal_.." functions.
>
> > +/*
> > + * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
> > + *
> > + * Note that this intentionally allows passing a stack that is not the current
> > + * top, as can happen with PG_FINALLY, or resource owners, which don't have a
> > + * guaranteed cleanup order.
> > + *
> > + * We are careful here to achieve two goals:
> > + *
> > + * 1) Reset the stack to the parent of whichever of the released stack entries
> > + * has the lowest index
> > + * 2) Accumulate all instrumentation to the currently active instrumentation,
> > + * so that callers get a complete picture of activity, even after an abort
> > + */
> > +void
> > +InstrStopFinalize(Instrumentation *instr)
> > +{
> > + int idx = -1;
> > +
> > + for (int i = instr_stack.stack_size - 1; i >= 0; i--)
> > + {
> > + if (instr_stack.entries[i] == instr)
> > + {
> > + idx = i;
> > + break;
> > + }
> > + }
>
> So this may not find a stack entry, because a prior call to
> InstrStopFinalize() already removed it from the stack, right?
>
> Makes it a bit more error prone. Maybe we should store whether the element is
> still on the stack in the Instrumentation, that way we a) can error out if we
> don't find it on the stack b) avoid searching the stack if already removed.
Yeah, that seems doable, added that as suggested.
It does add an extra instruction to InstrPushStack/InstrPopStack, but
that's probably not significant enough. We could always turn it into
an assert-only check if that's the case.
> > if (instr->need_timer)
> > + InstrStopTimer(instr);
> > +
> > + InstrAccumStack(instr_stack.current, instr);
> > +}
>
> Not that it's a huge issue, but seems like it'd be neater if the need_timer
> thing weren't duplicated, but implemented by calling InstrStop()?
That'd be a problem since InstrStop also pops the stack (if
need_stack=true), and InstrStopFinalize already popped the stack right
before.
I think the only alternative here would be adding a flag on InstrStop,
but that seems worse to me.
>
> > +void
> > +InstrQueryStart(QueryInstrumentation *qinstr)
> > +{
> > + InstrStart(&qinstr->instr);
> > +
> > + if (qinstr->instr.need_stack)
> > + {
> > + Assert(CurrentResourceOwner != NULL);
> > + qinstr->owner = CurrentResourceOwner;
> > +
> > + ResourceOwnerEnlarge(qinstr->owner);
> > + ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
> > + }
> > +}
> > +
> > +void
> > +InstrQueryStop(QueryInstrumentation *qinstr)
> > +{
> > + InstrStop(&qinstr->instr);
> > +
> > + if (qinstr->instr.need_stack)
> > + {
> > + Assert(qinstr->owner != NULL);
> > + ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
> > + qinstr->owner = NULL;
> > + }
> > +}
> > +
> > +void
> > +InstrQueryStopFinalize(QueryInstrumentation *qinstr)
> > +{
> > + InstrStopFinalize(&qinstr->instr);
>
> Why are these Instr[Query]StopFinalize() rather than just
> Instr[Query]Finalize()?
If you're coming at this from a naming perspective: Mainly to make it
clear that these both stop the instrumentation, and finalize it. If we
only called it "Instr[Query]Finalize" it wouldn't be clear that there
isn't a missing "Stop" call.
Alternatively we could:
1) Require callers to do two separate function calls
2) Have a "finalize" argument to the Stop function. I had that in a
prior iteration, but felt it was easier to miss the subtle true/false
difference.
> > +/* start instrumentation during parallel executor startup */
> > +QueryInstrumentation *
> > +InstrStartParallelQuery(void)
> > +{
> > + QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
> > +
> > + InstrQueryStart(qinstr);
> > + return qinstr;
> > +}
>
> Why do we hardcode INSTRUMENT_BUFFERS | INSTRUMENT_WAL?
>
That's reflecting the fact that parallel workers can only transport
these two instrumentation types. The 0006 patch removes that
hardcoding. I'll add a comment in the earlier patch for now, for
clarity.
>
> > From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
> > From: Lukas Fittl <[email protected]>
> > Date: Sat, 7 Mar 2026 17:52:24 -0800
> > Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
> > instructions by inlining
> >
> > For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
> > ExecProcNodeInstr when starting/stopping instrumentation for that node.
> >
> > Previously each ExecProcNodeInstr would check which instrumentation
> > options are active in the InstrStartNode/InstrStopNode calls, and do the
> > corresponding work (timers, instrumentation stack, etc.). These
> > conditionals being checked for each tuple being emitted add up, and cause
> > non-optimal set of instructions to be generated by the compiler.
> >
> > Because we already have an existing mechanism to specify a function
> > pointer when instrumentation is enabled, we can instead create specialized
> > functions that are tailored to the instrumentation options enabled, and
> > avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
> > the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
> > test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
> > top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
> > from ~ 20% to ~ 10% on top of actual runtime.
>
> I assume this is to a significant degree due to to allowing for inlining. Have
> you checked how much of the effort you get by just putting ExecProcNodeInstr()
> into instrument.c?
Worth a try - I haven't tested that yet - I'll come back to this
separately and verify how much that buys us, vs spelling out the
different variants.
>
> > @@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
> > then rejected by a recheck of the index condition. This happens because a
> > GiST index is <quote>lossy</quote> for polygon containment tests: it actually
> > returns the rows with polygons that overlap the target, and then we have
> > - to do the exact containment test on those rows.
> > + to do the exact containment test on those rows. The <literal>Table Buffers</literal>
> > + counts indicate how many operations were performed on the table instead of
> > + the index. This number is included in the <literal>Buffers</literal> counts.
> > </para>
> >
> > <para>
>
> I wonder if listing "Index Buffers" separately, instead of "Table Buffers"
> would make more sense, because normally the number of index accesses is much
> smaller and therefore a bit easier to put into relation to "Buffers".
>
I don't think changing this to be focused on index buffers makes sense
(but my opinion is weakly held). My arguments for why it doesn't make
sense:
1) The primary activity of the node is the index (only) scan. The fact
that it also does table access is what we're trying to call out, just
like we're calling out heap fetches.
2) For index only scans the inverse of what you noted is true, i.e.
you'd expect many more index buffers with very little table buffers.
The fact that there were any table buffers at all is worth calling
out.
>
> > @@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
> > ItemPointerGetBlockNumber(tid),
> > &node->ioss_VMBuffer))
> > {
> > + bool found;
> > +
> > /*
> > * Rats, we have to visit the heap to check visibility.
> > */
> > InstrCountTuples2(node, 1);
> > - if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
> > +
> > + if (table_instr)
> > + InstrPushStack(table_instr);
> > +
> > + found = index_fetch_heap(scandesc, node->ioss_TableSlot);
> > +
> > + if (table_instr)
> > + InstrPopStack(table_instr);
> > +
> > + if (!found)
> > continue; /* no visible tuple, try next index entry */
> >
> > ExecClearTuple(node->ioss_TableSlot);
>
> As-is this will unfortunately rather terribly conflict with the way the index
> prefetching patch is restructuring things, as after it neither index nor
> indexonly scan does the equivalent of index_fetch_heap() anymore. This all
> goes through a tableam interface, which in turn will call to the index to get
> the tids (to allow for tableam specific prefetching logic, obviously).
>
> I think this would require putting this into the IndexScanDesc via the
> IndexScanInstrumentation etc.
>
>
> Might be good for you to look at how that stuff works after the index
> prefetching patch and comment if you see a problem.
Agreed, I'll look at that tomorrow. Well, today, I suppose, looking at
the clock..
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] v13-0002-instrumentation-Separate-per-node-logic-from-oth.patch (27.4K, 2-v13-0002-instrumentation-Separate-per-node-logic-from-oth.patch)
download | inline diff:
From bf620643238327dd6aa5192aa4940b5ff5791328 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v13 02/12] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct. Additionally, clarify that
InstrAggNode is expected to only run after InstrEndLoop (as it does in
practice), and drop unused code.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 20 +--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 158 +++++++++++-------
src/include/executor/instrument.h | 60 ++++---
src/include/nodes/execnodes.h | 9 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 179 insertions(+), 125 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 5494d41dca1..fbf32f0e72c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1025,7 +1025,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1084,12 +1084,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 41e47cc795b..cc8ec24c30e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2779,7 +2779,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index eb6ef23c2d6..e73dc129132 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1890,11 +1890,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1903,7 +1903,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2290,18 +2290,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2309,9 +2309,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0237d8c3b1d..b0f636bf8b6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +385,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +435,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -445,7 +445,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 755191b51ef..78f60c1530c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -731,7 +735,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -817,7 +821,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -827,7 +831,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1059,7 +1063,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1087,9 +1091,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1319,7 +1323,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d35976925ae..132fe37ef60 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,8 +414,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..e3d890a7f98 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,31 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
- }
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -83,24 +63,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -113,6 +88,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -133,7 +176,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -141,47 +184,40 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
}
-/* aggregate instrumentation information */
+/*
+ * Aggregate instrumentation from parallel workers. Must be called after
+ * InstrEndLoop.
+ */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
- if (!dst->running && add->running)
- {
- dst->running = true;
- dst->firsttuple = add->firsttuple;
- }
- else if (dst->running && add->running &&
- INSTR_TIME_GT(dst->firsttuple, add->firsttuple))
- dst->firsttuple = add->firsttuple;
-
- INSTR_TIME_ADD(dst->counter, add->counter);
+ Assert(!add->running);
- dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -189,11 +225,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -204,7 +240,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -212,13 +248,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 908898aa7c9..3ecae7552fc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct NodeInstrumentation NodeInstrumentation;
typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
@@ -68,7 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
-typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
/* ----------------
@@ -1207,8 +1208,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7ddf970fb97..449acca8dc1 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1822,6 +1822,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3436,9 +3437,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v13-0003-instrumentation-Use-Instrumentation-instead-of-m.patch (19.3K, 3-v13-0003-instrumentation-Use-Instrumentation-instead-of-m.patch)
download | inline diff:
From 1b84f1215269f91f7eef02ca6237fb8355fba3da Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 05:08:23 -0700
Subject: [PATCH v13 03/12] instrumentation: Use Instrumentation instead of
manual buffer tracking
This replaces different repeated code blocks that read pgBufferUsage /
pgWalUsage, and may have also been running a timer to measure activity,
with the new Instrumentation struct and associated helpers.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/pg_stat_statements.c | 67 +++++--------------
src/backend/access/heap/vacuumlazy.c | 15 ++---
src/backend/commands/analyze.c | 31 +++++----
src/backend/commands/explain.c | 44 ++++++------
src/backend/commands/explain_dr.c | 56 +++++++---------
src/backend/commands/prepare.c | 28 +++-----
src/include/commands/explain_dr.h | 5 +-
7 files changed, 94 insertions(+), 152 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index fbf32f0e72c..63975706b87 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -911,22 +911,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -940,30 +929,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStop(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1156,17 +1135,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1182,6 +1155,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStop(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1196,9 +1170,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1210,23 +1181,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88c71cd85b6..30f589c9207 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,8 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -654,6 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -984,14 +985,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1000,12 +1001,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..8472fc0c280 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e73dc129132..e7550a8ac46 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,13 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
if (es->memory)
{
@@ -348,15 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -364,16 +364,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +583,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1017,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1025,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1036,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..34fe4f8f6dd 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,11 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (instr->need_timer || instr->need_bufusage)
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +182,9 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ if (instr->need_timer || instr->need_bufusage)
+ InstrStop(instr);
return true;
}
@@ -233,9 +220,17 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ {
+ int instrument_options = 0;
+
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
+ }
}
/*
@@ -290,22 +285,17 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
}
/*
- * GetSerializationMetrics - collect metrics
+ * GetSerializationMetrics - get serialization metrics
*
- * We have to be careful here since the receiver could be an IntoRel
- * receiver if the subject statement is CREATE TABLE AS. In that
- * case, return all-zeroes stats.
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
*/
-SerializeMetrics
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..bf9f2eb6149 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -22,6 +22,7 @@
#include "catalog/pg_type.h"
#include "commands/createas.h"
#include "commands/explain.h"
+#include "executor/instrument.h"
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
@@ -580,14 +581,17 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
+ int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -598,9 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -644,13 +645,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..ab5c53023e1 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* time and buffer usage */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
--
2.47.1
[application/octet-stream] v13-0005-Parallel-Bitmap-Heap-Scan-Fix-EXPLAIN-reporting-.patch (5.9K, 4-v13-0005-Parallel-Bitmap-Heap-Scan-Fix-EXPLAIN-reporting-.patch)
download | inline diff:
From 11c9364bb33d6c6c7a8de9e26bc247e761cb5808 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 03:39:46 -0700
Subject: [PATCH v13 05/12] Parallel Bitmap Heap Scan: Fix EXPLAIN reporting of
"Heap Blocks"
Fix the missing accumulation of "Heap Blocks" from parallel query workers
to the leader, causing EXPLAIN (ANALYZE) to only show the leader statistics,
significantly undercounting the true value.
Additionally, add a regression test covering EXPLAIN (ANALYZE) of a
Parallel Bitmap Heap Scan, which previously was not tested at all.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion
---
src/backend/commands/explain.c | 33 +++++++++++++++++++++------
src/test/regress/expected/explain.out | 33 +++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 31 +++++++++++++++++++++++++
3 files changed, 90 insertions(+), 7 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e7550a8ac46..79bd4d9d69e 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -3919,26 +3919,45 @@ show_indexsearches_info(PlanState *planstate, ExplainState *es)
static void
show_tidbitmap_info(BitmapHeapScanState *planstate, ExplainState *es)
{
+ uint64 exact_pages;
+ uint64 lossy_pages;
+
if (!es->analyze)
return;
+ /* Start with leader's stats */
+ exact_pages = planstate->stats.exact_pages;
+ lossy_pages = planstate->stats.lossy_pages;
+
+ /* Accumulate worker stats into node-level totals */
+ if (planstate->sinstrument != NULL)
+ {
+ for (int n = 0; n < planstate->sinstrument->num_workers; n++)
+ {
+ BitmapHeapScanInstrumentation *si = &planstate->sinstrument->sinstrument[n];
+
+ exact_pages += si->exact_pages;
+ lossy_pages += si->lossy_pages;
+ }
+ }
+
if (es->format != EXPLAIN_FORMAT_TEXT)
{
ExplainPropertyUInteger("Exact Heap Blocks", NULL,
- planstate->stats.exact_pages, es);
+ exact_pages, es);
ExplainPropertyUInteger("Lossy Heap Blocks", NULL,
- planstate->stats.lossy_pages, es);
+ lossy_pages, es);
}
else
{
- if (planstate->stats.exact_pages > 0 || planstate->stats.lossy_pages > 0)
+ if (exact_pages > 0 || lossy_pages > 0)
{
ExplainIndentText(es);
appendStringInfoString(es->str, "Heap Blocks:");
- if (planstate->stats.exact_pages > 0)
- appendStringInfo(es->str, " exact=" UINT64_FORMAT, planstate->stats.exact_pages);
- if (planstate->stats.lossy_pages > 0)
- appendStringInfo(es->str, " lossy=" UINT64_FORMAT, planstate->stats.lossy_pages);
+ if (exact_pages > 0)
+ appendStringInfo(es->str, " exact=" UINT64_FORMAT, exact_pages);
+ if (lossy_pages > 0)
+ appendStringInfo(es->str, " lossy=" UINT64_FORMAT, lossy_pages);
appendStringInfoChar(es->str, '\n');
}
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..58c5a512d74 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,36 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- Test parallel bitmap heap scan reports per-worker heap block stats.
+CREATE FUNCTION check_parallel_bitmap_heap_scan() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+BEGIN
+ SET LOCAL enable_seqscan = off;
+ SET LOCAL enable_indexscan = off;
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL min_parallel_index_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1 WHERE hundred > 1' INTO plan_json;
+
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL AND node->>'Node Type' != 'Bitmap Heap Scan' LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+
+ RETURN COALESCE((node->>'Exact Heap Blocks')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_parallel_bitmap_heap_scan() AS parallel_bitmap_instrumentation;
+ parallel_bitmap_instrumentation
+---------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_parallel_bitmap_heap_scan;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..bac97522053 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,34 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- Test parallel bitmap heap scan reports per-worker heap block stats.
+CREATE FUNCTION check_parallel_bitmap_heap_scan() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+BEGIN
+ SET LOCAL enable_seqscan = off;
+ SET LOCAL enable_indexscan = off;
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL min_parallel_index_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1 WHERE hundred > 1' INTO plan_json;
+
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL AND node->>'Node Type' != 'Bitmap Heap Scan' LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+
+ RETURN COALESCE((node->>'Exact Heap Blocks')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_parallel_bitmap_heap_scan() AS parallel_bitmap_instrumentation;
+
+DROP FUNCTION check_parallel_bitmap_heap_scan;
--
2.47.1
[application/octet-stream] v13-0001-instrumentation-Separate-trigger-logic-from-othe.patch (10.1K, 5-v13-0001-instrumentation-Separate-trigger-logic-from-othe.patch)
download | inline diff:
From 9134b8275c44dfdabeee8d08649da1a4b5c75daa Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v13 01/12] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/commands/explain.c | 19 ++++++++-----------
src/backend/commands/trigger.c | 22 +++++++++++-----------
src/backend/executor/execMain.c | 2 +-
src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
src/include/executor/instrument.h | 12 ++++++++++++
src/include/nodes/execnodes.h | 3 ++-
src/tools/pgindent/typedefs.list | 1 +
7 files changed, 61 insertions(+), 24 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4b70166b0e..eb6ef23c2d6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1101,18 +1101,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1137,11 +1134,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1151,9 +1148,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 90e94fb8a5a..4d4e96a5302 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -92,7 +92,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2311,7 +2311,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2346,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2391,10 +2391,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3947,7 +3947,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4342,7 +4342,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4383,7 +4383,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4600,10 +4600,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4719,7 +4719,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 45e00c6af85..0237d8c3b1d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1285,7 +1285,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 090cfccf65f..908898aa7c9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
typedef struct Tuplesortstate Tuplesortstate;
@@ -552,7 +553,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index c72f6c59573..7ddf970fb97 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3213,6 +3213,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v13-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.0K, 6-v13-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 5fc2d2519634b6a13658f3ec81c205190dfcfff1 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 26 Mar 2026 23:31:04 -0700
Subject: [PATCH v13 04/12] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 20 ++++++++++----------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9e8999bbb61..71c9a265662 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1103,10 +1103,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2085,7 +2085,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cc0b0bdd92..3e1c39160db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -840,7 +840,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -861,7 +861,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1684,9 +1684,9 @@ TrackBufferHit(IOObject io_object, IOContext io_context,
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
@@ -2148,9 +2148,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it. Otherwise
@@ -3043,7 +3043,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3189,7 +3189,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4601,7 +4601,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5796,7 +5796,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 396da84b25c..851b99056d5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -218,7 +218,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -479,7 +479,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -510,7 +510,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 2be26e92283..e3829d7fe7c 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..d4769f3da7b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += (val); \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += (val); \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v13-0007-instrumentation-Add-additional-regression-tests-.patch (25.1K, 7-v13-0007-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From 73d3d43fe305871f62332098e414f9db84fc133c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 03:48:32 -0700
Subject: [PATCH v13 07/12] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 ++++++
contrib/pg_stat_statements/expected/wal.out | 48 ++++
contrib/pg_stat_statements/sql/utility.sql | 56 +++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 228 ++++++++++++++++++
src/test/regress/sql/explain.sql | 226 +++++++++++++++++
6 files changed, 661 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index b307e810ca5..f630acd5f54 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -889,3 +889,231 @@ SELECT check_parallel_indexonly_scan() AS parallel_indexonly_instrumentation;
(1 row)
DROP FUNCTION check_parallel_indexonly_scan;
+-- Test parallel query reports similar buffer stats to a serial run
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio
+-------
+ 1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+ trigger_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 3a13fa6ca69..74f605739f1 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -251,3 +251,229 @@ $$ LANGUAGE plpgsql;
SELECT check_parallel_indexonly_scan() AS parallel_indexonly_instrumentation;
DROP FUNCTION check_parallel_indexonly_scan;
+
+-- Test parallel query reports similar buffer stats to a serial run
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+ plan_json json;
+ serial_buffers int;
+ parallel_buffers int;
+ node json;
+BEGIN
+ -- Serial --
+ SET LOCAL max_parallel_workers_per_gather = 0;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ serial_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ -- Parallel --
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1' INTO plan_json;
+ node := plan_json->0->'Plan';
+ parallel_buffers :=
+ COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+ COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+ RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
--
2.47.1
[application/octet-stream] v13-0006-Add-regression-test-coverage-for-EXPLAIN-of-Para.patch (3.9K, 8-v13-0006-Add-regression-test-coverage-for-EXPLAIN-of-Para.patch)
download | inline diff:
From eb3be81df13b3b7ded84db0019bb68105ce3163a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 03:48:22 -0700
Subject: [PATCH v13 06/12] Add regression test coverage for EXPLAIN of
Parallel Index Only Scans
The functions dealing with copying back parallel worker instrumentation
such as ExecIndexOnlyScanRetrieveInstrumentation were not exercised
at all in the regression tests, leading to a gap in coverage. Add a
query that verifies we correctly copy back "Index Searches" for
EXPLAIN ANALYZE of a Parallel Index Only Scan.
Reported-by: Andres Freund <[email protected]>
Author: Lukas Fittl <[email protected]>
Discussion:
---
src/test/regress/expected/explain.out | 34 +++++++++++++++++++++++++++
src/test/regress/sql/explain.sql | 32 +++++++++++++++++++++++++
2 files changed, 66 insertions(+)
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 58c5a512d74..b307e810ca5 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -855,3 +855,37 @@ SELECT check_parallel_bitmap_heap_scan() AS parallel_bitmap_instrumentation;
(1 row)
DROP FUNCTION check_parallel_bitmap_heap_scan;
+-- Test parallel index-only scan reports per-worker index search stats.
+CREATE FUNCTION check_parallel_indexonly_scan() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+BEGIN
+ SET LOCAL enable_seqscan = off;
+ SET LOCAL enable_bitmapscan = off;
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_index_scan_size = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1 WHERE thousand > 95' INTO plan_json;
+
+ -- Drill down to the Index Only Scan node
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL AND node->>'Node Type' != 'Index Only Scan' LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+
+ RETURN COALESCE((node->>'Index Searches')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_parallel_indexonly_scan() AS parallel_indexonly_instrumentation;
+ parallel_indexonly_instrumentation
+------------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_parallel_indexonly_scan;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index bac97522053..3a13fa6ca69 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -219,3 +219,35 @@ $$ LANGUAGE plpgsql;
SELECT check_parallel_bitmap_heap_scan() AS parallel_bitmap_instrumentation;
DROP FUNCTION check_parallel_bitmap_heap_scan;
+
+-- Test parallel index-only scan reports per-worker index search stats.
+CREATE FUNCTION check_parallel_indexonly_scan() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+BEGIN
+ SET LOCAL enable_seqscan = off;
+ SET LOCAL enable_bitmapscan = off;
+ SET LOCAL parallel_setup_cost = 0;
+ SET LOCAL parallel_tuple_cost = 0;
+ SET LOCAL min_parallel_index_scan_size = 0;
+ SET LOCAL min_parallel_table_scan_size = 0;
+ SET LOCAL max_parallel_workers_per_gather = 2;
+ SET LOCAL parallel_leader_participation = off;
+
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT count(*) FROM tenk1 WHERE thousand > 95' INTO plan_json;
+
+ -- Drill down to the Index Only Scan node
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL AND node->>'Node Type' != 'Index Only Scan' LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+
+ RETURN COALESCE((node->>'Index Searches')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_parallel_indexonly_scan() AS parallel_indexonly_instrumentation;
+
+DROP FUNCTION check_parallel_indexonly_scan;
--
2.47.1
[application/octet-stream] v13-0009-instrumentation-Use-Instrumentation-struct-for-p.patch (29.2K, 9-v13-0009-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From 9365bbd06e42fc452d400b870e2334ccc32ade8e Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v13 09/12] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3a5176c76c7..9e545b4ef0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2888,8 +2877,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2950,11 +2938,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d80f72a0b0..f3de62ce7f3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2119,8 +2108,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2200,11 +2188,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2d7b7cef912..cb238f862a7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c330c891c03..b5fed54fb85 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -749,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1007,8 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1102,11 +1089,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c01e780f918..2e57136edfd 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -631,8 +630,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -696,21 +693,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -796,17 +786,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -832,9 +823,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1236,7 +1227,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1250,11 +1241,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
{
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
ExecFinalizeWorkerInstrumentation(pei->planstate);
}
@@ -1481,8 +1472,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1497,7 +1486,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1555,11 +1544,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f9202b558d6..af64aa145eb 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -339,11 +339,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -359,12 +360,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d2f0191af27..b62619412a0 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -286,8 +286,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
--
2.47.1
[application/octet-stream] v13-0008-Optimize-measuring-WAL-buffer-usage-through-stac.patch (89.6K, 10-v13-0008-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From 9386bc74a560dd979b1d2b4484cdc1420ab86b39 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v13 08/12] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
In tests, the stack-based instrumentation mechanism reduces the overhead
of EXPLAIN (ANALYZE, BUFFERS ON, TIMING OFF) for a large COUNT(*) query
from about 50% to 22% on top of the actual runtime.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 24 +-
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 12 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 12 +-
src/backend/commands/explain.c | 10 +-
src/backend/commands/explain_dr.c | 6 +-
src/backend/commands/prepare.c | 10 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/trigger.c | 17 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/README.instrument | 237 +++++++++
src/backend/executor/execMain.c | 84 +++-
src/backend/executor/execParallel.c | 36 +-
src/backend/executor/execPartition.c | 2 +-
src/backend/executor/execProcnode.c | 103 +++-
src/backend/executor/execUtils.c | 11 +-
src/backend/executor/instrument.c | 468 ++++++++++++++----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 5 +-
src/include/executor/instrument.h | 201 +++++++-
src/include/nodes/execnodes.h | 3 +-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
29 files changed, 1084 insertions(+), 236 deletions(-)
create mode 100644 src/backend/executor/README.instrument
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 63975706b87..78f1518c940 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -929,7 +929,7 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
- InstrStop(&instr);
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -994,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1068,10 +1058,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1155,7 +1145,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
- InstrStop(&instr);
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e09..3a5176c76c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 9d83a495775..0d80f72a0b0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2118,6 +2118,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2186,7 +2187,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2201,7 +2202,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 30f589c9207..291d9d67bc2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,7 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -653,8 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -985,7 +985,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -1001,8 +1001,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufferusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 756dfa3dcf4..2d7b7cef912 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8472fc0c280..10f8a2dc81c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,7 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -361,8 +361,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
pg_rusage_init(&ru0);
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -743,7 +743,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -757,8 +757,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
total_blks_hit = bufusage.shared_blks_hit +
bufusage.local_blks_hit;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 79bd4d9d69e..9fc39cabdf8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,7 +324,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -333,7 +333,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -351,12 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -366,7 +366,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 34fe4f8f6dd..9c1b30fb75b 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -113,7 +113,7 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (instr->need_timer || instr->need_bufusage)
+ if (instr->need_timer || instr->need_stack)
InstrStart(instr);
/* Set or update my derived attribute info, if needed */
@@ -183,7 +183,7 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextReset(myState->tmpcontext);
/* Stop per-tuple measurement */
- if (instr->need_timer || instr->need_bufusage)
+ if (instr->need_timer || instr->need_stack)
InstrStop(instr);
return true;
@@ -241,6 +241,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf9f2eb6149..ee811357588 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -581,7 +581,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
@@ -590,7 +590,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -602,7 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -637,7 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -654,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ce2e81f9c2..f72c1ac521a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 4d4e96a5302..b8b8840345b 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -93,6 +93,7 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2312,6 +2313,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2348,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(qinstr, instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2441,6 +2443,7 @@ ExecBSInsertTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2502,6 +2505,7 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2606,6 +2610,7 @@ ExecIRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2670,6 +2675,7 @@ ExecBSDeleteTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2780,6 +2786,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2884,6 +2891,7 @@ ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (rettuple == NULL)
return false; /* Delete was suppressed */
@@ -2942,6 +2950,7 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -3094,6 +3103,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
@@ -3258,6 +3268,7 @@ ExecIRUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -3316,6 +3327,7 @@ ExecBSTruncateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -4383,7 +4395,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(estate->es_instrument, instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4571,6 +4583,7 @@ AfterTriggerExecute(EState *estate,
tgindx,
finfo,
NULL,
+ NULL,
per_tuple_context);
if (rettuple != NULL &&
rettuple != LocTriggerData.tg_trigtuple &&
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..c330c891c03 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1006,6 +1006,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1095,7 +1096,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1103,7 +1104,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/README.instrument b/src/backend/executor/README.instrument
new file mode 100644
index 00000000000..7df837dbc77
--- /dev/null
+++ b/src/backend/executor/README.instrument
@@ -0,0 +1,237 @@
+src/backend/executor/README.instrument
+
+Instrumentation
+===============
+
+The instrumentation subsystem measures time, buffer usage and WAL activity
+during query execution and other similar activities. It is used by
+EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
+activity and/or timing metrics over a section of code.
+
+The design has two central goals:
+
+* Make it cheap to measure activity in a section of code, even when
+ that section is called many times and the aggregate is what is used
+ (as is the case with per-node instrumentation in the executor)
+
+* Ensure nested instrumentation accurately measures activity/timing,
+ even when execution is aborted due to errors being thrown.
+
+The key data structures are defined in src/include/executor/instrument.h
+and the implementation lives in src/backend/executor/instrument.c.
+
+
+Instrumentation Options
+-----------------------
+
+Callers specify what to measure with a bitmask of InstrumentOption flags:
+
+ INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
+ INSTRUMENT_TIMER -- wall-clock timing and row counts
+ INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
+ INSTRUMENT_WAL -- WAL records, FPI, bytes
+
+INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
+(described below) for efficient handling of counter values.
+
+
+Struct Hierarchy
+----------------
+
+There are the following instrumentation structs, each specialized for a
+different scope:
+
+Instrumentation Base struct. Holds timing and buffer/WAL counters.
+
+QueryInstrumentation Extends Instrumentation for query-level tracking. When
+ stack-based tracking is enabled, it owns a dedicated
+ MemoryContext and uses the ResourceOwner mechanism for
+ abort cleanup.
+
+NodeInstrumentation Extends Instrumentation for per-plan-node statistics
+ (startup time, tuple counts, loop counts, etc).
+
+TriggerInstrumentation Extends Instrumentation with a firing count.
+
+
+Stack-based instrumentation
+===========================
+
+For tracking WAL or buffer usage counters, the specialized stack-based
+instrumentation is used.
+
+A simple approach to measuring buffer/WAL activity in a code section could be
+to have a set of global counters, snapshot all the counters at the start, and
+diff them at the end. But, this is expensive in practice: BufferUsage alone
+has many fields, and the diff must be computed for every InstrStartNode /
+InstrStopNode cycle.
+
+An alternative is to write counter updates directly into the struct that
+should receive them, avoiding the diff. But that has two complexities: Low-level
+code such as the buffer manager, has no direct pointers to higher level
+structs, such as plan nodes tracking buffer usage. And instrumentation is often
+nested: We might both be interested in the aggregate buffer usage of a query, and
+the individual per-node details. Stack-based instrumentation solves for that:
+
+At all times, there is a stack that tracks which Instrumentation is currently
+active. The stack is represented by instr_stack, a per-backend global
+that holds a dynamic array of Instrumentation pointers. The field
+instr_stack.current always points to the current stack entry that should
+be updated when activity occurs. When the stack array is empty, the
+current stack points to instr_top.
+
+For example, if a backend has two portals open, the overall nesting of
+Instrumentation and their respective InstrStart/InstrStop calls creates a
+tree-like structure like this:
+
+ Session (instr_top)
+ |
+ +-- Query A (QueryInstrumentation)
+ | |
+ | +-- NestLoop (NodeInstrumentation)
+ | |
+ | +-- Seq Scan A (NodeInstrumentation)
+ | +-- Seq Scan B (NodeInstrumentation)
+ |
+ +-- Query B (QueryInstrumentation)
+ |
+ +-- Seq Scan C (NodeInstrumentation)
+
+While executing Seq Scan B, the stack looks like:
+
+ instr_top (implicit bottom, not in the entries array)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B <-- instr_stack.current
+
+When no query is running, the stack is empty (stack_size == 0) and
+instr_stack.current points to instr_top.
+
+Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
+INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
+writes directly into instr_stack.current. Each instrumentation node starts
+zeroed, so the values it accumulates while on top of the stack represent
+exactly the activity that occurred during that time.
+
+Every Instrumentation node (except for instr_top) has a target, or parent, it
+will be accumulated into, which is typically the Instrumentation that was the
+current stack entry when it was created.
+
+For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
+its instrumentation data gets added to the immediate parent in
+the execution tree, the NestLoop, which will then get added to Query A's
+QueryInstrumentation, which then accumulates to the parent.
+
+While we can typically think of this as a tree, the NodeInstrumentation
+underneath a particular QueryInstrumentation could behave differently --
+for example, it could propagate directly to the QueryInstrumentation, in
+order to not show cumulative numbers in EXPLAIN ANALYZE.
+
+Note these relationships are partially implicit, especially when it comes
+to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
+unfinalized child nodes. The parent of a QueryInstrumentation itself is
+determined by the stack (see below): when a query is finalized or cleaned
+up on abort, its counters are accumulated to whatever entry is then current
+on the stack, which is typically instr_top.
+
+
+Finalization and Abort Safety
+=============================
+
+Finalization is the process of rolling up a node's buffer/WAL counters to
+its parent. In normal execution, nodes are pushed onto the stack when they
+start and popped when they stop; at finalization time their accumulated
+counters are added to the parent.
+
+Due to the use of longjmp for error handling, functions can exit abruptly
+without executing their normal cleanup code. On abort, two things need
+to happen:
+
+1. The stack is reset to the level saved at the start of the aborting
+ (sub-)transaction level. This ensures that we don't later try to update
+ counters on a freed stack entry. We also need to ensure that the stack
+ entry that was current before a particular Instrumentation started, is
+ current again after it stops.
+
+2. Finalize all affected Instrumentation nodes, rolling up their counters
+ to the innermost surviving Instrumentation, so that data is not lost.
+
+For example, if Seq Scan B aborts while the stack is:
+
+ instr_top (implicit bottom)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B
+
+The abort handler for Query A accumulates all unfinalized children (Seq
+Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
+unwinds the instrumentation stack and accumulates Query A's counters to
+instr_top.
+
+Note that on abort the children do not accumulate through each other (Seq
+Scan B -> NestLoop -> Query A); they all accumulate directly to their
+parent QueryInstrumentation. This means the order in which children are
+released does not matter -- this is important because ResourceOwner cleanup
+does not guarantee a particular release order. The per-node breakdown is lost,
+but the instrumentation active when the query was started (instr_top in the
+above example) survives the abort, and its counters include the activity.
+
+If multiple QueryInstrumentations are active on the stack (e.g. nested
+portals), the abort handler of each uses InstrStopFinalize() to accumulate
+the statistics to the parent entry of either the entry being released, or a
+previously released entry if it was higher up in the stack, so they compose
+correctly regardless of release order.
+
+There are two mechanisms for achieving abort safety:
+
+* Resource Owner (QueryInstrumentation): registers with the current
+ ResourceOwner at start. On transaction abort, the resource owner system
+ calls the release callback, which walks unfinalized child entries,
+ accumulates their data, unwinds the stack, and destroys the dedicated
+ memory context (freeing the QueryInstrumentation and all child
+ allocations as a unit). This is the recommended approach when the
+ instrumented code already has an appropriate resource owner (e.g. it
+ runs inside a portal). The query executor uses this path.
+
+* PG_FINALLY (base Instrumentation): when no suitable resource owner
+ exists, or when the caller wants to inspect the instrumentation data
+ even after an error, the base Instrumentation can be used with a
+ PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
+
+Both mechanisms add overhead, so neither is suitable for high-frequency
+instrumentation like per-node measurements in the executor. Instead,
+plan node and trigger children rely on their parent QueryInstrumentation
+for abort safety: they are allocated in the parent's memory context and
+registered in its unfinalized-entries list, so the parent's abort handler
+recovers their data automatically. In normal execution, children are
+finalized explicitly by the caller.
+
+Parallel Query
+--------------
+
+Parallel workers get their own QueryInstrumentation so they can measure
+buffer and WAL activity independently, then copy the totals into dynamic
+shared memory at worker shutdown. The leader accumulates these into its
+own stack.
+
+When per-node instrumentation is active, parallel workers skip per-node
+finalization at shutdown to avoid double-counting; the per-node data is
+aggregated separately through InstrAggNode().
+
+
+Memory Handling
+===============
+
+Instrumentation objects that use the stack must survive until finalization
+runs, including the abort case. To ensure this, QueryInstrumentation
+creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
+of TopMemoryContext. All child instrumentation (nodes, triggers) should be
+allocated in this context.
+
+On successful completion, instr_cxt is reparented to CurrentMemoryContext
+so its lifetime is tied to the caller's context. On abort, the
+ResourceOwner cleanup frees it after accumulating the instrumentation data
+to the current stack entry after resetting the stack.
+
+When the stack is not needed (timer/rows only), Instrumentation allocations
+happen in CurrentMemoryContext instead of TopMemoryContext.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b0f636bf8b6..d0cd34d286c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -247,9 +248,16 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
estate->es_top_eflags = eflags;
- estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up per-node instrumentation if needed. We do this before InitPlan
+ * so that node and trigger instrumentation can be allocated within the
+ * query's dedicated instrumentation memory context.
+ */
+ if (!estate->es_instrument && queryDesc->instrument_options)
+ estate->es_instrument = InstrQueryAlloc(queryDesc->instrument_options);
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
@@ -331,9 +339,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /* Start up instrumentation for this execution run */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
+ if (estate->es_instrument)
+ InstrQueryStart(estate->es_instrument);
/*
* extract information from the query descriptor and the query feature.
@@ -384,8 +394,10 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (sendTuples)
dest->rShutdown(dest);
+ if (estate->es_instrument)
+ InstrQueryStop(estate->es_instrument);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +447,9 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
+ if (estate->es_instrument)
+ InstrQueryStart(estate->es_instrument);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -444,8 +458,32 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ if (estate->es_instrument)
+ {
+ /*
+ * Accumulate per-node and trigger statistics to their respective
+ * parent instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and
+ * the leader's own ExecFinalizeNodeInstrumentation handles
+ * propagation. If we accumulated here, the leader would
+ * double-count: worker parent nodes would already include their
+ * children's stats, and then the leader's accumulation would add the
+ * children again.
+ */
+ if (!IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
+ InstrQueryStopFinalize(estate->es_instrument);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1263,7 +1301,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1284,8 +1322,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
}
else
{
@@ -1358,6 +1396,10 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
* also provides a way for EXPLAIN ANALYZE to report the runtimes of such
* triggers.) So we make additional ResultRelInfo's as needed, and save them
* in es_trig_target_relations.
+ *
+ * Note: if new relation lists are searched here, they must also be added to
+ * ExecFinalizeTriggerInstrumentation so that trigger instrumentation data
+ * is properly accumulated.
*/
ResultRelInfo *
ExecGetTriggerResultRel(EState *estate, Oid relid,
@@ -1500,6 +1542,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_instrument->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 78f60c1530c..c01e780f918 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -700,7 +700,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -825,13 +825,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
- instrumentation->instrument_options = estate->es_instrument;
+ instrumentation->instrument_options = estate->es_instrument->instrument_options;
instrumentation->instrument_offset = instrument_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInitNode(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1238,9 +1252,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
{
/* Accumulate instrumentation, if any. */
if (pei->instrumentation)
+ {
ExecParallelRetrieveInstrumentation(pei->planstate,
pei->instrumentation);
+ ExecFinalizeWorkerInstrumentation(pei->planstate);
+ }
+
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
ExecParallelRetrieveJitInstrumentation(pei->planstate,
@@ -1462,6 +1480,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1522,7 +1541,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1538,7 +1557,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6f2909a1bc3 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1381,7 +1381,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..3b3ec9850e8 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -788,10 +790,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +831,99 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ Assert(node->instrument != NULL);
+
+ /*
+ * Recurse into children first (bottom-up accumulation), and accummulate
+ * to this nodes instrumentation as the parent context.
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ &node->instrument->instr);
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
+/*
+ * ExecFinalizeWorkerInstrumentation
+ *
+ * Accumulate per-worker instrumentation stats from child nodes into their
+ * parents, mirroring what ExecFinalizeNodeInstrumentation does for the
+ * leader's own stats. Without this, per-worker buffer/WAL stats shown by
+ * EXPLAIN (ANALYZE, VERBOSE) would only reflect each node's own direct
+ * activity, not including children.
+ *
+ * This must run after ExecParallelRetrieveInstrumentation has populated
+ * worker_instrument for all nodes in the parallel subtree.
+ */
+void
+ExecFinalizeWorkerInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeWorkerInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
+{
+ PlanState *parent = (PlanState *) context;
+ int num_workers;
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing this node
+ * as parent context if it has worker_instrument, otherwise pass through
+ * the previous parent.
+ */
+ planstate_tree_walker(node, ExecFinalizeWorkerInstrumentation_walker,
+ node->worker_instrument ? (void *) node : context);
+
+ if (!node->worker_instrument)
+ return false;
+
+ num_workers = node->worker_instrument->num_workers;
+
+ /* Accumulate this node's per-worker stats to parent's per-worker stats */
+ if (parent && parent->worker_instrument)
+ {
+ int parent_workers = parent->worker_instrument->num_workers;
+
+ for (int n = 0; n < Min(num_workers, parent_workers); n++)
+ InstrAccumStack(&parent->worker_instrument->instrument[n].instr,
+ &node->worker_instrument->instrument[n].instr);
+ }
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..700764daf45 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -150,7 +150,7 @@ CreateExecutorState(void)
estate->es_total_processed = 0;
estate->es_top_eflags = 0;
- estate->es_instrument = 0;
+ estate->es_instrument = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
@@ -227,6 +227,15 @@ FreeExecutorState(EState *estate)
estate->es_partition_directory = NULL;
}
+ /*
+ * Make sure the instrumentation context gets freed. This usually gets
+ * re-parented under the per-query context in InstrQueryStopFinalize, but
+ * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
+ * gets called, so we would otherwise leak it in TopMemoryContext.
+ */
+ if (estate->es_instrument && estate->es_instrument->instr.need_stack)
+ MemoryContextDelete(estate->es_instrument->instr_cxt);
+
/*
* Free the per-query memory context, thereby releasing all working
* memory, including the EState node itself.
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index e3d890a7f98..f9202b558d6 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,31 +16,53 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {
+ .stack_space = 0,
+ .stack_size = 0,
+ .entries = NULL,
+ .current = &instr_top,
+};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+ Assert(instr_stack.stack_size >= instr_stack.stack_space);
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
-
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -55,52 +77,309 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_stack)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ if (instr->on_stack)
+ {
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx < 0)
+ elog(ERROR, "instrumentation entry not found on stack");
+
+ /* Clear on_stack for any intermediate entries we're skipping over */
+ for (int i = instr_stack.stack_size - 1; i > idx; i--)
+ instr_stack.entries[i]->on_stack = false;
+
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
- INSTR_TIME_SET_ZERO(instr->starttime);
+ InstrAccumStack(&qinstr->instr, child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
+
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner == NULL);
+ return;
+ }
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
- InstrInitNode(instr, instrument_options);
+ InstrInitNode(instr, qinstr->instrument_options);
instr->async_mode = async_mode;
+ InstrQueryRememberChild(qinstr, &instr->instr);
+
return instr;
}
@@ -119,6 +398,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -148,14 +428,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_stack)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -190,8 +468,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -225,67 +503,73 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
{
- TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
- InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, qinstr->instrument_options);
return tginstr;
}
void
-InstrStartTrigger(TriggerInstrumentation *tginstr)
+InstrStartTrigger(QueryInstrumentation *qinstr, TriggerInstrumentation *tginstr)
{
InstrStart(&tginstr->instr);
+
+ /*
+ * On first call, register with the parent QueryInstrumentation for abort
+ * recovery.
+ */
+ if (qinstr && tginstr->instr.need_stack &&
+ dlist_node_is_detached(&tginstr->instr.unfinalized_entry))
+ dlist_push_head(&qinstr->unfinalized_entries,
+ &tginstr->instr.unfinalized_entry);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -306,39 +590,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b38170f0fbe..3ca0a7a635d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -904,7 +904,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e1c39160db..cf4f4246ca2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1266,9 +1266,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
}
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index e3829d7fe7c..e7fc7f071d8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 491c4886506..03f0e864176 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,7 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +302,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
+extern void ExecFinalizeWorkerInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d4769f3da7b..d2f0191af27 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,92 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
+ bool on_stack; /* true if currently on instr_stack */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +175,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,16 +192,104 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+ instr->on_stack = true;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+ instr->on_stack = false;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
- bool async_mode);
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
@@ -141,35 +297,36 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
-extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
+extern void InstrStartTrigger(QueryInstrumentation *qinstr,
+ TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += (val); \
+ instr_stack.current->bufusage.fld += (val); \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += (val); \
+ instr_stack.current->walusage.fld += (val); \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..b28288aa1e8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -54,6 +54,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -753,7 +754,7 @@ typedef struct EState
* ExecutorRun() calls. */
int es_top_eflags; /* eflags passed to ExecutorStart */
- int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_instrument; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 449acca8dc1..7393926e34d 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1355,6 +1355,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2477,6 +2478,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v13-0010-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (11.3K, 11-v13-0010-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From d51ed5a5ebfe83116a4a740ba3b9d3f49687f226 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v13 10/12] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +--
src/backend/executor/instrument.c | 199 ++++++++++++++++++++--------
src/include/executor/instrument.h | 5 +
3 files changed, 149 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3b3ec9850e8..6e8cbaeccf7 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
@@ -465,7 +464,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -473,25 +472,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index af64aa145eb..3183f00d693 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -66,29 +66,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -96,6 +87,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -391,65 +392,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_stack)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Update tuple count */
@@ -507,6 +500,100 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.entries[instr_stack.stack_size - 1]->on_stack = false;
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static pg_attribute_always_inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopNodeTimer(instr);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b62619412a0..bae8a9b0e62 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -297,6 +297,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
extern void InstrStartTrigger(QueryInstrumentation *qinstr,
TriggerInstrumentation *tginstr);
--
2.47.1
[application/octet-stream] v13-0011-Index-scans-Show-table-buffer-accesses-separatel.patch (22.9K, 12-v13-0011-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From b57c4118984bd46b848607680afff11b7960f1bd Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v13 11/12] Index scans: Show table buffer accesses separately
in EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 ++++++--
src/backend/executor/execProcnode.c | 46 ++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 41 ++++++-
src/backend/executor/nodeIndexscan.c | 127 +++++++++++++++++----
src/include/executor/instrument_node.h | 5 +
8 files changed, 244 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9fc39cabdf8..42fc00cbd34 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -611,7 +611,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1028,7 +1028,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1042,7 +1042,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1970,6 +1970,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1987,6 +1990,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2288,7 +2294,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2307,7 +2313,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4126,7 +4132,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4151,6 +4157,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4206,6 +4214,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4247,6 +4257,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4267,8 +4285,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4287,6 +4317,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6e8cbaeccf7..a59de0ef22b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -846,6 +846,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
&node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
@@ -891,6 +905,38 @@ ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
num_workers = node->worker_instrument->num_workers;
+ /*
+ * Fold per-worker IndexScan/IndexOnlyScan table buffer stats into the
+ * per-worker node stats, matching what ExecFinalizeNodeInstrumentation
+ * does for the leader.
+ */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, iss->iss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &iss->iss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, ioss->ioss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &ioss->ioss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+
/* Accumulate this node's per-worker stats to parent's per-worker stats */
if (parent && parent->worker_instrument)
{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d61..63e24a0bcd4 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index de6154fd541..9e64ce2bd2d 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -436,6 +451,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -610,7 +626,21 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexOnlyNext calls InstrPushStack / InstrPopStack (instead of the
+ * full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_instrument, &indexstate->ioss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -899,4 +929,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 1620d146071..02ef9d124a3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -132,8 +138,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -181,6 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -200,6 +223,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -263,36 +289,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -818,6 +875,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -980,7 +1038,21 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexNext / IndexNextWithReorder call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_instrument, &indexstate->iss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1834,4 +1906,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 2a0ff377a73..e2315cef384 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
[application/octet-stream] v13-0012-Add-test_session_buffer_usage-test-module.patch (30.0K, 13-v13-0012-Add-test_session_buffer_usage-test-module.patch)
download | inline diff:
From ff71ea65359af4cc1c6b8df4d9017da5fe87e4b7 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v13 12/12] Add test_session_buffer_usage test module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
.../test_session_buffer_usage/Makefile | 23 ++
.../expected/test_session_buffer_usage.out | 342 ++++++++++++++++++
.../test_session_buffer_usage/meson.build | 33 ++
.../sql/test_session_buffer_usage.sql | 245 +++++++++++++
.../test_session_buffer_usage--1.0.sql | 31 ++
.../test_session_buffer_usage.c | 95 +++++
.../test_session_buffer_usage.control | 5 +
9 files changed, 776 insertions(+)
create mode 100644 src/test/modules/test_session_buffer_usage/Makefile
create mode 100644 src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
create mode 100644 src/test/modules/test_session_buffer_usage/meson.build
create mode 100644 src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 864b407abcf..c5ace162fe2 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -48,6 +48,7 @@ SUBDIRS = \
test_resowner \
test_rls_hooks \
test_saslprep \
+ test_session_buffer_usage \
test_shm_mq \
test_slru \
test_tidstore \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index e5acacd5083..802cc93d71a 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -49,6 +49,7 @@ subdir('test_regex')
subdir('test_resowner')
subdir('test_rls_hooks')
subdir('test_saslprep')
+subdir('test_session_buffer_usage')
subdir('test_shm_mq')
subdir('test_slru')
subdir('test_tidstore')
diff --git a/src/test/modules/test_session_buffer_usage/Makefile b/src/test/modules/test_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..1252b222cb9
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_session_buffer_usage/Makefile
+
+MODULE_big = test_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ test_session_buffer_usage.o
+
+EXTENSION = test_session_buffer_usage
+DATA = test_session_buffer_usage--1.0.sql
+PGFILEDESC = "test_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = test_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_session_buffer_usage
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
new file mode 100644
index 00000000000..5f7d349871a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/meson.build b/src/test/modules/test_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..b96f67dc7fe
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_session_buffer_usage_sources = files(
+ 'test_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ test_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_session_buffer_usage',
+ '--FILEDESC', 'test_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+test_session_buffer_usage = shared_module('test_session_buffer_usage',
+ test_session_buffer_usage_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_session_buffer_usage
+
+test_install_data += files(
+ 'test_session_buffer_usage.control',
+ 'test_session_buffer_usage--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_session_buffer_usage',
+ ],
+ },
+}
diff --git a/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
new file mode 100644
index 00000000000..daf2159c4a6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT test_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT test_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..e9833be470a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION test_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION test_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
new file mode 100644
index 00000000000..50eb1a2ffe6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(test_session_buffer_usage);
+PG_FUNCTION_INFO_V1(test_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: test_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+test_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: test_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+test_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
new file mode 100644
index 00000000000..41cfb15a765
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# test_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/test_session_buffer_usage'
+relocatable = true
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-05 18:13 ` Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2026-04-05 18:13 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
Hi,
Not a real reply to your email, just looking at committing 0001/0002 to get
them out of the way.
Unfortunately I think 0001 on its own doesn't actually work correctly. I
luckily tried an EXPLAIN ANALYZE with triggers and noticed that the time is
reported as zeroes.
The only reason I tried is because I misread the diff and though you'd changed
the calls=%.3f to calls=%d, even though the old state is calls=%.0f...
The reason it doesn't work is that explain shows tginstr->instr.total, but
with the patch the trigger instrumentation just computes
tginstr->instr.{counter,firsttuple}.
And of course we don't have any tests even showing trigger output. Not that
such a test would have been likely to catch this issue, as something like the
the amount of time is nontrivial to test.
This is actually fixed by 0002, as it makes InstrStop() update ->total,
rather than ->counter as before.
But I'd prefer not to break the intermediary state ;).
I guess we could squash both patches?
But probably the least bad solution is to add an InstrEndLoop() to in 0001 and
remove it again in 0002.
Re 0002
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
I think that probably should be in 0001?
I'm kinda wondering whether, to keep the line lenghts manageable,
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
{
double nloops = planstate->instrument->nloops;
double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
double rows = planstate->instrument->ntuples / nloops;
Should store planstate->instrument in a local var and wrap after =.
But not sure it's worth bothering with.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2026-04-05 19:38 ` Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-04-05 19:38 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
On Sun, Apr 5, 2026 at 11:13 AM Andres Freund <[email protected]> wrote:
> Unfortunately I think 0001 on its own doesn't actually work correctly. I
> luckily tried an EXPLAIN ANALYZE with triggers and noticed that the time is
> reported as zeroes.
>
> The only reason I tried is because I misread the diff and though you'd changed
> the calls=%.3f to calls=%d, even though the old state is calls=%.0f...
>
>
> The reason it doesn't work is that explain shows tginstr->instr.total, but
> with the patch the trigger instrumentation just computes
> tginstr->instr.{counter,firsttuple}.
Argh, good catch. That's on me for not manually testing it when I
factored it out.
I've confirmed this works now, both with 0001 only, and with 0001+0002.
> But probably the least bad solution is to add an InstrEndLoop() to in 0001 and
> remove it again in 0002.
Yeah, I've done that for now.
>
> Re 0002
>
> In passing, drop the "n" argument to InstrAlloc, as all remaining callers
> need exactly one Instrumentation struct.
>
> I think that probably should be in 0001?
Ack, done.
>
>
> I'm kinda wondering whether, to keep the line lenghts manageable,
> --- a/src/backend/commands/explain.c
> +++ b/src/backend/commands/explain.c
> @@ -1837,7 +1837,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
> {
> double nloops = planstate->instrument->nloops;
> double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
> - double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
> + double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
> double rows = planstate->instrument->ntuples / nloops;
>
> Should store planstate->instrument in a local var and wrap after =.
>
> But not sure it's worth bothering with.
Sure, seems easy enough.
See attached v14 with changes to 0001 and 0002 only. I've also moved
the PBHS/PIOS patches to their own thread [0].
Thanks,
Lukas
[0]: https://www.postgresql.org/message-id/[email protected]...
--
Lukas Fittl
Attachments:
[application/octet-stream] v14-0005-instrumentation-Add-additional-regression-tests-.patch (22.5K, 2-v14-0005-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From d6aae133d25088e5d3c06123fe45369f4001246a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 03:48:32 -0700
Subject: [PATCH v14 05/10] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 ++++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 188 ++++++++++++++++++
src/test/regress/sql/explain.sql | 188 ++++++++++++++++++
6 files changed, 583 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..5ff96491b0a 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,191 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+ trigger_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..9f0e8524497 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,191 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
--
2.47.1
[application/octet-stream] v14-0003-instrumentation-Use-Instrumentation-instead-of-m.patch (19.3K, 3-v14-0003-instrumentation-Use-Instrumentation-instead-of-m.patch)
download | inline diff:
From f4afa2ed300db98d184803f3eb3be4197064ecff Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 05:08:23 -0700
Subject: [PATCH v14 03/10] instrumentation: Use Instrumentation instead of
manual buffer tracking
This replaces different repeated code blocks that read pgBufferUsage /
pgWalUsage, and may have also been running a timer to measure activity,
with the new Instrumentation struct and associated helpers.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/pg_stat_statements.c | 67 +++++--------------
src/backend/access/heap/vacuumlazy.c | 15 ++---
src/backend/commands/analyze.c | 31 +++++----
src/backend/commands/explain.c | 44 ++++++------
src/backend/commands/explain_dr.c | 56 +++++++---------
src/backend/commands/prepare.c | 28 +++-----
src/include/commands/explain_dr.h | 5 +-
7 files changed, 94 insertions(+), 152 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index fbf32f0e72c..63975706b87 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -911,22 +911,11 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
-
- /*
- * Similarly the planner could write some WAL records in some cases
- * (e.g. setting a hint bit with those being WAL-logged)
- */
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ /* Track time and buffer/WAL usage as the planner can access them. */
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -940,30 +929,20 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStop(&instr);
nesting_level--;
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1156,17 +1135,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1182,6 +1155,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStop(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1196,9 +1170,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1210,23 +1181,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88c71cd85b6..30f589c9207 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,8 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -654,6 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -984,14 +985,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1000,12 +1001,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..8472fc0c280 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index de4c1a250d1..2afe858c441 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,13 +324,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ int instrument_options = INSTRUMENT_TIMER;
+
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
if (es->memory)
{
@@ -348,15 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -364,16 +364,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +583,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1017,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1025,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1036,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..34fe4f8f6dd 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,11 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ if (instr->need_timer || instr->need_bufusage)
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +182,9 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ if (instr->need_timer || instr->need_bufusage)
+ InstrStop(instr);
return true;
}
@@ -233,9 +220,17 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ {
+ int instrument_options = 0;
+
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
+ }
}
/*
@@ -290,22 +285,17 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
}
/*
- * GetSerializationMetrics - collect metrics
+ * GetSerializationMetrics - get serialization metrics
*
- * We have to be careful here since the receiver could be an IntoRel
- * receiver if the subject statement is CREATE TABLE AS. In that
- * case, return all-zeroes stats.
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
*/
-SerializeMetrics
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..bf9f2eb6149 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -22,6 +22,7 @@
#include "catalog/pg_type.h"
#include "commands/createas.h"
#include "commands/explain.h"
+#include "executor/instrument.h"
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
@@ -580,14 +581,17 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
+ int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -598,9 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -644,13 +645,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..ab5c53023e1 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* time and buffer usage */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
--
2.47.1
[application/octet-stream] v14-0002-instrumentation-Separate-per-node-logic-from-oth.patch (27.9K, 4-v14-0002-instrumentation-Separate-per-node-logic-from-oth.patch)
download | inline diff:
From af97c724475175846b416d88511e5471d73e690c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v14 02/10] instrumentation: Separate per-node logic from other
uses
Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.
Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.
Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.
In passing, clarify that InstrAggNode is expected to only run after
InstrEndLoop (as it does in practice), and drop unused code.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
---
contrib/auto_explain/auto_explain.c | 8 +-
.../pg_stat_statements/pg_stat_statements.c | 8 +-
contrib/postgres_fdw/postgres_fdw.c | 2 +-
src/backend/commands/explain.c | 30 ++--
src/backend/executor/execMain.c | 8 +-
src/backend/executor/execParallel.c | 24 +--
src/backend/executor/execProcnode.c | 4 +-
src/backend/executor/instrument.c | 150 +++++++++++-------
src/include/executor/instrument.h | 59 ++++---
src/include/nodes/execnodes.h | 9 +-
src/tools/pgindent/typedefs.list | 3 +-
11 files changed, 183 insertions(+), 122 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 5f5c1ff0da3..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
*/
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
/* Log plan if duration is exceeded. */
msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index ddbd5727ddf..fbf32f0e72c 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1025,7 +1025,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
MemoryContextSwitchTo(oldcxt);
}
}
@@ -1084,12 +1084,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
if (queryId != INT64CONST(0) && queryDesc->totaltime &&
pgss_enabled(nesting_level))
{
- /*
- * Make sure stats accumulation is done. (Note: it's okay if several
- * levels of hook all do this.)
- */
- InstrEndLoop(queryDesc->totaltime);
-
pgss_store(queryDesc->sourceText,
queryId,
queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 41e47cc795b..cc8ec24c30e 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2779,7 +2779,7 @@ postgresIterateDirectModify(ForeignScanState *node)
if (!resultRelInfo->ri_projectReturning)
{
TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
- Instrumentation *instr = node->ss.ps.instrument;
+ NodeInstrumentation *instr = node->ss.ps.instrument;
Assert(!dmstate->has_returning);
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 6b7b23049ca..de4c1a250d1 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1105,9 +1105,6 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
char *relname;
char *conname = NULL;
- /* Ensure total timing is updated from the internal counter */
- InstrEndLoop(&tginstr->instr);
-
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
@@ -1838,10 +1835,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->analyze &&
planstate->instrument && planstate->instrument->nloops > 0)
{
- double nloops = planstate->instrument->nloops;
- double startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
- double total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
- double rows = planstate->instrument->ntuples / nloops;
+ NodeInstrumentation *instr = planstate->instrument;
+ double nloops = instr->nloops;
+ double startup_ms = INSTR_TIME_GET_MILLISEC(instr->startup) / nloops;
+ double total_ms = INSTR_TIME_GET_MILLISEC(instr->instr.total) / nloops;
+ double rows = instr->ntuples / nloops;
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -1893,11 +1891,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* prepare per-worker general execution details */
if (es->workers_state && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
double startup_ms;
double total_ms;
@@ -1906,7 +1904,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (nloops <= 0)
continue;
startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
- total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+ total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
rows = instrument->ntuples / nloops;
ExplainOpenWorker(n, es);
@@ -2293,18 +2291,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
- show_wal_usage(es, &planstate->instrument->walusage);
+ show_wal_usage(es, &planstate->instrument->instr.walusage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
{
- WorkerInstrumentation *w = planstate->worker_instrument;
+ WorkerNodeInstrumentation *w = planstate->worker_instrument;
for (int n = 0; n < w->num_workers; n++)
{
- Instrumentation *instrument = &w->instrument[n];
+ NodeInstrumentation *instrument = &w->instrument[n];
double nloops = instrument->nloops;
if (nloops <= 0)
@@ -2312,9 +2310,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage);
if (es->wal)
- show_wal_usage(es, &instrument->walusage);
+ show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
}
}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 0237d8c3b1d..b0f636bf8b6 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -333,7 +333,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/*
* extract information from the query descriptor and the query feature.
@@ -385,7 +385,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
dest->rShutdown(dest);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, estate->es_processed);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +435,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStartNode(queryDesc->totaltime);
+ InstrStart(queryDesc->totaltime);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -445,7 +445,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
AfterTriggerEndQuery(estate);
if (queryDesc->totaltime)
- InstrStopNode(queryDesc->totaltime, 0);
+ InstrStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 755191b51ef..78f60c1530c 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
* instrument_options: Same meaning here as in instrument.c.
*
* instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object. This will depend on the length of
+ * of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
* num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
- /* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+ /*
+ * array of num_plan_nodes * num_workers NodeInstrumentation objects
+ * follows
+ */
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
- (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+ (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
@@ -731,7 +735,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len = MAXALIGN(instrumentation_len);
instrument_offset = instrumentation_len;
instrumentation_len +=
- mul_size(sizeof(Instrumentation),
+ mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -817,7 +821,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*/
if (estate->es_instrument)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -827,7 +831,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInit(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1059,7 +1063,7 @@ static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation)
{
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
int i;
int n;
int ibytes;
@@ -1087,9 +1091,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
* Switch into per-query memory context.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
- ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+ ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
planstate->worker_instrument =
- palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+ palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
MemoryContextSwitchTo(oldcontext);
planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1319,7 +1323,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
{
int i;
int plan_node_id = planstate->plan->plan_node_id;
- Instrumentation *instrument;
+ NodeInstrumentation *instrument;
InstrEndLoop(planstate->instrument);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index a2047e4dbc6..132fe37ef60 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,8 +414,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(estate->es_instrument,
- result->async_capable);
+ result->instrument = InstrAllocNode(estate->es_instrument,
+ result->async_capable);
return result;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index abb391c114c..e3d890a7f98 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,43 +26,31 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure */
+/* General purpose instrumentation handling */
Instrumentation *
-InstrAlloc(int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
{
- Instrumentation *instr;
-
- /* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(sizeof(Instrumentation));
- if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
- {
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
- instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- instr->async_mode = async_mode;
- }
+ Instrumentation *instr = palloc0(sizeof(Instrumentation));
+ InstrInitOptions(instr, instrument_options);
return instr;
}
-/* Initialize a pre-allocated instrumentation structure. */
void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- memset(instr, 0, sizeof(Instrumentation));
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-/* Entry to a plan node */
void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
{
if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ elog(ERROR, "InstrStart called twice in a row");
else
INSTR_TIME_SET_CURRENT(instr->starttime);
}
@@ -75,24 +63,19 @@ InstrStartNode(Instrumentation *instr)
instr->walusage_start = pgWalUsage;
}
-/* Exit from a plan node */
void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
-
/* let's update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStopNode called without start");
+ elog(ERROR, "InstrStop called without start");
INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
}
@@ -105,6 +88,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (instr->need_walusage)
WalUsageAccumDiff(&instr->walusage,
&pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+ NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+ InstrInitNode(instr, instrument_options);
+ instr->async_mode = async_mode;
+
+ return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+ memset(instr, 0, sizeof(NodeInstrumentation));
+ InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+ InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ double save_tuplecount = instr->tuplecount;
+ instr_time endtime;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
+
+ /*
+ * Update the time only if the timer was requested.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+ if (instr->instr.need_timer)
+ {
+ if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+ elog(ERROR, "InstrStopNode called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
+ }
+
+ /* Add delta of buffer usage since entry to node's totals */
+ if (instr->instr.need_bufusage)
+ BufferUsageAccumDiff(&instr->instr.bufusage,
+ &pgBufferUsage, &instr->instr.bufusage_start);
+
+ if (instr->instr.need_walusage)
+ WalUsageAccumDiff(&instr->instr.walusage,
+ &pgWalUsage, &instr->instr.walusage_start);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -125,7 +176,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
/* Update tuple count */
void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
{
/* count the returned tuples */
instr->tuplecount += nTuples;
@@ -133,47 +184,40 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
/* Finish a run cycle for a plan node */
void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
{
/* Skip if nothing has happened, or already shut down */
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
elog(ERROR, "InstrEndLoop called on running node");
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
- INSTR_TIME_ADD(instr->total, instr->counter);
+ INSTR_TIME_ADD(instr->instr.total, instr->counter);
instr->ntuples += instr->tuplecount;
instr->nloops += 1;
/* Reset for next cycle (if any) */
instr->running = false;
- INSTR_TIME_SET_ZERO(instr->starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
INSTR_TIME_SET_ZERO(instr->counter);
INSTR_TIME_SET_ZERO(instr->firsttuple);
instr->tuplecount = 0;
}
-/* aggregate instrumentation information */
+/*
+ * Aggregate instrumentation from parallel workers. Must be called after
+ * InstrEndLoop.
+ */
void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
{
- if (!dst->running && add->running)
- {
- dst->running = true;
- dst->firsttuple = add->firsttuple;
- }
- else if (dst->running && add->running &&
- INSTR_TIME_GT(dst->firsttuple, add->firsttuple))
- dst->firsttuple = add->firsttuple;
-
- INSTR_TIME_ADD(dst->counter, add->counter);
+ Assert(!add->running);
- dst->tuplecount += add->tuplecount;
INSTR_TIME_ADD(dst->startup, add->startup);
- INSTR_TIME_ADD(dst->total, add->total);
+ INSTR_TIME_ADD(dst->instr.total, add->instr.total);
dst->ntuples += add->ntuples;
dst->ntuples2 += add->ntuples2;
dst->nloops += add->nloops;
@@ -181,11 +225,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->need_bufusage)
- BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ if (dst->instr.need_bufusage)
+ BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
- if (dst->need_walusage)
- WalUsageAdd(&dst->walusage, &add->walusage);
+ if (dst->instr.need_walusage)
+ WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
}
/* Trigger instrumentation handling */
@@ -196,7 +240,7 @@ InstrAllocTrigger(int n, int instrument_options)
int i;
for (i = 0; i < n; i++)
- InstrInit(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, instrument_options);
return tginstr;
}
@@ -204,13 +248,13 @@ InstrAllocTrigger(int n, int instrument_options)
void
InstrStartTrigger(TriggerInstrumentation *tginstr)
{
- InstrStartNode(&tginstr->instr);
+ InstrStart(&tginstr->instr);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
- InstrStopNode(&tginstr->instr, 0);
+ InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 056287739f2..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
typedef struct Instrumentation
{
- /* Parameters set at node creation: */
+ /* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
bool need_bufusage; /* true if we need buffer usage data */
bool need_walusage; /* true if we need WAL usage data */
+ /* Internal state keeping: */
+ instr_time starttime; /* start time of last InstrStart */
+ BufferUsage bufusage_start; /* buffer usage at start */
+ WalUsage walusage_start; /* WAL usage at start */
+ /* Accumulated statistics: */
+ instr_time total; /* total runtime */
+ BufferUsage bufusage; /* total buffer usage */
+ WalUsage walusage; /* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+ Instrumentation instr;
+ /* Parameters set at node creation: */
bool async_mode; /* true if node is in async mode */
/* Info about current plan cycle: */
bool running; /* true if we've completed first tuple */
- instr_time starttime; /* start time of current iteration of node */
instr_time counter; /* accumulated runtime for this node */
instr_time firsttuple; /* time for first tuple of this cycle */
double tuplecount; /* # of tuples emitted so far this cycle */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics across all completed cycles: */
instr_time startup; /* total startup time */
- instr_time total; /* total time */
double ntuples; /* total tuples produced */
double ntuples2; /* secondary node-specific tuple counter */
double nloops; /* # of run cycles for this node */
double nfiltered1; /* # of tuples removed by scanqual or joinqual */
double nfiltered2; /* # of tuples removed by "other" quals */
- BufferUsage bufusage; /* total buffer usage */
- WalUsage walusage; /* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
- Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+ NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
typedef struct TriggerInstrumentation
{
@@ -110,13 +127,19 @@ typedef struct TriggerInstrumentation
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options, bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+ bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 908898aa7c9..3ecae7552fc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct NodeInstrumentation NodeInstrumentation;
typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
@@ -68,7 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
-typedef struct WorkerInstrumentation WorkerInstrumentation;
+typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
/* ----------------
@@ -1207,8 +1208,10 @@ typedef struct PlanState
ExecProcNodeMtd ExecProcNodeReal; /* actual function, if above is a
* wrapper */
- Instrumentation *instrument; /* Optional runtime stats for this node */
- WorkerInstrumentation *worker_instrument; /* per-worker instrumentation */
+ NodeInstrumentation *instrument; /* Optional runtime stats for this
+ * node */
+ WorkerNodeInstrumentation *worker_instrument; /* per-worker
+ * instrumentation */
/* Per-worker JIT instrumentation */
struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 6a328fceaee..ca0c86d9e59 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1824,6 +1824,7 @@ NextSampleBlock_function
NextSampleTuple_function
NextValueExpr
Node
+NodeInstrumentation
NodeTag
NonEmptyRange
NoneCompressorState
@@ -3438,9 +3439,9 @@ WorkTableScan
WorkTableScanState
WorkerInfo
WorkerInfoData
-WorkerInstrumentation
WorkerJobDumpPtrType
WorkerJobRestorePtrType
+WorkerNodeInstrumentation
Working_State
WriteBufPtrType
WriteBytePtrType
--
2.47.1
[application/octet-stream] v14-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.0K, 5-v14-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 69ed8a67ddb1aa7934ab104db1eae2c8892b67ed Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 26 Mar 2026 23:31:04 -0700
Subject: [PATCH v14 04/10] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 20 ++++++++++----------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9e8999bbb61..71c9a265662 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1103,10 +1103,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2085,7 +2085,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cc0b0bdd92..3e1c39160db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -840,7 +840,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -861,7 +861,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1684,9 +1684,9 @@ TrackBufferHit(IOObject io_object, IOContext io_context,
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
@@ -2148,9 +2148,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it. Otherwise
@@ -3043,7 +3043,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3189,7 +3189,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4601,7 +4601,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5796,7 +5796,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 396da84b25c..851b99056d5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -218,7 +218,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -479,7 +479,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -510,7 +510,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 2be26e92283..e3829d7fe7c 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..d4769f3da7b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += (val); \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += (val); \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v14-0001-instrumentation-Separate-trigger-logic-from-othe.patch (13.9K, 6-v14-0001-instrumentation-Separate-trigger-logic-from-othe.patch)
download | inline diff:
From fe51fb92449880cbfa507f77b2e81d3586b61bd7 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v14 01/10] instrumentation: Separate trigger logic from other
uses
Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.
In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
---
contrib/auto_explain/auto_explain.c | 2 +-
.../pg_stat_statements/pg_stat_statements.c | 2 +-
src/backend/commands/explain.c | 20 ++++----
src/backend/commands/trigger.c | 22 ++++-----
src/backend/executor/execMain.c | 2 +-
src/backend/executor/execProcnode.c | 2 +-
src/backend/executor/instrument.c | 48 +++++++++++++------
src/include/executor/instrument.h | 15 +++++-
src/include/nodes/execnodes.h | 3 +-
src/tools/pgindent/typedefs.list | 1 +
10 files changed, 74 insertions(+), 43 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..5f5c1ff0da3 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL, false);
MemoryContextSwitchTo(oldcxt);
}
}
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 5494d41dca1..ddbd5727ddf 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1025,7 +1025,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
MemoryContext oldcxt;
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+ queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL, false);
MemoryContextSwitchTo(oldcxt);
}
}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e4b70166b0e..6b7b23049ca 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1101,18 +1101,18 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
{
Trigger *trig = rInfo->ri_TrigDesc->triggers + nt;
- Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+ TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
char *relname;
char *conname = NULL;
- /* Must clean up instrumentation state */
- InstrEndLoop(instr);
+ /* Ensure total timing is updated from the internal counter */
+ InstrEndLoop(&tginstr->instr);
/*
* We ignore triggers that were never invoked; they likely aren't
* relevant to the current query type.
*/
- if (instr->ntuples == 0)
+ if (tginstr->firings == 0)
continue;
ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1137,11 +1137,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
if (show_relname)
appendStringInfo(es->str, " on %s", relname);
if (es->timing)
- appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
- INSTR_TIME_GET_MILLISEC(instr->total),
- instr->ntuples);
+ appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+ tginstr->firings);
else
- appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+ appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
}
else
{
@@ -1151,9 +1151,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
ExplainPropertyText("Relation", relname, es);
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(instr->total), 3,
+ INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
es);
- ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+ ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
}
if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 90e94fb8a5a..4d4e96a5302 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -92,7 +92,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2311,7 +2311,7 @@ static HeapTuple
ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2346,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2391,10 +2391,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
return (HeapTuple) DatumGetPointer(result);
}
@@ -3947,7 +3947,7 @@ static void AfterTriggerExecute(EState *estate,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
FmgrInfo *finfo,
- Instrumentation *instr,
+ TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2);
@@ -4342,7 +4342,7 @@ AfterTriggerExecute(EState *estate,
ResultRelInfo *src_relInfo,
ResultRelInfo *dst_relInfo,
TriggerDesc *trigdesc,
- FmgrInfo *finfo, Instrumentation *instr,
+ FmgrInfo *finfo, TriggerInstrumentation *instr,
MemoryContext per_tuple_context,
TupleTableSlot *trig_tuple_slot1,
TupleTableSlot *trig_tuple_slot2)
@@ -4383,7 +4383,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartNode(instr + tgindx);
+ InstrStartTrigger(instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4600,10 +4600,10 @@ AfterTriggerExecute(EState *estate,
/*
* If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
- * one "tuple returned" (really the number of firings).
+ * the firing of the trigger.
*/
if (instr)
- InstrStopNode(instr + tgindx, 1);
+ InstrStopTrigger(instr + tgindx, 1);
}
@@ -4719,7 +4719,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
Relation rel = NULL;
TriggerDesc *trigdesc = NULL;
FmgrInfo *finfo = NULL;
- Instrumentation *instr = NULL;
+ TriggerInstrumentation *instr = NULL;
TupleTableSlot *slot1 = NULL,
*slot2 = NULL;
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 45e00c6af85..0237d8c3b1d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1285,7 +1285,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
}
else
{
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index d35976925ae..a2047e4dbc6 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,7 +414,7 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAlloc(1, estate->es_instrument,
+ result->instrument = InstrAlloc(estate->es_instrument,
result->async_capable);
return result;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..abb391c114c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,28 +26,20 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-/* Allocate new instrumentation structure(s) */
+/* Allocate new instrumentation structure */
Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options, bool async_mode)
{
Instrumentation *instr;
/* initialize all fields to zeroes, then modify as needed */
- instr = palloc0(n * sizeof(Instrumentation));
+ instr = palloc0(sizeof(Instrumentation));
if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
{
- bool need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
- bool need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
- int i;
-
- for (i = 0; i < n; i++)
- {
- instr[i].need_bufusage = need_buffers;
- instr[i].need_walusage = need_wal;
- instr[i].need_timer = need_timer;
- instr[i].async_mode = async_mode;
- }
+ instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
+ instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
+ instr->async_mode = async_mode;
}
return instr;
@@ -196,6 +188,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
WalUsageAdd(&dst->walusage, &add->walusage);
}
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+ TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ int i;
+
+ for (i = 0; i < n; i++)
+ InstrInit(&tginstr[i].instr, instrument_options);
+
+ return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+ InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+ InstrStopNode(&tginstr->instr, 0);
+ tginstr->firings += firings;
+}
+
/* note current values during parallel executor startup */
void
InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..056287739f2 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,17 +100,28 @@ typedef struct WorkerInstrumentation
Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
} WorkerInstrumentation;
+typedef struct TriggerInstrumentation
+{
+ Instrumentation instr;
+ int firings; /* number of times the instrumented trigger
+ * was fired */
+} TriggerInstrumentation;
+
extern PGDLLIMPORT BufferUsage pgBufferUsage;
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
- bool async_mode);
+extern Instrumentation *InstrAlloc(int instrument_options, bool async_mode);
extern void InstrInit(Instrumentation *instr, int instrument_options);
extern void InstrStartNode(Instrumentation *instr);
extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
extern void InstrStartParallelQuery(void);
extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 090cfccf65f..908898aa7c9 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -60,6 +60,7 @@ typedef struct ScanKeyData ScanKeyData;
typedef struct SnapshotData *Snapshot;
typedef struct SortSupportData *SortSupport;
typedef struct TIDBitmap TIDBitmap;
+typedef struct TriggerInstrumentation TriggerInstrumentation;
typedef struct TupleConversionMap TupleConversionMap;
typedef struct TupleDescData *TupleDesc;
typedef struct Tuplesortstate Tuplesortstate;
@@ -552,7 +553,7 @@ typedef struct ResultRelInfo
ExprState **ri_TrigWhenExprs;
/* optional runtime measurements for triggers */
- Instrumentation *ri_TrigInstrument;
+ TriggerInstrumentation *ri_TrigInstrument;
/* On-demand created slots for triggers / returning processing */
TupleTableSlot *ri_ReturningSlot; /* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 0c5493bd47f..6a328fceaee 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3215,6 +3215,7 @@ TriggerDesc
TriggerEvent
TriggerFlags
TriggerInfo
+TriggerInstrumentation
TriggerTransition
TruncateStmt
TsmRoutine
--
2.47.1
[application/octet-stream] v14-0006-Optimize-measuring-WAL-buffer-usage-through-stac.patch (89.6K, 7-v14-0006-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From 9fe24ed6519322e08789eb31fc1f6fcc49822bc6 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v14 06/10] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
In tests, the stack-based instrumentation mechanism reduces the overhead
of EXPLAIN (ANALYZE, BUFFERS ON, TIMING OFF) for a large COUNT(*) query
from about 50% to 22% on top of the actual runtime.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
contrib/auto_explain/auto_explain.c | 16 +-
.../pg_stat_statements/pg_stat_statements.c | 24 +-
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 12 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 12 +-
src/backend/commands/explain.c | 10 +-
src/backend/commands/explain_dr.c | 6 +-
src/backend/commands/prepare.c | 10 +-
src/backend/commands/tablecmds.c | 2 +-
src/backend/commands/trigger.c | 17 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/README.instrument | 237 +++++++++
src/backend/executor/execMain.c | 84 +++-
src/backend/executor/execParallel.c | 36 +-
src/backend/executor/execPartition.c | 2 +-
src/backend/executor/execProcnode.c | 103 +++-
src/backend/executor/execUtils.c | 11 +-
src/backend/executor/instrument.c | 468 ++++++++++++++----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/execdesc.h | 4 +-
src/include/executor/executor.h | 5 +-
src/include/executor/instrument.h | 201 +++++++-
src/include/nodes/execnodes.h | 3 +-
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
29 files changed, 1084 insertions(+), 236 deletions(-)
create mode 100644 src/backend/executor/README.instrument
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
/* Log plan if duration is exceeded. */
- msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+ msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
if (msec >= auto_explain_log_min_duration)
{
ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 63975706b87..78f1518c940 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -929,7 +929,7 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
- InstrStop(&instr);
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -994,19 +994,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
+ /* Set up to track total elapsed time in ExecutorRun. */
if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
}
}
@@ -1068,10 +1058,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
queryDesc->plannedstmt->stmt_location,
queryDesc->plannedstmt->stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+ INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
queryDesc->estate->es_total_processed,
- &queryDesc->totaltime->bufusage,
- &queryDesc->totaltime->walusage,
+ &queryDesc->totaltime->instr.bufusage,
+ &queryDesc->totaltime->instr.walusage,
queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
NULL,
queryDesc->estate->es_parallel_workers_to_launch,
@@ -1155,7 +1145,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
- InstrStop(&instr);
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e09..3a5176c76c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 9d83a495775..0d80f72a0b0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2118,6 +2118,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2186,7 +2187,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2201,7 +2202,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 30f589c9207..291d9d67bc2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,7 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -653,8 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -985,7 +985,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -1001,8 +1001,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufferusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 756dfa3dcf4..2d7b7cef912 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8472fc0c280..10f8a2dc81c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,7 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -361,8 +361,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
pg_rusage_init(&ru0);
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -743,7 +743,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -757,8 +757,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
total_blks_hit = bufusage.shared_blks_hit +
bufusage.local_blks_hit;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 2afe858c441..a992dde6b8a 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,7 +324,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -333,7 +333,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -351,12 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -366,7 +366,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 34fe4f8f6dd..9c1b30fb75b 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -113,7 +113,7 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
Instrumentation *instr = &myState->metrics.instr;
/* only measure time, buffers if requested */
- if (instr->need_timer || instr->need_bufusage)
+ if (instr->need_timer || instr->need_stack)
InstrStart(instr);
/* Set or update my derived attribute info, if needed */
@@ -183,7 +183,7 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextReset(myState->tmpcontext);
/* Stop per-tuple measurement */
- if (instr->need_timer || instr->need_bufusage)
+ if (instr->need_timer || instr->need_stack)
InstrStop(instr);
return true;
@@ -241,6 +241,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf9f2eb6149..ee811357588 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -581,7 +581,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
@@ -590,7 +590,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -602,7 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -637,7 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -654,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ce2e81f9c2..f72c1ac521a 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 4d4e96a5302..b8b8840345b 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -93,6 +93,7 @@ static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context);
static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
ResultRelInfo *src_partinfo,
@@ -2312,6 +2313,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
int tgindx,
FmgrInfo *finfo,
TriggerInstrumentation *instr,
+ QueryInstrumentation *qinstr,
MemoryContext per_tuple_context)
{
LOCAL_FCINFO(fcinfo, 0);
@@ -2346,7 +2348,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
* If doing EXPLAIN ANALYZE, start charging time to this trigger.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(qinstr, instr + tgindx);
/*
* Do the function evaluation in the per-tuple memory context, so that
@@ -2441,6 +2443,7 @@ ExecBSInsertTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2502,6 +2505,7 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2606,6 +2610,7 @@ ExecIRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2670,6 +2675,7 @@ ExecBSDeleteTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -2780,6 +2786,7 @@ ExecBRDeleteTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -2884,6 +2891,7 @@ ExecIRDeleteTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (rettuple == NULL)
return false; /* Delete was suppressed */
@@ -2942,6 +2950,7 @@ ExecBSUpdateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -3094,6 +3103,7 @@ ExecBRUpdateTriggers(EState *estate, EPQState *epqstate,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
@@ -3258,6 +3268,7 @@ ExecIRUpdateTriggers(EState *estate, ResultRelInfo *relinfo,
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple == NULL)
{
@@ -3316,6 +3327,7 @@ ExecBSTruncateTriggers(EState *estate, ResultRelInfo *relinfo)
i,
relinfo->ri_TrigFunctions,
relinfo->ri_TrigInstrument,
+ estate->es_instrument,
GetPerTupleMemoryContext(estate));
if (newtuple)
@@ -4383,7 +4395,7 @@ AfterTriggerExecute(EState *estate,
* to include time spent re-fetching tuples in the trigger cost.
*/
if (instr)
- InstrStartTrigger(instr + tgindx);
+ InstrStartTrigger(estate->es_instrument, instr + tgindx);
/*
* Fetch the required tuple(s).
@@ -4571,6 +4583,7 @@ AfterTriggerExecute(EState *estate,
tgindx,
finfo,
NULL,
+ NULL,
per_tuple_context);
if (rettuple != NULL &&
rettuple != LocTriggerData.tg_trigtuple &&
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..c330c891c03 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1006,6 +1006,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1095,7 +1096,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1103,7 +1104,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/README.instrument b/src/backend/executor/README.instrument
new file mode 100644
index 00000000000..7df837dbc77
--- /dev/null
+++ b/src/backend/executor/README.instrument
@@ -0,0 +1,237 @@
+src/backend/executor/README.instrument
+
+Instrumentation
+===============
+
+The instrumentation subsystem measures time, buffer usage and WAL activity
+during query execution and other similar activities. It is used by
+EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
+activity and/or timing metrics over a section of code.
+
+The design has two central goals:
+
+* Make it cheap to measure activity in a section of code, even when
+ that section is called many times and the aggregate is what is used
+ (as is the case with per-node instrumentation in the executor)
+
+* Ensure nested instrumentation accurately measures activity/timing,
+ even when execution is aborted due to errors being thrown.
+
+The key data structures are defined in src/include/executor/instrument.h
+and the implementation lives in src/backend/executor/instrument.c.
+
+
+Instrumentation Options
+-----------------------
+
+Callers specify what to measure with a bitmask of InstrumentOption flags:
+
+ INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
+ INSTRUMENT_TIMER -- wall-clock timing and row counts
+ INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
+ INSTRUMENT_WAL -- WAL records, FPI, bytes
+
+INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
+(described below) for efficient handling of counter values.
+
+
+Struct Hierarchy
+----------------
+
+There are the following instrumentation structs, each specialized for a
+different scope:
+
+Instrumentation Base struct. Holds timing and buffer/WAL counters.
+
+QueryInstrumentation Extends Instrumentation for query-level tracking. When
+ stack-based tracking is enabled, it owns a dedicated
+ MemoryContext and uses the ResourceOwner mechanism for
+ abort cleanup.
+
+NodeInstrumentation Extends Instrumentation for per-plan-node statistics
+ (startup time, tuple counts, loop counts, etc).
+
+TriggerInstrumentation Extends Instrumentation with a firing count.
+
+
+Stack-based instrumentation
+===========================
+
+For tracking WAL or buffer usage counters, the specialized stack-based
+instrumentation is used.
+
+A simple approach to measuring buffer/WAL activity in a code section could be
+to have a set of global counters, snapshot all the counters at the start, and
+diff them at the end. But, this is expensive in practice: BufferUsage alone
+has many fields, and the diff must be computed for every InstrStartNode /
+InstrStopNode cycle.
+
+An alternative is to write counter updates directly into the struct that
+should receive them, avoiding the diff. But that has two complexities: Low-level
+code such as the buffer manager, has no direct pointers to higher level
+structs, such as plan nodes tracking buffer usage. And instrumentation is often
+nested: We might both be interested in the aggregate buffer usage of a query, and
+the individual per-node details. Stack-based instrumentation solves for that:
+
+At all times, there is a stack that tracks which Instrumentation is currently
+active. The stack is represented by instr_stack, a per-backend global
+that holds a dynamic array of Instrumentation pointers. The field
+instr_stack.current always points to the current stack entry that should
+be updated when activity occurs. When the stack array is empty, the
+current stack points to instr_top.
+
+For example, if a backend has two portals open, the overall nesting of
+Instrumentation and their respective InstrStart/InstrStop calls creates a
+tree-like structure like this:
+
+ Session (instr_top)
+ |
+ +-- Query A (QueryInstrumentation)
+ | |
+ | +-- NestLoop (NodeInstrumentation)
+ | |
+ | +-- Seq Scan A (NodeInstrumentation)
+ | +-- Seq Scan B (NodeInstrumentation)
+ |
+ +-- Query B (QueryInstrumentation)
+ |
+ +-- Seq Scan C (NodeInstrumentation)
+
+While executing Seq Scan B, the stack looks like:
+
+ instr_top (implicit bottom, not in the entries array)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B <-- instr_stack.current
+
+When no query is running, the stack is empty (stack_size == 0) and
+instr_stack.current points to instr_top.
+
+Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
+INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
+writes directly into instr_stack.current. Each instrumentation node starts
+zeroed, so the values it accumulates while on top of the stack represent
+exactly the activity that occurred during that time.
+
+Every Instrumentation node (except for instr_top) has a target, or parent, it
+will be accumulated into, which is typically the Instrumentation that was the
+current stack entry when it was created.
+
+For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
+its instrumentation data gets added to the immediate parent in
+the execution tree, the NestLoop, which will then get added to Query A's
+QueryInstrumentation, which then accumulates to the parent.
+
+While we can typically think of this as a tree, the NodeInstrumentation
+underneath a particular QueryInstrumentation could behave differently --
+for example, it could propagate directly to the QueryInstrumentation, in
+order to not show cumulative numbers in EXPLAIN ANALYZE.
+
+Note these relationships are partially implicit, especially when it comes
+to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
+unfinalized child nodes. The parent of a QueryInstrumentation itself is
+determined by the stack (see below): when a query is finalized or cleaned
+up on abort, its counters are accumulated to whatever entry is then current
+on the stack, which is typically instr_top.
+
+
+Finalization and Abort Safety
+=============================
+
+Finalization is the process of rolling up a node's buffer/WAL counters to
+its parent. In normal execution, nodes are pushed onto the stack when they
+start and popped when they stop; at finalization time their accumulated
+counters are added to the parent.
+
+Due to the use of longjmp for error handling, functions can exit abruptly
+without executing their normal cleanup code. On abort, two things need
+to happen:
+
+1. The stack is reset to the level saved at the start of the aborting
+ (sub-)transaction level. This ensures that we don't later try to update
+ counters on a freed stack entry. We also need to ensure that the stack
+ entry that was current before a particular Instrumentation started, is
+ current again after it stops.
+
+2. Finalize all affected Instrumentation nodes, rolling up their counters
+ to the innermost surviving Instrumentation, so that data is not lost.
+
+For example, if Seq Scan B aborts while the stack is:
+
+ instr_top (implicit bottom)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B
+
+The abort handler for Query A accumulates all unfinalized children (Seq
+Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
+unwinds the instrumentation stack and accumulates Query A's counters to
+instr_top.
+
+Note that on abort the children do not accumulate through each other (Seq
+Scan B -> NestLoop -> Query A); they all accumulate directly to their
+parent QueryInstrumentation. This means the order in which children are
+released does not matter -- this is important because ResourceOwner cleanup
+does not guarantee a particular release order. The per-node breakdown is lost,
+but the instrumentation active when the query was started (instr_top in the
+above example) survives the abort, and its counters include the activity.
+
+If multiple QueryInstrumentations are active on the stack (e.g. nested
+portals), the abort handler of each uses InstrStopFinalize() to accumulate
+the statistics to the parent entry of either the entry being released, or a
+previously released entry if it was higher up in the stack, so they compose
+correctly regardless of release order.
+
+There are two mechanisms for achieving abort safety:
+
+* Resource Owner (QueryInstrumentation): registers with the current
+ ResourceOwner at start. On transaction abort, the resource owner system
+ calls the release callback, which walks unfinalized child entries,
+ accumulates their data, unwinds the stack, and destroys the dedicated
+ memory context (freeing the QueryInstrumentation and all child
+ allocations as a unit). This is the recommended approach when the
+ instrumented code already has an appropriate resource owner (e.g. it
+ runs inside a portal). The query executor uses this path.
+
+* PG_FINALLY (base Instrumentation): when no suitable resource owner
+ exists, or when the caller wants to inspect the instrumentation data
+ even after an error, the base Instrumentation can be used with a
+ PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
+
+Both mechanisms add overhead, so neither is suitable for high-frequency
+instrumentation like per-node measurements in the executor. Instead,
+plan node and trigger children rely on their parent QueryInstrumentation
+for abort safety: they are allocated in the parent's memory context and
+registered in its unfinalized-entries list, so the parent's abort handler
+recovers their data automatically. In normal execution, children are
+finalized explicitly by the caller.
+
+Parallel Query
+--------------
+
+Parallel workers get their own QueryInstrumentation so they can measure
+buffer and WAL activity independently, then copy the totals into dynamic
+shared memory at worker shutdown. The leader accumulates these into its
+own stack.
+
+When per-node instrumentation is active, parallel workers skip per-node
+finalization at shutdown to avoid double-counting; the per-node data is
+aggregated separately through InstrAggNode().
+
+
+Memory Handling
+===============
+
+Instrumentation objects that use the stack must survive until finalization
+runs, including the abort case. To ensure this, QueryInstrumentation
+creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
+of TopMemoryContext. All child instrumentation (nodes, triggers) should be
+allocated in this context.
+
+On successful completion, instr_cxt is reparented to CurrentMemoryContext
+so its lifetime is tied to the caller's context. On abort, the
+ResourceOwner cleanup frees it after accumulating the instrumentation data
+to the current stack entry after resetting the stack.
+
+When the stack is not needed (timer/rows only), Instrumentation allocations
+happen in CurrentMemoryContext instead of TopMemoryContext.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b0f636bf8b6..d0cd34d286c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -247,9 +248,16 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
estate->es_top_eflags = eflags;
- estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up per-node instrumentation if needed. We do this before InitPlan
+ * so that node and trigger instrumentation can be allocated within the
+ * query's dedicated instrumentation memory context.
+ */
+ if (!estate->es_instrument && queryDesc->instrument_options)
+ estate->es_instrument = InstrQueryAlloc(queryDesc->instrument_options);
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
@@ -331,9 +339,11 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
+ /* Start up instrumentation for this execution run */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
+ if (estate->es_instrument)
+ InstrQueryStart(estate->es_instrument);
/*
* extract information from the query descriptor and the query feature.
@@ -384,8 +394,10 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (sendTuples)
dest->rShutdown(dest);
+ if (estate->es_instrument)
+ InstrQueryStop(estate->es_instrument);
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStop(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
}
@@ -435,7 +447,9 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
/* Allow instrumentation of Executor overall runtime */
if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ InstrQueryStart(queryDesc->totaltime);
+ if (estate->es_instrument)
+ InstrQueryStart(estate->es_instrument);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -444,8 +458,32 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
+ if (estate->es_instrument)
+ {
+ /*
+ * Accumulate per-node and trigger statistics to their respective
+ * parent instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and
+ * the leader's own ExecFinalizeNodeInstrumentation handles
+ * propagation. If we accumulated here, the leader would
+ * double-count: worker parent nodes would already include their
+ * children's stats, and then the leader's accumulation would add the
+ * children again.
+ */
+ if (!IsParallelWorker())
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
+ InstrQueryStopFinalize(estate->es_instrument);
+ }
+
if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ InstrQueryStopFinalize(queryDesc->totaltime);
MemoryContextSwitchTo(oldcontext);
@@ -1263,7 +1301,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1284,8 +1322,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
}
else
{
@@ -1358,6 +1396,10 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
* also provides a way for EXPLAIN ANALYZE to report the runtimes of such
* triggers.) So we make additional ResultRelInfo's as needed, and save them
* in es_trig_target_relations.
+ *
+ * Note: if new relation lists are searched here, they must also be added to
+ * ExecFinalizeTriggerInstrumentation so that trigger instrumentation data
+ * is properly accumulated.
*/
ResultRelInfo *
ExecGetTriggerResultRel(EState *estate, Oid relid,
@@ -1500,6 +1542,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_instrument->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 78f60c1530c..c01e780f918 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -700,7 +700,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -825,13 +825,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int i;
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
- instrumentation->instrument_options = estate->es_instrument;
+ instrumentation->instrument_options = estate->es_instrument->instrument_options;
instrumentation->instrument_offset = instrument_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- InstrInitNode(&instrument[i], estate->es_instrument);
+ InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1238,9 +1252,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
{
/* Accumulate instrumentation, if any. */
if (pei->instrumentation)
+ {
ExecParallelRetrieveInstrumentation(pei->planstate,
pei->instrumentation);
+ ExecFinalizeWorkerInstrumentation(pei->planstate);
+ }
+
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
ExecParallelRetrieveJitInstrumentation(pei->planstate,
@@ -1462,6 +1480,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1522,7 +1541,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1538,7 +1557,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6f2909a1bc3 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1381,7 +1381,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..3b3ec9850e8 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -788,10 +790,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +831,99 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ Assert(node->instrument != NULL);
+
+ /*
+ * Recurse into children first (bottom-up accumulation), and accummulate
+ * to this nodes instrumentation as the parent context.
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ &node->instrument->instr);
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
+/*
+ * ExecFinalizeWorkerInstrumentation
+ *
+ * Accumulate per-worker instrumentation stats from child nodes into their
+ * parents, mirroring what ExecFinalizeNodeInstrumentation does for the
+ * leader's own stats. Without this, per-worker buffer/WAL stats shown by
+ * EXPLAIN (ANALYZE, VERBOSE) would only reflect each node's own direct
+ * activity, not including children.
+ *
+ * This must run after ExecParallelRetrieveInstrumentation has populated
+ * worker_instrument for all nodes in the parallel subtree.
+ */
+void
+ExecFinalizeWorkerInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeWorkerInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
+{
+ PlanState *parent = (PlanState *) context;
+ int num_workers;
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing this node
+ * as parent context if it has worker_instrument, otherwise pass through
+ * the previous parent.
+ */
+ planstate_tree_walker(node, ExecFinalizeWorkerInstrumentation_walker,
+ node->worker_instrument ? (void *) node : context);
+
+ if (!node->worker_instrument)
+ return false;
+
+ num_workers = node->worker_instrument->num_workers;
+
+ /* Accumulate this node's per-worker stats to parent's per-worker stats */
+ if (parent && parent->worker_instrument)
+ {
+ int parent_workers = parent->worker_instrument->num_workers;
+
+ for (int n = 0; n < Min(num_workers, parent_workers); n++)
+ InstrAccumStack(&parent->worker_instrument->instrument[n].instr,
+ &node->worker_instrument->instrument[n].instr);
+ }
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..700764daf45 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -150,7 +150,7 @@ CreateExecutorState(void)
estate->es_total_processed = 0;
estate->es_top_eflags = 0;
- estate->es_instrument = 0;
+ estate->es_instrument = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
@@ -227,6 +227,15 @@ FreeExecutorState(EState *estate)
estate->es_partition_directory = NULL;
}
+ /*
+ * Make sure the instrumentation context gets freed. This usually gets
+ * re-parented under the per-query context in InstrQueryStopFinalize, but
+ * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
+ * gets called, so we would otherwise leak it in TopMemoryContext.
+ */
+ if (estate->es_instrument && estate->es_instrument->instr.need_stack)
+ MemoryContextDelete(estate->es_instrument->instr_cxt);
+
/*
* Free the per-query memory context, thereby releasing all working
* memory, including the EState node itself.
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index e3d890a7f98..f9202b558d6 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,31 +16,53 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {
+ .stack_space = 0,
+ .stack_size = 0,
+ .entries = NULL,
+ .current = &instr_top,
+};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+ Assert(instr_stack.stack_size >= instr_stack.stack_space);
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0(sizeof(Instrumentation));
-
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -55,52 +77,309 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at node entry, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+ instr_time endtime;
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ /* let's update the time only if the timer was requested */
+ if (INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStop called without start");
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
void
InstrStop(Instrumentation *instr)
{
- instr_time endtime;
+ if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ if (instr->need_stack)
+ InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ * has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ if (instr->on_stack)
+ {
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx < 0)
+ elog(ERROR, "instrumentation entry not found on stack");
+
+ /* Clear on_stack for any intermediate entries we're skipping over */
+ for (int i = instr_stack.stack_size - 1; i > idx; i--)
+ instr_stack.entries[i]->on_stack = false;
+
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+
+ InstrPopStack(instr);
+ }
- /* let's update the time only if the timer was requested */
if (instr->need_timer)
+ InstrStopTimer(instr);
+
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
{
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
- INSTR_TIME_SET_ZERO(instr->starttime);
+ InstrAccumStack(&qinstr->instr, child);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
+
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner == NULL);
+ return;
+ }
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
}
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode)
{
- NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
- InstrInitNode(instr, instrument_options);
+ InstrInitNode(instr, qinstr->instrument_options);
instr->async_mode = async_mode;
+ InstrQueryRememberChild(qinstr, &instr->instr);
+
return instr;
}
@@ -119,6 +398,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -148,14 +428,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
INSTR_TIME_SET_ZERO(instr->instr.starttime);
}
- /* Add delta of buffer usage since entry to node's totals */
- if (instr->instr.need_bufusage)
- BufferUsageAccumDiff(&instr->instr.bufusage,
- &pgBufferUsage, &instr->instr.bufusage_start);
-
- if (instr->instr.need_walusage)
- WalUsageAccumDiff(&instr->instr.walusage,
- &pgWalUsage, &instr->instr.walusage_start);
+ /*
+ * Only pop the stack, accumulation runs in
+ * ExecFinalizeNodeInstrumentation
+ */
+ if (instr->instr.need_stack)
+ InstrPopStack(&instr->instr);
/* Is this the first tuple of this cycle? */
if (!instr->running)
@@ -190,8 +468,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -225,67 +503,73 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered2 += add->nfiltered2;
/* Add delta of buffer usage since entry to node's totals */
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
{
- TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
- InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrInitOptions(&tginstr[i].instr, qinstr->instrument_options);
return tginstr;
}
void
-InstrStartTrigger(TriggerInstrumentation *tginstr)
+InstrStartTrigger(QueryInstrumentation *qinstr, TriggerInstrumentation *tginstr)
{
InstrStart(&tginstr->instr);
+
+ /*
+ * On first call, register with the parent QueryInstrumentation for abort
+ * recovery.
+ */
+ if (qinstr && tginstr->instr.need_stack &&
+ dlist_node_is_detached(&tginstr->instr.unfinalized_entry))
+ dlist_push_head(&qinstr->unfinalized_entries,
+ &tginstr->instr.unfinalized_entry);
}
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -306,39 +590,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b38170f0fbe..3ca0a7a635d 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -904,7 +904,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e1c39160db..cf4f4246ca2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1266,9 +1266,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
}
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index e3829d7fe7c..e7fc7f071d8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
- struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
+ /* This field is set by ExecutorRun, or plugins */
+ struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
/* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 491c4886506..03f0e864176 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,7 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +302,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
+extern void ExecFinalizeWorkerInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d4769f3da7b..d2f0191af27 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,92 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
+ bool on_stack; /* true if currently on instr_stack */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +175,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,16 +192,104 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+ instr->on_stack = true;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+ instr->on_stack = false;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
- bool async_mode);
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
extern void InstrStartNode(NodeInstrumentation *instr);
extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
@@ -141,35 +297,36 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
-extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
+extern void InstrStartTrigger(QueryInstrumentation *qinstr,
+ TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += (val); \
+ instr_stack.current->bufusage.fld += (val); \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += (val); \
+ instr_stack.current->walusage.fld += (val); \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..b28288aa1e8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -54,6 +54,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -753,7 +754,7 @@ typedef struct EState
* ExecutorRun() calls. */
int es_top_eflags; /* eflags passed to ExecutorStart */
- int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_instrument; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index ca0c86d9e59..7f4e31da0d5 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1357,6 +1357,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2479,6 +2480,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v14-0009-Index-scans-Show-table-buffer-accesses-separatel.patch (22.9K, 8-v14-0009-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From 4e43c3075d7ddf031227d560c372291b901c8900 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v14 09/10] Index scans: Show table buffer accesses separately
in EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 ++++++--
src/backend/executor/execProcnode.c | 46 ++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 41 ++++++-
src/backend/executor/nodeIndexscan.c | 127 +++++++++++++++++----
src/include/executor/instrument_node.h | 5 +
8 files changed, 244 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index a992dde6b8a..a6488a67461 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -611,7 +611,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1028,7 +1028,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1042,7 +1042,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1971,6 +1971,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1988,6 +1991,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2289,7 +2295,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2308,7 +2314,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4108,7 +4114,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4133,6 +4139,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4188,6 +4196,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4229,6 +4239,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4249,8 +4267,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4269,6 +4299,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6e8cbaeccf7..a59de0ef22b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -846,6 +846,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
&node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
@@ -891,6 +905,38 @@ ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
num_workers = node->worker_instrument->num_workers;
+ /*
+ * Fold per-worker IndexScan/IndexOnlyScan table buffer stats into the
+ * per-worker node stats, matching what ExecFinalizeNodeInstrumentation
+ * does for the leader.
+ */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, iss->iss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &iss->iss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, ioss->ioss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &ioss->ioss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+
/* Accumulate this node's per-worker stats to parent's per-worker stats */
if (parent && parent->worker_instrument)
{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d61..63e24a0bcd4 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index de6154fd541..9e64ce2bd2d 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -436,6 +451,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -610,7 +626,21 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexOnlyNext calls InstrPushStack / InstrPopStack (instead of the
+ * full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_instrument, &indexstate->ioss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -899,4 +929,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 1620d146071..02ef9d124a3 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -132,8 +138,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -181,6 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -200,6 +223,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -263,36 +289,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -818,6 +875,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -980,7 +1038,21 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_instrument->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexNext / IndexNextWithReorder call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument->instrument_options & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_instrument, &indexstate->iss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1834,4 +1906,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 2a0ff377a73..e2315cef384 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
[application/octet-stream] v14-0007-instrumentation-Use-Instrumentation-struct-for-p.patch (29.2K, 9-v14-0007-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From 54b154d390d99e9180118194a1cfed56524c3d97 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v14 07/10] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3a5176c76c7..9e545b4ef0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2888,8 +2877,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2950,11 +2938,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d80f72a0b0..f3de62ce7f3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2119,8 +2108,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2200,11 +2188,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2d7b7cef912..cb238f862a7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c330c891c03..b5fed54fb85 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -749,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1007,8 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1102,11 +1089,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c01e780f918..2e57136edfd 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -631,8 +630,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -696,21 +693,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -796,17 +786,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -832,9 +823,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument->instrument_options);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1236,7 +1227,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1250,11 +1241,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
{
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
ExecFinalizeWorkerInstrumentation(pei->planstate);
}
@@ -1481,8 +1472,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1497,7 +1486,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1555,11 +1544,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index f9202b558d6..af64aa145eb 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -339,11 +339,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -359,12 +360,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index d2f0191af27..b62619412a0 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -286,8 +286,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr, bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
--
2.47.1
[application/octet-stream] v14-0008-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (11.3K, 10-v14-0008-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From 421a6802c1e019de937ffa6207270b4be3f0eb5b Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v14 08/10] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +--
src/backend/executor/instrument.c | 199 ++++++++++++++++++++--------
src/include/executor/instrument.h | 5 +
3 files changed, 149 insertions(+), 77 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 3b3ec9850e8..6e8cbaeccf7 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
@@ -465,7 +464,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -473,25 +472,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index af64aa145eb..3183f00d693 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -66,29 +66,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
}
-static void
+static inline void
InstrStopTimer(Instrumentation *instr)
{
instr_time endtime;
- /* let's update the time only if the timer was requested */
- if (INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStop called without start");
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
INSTR_TIME_SET_CURRENT(endtime);
INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -96,6 +87,16 @@ InstrStopTimer(Instrumentation *instr)
INSTR_TIME_SET_ZERO(instr->starttime);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -391,65 +392,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
InstrInitOptions(&instr->instr, instrument_options);
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
{
- double save_tuplecount = instr->tuplecount;
instr_time endtime;
- /* count the returned tuples */
- instr->tuplecount += nTuples;
+ Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ INSTR_TIME_SET_ZERO(instr->instr.starttime);
/*
- * Update the time only if the timer was requested.
+ * Is this the first tuple of this cycle?
*
- * Note this is different from InstrStop because total is only updated in
- * InstrEndLoop. We need the separate counter variable because we need to
- * calculate start-up time for the first tuple in each cycle, and then
- * accumulate it together.
+ * In async mode, if the plan node hadn't emitted any tuples before, this
+ * might be the first tuple
*/
- if (instr->instr.need_timer)
- {
- if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrStopNode called without start");
-
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+ if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+ instr->firsttuple = instr->counter;
+}
- INSTR_TIME_SET_ZERO(instr->instr.starttime);
- }
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+ if (instr->instr.need_timer)
+ InstrStopNodeTimer(instr);
- /*
- * Only pop the stack, accumulation runs in
- * ExecFinalizeNodeInstrumentation
- */
+ /* Only pop the stack, accumulation runs in InstrFinalizeNode */
if (instr->instr.need_stack)
InstrPopStack(&instr->instr);
- /* Is this the first tuple of this cycle? */
- if (!instr->running)
- {
- instr->running = true;
- instr->firsttuple = instr->counter;
- }
- else
- {
- /*
- * In async mode, if the plan node hadn't emitted any tuples before,
- * this might be the first tuple
- */
- if (instr->async_mode && save_tuplecount < 1.0)
- instr->firsttuple = instr->counter;
- }
+ instr->running = true;
+
+ /* count the returned tuples */
+ instr->tuplecount += nTuples;
}
/* Update tuple count */
@@ -507,6 +500,100 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.entries[instr_stack.stack_size - 1]->on_stack = false;
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static pg_attribute_always_inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopNodeTimer(instr);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b62619412a0..bae8a9b0e62 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -297,6 +297,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr, int n);
extern void InstrStartTrigger(QueryInstrumentation *qinstr,
TriggerInstrumentation *tginstr);
--
2.47.1
[application/octet-stream] v14-0010-Add-test_session_buffer_usage-test-module.patch (30.0K, 11-v14-0010-Add-test_session_buffer_usage-test-module.patch)
download | inline diff:
From 9aa98bde9304d7963ada924f7290e39d9047eb35 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v14 10/10] Add test_session_buffer_usage test module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
.../test_session_buffer_usage/Makefile | 23 ++
.../expected/test_session_buffer_usage.out | 342 ++++++++++++++++++
.../test_session_buffer_usage/meson.build | 33 ++
.../sql/test_session_buffer_usage.sql | 245 +++++++++++++
.../test_session_buffer_usage--1.0.sql | 31 ++
.../test_session_buffer_usage.c | 95 +++++
.../test_session_buffer_usage.control | 5 +
9 files changed, 776 insertions(+)
create mode 100644 src/test/modules/test_session_buffer_usage/Makefile
create mode 100644 src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
create mode 100644 src/test/modules/test_session_buffer_usage/meson.build
create mode 100644 src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 864b407abcf..c5ace162fe2 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -48,6 +48,7 @@ SUBDIRS = \
test_resowner \
test_rls_hooks \
test_saslprep \
+ test_session_buffer_usage \
test_shm_mq \
test_slru \
test_tidstore \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index e5acacd5083..802cc93d71a 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -49,6 +49,7 @@ subdir('test_regex')
subdir('test_resowner')
subdir('test_rls_hooks')
subdir('test_saslprep')
+subdir('test_session_buffer_usage')
subdir('test_shm_mq')
subdir('test_slru')
subdir('test_tidstore')
diff --git a/src/test/modules/test_session_buffer_usage/Makefile b/src/test/modules/test_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..1252b222cb9
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_session_buffer_usage/Makefile
+
+MODULE_big = test_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ test_session_buffer_usage.o
+
+EXTENSION = test_session_buffer_usage
+DATA = test_session_buffer_usage--1.0.sql
+PGFILEDESC = "test_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = test_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_session_buffer_usage
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
new file mode 100644
index 00000000000..5f7d349871a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/meson.build b/src/test/modules/test_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..b96f67dc7fe
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_session_buffer_usage_sources = files(
+ 'test_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ test_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_session_buffer_usage',
+ '--FILEDESC', 'test_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+test_session_buffer_usage = shared_module('test_session_buffer_usage',
+ test_session_buffer_usage_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_session_buffer_usage
+
+test_install_data += files(
+ 'test_session_buffer_usage.control',
+ 'test_session_buffer_usage--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_session_buffer_usage',
+ ],
+ },
+}
diff --git a/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
new file mode 100644
index 00000000000..daf2159c4a6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT test_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT test_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..e9833be470a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION test_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION test_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
new file mode 100644
index 00000000000..50eb1a2ffe6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(test_session_buffer_usage);
+PG_FUNCTION_INFO_V1(test_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: test_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+test_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: test_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+test_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
new file mode 100644
index 00000000000..41cfb15a765
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# test_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/test_session_buffer_usage'
+relocatable = true
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-05 21:02 ` Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2026-04-05 21:02 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
Hi,
On 2026-04-05 12:38:58 -0700, Lukas Fittl wrote:
> On Sun, Apr 5, 2026 at 11:13 AM Andres Freund <[email protected]> wrote:
> > Unfortunately I think 0001 on its own doesn't actually work correctly. I
> > luckily tried an EXPLAIN ANALYZE with triggers and noticed that the time is
> > reported as zeroes.
> >
> > The only reason I tried is because I misread the diff and though you'd changed
> > the calls=%.3f to calls=%d, even though the old state is calls=%.0f...
> >
> >
> > The reason it doesn't work is that explain shows tginstr->instr.total, but
> > with the patch the trigger instrumentation just computes
> > tginstr->instr.{counter,firsttuple}.
>
> Argh, good catch. That's on me for not manually testing it when I
> factored it out.
>
> I've confirmed this works now, both with 0001 only, and with 0001+0002.
I made 'firings' an an int64, rather than int. Could have made it unsigned,
but ExplainPropertyInteger accepts an int64...
Because the patch did change those lines anyway, I replaced
palloc0(sizeof(Instrumentation)) with palloc_object(), and
palloc0(n * sizeof(TriggerInstrumentation)) with palloc_array().
It also seemed silly to have an if around the assingments of need_*:
if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
{
instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
instr->async_mode = async_mode;
but that gets cleared up in 0002 anyway.
But it did lead me to notice a pre-existing bug: We only set async_mode in the
if (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL)
branch.
It looks like that doesn't matter today, because all async_mode is used for is
/*
* In async mode, if the plan node hadn't emitted any tuples before,
* this might be the first tuple
*/
if (instr->async_mode && save_tuplecount < 1.0)
instr->firsttuple = instr->counter;
and without INSTRUMENT_TIMER instr->counter would be zero anyway.
But I guess it's worth noting that in the commit message for 0002?
I felt a bit silly leaving the instr->need_* stuff in InstrAlloc(), when there
is InstrInit() directly afterwards that does the same thing, but then that
leads to removing the redundant memset etc, so I left it for 0002.
With that I pushed 0001.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2026-04-05 23:12 ` Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2026-04-05 23:12 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
Hi,
On 2026-04-05 17:02:28 -0400, Andres Freund wrote:
> With that I pushed 0001.
For 0002 I:
- fixed a few comments still refering to node in the generic Instr* functions
- added comment about the async_mode buglet to the commit message
- added an async_mode argument to InstrInitNode(), as its callsite already
needed to be touched, and it felt wrong that InstrAllocNode() could do
things that were not possible with InstrInitNode()
- Deduplicated the code between InstrStop() and InstrStotNode() by introducing
InstrStopCommon()
After those (and some testing) I pushed this.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2026-04-06 09:58 ` Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-04-06 09:58 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
On Sun, Apr 5, 2026 at 4:12 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-05 17:02:28 -0400, Andres Freund wrote:
> > With that I pushed 0001.
>
> For 0002 I:
> - fixed a few comments still refering to node in the generic Instr* functions
> - added comment about the async_mode buglet to the commit message
> - added an async_mode argument to InstrInitNode(), as its callsite already
> needed to be touched, and it felt wrong that InstrAllocNode() could do
> things that were not possible with InstrInitNode()
> - Deduplicated the code between InstrStop() and InstrStotNode() by introducing
> InstrStopCommon()
>
> After those (and some testing) I pushed this.
>
Thanks for pushing these two! And appreciate the refinements, they
make sense to me.
Attached v15. Quick summary:
0001 converts direct users of pgBufferUsage/pgWalUsage to the new
general purpose Instrumentation just pushed.
0002 introduces the macros needed for the stack-based instrumentation,
same as before.
0003 adds additional test coverage for buffer usage, same as before.
0004 is new, and adds queryDesc->totaltime_options for extensions to
request a certain level of totaltime measurement (this solves the
problem Andres noted in a review comment)
0005 is the stack-based instrumentation commit, now smaller and more
digestible, with the same performance benefits.
-- if we get up to here, we get the main benefit --
0006 is the parallel instrumentation cleanup. I don't think we need
this right now unless the EXPLAIN (IO) work changes course.
0007 is the same ExecProcNodeInstr change as before (this one we could
simplify by simply moving the function, getting about half the
possible speedup)
0008 is the table-specific buffer measurement for index scans (for
current master)
0009 is the test module for top level instrumentation data.
I've also attached an alternate for 0008, that works on top of the
index prefetch work (v23) - the change actually gets smaller because
heap fetches are better encapsulated then.
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/octet-stream] nocfbot-0008-post-index-prefetch-Index-scans-Show-table-buffer-accesses-separately.patch (20.4K, 2-nocfbot-0008-post-index-prefetch-Index-scans-Show-table-buffer-accesses-separately.patch)
download | inline diff:
From ecba8752d060f19f43ed3297af5b8314e26a7767 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v0 1/2] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]> (in an earlier version)
Reviewed-by: Tomas Vondra <[email protected]> (in an earlier version)
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
doc/src/sgml/perform.sgml | 13 ++++--
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/access/heap/heapam_indexscan.c | 16 ++++++--
src/backend/commands/explain.c | 48 ++++++++++++++++++----
src/backend/executor/execProcnode.c | 46 +++++++++++++++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 24 ++++++++++-
src/backend/executor/nodeIndexscan.c | 24 ++++++++++-
src/include/executor/instrument_node.h | 5 +++
9 files changed, 162 insertions(+), 17 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/access/heap/heapam_indexscan.c b/src/backend/access/heap/heapam_indexscan.c
index fd70ad5bc2c..702419cbabf 100644
--- a/src/backend/access/heap/heapam_indexscan.c
+++ b/src/backend/access/heap/heapam_indexscan.c
@@ -572,11 +572,14 @@ heapam_index_fetch_tuple(Relation rel,
static pg_attribute_always_inline bool
heapam_index_fetch_heap_item(IndexScanDesc scan, IndexFetchHeapData *hscan,
TupleTableSlot *slot, bool *heap_continue,
- bool amgetbatch)
+ bool amgetbatch, Instrumentation *table_instr)
{
bool all_dead = false;
bool found;
+ if (table_instr)
+ InstrPushStack(table_instr);
+
found = heapam_index_fetch_tuple(scan->heapRelation, hscan,
&scan->xs_heaptid,
scan->xs_snapshot, slot,
@@ -607,6 +610,9 @@ heapam_index_fetch_heap_item(IndexScanDesc scan, IndexFetchHeapData *hscan,
}
}
+ if (table_instr)
+ InstrPopStack(table_instr);
+
return found;
}
@@ -1390,6 +1396,10 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
BlockNumber last_visited_block = InvalidBlockNumber;
uint8 n_visited_pages = 0;
ItemPointer tid = NULL;
+ Instrumentation *table_instr = NULL;
+
+ if (scan->instrument && scan->instrument->table_instr.need_stack)
+ table_instr = &scan->instrument->table_instr;
for (;;)
{
@@ -1434,7 +1444,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
scan->instrument->ntabletuplefetches++;
if (!heapam_index_fetch_heap_item(scan, hscan, slot,
- heap_continue, amgetbatch))
+ heap_continue, amgetbatch, table_instr))
{
/*
* No visible tuple. If caller set a visited-pages limit
@@ -1494,7 +1504,7 @@ heapam_index_getnext_slot(IndexScanDesc scan, ScanDirection direction,
* next TID from the index.
*/
if (heapam_index_fetch_heap_item(scan, hscan, slot, heap_continue,
- amgetbatch))
+ amgetbatch, table_instr))
return true;
}
}
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ae3258b3f5c..647be5d1286 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -611,7 +611,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1028,7 +1028,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1042,7 +1042,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -2287,7 +2287,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2306,7 +2306,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -3862,6 +3862,7 @@ show_indexscan_info(PlanState *planstate, ExplainState *es)
{
Plan *plan = planstate->plan;
SharedIndexScanInstrumentation *SharedInfo = NULL;
+ Instrumentation *table_instr = NULL;
uint64 nsearches = 0,
ntabletuplefetches = 0;
@@ -3877,6 +3878,7 @@ show_indexscan_info(PlanState *planstate, ExplainState *es)
nsearches = indexstate->iss_Instrument->nsearches;
SharedInfo = indexstate->iss_SharedInfo;
+ table_instr = &indexstate->iss_Instrument->table_instr;
break;
}
case T_IndexOnlyScan:
@@ -3886,6 +3888,7 @@ show_indexscan_info(PlanState *planstate, ExplainState *es)
nsearches = indexstate->ioss_Instrument->nsearches;
ntabletuplefetches = indexstate->ioss_Instrument->ntabletuplefetches;
SharedInfo = indexstate->ioss_SharedInfo;
+ table_instr = &indexstate->ioss_Instrument->table_instr;
break;
}
case T_BitmapIndexScan:
@@ -3894,6 +3897,7 @@ show_indexscan_info(PlanState *planstate, ExplainState *es)
nsearches = indexstate->biss_Instrument->nsearches;
SharedInfo = indexstate->biss_SharedInfo;
+ table_instr = &indexstate->biss_Instrument->table_instr;
break;
}
default:
@@ -3916,6 +3920,9 @@ show_indexscan_info(PlanState *planstate, ExplainState *es)
ExplainPropertyUInteger("Heap Fetches", NULL, ntabletuplefetches, es);
ExplainPropertyUInteger("Index Searches", NULL, nsearches, es);
+
+ if (es->buffers && table_instr)
+ show_buffer_usage(es, &table_instr->bufusage, "Table");
}
/*
@@ -4112,7 +4119,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4137,6 +4144,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4192,6 +4201,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4233,6 +4244,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4253,8 +4272,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4273,6 +4304,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index ac400670fea..28f1f666a3b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -847,6 +847,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
&node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
@@ -892,6 +906,38 @@ ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
num_workers = node->worker_instrument->num_workers;
+ /*
+ * Fold per-worker IndexScan/IndexOnlyScan table buffer stats into the
+ * per-worker node stats, matching what ExecFinalizeNodeInstrumentation
+ * does for the leader.
+ */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, iss->iss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &iss->iss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, ioss->ioss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &ioss->ioss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+
/* Accumulate this node's per-worker stats to parent's per-worker stats */
if (parent && parent->worker_instrument)
{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index a9a3d2fb149..ff802b86446 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -277,7 +277,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index b6b9dbd1075..9ff77a25a95 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -347,6 +347,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
winstrument->ntabletuplefetches += node->ioss_Instrument->ntabletuplefetches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -521,7 +522,21 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexOnlyNext calls InstrPushStack / InstrPopStack (instead of the
+ * full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_query_instr, &indexstate->ioss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -811,4 +826,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 2ac854da468..d3ae4d016c4 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -821,6 +821,7 @@ ExecEndIndexScan(IndexScanState *node)
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
Assert(node->iss_Instrument->ntabletuplefetches == 0);
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -983,7 +984,21 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexNext / IndexNextWithReorder call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_query_instr, &indexstate->iss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1839,4 +1854,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 78f810aabaf..bf0c4416dae 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -51,6 +53,9 @@ typedef struct IndexScanInstrumentation
/* Table tuples fetched count (incremented during index-only scans) */
uint64 ntabletuplefetches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
[application/octet-stream] v15-0003-instrumentation-Add-additional-regression-tests-.patch (22.5K, 3-v15-0003-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From 97681481fb96a5907830d405ed5c2564baddb872 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 03:48:32 -0700
Subject: [PATCH v15 3/9] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 ++++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 188 ++++++++++++++++++
src/test/regress/sql/explain.sql | 188 ++++++++++++++++++
6 files changed, 583 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..5ff96491b0a 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,191 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+ trigger_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..9f0e8524497 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,191 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
--
2.47.1
[application/octet-stream] v15-0004-instrumentation-Allocate-queryDesc-totaltime-in-.patch (6.4K, 4-v15-0004-instrumentation-Allocate-queryDesc-totaltime-in-.patch)
download | inline diff:
From d444dcdd48f5712cfcae7e7f4cc8055f1c33f902 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v15 4/9] instrumentation: Allocate queryDesc->totaltime in
ExecutorStart if needed
This introduces a new field, queryDesc->totaltime_options, that extensions
can use to indicate whether they need queryDesc->totaltime populated,
and with which instrumentation options. Extensions should take care to
only add options they need, instead of replacing the options of others.
This replaces the practice of extensions allocating queryDesc->totaltime
themselves, which required them to always use INSTRUMENT_ALL for the
options argument. If they wouldn't have, another extension could silently
be impacted by it. It also unnecessarily made extensions hooks worry
about being sure to allocate in the per-query memory context.
Adjust pg_stat_statements and auto_explain to match, and lower the
requested instrumentation level for auto_explain to INSTRUMENT_TIMER,
since the summary instrumentation it needs is only runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 20 +++------------
.../pg_stat_statements/pg_stat_statements.c | 25 ++++++-------------
src/backend/executor/execMain.c | 9 +++++++
src/backend/tcop/pquery.c | 1 +
src/include/executor/execdesc.h | 4 ++-
5 files changed, 23 insertions(+), 36 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..2f882026b50 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -284,6 +284,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
+ /* We're always interested in runtime */
+ queryDesc->totaltime_options |= INSTRUMENT_TIMER;
+
/* Enable per-node instrumentation iff log_analyze is required. */
if (auto_explain_log_analyze && (eflags & EXEC_FLAG_EXPLAIN_ONLY) == 0)
{
@@ -302,23 +305,6 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
prev_ExecutorStart(queryDesc, eflags);
else
standard_ExecutorStart(queryDesc, eflags);
-
- if (auto_explain_enabled())
- {
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
- if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
- }
}
/*
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b6863479e9f..346adb5599f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -983,11 +983,6 @@ pgss_planner(Query *parse,
static void
pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
{
- if (prev_ExecutorStart)
- prev_ExecutorStart(queryDesc, eflags);
- else
- standard_ExecutorStart(queryDesc, eflags);
-
/*
* If query has queryId zero, don't track it. This prevents double
* counting of optimizable statements that are directly contained in
@@ -995,20 +990,14 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
- if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ /* Request all summary instrumentation, i.e. timing, buffers and WAL */
+ queryDesc->totaltime_options |= INSTRUMENT_ALL;
}
+
+ if (prev_ExecutorStart)
+ prev_ExecutorStart(queryDesc, eflags);
+ else
+ standard_ExecutorStart(queryDesc, eflags);
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b0f636bf8b6..7d74f6da402 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -250,6 +250,15 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up query-level instrumentation if extensions have requested it via
+ * totaltime_options. Ensure an extension has not allocated totaltime
+ * itself.
+ */
+ Assert(queryDesc->totaltime == NULL);
+ if (queryDesc->totaltime_options)
+ queryDesc->totaltime = InstrQueryAlloc(queryDesc->totaltime_options);
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index d8fc75d0bb9..e27f26ecd83 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -86,6 +86,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
qd->params = params; /* parameter values passed into query */
qd->queryEnv = queryEnv;
qd->instrument_options = instrument_options; /* instrumentation wanted? */
+ qd->totaltime_options = 0;
/* null these fields until set by ExecutorStart */
qd->tupDesc = NULL;
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..0d76a1c173e 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,8 @@ typedef struct QueryDesc
ParamListInfo params; /* param values being passed in */
QueryEnvironment *queryEnv; /* query environment passed in */
int instrument_options; /* OR of InstrumentOption flags */
+ int totaltime_options; /* OR of InstrumentOption flags for
+ * totaltime */
/* These fields are set by ExecutorStart */
TupleDesc tupDesc; /* descriptor for result tuples */
@@ -51,7 +53,7 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
+ /* This field is allocated by ExecutorRun if needed */
struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
--
2.47.1
[application/octet-stream] v15-0002-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.0K, 5-v15-0002-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 38d92b532d00d078b0b8333b17411585a81b8289 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 26 Mar 2026 23:31:04 -0700
Subject: [PATCH v15 2/9] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 20 ++++++++++----------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b82af9a85c0..470110f6774 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1115,10 +1115,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2097,7 +2097,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cc0b0bdd92..3e1c39160db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -840,7 +840,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -861,7 +861,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1684,9 +1684,9 @@ TrackBufferHit(IOObject io_object, IOContext io_context,
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
@@ -2148,9 +2148,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it. Otherwise
@@ -3043,7 +3043,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3189,7 +3189,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4601,7 +4601,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5796,7 +5796,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 396da84b25c..851b99056d5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -218,7 +218,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -479,7 +479,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -510,7 +510,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 2be26e92283..e3829d7fe7c 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc9fbb0e2f0..5261356dba6 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -154,4 +154,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += (val); \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += (val); \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/octet-stream] v15-0001-instrumentation-Use-Instrumentation-instead-of-m.patch (19.2K, 6-v15-0001-instrumentation-Use-Instrumentation-instead-of-m.patch)
download | inline diff:
From ae4383e786599a06d95924276a0e414f131d344d Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 05:08:23 -0700
Subject: [PATCH v15 1/9] instrumentation: Use Instrumentation instead of
manual buffer tracking
This replaces different repeated code blocks that read pgBufferUsage /
pgWalUsage, and may have also been running a timer to measure activity,
with the new Instrumentation struct and associated helpers.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/pg_stat_statements.c | 62 +++++--------------
src/backend/access/heap/vacuumlazy.c | 15 +++--
src/backend/commands/analyze.c | 31 +++++-----
src/backend/commands/explain.c | 44 +++++++------
src/backend/commands/explain_dr.c | 53 ++++++----------
src/backend/commands/prepare.c | 28 ++++-----
src/include/commands/explain_dr.h | 5 +-
7 files changed, 91 insertions(+), 147 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 025215fcc90..b6863479e9f 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -906,22 +906,16 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
-
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
+ Instrumentation instr = {0};
/*
+ * We need to track buffer usage as the planner can access them.
+ *
* Similarly the planner could write some WAL records in some cases
* (e.g. setting a hint bit with those being WAL-logged)
*/
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -939,26 +933,17 @@ pgss_planner(Query *parse,
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
+ InstrStop(&instr);
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1151,17 +1136,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1191,8 +1170,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
+ InstrStop(&instr);
/*
* Track the total number of rows retrieved or affected by the utility
@@ -1205,23 +1183,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 88c71cd85b6..30f589c9207 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,8 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -654,6 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -984,14 +985,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1000,12 +1001,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..8472fc0c280 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 73eaaf176ac..d6dc7268438 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,14 +324,17 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
+ int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
+
if (es->memory)
{
/*
@@ -348,15 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -364,16 +364,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +583,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1017,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1025,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1036,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..df5ae5f4569 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,10 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
- /* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ /* Start per-tuple measurement */
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +181,8 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ InstrStop(instr);
return true;
}
@@ -209,6 +194,7 @@ static void
serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ int instrument_options = 0;
Assert(receiver->es != NULL);
@@ -233,9 +219,13 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
}
/*
@@ -290,22 +280,17 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
}
/*
- * GetSerializationMetrics - collect metrics
+ * GetSerializationMetrics - get serialization metrics
*
- * We have to be careful here since the receiver could be an IntoRel
- * receiver if the subject statement is CREATE TABLE AS. In that
- * case, return all-zeroes stats.
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
*/
-SerializeMetrics
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..bf9f2eb6149 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -22,6 +22,7 @@
#include "catalog/pg_type.h"
#include "commands/createas.h"
#include "commands/explain.h"
+#include "executor/instrument.h"
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
@@ -580,14 +581,17 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
+ int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -598,9 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -644,13 +645,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..ab5c53023e1 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* time and buffer usage */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
--
2.47.1
[application/octet-stream] v15-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (9.4K, 7-v15-0007-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From 9e527947d49be7715cb01addda2890ff54ed5c16 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 19:30:56 -0700
Subject: [PATCH v15 7/9] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
---
src/backend/executor/execProcnode.c | 22 +----
src/backend/executor/instrument.c | 144 ++++++++++++++++++++++++----
src/include/executor/instrument.h | 5 +
3 files changed, 130 insertions(+), 41 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index f006931c94d..ac400670fea 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
@@ -466,7 +465,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
@@ -474,25 +473,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index cfedbad0eba..5a17be9aa53 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -66,19 +66,25 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT(instr->starttime);
+}
+
+static inline void
+InstrStopTimer(Instrumentation *instr, instr_time *accum_time)
+{
+ instr_time endtime;
+
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
+
+ INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_ACCUM_DIFF(*accum_time, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
/*
@@ -88,18 +94,13 @@ InstrStart(Instrumentation *instr)
static inline void
InstrStopCommon(Instrumentation *instr, instr_time *accum_time)
{
- instr_time endtime;
-
/* update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStop called without start");
- INSTR_TIME_SET_CURRENT(endtime);
- INSTR_TIME_ACCUM_DIFF(*accum_time, endtime, instr->starttime);
-
- INSTR_TIME_SET_ZERO(instr->starttime);
+ InstrStopTimer(instr, accum_time);
}
/* pop the stack, unless InstrStopFinalize previously cleaned up */
@@ -107,6 +108,16 @@ InstrStopCommon(Instrumentation *instr, instr_time *accum_time)
InstrPopStack(instr);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -398,15 +409,14 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options, bool async_mod
instr->async_mode = async_mode;
}
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
{
@@ -495,6 +505,100 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.entries[instr_stack.stack_size - 1]->on_stack = false;
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static pg_attribute_always_inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopTimer(&instr->instr, &instr->counter);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int instrument_options, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 5bb698d686d..bd481afd0de 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -300,6 +300,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr,
int instrument_options, int n);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
--
2.47.1
[application/octet-stream] v15-0006-instrumentation-Use-Instrumentation-struct-for-p.patch (29.1K, 8-v15-0006-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From 90128ad03216fab0a0d62a3521694f9dc1a93b52 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v15 6/9] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3a5176c76c7..9e545b4ef0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2888,8 +2877,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2950,11 +2938,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d80f72a0b0..f3de62ce7f3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2119,8 +2108,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2200,11 +2188,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2d7b7cef912..cb238f862a7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index c330c891c03..b5fed54fb85 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -47,9 +47,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Shared information among parallel workers. So this is allocated in the DSM
@@ -188,11 +187,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -250,8 +246,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -304,18 +299,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -396,17 +388,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -749,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1007,8 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1102,11 +1089,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index a99e37c98e2..c09d51428a6 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -631,8 +630,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -696,21 +693,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -796,17 +786,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -832,9 +823,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument, false);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1236,7 +1227,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1250,11 +1241,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
{
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
ExecFinalizeWorkerInstrumentation(pei->planstate);
}
@@ -1481,8 +1472,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1497,7 +1486,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1555,11 +1544,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index b84c552c6f8..cfedbad0eba 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -345,11 +345,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -365,12 +366,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f5cc6fb662b..5bb698d686d 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -286,8 +286,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr,
int instrument_options,
--
2.47.1
[application/octet-stream] v15-0008-Index-scans-Show-table-buffer-accesses-separatel.patch (22.9K, 9-v15-0008-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From 9a352d49e19f5614aedc9511527b92bee3c6a38c Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v15 8/9] Index scans: Show table buffer accesses separately in
EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
Actually populate I(O)S table stack pre index prefetching merge
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 ++++++--
src/backend/executor/execProcnode.c | 46 ++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 41 ++++++-
src/backend/executor/nodeIndexscan.c | 127 +++++++++++++++++----
src/include/executor/instrument_node.h | 5 +
8 files changed, 244 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c93e4cbee97..e5ed2524904 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -611,7 +611,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1028,7 +1028,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1042,7 +1042,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1972,6 +1972,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1989,6 +1992,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2290,7 +2296,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2309,7 +2315,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4109,7 +4115,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4134,6 +4140,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4189,6 +4197,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4230,6 +4240,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4250,8 +4268,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4270,6 +4300,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index ac400670fea..28f1f666a3b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -847,6 +847,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
&node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
@@ -892,6 +906,38 @@ ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
num_workers = node->worker_instrument->num_workers;
+ /*
+ * Fold per-worker IndexScan/IndexOnlyScan table buffer stats into the
+ * per-worker node stats, matching what ExecFinalizeNodeInstrumentation
+ * does for the leader.
+ */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, iss->iss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &iss->iss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, ioss->ioss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &ioss->ioss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+
/* Accumulate this node's per-worker stats to parent's per-worker stats */
if (parent && parent->worker_instrument)
{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 70c55ee6d61..657ee2d0667 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index de6154fd541..d918570e684 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -436,6 +451,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -610,7 +626,21 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexOnlyNext calls InstrPushStack / InstrPopStack (instead of the
+ * full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_query_instr, &indexstate->ioss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -899,4 +929,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 1620d146071..5041266984a 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -132,8 +138,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -181,6 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -200,6 +223,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -263,36 +289,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -818,6 +875,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -980,7 +1038,21 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexNext / IndexNextWithReorder call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_query_instr, &indexstate->iss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1834,4 +1906,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 2a0ff377a73..e2315cef384 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/* ---------------------
* Instrumentation information for aggregate function execution
@@ -48,6 +50,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
[application/octet-stream] v15-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch (81.2K, 10-v15-0005-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From bf8c303c85a6da53e4735b4271a0648a8f2a54d6 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Mon, 6 Apr 2026 01:20:45 -0700
Subject: [PATCH v15 5/9] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
In tests, the stack-based instrumentation mechanism reduces the overhead
of EXPLAIN (ANALYZE, BUFFERS ON, TIMING OFF) for a large COUNT(*) query
from about 50% to 22% on top of the actual runtime.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
.../pg_stat_statements/pg_stat_statements.c | 6 +-
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 12 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 12 +-
src/backend/commands/explain.c | 10 +-
src/backend/commands/explain_dr.c | 2 +
src/backend/commands/prepare.c | 10 +-
src/backend/commands/tablecmds.c | 5 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/README.instrument | 237 ++++++++++
src/backend/executor/execMain.c | 94 +++-
src/backend/executor/execParallel.c | 32 +-
src/backend/executor/execPartition.c | 5 +-
src/backend/executor/execProcnode.c | 106 ++++-
src/backend/executor/execUtils.c | 13 +-
src/backend/executor/instrument.c | 429 ++++++++++++++----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/executor.h | 6 +-
src/include/executor/instrument.h | 199 +++++++-
src/include/nodes/execnodes.h | 2 +
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
26 files changed, 1044 insertions(+), 193 deletions(-)
create mode 100644 src/backend/executor/README.instrument
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 346adb5599f..4fdd5ef8898 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -929,12 +929,11 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- InstrStop(&instr);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
@@ -1145,6 +1144,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1159,8 +1159,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- InstrStop(&instr);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e09..3a5176c76c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 9d83a495775..0d80f72a0b0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2118,6 +2118,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2186,7 +2187,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2201,7 +2202,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 30f589c9207..291d9d67bc2 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -637,7 +637,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -653,8 +653,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -985,7 +985,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -1001,8 +1001,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufferusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 756dfa3dcf4..2d7b7cef912 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8472fc0c280..10f8a2dc81c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,7 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -361,8 +361,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
pg_rusage_init(&ru0);
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -743,7 +743,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -757,8 +757,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
total_blks_hit = bufusage.shared_blks_hit +
bufusage.local_blks_hit;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index d6dc7268438..c93e4cbee97 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,7 +324,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
@@ -333,7 +333,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -351,12 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -366,7 +366,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index df5ae5f4569..836395d6992 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -236,6 +236,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf9f2eb6149..ee811357588 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -581,7 +581,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
@@ -590,7 +590,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -602,7 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -637,7 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -654,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 0ce2e81f9c2..9fea019c39e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ 0, NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
@@ -6337,7 +6337,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap)
oldrel,
0, /* dummy rangetable index */
NULL,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
MemoryContextSwitchTo(oldcontext);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 77834b96a21..c330c891c03 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1006,6 +1006,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1095,7 +1096,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1103,7 +1104,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/README.instrument b/src/backend/executor/README.instrument
new file mode 100644
index 00000000000..7df837dbc77
--- /dev/null
+++ b/src/backend/executor/README.instrument
@@ -0,0 +1,237 @@
+src/backend/executor/README.instrument
+
+Instrumentation
+===============
+
+The instrumentation subsystem measures time, buffer usage and WAL activity
+during query execution and other similar activities. It is used by
+EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
+activity and/or timing metrics over a section of code.
+
+The design has two central goals:
+
+* Make it cheap to measure activity in a section of code, even when
+ that section is called many times and the aggregate is what is used
+ (as is the case with per-node instrumentation in the executor)
+
+* Ensure nested instrumentation accurately measures activity/timing,
+ even when execution is aborted due to errors being thrown.
+
+The key data structures are defined in src/include/executor/instrument.h
+and the implementation lives in src/backend/executor/instrument.c.
+
+
+Instrumentation Options
+-----------------------
+
+Callers specify what to measure with a bitmask of InstrumentOption flags:
+
+ INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
+ INSTRUMENT_TIMER -- wall-clock timing and row counts
+ INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
+ INSTRUMENT_WAL -- WAL records, FPI, bytes
+
+INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
+(described below) for efficient handling of counter values.
+
+
+Struct Hierarchy
+----------------
+
+There are the following instrumentation structs, each specialized for a
+different scope:
+
+Instrumentation Base struct. Holds timing and buffer/WAL counters.
+
+QueryInstrumentation Extends Instrumentation for query-level tracking. When
+ stack-based tracking is enabled, it owns a dedicated
+ MemoryContext and uses the ResourceOwner mechanism for
+ abort cleanup.
+
+NodeInstrumentation Extends Instrumentation for per-plan-node statistics
+ (startup time, tuple counts, loop counts, etc).
+
+TriggerInstrumentation Extends Instrumentation with a firing count.
+
+
+Stack-based instrumentation
+===========================
+
+For tracking WAL or buffer usage counters, the specialized stack-based
+instrumentation is used.
+
+A simple approach to measuring buffer/WAL activity in a code section could be
+to have a set of global counters, snapshot all the counters at the start, and
+diff them at the end. But, this is expensive in practice: BufferUsage alone
+has many fields, and the diff must be computed for every InstrStartNode /
+InstrStopNode cycle.
+
+An alternative is to write counter updates directly into the struct that
+should receive them, avoiding the diff. But that has two complexities: Low-level
+code such as the buffer manager, has no direct pointers to higher level
+structs, such as plan nodes tracking buffer usage. And instrumentation is often
+nested: We might both be interested in the aggregate buffer usage of a query, and
+the individual per-node details. Stack-based instrumentation solves for that:
+
+At all times, there is a stack that tracks which Instrumentation is currently
+active. The stack is represented by instr_stack, a per-backend global
+that holds a dynamic array of Instrumentation pointers. The field
+instr_stack.current always points to the current stack entry that should
+be updated when activity occurs. When the stack array is empty, the
+current stack points to instr_top.
+
+For example, if a backend has two portals open, the overall nesting of
+Instrumentation and their respective InstrStart/InstrStop calls creates a
+tree-like structure like this:
+
+ Session (instr_top)
+ |
+ +-- Query A (QueryInstrumentation)
+ | |
+ | +-- NestLoop (NodeInstrumentation)
+ | |
+ | +-- Seq Scan A (NodeInstrumentation)
+ | +-- Seq Scan B (NodeInstrumentation)
+ |
+ +-- Query B (QueryInstrumentation)
+ |
+ +-- Seq Scan C (NodeInstrumentation)
+
+While executing Seq Scan B, the stack looks like:
+
+ instr_top (implicit bottom, not in the entries array)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B <-- instr_stack.current
+
+When no query is running, the stack is empty (stack_size == 0) and
+instr_stack.current points to instr_top.
+
+Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
+INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
+writes directly into instr_stack.current. Each instrumentation node starts
+zeroed, so the values it accumulates while on top of the stack represent
+exactly the activity that occurred during that time.
+
+Every Instrumentation node (except for instr_top) has a target, or parent, it
+will be accumulated into, which is typically the Instrumentation that was the
+current stack entry when it was created.
+
+For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
+its instrumentation data gets added to the immediate parent in
+the execution tree, the NestLoop, which will then get added to Query A's
+QueryInstrumentation, which then accumulates to the parent.
+
+While we can typically think of this as a tree, the NodeInstrumentation
+underneath a particular QueryInstrumentation could behave differently --
+for example, it could propagate directly to the QueryInstrumentation, in
+order to not show cumulative numbers in EXPLAIN ANALYZE.
+
+Note these relationships are partially implicit, especially when it comes
+to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
+unfinalized child nodes. The parent of a QueryInstrumentation itself is
+determined by the stack (see below): when a query is finalized or cleaned
+up on abort, its counters are accumulated to whatever entry is then current
+on the stack, which is typically instr_top.
+
+
+Finalization and Abort Safety
+=============================
+
+Finalization is the process of rolling up a node's buffer/WAL counters to
+its parent. In normal execution, nodes are pushed onto the stack when they
+start and popped when they stop; at finalization time their accumulated
+counters are added to the parent.
+
+Due to the use of longjmp for error handling, functions can exit abruptly
+without executing their normal cleanup code. On abort, two things need
+to happen:
+
+1. The stack is reset to the level saved at the start of the aborting
+ (sub-)transaction level. This ensures that we don't later try to update
+ counters on a freed stack entry. We also need to ensure that the stack
+ entry that was current before a particular Instrumentation started, is
+ current again after it stops.
+
+2. Finalize all affected Instrumentation nodes, rolling up their counters
+ to the innermost surviving Instrumentation, so that data is not lost.
+
+For example, if Seq Scan B aborts while the stack is:
+
+ instr_top (implicit bottom)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B
+
+The abort handler for Query A accumulates all unfinalized children (Seq
+Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
+unwinds the instrumentation stack and accumulates Query A's counters to
+instr_top.
+
+Note that on abort the children do not accumulate through each other (Seq
+Scan B -> NestLoop -> Query A); they all accumulate directly to their
+parent QueryInstrumentation. This means the order in which children are
+released does not matter -- this is important because ResourceOwner cleanup
+does not guarantee a particular release order. The per-node breakdown is lost,
+but the instrumentation active when the query was started (instr_top in the
+above example) survives the abort, and its counters include the activity.
+
+If multiple QueryInstrumentations are active on the stack (e.g. nested
+portals), the abort handler of each uses InstrStopFinalize() to accumulate
+the statistics to the parent entry of either the entry being released, or a
+previously released entry if it was higher up in the stack, so they compose
+correctly regardless of release order.
+
+There are two mechanisms for achieving abort safety:
+
+* Resource Owner (QueryInstrumentation): registers with the current
+ ResourceOwner at start. On transaction abort, the resource owner system
+ calls the release callback, which walks unfinalized child entries,
+ accumulates their data, unwinds the stack, and destroys the dedicated
+ memory context (freeing the QueryInstrumentation and all child
+ allocations as a unit). This is the recommended approach when the
+ instrumented code already has an appropriate resource owner (e.g. it
+ runs inside a portal). The query executor uses this path.
+
+* PG_FINALLY (base Instrumentation): when no suitable resource owner
+ exists, or when the caller wants to inspect the instrumentation data
+ even after an error, the base Instrumentation can be used with a
+ PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
+
+Both mechanisms add overhead, so neither is suitable for high-frequency
+instrumentation like per-node measurements in the executor. Instead,
+plan node and trigger children rely on their parent QueryInstrumentation
+for abort safety: they are allocated in the parent's memory context and
+registered in its unfinalized-entries list, so the parent's abort handler
+recovers their data automatically. In normal execution, children are
+finalized explicitly by the caller.
+
+Parallel Query
+--------------
+
+Parallel workers get their own QueryInstrumentation so they can measure
+buffer and WAL activity independently, then copy the totals into dynamic
+shared memory at worker shutdown. The leader accumulates these into its
+own stack.
+
+When per-node instrumentation is active, parallel workers skip per-node
+finalization at shutdown to avoid double-counting; the per-node data is
+aggregated separately through InstrAggNode().
+
+
+Memory Handling
+===============
+
+Instrumentation objects that use the stack must survive until finalization
+runs, including the abort case. To ensure this, QueryInstrumentation
+creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
+of TopMemoryContext. All child instrumentation (nodes, triggers) should be
+allocated in this context.
+
+On successful completion, instr_cxt is reparented to CurrentMemoryContext
+so its lifetime is tied to the caller's context. On abort, the
+ResourceOwner cleanup frees it after accumulating the instrumentation data
+to the current stack entry after resetting the stack.
+
+When the stack is not needed (timer/rows only), Instrumentation allocations
+happen in CurrentMemoryContext instead of TopMemoryContext.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 7d74f6da402..44d4fea76eb 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -254,10 +255,18 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
* Set up query-level instrumentation if extensions have requested it via
* totaltime_options. Ensure an extension has not allocated totaltime
* itself.
+ *
+ * Alternatively, also set it up when running EXPLAIN (ANALYZE), as we
+ * utilize totaltime as the parent for node and trigger instrumentation.
*/
Assert(queryDesc->totaltime == NULL);
- if (queryDesc->totaltime_options)
- queryDesc->totaltime = InstrQueryAlloc(queryDesc->totaltime_options);
+ if (queryDesc->totaltime_options || queryDesc->instrument_options)
+ {
+ estate->es_query_instr = InstrQueryAlloc(queryDesc->instrument_options |
+ queryDesc->totaltime_options);
+
+ queryDesc->totaltime = &estate->es_query_instr->instr;
+ }
/*
* Set up an AFTER-trigger statement context, unless told not to, or
@@ -340,9 +349,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
- if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ /* Start up instrumentation for this execution run */
+ if (estate->es_query_instr)
+ InstrQueryStart(estate->es_query_instr);
/*
* extract information from the query descriptor and the query feature.
@@ -393,8 +402,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (sendTuples)
dest->rShutdown(dest);
- if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ if (estate->es_query_instr)
+ InstrQueryStop(estate->es_query_instr);
MemoryContextSwitchTo(oldcontext);
}
@@ -443,8 +452,8 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
/* Allow instrumentation of Executor overall runtime */
- if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ if (estate->es_query_instr)
+ InstrQueryStart(estate->es_query_instr);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -453,8 +462,29 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
- if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ if (estate->es_query_instr)
+ {
+ /*
+ * Accumulate per-node and trigger statistics to their respective
+ * parent instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and
+ * the leader's own ExecFinalizeNodeInstrumentation handles
+ * propagation. If we accumulated here, the leader would
+ * double-count: worker parent nodes would already include their
+ * children's stats, and then the leader's accumulation would add the
+ * children again.
+ */
+ if (!IsParallelWorker() && estate->es_instrument)
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
+ InstrQueryStopFinalize(estate->es_query_instr);
+ }
MemoryContextSwitchTo(oldcontext);
@@ -1272,7 +1302,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ int instrument_options,
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1293,8 +1324,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, instrument_options, n);
}
else
{
@@ -1367,6 +1398,10 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
* also provides a way for EXPLAIN ANALYZE to report the runtimes of such
* triggers.) So we make additional ResultRelInfo's as needed, and save them
* in es_trig_target_relations.
+ *
+ * Note: if new relation lists are searched here, they must also be added to
+ * ExecFinalizeTriggerInstrumentation so that trigger instrumentation data
+ * is properly accumulated.
*/
ResultRelInfo *
ExecGetTriggerResultRel(EState *estate, Oid relid,
@@ -1433,7 +1468,8 @@ ExecGetTriggerResultRel(EState *estate, Oid relid,
rel,
0, /* dummy rangetable index */
rootRelInfo,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
estate->es_trig_target_relations =
lappend(estate->es_trig_target_relations, rInfo);
MemoryContextSwitchTo(oldcontext);
@@ -1496,7 +1532,8 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
/* dummy rangetable index */
InitResultRelInfo(rInfo, ancRel, 0, NULL,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
ancResultRels = lappend(ancResultRels, rInfo);
}
ancResultRels = lappend(ancResultRels, rootRelInfo);
@@ -1509,6 +1546,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_query_instr->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
@@ -3066,6 +3127,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
/* es_trig_target_relations must NOT be copied */
rcestate->es_top_eflags = parentestate->es_top_eflags;
rcestate->es_instrument = parentestate->es_instrument;
+ rcestate->es_query_instr = parentestate->es_query_instr;
/* es_auxmodifytables must NOT be copied */
/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 5e4a4a9740c..a99e37c98e2 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -700,7 +700,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1081,14 +1081,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1238,9 +1252,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
{
/* Accumulate instrumentation, if any. */
if (pei->instrumentation)
+ {
ExecParallelRetrieveInstrumentation(pei->planstate,
pei->instrumentation);
+ ExecFinalizeWorkerInstrumentation(pei->planstate);
+ }
+
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
ExecParallelRetrieveJitInstrumentation(pei->planstate,
@@ -1462,6 +1480,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1522,7 +1541,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1538,7 +1557,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6888fbe4278 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -586,7 +586,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
partrel,
0,
rootResultRelInfo,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
/*
* Verify result relation is a valid target for an INSERT. An UPDATE of a
@@ -1381,7 +1382,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..f006931c94d 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -123,6 +123,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -414,7 +416,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAllocNode(estate->es_instrument,
+ result->instrument = InstrAllocNode(estate->es_query_instr,
+ estate->es_instrument,
result->async_capable);
return result;
@@ -788,10 +791,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -829,6 +832,99 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ Assert(node->instrument != NULL);
+
+ /*
+ * Recurse into children first (bottom-up accumulation), and accummulate
+ * to this nodes instrumentation as the parent context.
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ &node->instrument->instr);
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
+/*
+ * ExecFinalizeWorkerInstrumentation
+ *
+ * Accumulate per-worker instrumentation stats from child nodes into their
+ * parents, mirroring what ExecFinalizeNodeInstrumentation does for the
+ * leader's own stats. Without this, per-worker buffer/WAL stats shown by
+ * EXPLAIN (ANALYZE, VERBOSE) would only reflect each node's own direct
+ * activity, not including children.
+ *
+ * This must run after ExecParallelRetrieveInstrumentation has populated
+ * worker_instrument for all nodes in the parallel subtree.
+ */
+void
+ExecFinalizeWorkerInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeWorkerInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
+{
+ PlanState *parent = (PlanState *) context;
+ int num_workers;
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing this node
+ * as parent context if it has worker_instrument, otherwise pass through
+ * the previous parent.
+ */
+ planstate_tree_walker(node, ExecFinalizeWorkerInstrumentation_walker,
+ node->worker_instrument ? (void *) node : context);
+
+ if (!node->worker_instrument)
+ return false;
+
+ num_workers = node->worker_instrument->num_workers;
+
+ /* Accumulate this node's per-worker stats to parent's per-worker stats */
+ if (parent && parent->worker_instrument)
+ {
+ int parent_workers = parent->worker_instrument->num_workers;
+
+ for (int n = 0; n < Min(num_workers, parent_workers); n++)
+ InstrAccumStack(&parent->worker_instrument->instrument[n].instr,
+ &node->worker_instrument->instrument[n].instr);
+ }
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..8db2b70e5fe 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -151,6 +151,7 @@ CreateExecutorState(void)
estate->es_top_eflags = 0;
estate->es_instrument = 0;
+ estate->es_query_instr = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
@@ -227,6 +228,15 @@ FreeExecutorState(EState *estate)
estate->es_partition_directory = NULL;
}
+ /*
+ * Make sure the instrumentation context gets freed. This usually gets
+ * re-parented under the per-query context in InstrQueryStopFinalize, but
+ * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
+ * gets called, so we would otherwise leak it in TopMemoryContext.
+ */
+ if (estate->es_query_instr && estate->es_query_instr->instr.need_stack)
+ MemoryContextDelete(estate->es_query_instr->instr_cxt);
+
/*
* Free the per-query memory context, thereby releasing all working
* memory, including the EState node itself.
@@ -913,7 +923,8 @@ ExecInitResultRelation(EState *estate, ResultRelInfo *resultRelInfo,
resultRelationDesc,
rti,
NULL,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
if (estate->es_result_relations == NULL)
estate->es_result_relations = (ResultRelInfo **)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 011a9684df0..b84c552c6f8 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,31 +16,53 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {
+ .stack_space = 0,
+ .stack_size = 0,
+ .entries = NULL,
+ .current = &instr_top,
+};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+
+ Assert(instr_stack.stack_size >= instr_stack.stack_space);
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0_object(Instrumentation);
-
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -55,12 +77,8 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT(instr->starttime);
}
- /* save buffer usage totals at start, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
-
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
}
/*
@@ -84,14 +102,9 @@ InstrStopCommon(Instrumentation *instr, instr_time *accum_time)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since InstrStart to the totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
-
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ /* pop the stack, unless InstrStopFinalize previously cleaned up */
+ if (instr->on_stack)
+ InstrPopStack(instr);
}
void
@@ -100,16 +113,279 @@ InstrStop(Instrumentation *instr)
InstrStopCommon(instr, &instr->total);
}
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ /*
+ * If our current node is on the stack, make sure we reset the stack to
+ * the parent of whichever of the released stack entries has the lowest
+ * index
+ */
+ if (instr->on_stack)
+ {
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx < 0)
+ elog(ERROR, "instrumentation entry not found on stack");
+
+ /* Clear on_stack for any intermediate entries we're skipping over */
+ for (int i = instr_stack.stack_size - 1; i > idx; i--)
+ instr_stack.entries[i]->on_stack = false;
+
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+ }
+
+ InstrStop(instr);
+
+ /*
+ * Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
+ {
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
+
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
+
+ InstrAccumStack(&qinstr->instr, child);
+ }
+
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
+
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner == NULL);
+ return;
+ }
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
+}
+
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, int instrument_options,
+ bool async_mode)
{
- NodeInstrumentation *instr = palloc_object(NodeInstrumentation);
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
InstrInitNode(instr, instrument_options, async_mode);
+ InstrQueryRememberChild(qinstr, &instr->instr);
+
return instr;
}
@@ -129,6 +405,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -180,8 +457,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -214,22 +491,30 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered1 += add->nfiltered1;
dst->nfiltered2 += add->nfiltered2;
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int instrument_options, int n)
{
- TriggerInstrumentation *tginstr = palloc0_array(TriggerInstrumentation, n);
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
+ {
InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrQueryRememberChild(qinstr, &tginstr[i].instr);
+ }
return tginstr;
}
@@ -243,38 +528,30 @@ InstrStartTrigger(TriggerInstrumentation *tginstr)
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int64 firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -295,39 +572,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b38170f0fbe..a829ddf5acb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -904,7 +904,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e1c39160db..cf4f4246ca2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1266,9 +1266,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
}
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index e3829d7fe7c..e7fc7f071d8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 491c4886506..78961ae058b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,8 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ int instrument_options,
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +303,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
+extern void ExecFinalizeWorkerInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 5261356dba6..f5cc6fb662b 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,92 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured in between).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
+ bool on_stack; /* true if currently on instr_stack */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +175,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,15 +192,105 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+ instr->on_stack = true;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+ instr->on_stack = false;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr,
+ int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options,
bool async_mode);
@@ -142,35 +300,36 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr,
+ int instrument_options, int n);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int64 firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += (val); \
+ instr_stack.current->bufusage.fld += (val); \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += (val); \
+ instr_stack.current->walusage.fld += (val); \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..491c4e272d8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -54,6 +54,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -754,6 +755,7 @@ typedef struct EState
int es_top_eflags; /* eflags passed to ExecutorStart */
int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_query_instr; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 35acda59851..b639c360cea 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1357,6 +1357,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2479,6 +2480,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/octet-stream] v15-0009-Add-test_session_buffer_usage-test-module.patch (30.0K, 11-v15-0009-Add-test_session_buffer_usage-test-module.patch)
download | inline diff:
From c95f0246b23f14205bb5eb68014b8d080c01cc03 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v15 9/9] Add test_session_buffer_usage test module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
.../test_session_buffer_usage/Makefile | 23 ++
.../expected/test_session_buffer_usage.out | 342 ++++++++++++++++++
.../test_session_buffer_usage/meson.build | 33 ++
.../sql/test_session_buffer_usage.sql | 245 +++++++++++++
.../test_session_buffer_usage--1.0.sql | 31 ++
.../test_session_buffer_usage.c | 95 +++++
.../test_session_buffer_usage.control | 5 +
9 files changed, 776 insertions(+)
create mode 100644 src/test/modules/test_session_buffer_usage/Makefile
create mode 100644 src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
create mode 100644 src/test/modules/test_session_buffer_usage/meson.build
create mode 100644 src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index f1b04c99969..e74e327701b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -48,6 +48,7 @@ SUBDIRS = \
test_resowner \
test_rls_hooks \
test_saslprep \
+ test_session_buffer_usage \
test_shmem \
test_shm_mq \
test_slru \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fc99552d9ab..5c46ec13918 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -49,6 +49,7 @@ subdir('test_regex')
subdir('test_resowner')
subdir('test_rls_hooks')
subdir('test_saslprep')
+subdir('test_session_buffer_usage')
subdir('test_shmem')
subdir('test_shm_mq')
subdir('test_slru')
diff --git a/src/test/modules/test_session_buffer_usage/Makefile b/src/test/modules/test_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..1252b222cb9
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_session_buffer_usage/Makefile
+
+MODULE_big = test_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ test_session_buffer_usage.o
+
+EXTENSION = test_session_buffer_usage
+DATA = test_session_buffer_usage--1.0.sql
+PGFILEDESC = "test_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = test_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_session_buffer_usage
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
new file mode 100644
index 00000000000..5f7d349871a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/meson.build b/src/test/modules/test_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..b96f67dc7fe
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_session_buffer_usage_sources = files(
+ 'test_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ test_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_session_buffer_usage',
+ '--FILEDESC', 'test_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+test_session_buffer_usage = shared_module('test_session_buffer_usage',
+ test_session_buffer_usage_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_session_buffer_usage
+
+test_install_data += files(
+ 'test_session_buffer_usage.control',
+ 'test_session_buffer_usage--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_session_buffer_usage',
+ ],
+ },
+}
diff --git a/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
new file mode 100644
index 00000000000..daf2159c4a6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT test_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT test_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..e9833be470a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION test_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION test_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
new file mode 100644
index 00000000000..50eb1a2ffe6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(test_session_buffer_usage);
+PG_FUNCTION_INFO_V1(test_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: test_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+test_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: test_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+test_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
new file mode 100644
index 00000000000..41cfb15a765
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# test_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/test_session_buffer_usage'
+relocatable = true
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-06 22:46 ` Zsolt Parragi <[email protected]>
2026-04-07 00:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Zsolt Parragi @ 2026-04-06 22:46 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Andres Freund <[email protected]>; Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
I couldn't find any issues with v15, all comments are stylistic/minor,
except maybe the first one.
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
Is it okay to store a pointer in shared memory, even if it seems to be
always NULL there?
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
Is it okay to incude files in the middle of the file, is there a good
reason why these can't be at the top of the file?
+ * Recurse into children first (bottom-up accumulation), and accummulate
+ * to this nodes instrumentation as the parent context.
Two typos (accumulate / this node's)
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
This is mainly a generic observation, not strictly related to this
patch, but this list could use some explanation which of these
priorities are actually required by dependencies, and which are just
"put the new entry at the end of the list".
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
@ 2026-04-07 00:39 ` Lukas Fittl <[email protected]>
2026-04-07 20:30 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-04-07 00:39 UTC (permalink / raw)
To: Zsolt Parragi <[email protected]>; +Cc: Andres Freund <[email protected]>; Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
On Mon, Apr 6, 2026 at 3:46 PM Zsolt Parragi <[email protected]> wrote:
>
> I couldn't find any issues with v15, all comments are stylistic/minor,
> except maybe the first one.
Thanks for reviewing!
>
> + /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
> + dlist_node unfinalized_entry;
>
> Is it okay to store a pointer in shared memory, even if it seems to be
> always NULL there?
Its not ideal, mainly because a caller might interpret it incorrectly,
but as long as we don't read from it, its safe in practice. In the
parallel instrumentation we just use the Instrumentation struct as a
way to transport data (with the 0006 patch applied), and we
copy/accumulate from it before it gets used elsewhere.
I've previously avoided putting the unfinalized_entry value in the
Instrumentation struct for that reason, but I don't think there is a
good way to avoid that without complicating the design.
>
> #ifndef INSTRUMENT_NODE_H
> #define INSTRUMENT_NODE_H
>
> +
> +#include "executor/tuptable.h"
> +#include "nodes/execnodes.h"
> +
>
> Is it okay to incude files in the middle of the file, is there a good
> reason why these can't be at the top of the file?
Yeah, those need to be on the top of the file, good catch.
>
> + * Recurse into children first (bottom-up accumulation), and accummulate
> + * to this nodes instrumentation as the parent context.
>
> Two typos (accumulate / this node's)
Good catch, agreed those are typos.
>
> #define RELEASE_PRIO_FILES 600
> #define RELEASE_PRIO_WAITEVENTSETS 700
> +#define RELEASE_PRIO_INSTRUMENTATION 800
>
> This is mainly a generic observation, not strictly related to this
> patch, but this list could use some explanation which of these
> priorities are actually required by dependencies, and which are just
> "put the new entry at the end of the list".
Agreed, that would be helpful. It'll require more investigation to
confirm particular ordering reasons that exist today, but it seems
worth explaining more clearly.
I'll hold off on posting another patch round since what you raised
were just small stylistic issues, and they don't apply to the
remaining prep patches before the stack patch itself.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-04-07 00:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-07 20:30 ` Lukas Fittl <[email protected]>
2026-04-07 22:19 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-04-07 20:30 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>; Zsolt Parragi <[email protected]>
On Mon, Apr 6, 2026 at 5:39 PM Lukas Fittl <[email protected]> wrote:
>
> On Mon, Apr 6, 2026 at 3:46 PM Zsolt Parragi <[email protected]> wrote:
> >
> > I couldn't find any issues with v15, all comments are stylistic/minor,
> > except maybe the first one.
>
> Thanks for reviewing!
>
> ...
>
> I'll hold off on posting another patch round since what you raised
> were just small stylistic issues, and they don't apply to the
> remaining prep patches before the stack patch itself.
Attached v16, rebased with Zsolt's feedback addressed. I've also
re-ordered as follows:
0001 is the change to make queryDesc->totaltime be allocated by
ExecutorStart instead of plugins themselves, and adds a
queryDesc->totaltime_options to have plugins request which level of
summary instrumentation they need. This change is pretty simple, and
could still make sense to get into 19. Because of the earlier
Instrumentation refactoring that was pushed (thanks!) we're already
asking extensions allocating queryDesc->totaltime to modify their use
of InstrAlloc, so I think we might as well clean this up now.
0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
suggested previously. We still get some quick performance wins from
doing that (see end of email), and again, its a simple change, so
could be considered if someone has bandwidth remaining. I've added a
later patch that then does the more complex inlining and gets us the
full speed up.
At this point I'd say its safe to say that we should push out later
changes to PG20, because it needs another good look over, and I don't
think Andres or Heikki have the capacity for that today (but I really
appreciate all the effort put in by both of you!).
---
0002 measurements (with current master and TSC clock source used for
timing, best of three):
CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;
master:
265.319 ms actual runtime
308.532 ms TIMING OFF, BUFFERS OFF
375.810 ms TIMING OFF, BUFFERS ON
381.701 ms TIMING ON, BUFFERS OFF
437.722 ms TIMING ON, BUFFERS ON
0002:
265.207 ms actual runtime
291.799 ms TIMING OFF, BUFFERS OFF
364.653 ms TIMING OFF, BUFFERS ON
359.759 ms TIMING ON, BUFFERS OFF
433.023 ms TIMING ON, BUFFERS ON
full patch set:
265.763 ms actual runtime
273.222 ms TIMING OFF, BUFFERS OFF
293.621 ms TIMING OFF, BUFFERS ON
331.926 ms TIMING ON, BUFFERS OFF
363.055 ms TIMING ON, BUFFERS ON
Thanks,
Lukas
--
Lukas Fittl
Attachments:
[application/x-patch] v16-0002-instrumentation-Move-ExecProcNodeInstr-to-allow-.patch (4.5K, 2-v16-0002-instrumentation-Move-ExecProcNodeInstr-to-allow-.patch)
download | inline diff:
From eb0b385e6b4e43ec0643ab4ed4d4f1e17a9ab365 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 7 Apr 2026 12:32:36 -0700
Subject: [PATCH v16 02/10] instrumentation: Move ExecProcNodeInstr to allow
inlining
This moves the implementation of ExecProcNodeInstr, the ExecProcNode
variant that gets used when instrumentation is on, to be defined in
instrument.c instead of execProcNode.c, and marks functions it uses
as inline.
This allows compilers to generate an optimized implementation, and
shows a 2 to 5% reduction in instrumentation overhead for queries
that move lots of rows.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com
---
src/backend/executor/execProcnode.c | 20 --------------------
src/backend/executor/instrument.c | 27 ++++++++++++++++++++++++---
src/include/executor/instrument.h | 4 ++++
3 files changed, 28 insertions(+), 23 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 132fe37ef60..7c4c66e323f 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -121,7 +121,6 @@
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
@@ -471,25 +470,6 @@ ExecProcNodeFirst(PlanState *node)
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* ----------------------------------------------------------------
* MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 4c3aec7fdee..dd08fc99fb2 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,6 +16,8 @@
#include <unistd.h>
#include "executor/instrument.h"
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
#include "portability/instr_time.h"
#include "utils/guc_hooks.h"
@@ -46,7 +48,7 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-void
+inline void
InstrStart(Instrumentation *instr)
{
if (instr->need_timer)
@@ -125,14 +127,14 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options, bool async_mod
}
/* Entry to a plan node */
-void
+inline void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
/* Exit from a plan node */
-void
+inline void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
{
double save_tuplecount = instr->tuplecount;
@@ -166,6 +168,25 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
}
}
+/*
+ * ExecProcNode wrapper that performs instrumentation calls. By keeping
+ * this a separate function, we avoid overhead in the normal case where
+ * no instrumentation is wanted.
+ */
+TupleTableSlot *
+ExecProcNodeInstr(PlanState *node)
+{
+ TupleTableSlot *result;
+
+ InstrStartNode(node->instrument);
+
+ result = node->ExecProcNodeReal(node);
+
+ InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
+
+ return result;
+}
+
/* Update tuple count */
void
InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc9fbb0e2f0..84ef0ad089c 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -142,6 +142,10 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
extern void InstrEndLoop(NodeInstrumentation *instr);
extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+extern TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+
extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int64 firings);
--
2.47.1
[application/x-patch] v16-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch (9.0K, 3-v16-0004-instrumentation-Replace-direct-changes-of-pgBuff.patch)
download | inline diff:
From 29529ef91da0f9e2a05dc7574ad8cf0d03404b24 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Thu, 26 Mar 2026 23:31:04 -0700
Subject: [PATCH v16 04/10] instrumentation: Replace direct changes of
pgBufferUsage/pgWalUsage with INSTR_* macros
This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
src/backend/access/transam/xlog.c | 10 +++++-----
src/backend/storage/buffer/bufmgr.c | 20 ++++++++++----------
src/backend/storage/buffer/localbuf.c | 6 +++---
src/backend/storage/file/buffile.c | 8 ++++----
src/backend/utils/activity/pgstat_io.c | 8 ++++----
src/include/executor/instrument.h | 19 +++++++++++++++++++
6 files changed, 45 insertions(+), 26 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f85b5286086..bf905383a40 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1115,10 +1115,10 @@ XLogInsertRecord(XLogRecData *rdata,
/* Report WAL traffic to the instrumentation. */
if (inserted)
{
- pgWalUsage.wal_bytes += rechdr->xl_tot_len;
- pgWalUsage.wal_records++;
- pgWalUsage.wal_fpi += num_fpi;
- pgWalUsage.wal_fpi_bytes += fpi_bytes;
+ INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+ INSTR_WALUSAGE_INCR(wal_records);
+ INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+ INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
/* Required for the flush of pending stats WAL data */
pgstat_report_fixed = true;
@@ -2097,7 +2097,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
WriteRqst.Flush = InvalidXLogRecPtr;
XLogWrite(WriteRqst, tli, false);
LWLockRelease(WALWriteLock);
- pgWalUsage.wal_buffers_full++;
+ INSTR_WALUSAGE_INCR(wal_buffers_full);
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
/*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3cc0b0bdd92..3e1c39160db 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -840,7 +840,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
PinLocalBuffer(bufHdr, true);
- pgBufferUsage.local_blks_hit++;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
return true;
}
@@ -861,7 +861,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
{
if (BufferTagsEqual(&tag, &bufHdr->tag))
{
- pgBufferUsage.shared_blks_hit++;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
return true;
}
UnpinBuffer(bufHdr);
@@ -1684,9 +1684,9 @@ TrackBufferHit(IOObject io_object, IOContext io_context,
true);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(local_blks_hit);
else
- pgBufferUsage.shared_blks_hit += 1;
+ INSTR_BUFUSAGE_INCR(shared_blks_hit);
pgstat_count_io_op(io_object, io_context, IOOP_HIT, 1, 0);
@@ -2148,9 +2148,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
io_start, 1, io_buffers_len * BLCKSZ);
if (persistence == RELPERSISTENCE_TEMP)
- pgBufferUsage.local_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
else
- pgBufferUsage.shared_blks_read += io_buffers_len;
+ INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
/*
* Track vacuum cost when issuing IO, not after waiting for it. Otherwise
@@ -3043,7 +3043,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
}
- pgBufferUsage.shared_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
*extended_by = extend_by;
@@ -3189,7 +3189,7 @@ MarkBufferDirty(Buffer buffer)
*/
if (!(old_buf_state & BM_DIRTY))
{
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
@@ -4601,7 +4601,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
IOOP_WRITE, io_start, 1, BLCKSZ);
- pgBufferUsage.shared_blks_written++;
+ INSTR_BUFUSAGE_INCR(shared_blks_written);
/*
* Mark the buffer as clean and end the BM_IO_IN_PROGRESS state.
@@ -5796,7 +5796,7 @@ MarkSharedBufferDirtyHint(Buffer buffer, BufferDesc *bufHdr, uint64 lockstate,
UnlockBufHdr(bufHdr);
}
- pgBufferUsage.shared_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
if (VacuumCostActive)
VacuumCostBalance += VacuumCostPageDirty;
}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 396da84b25c..851b99056d5 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -218,7 +218,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
/* Mark not-dirty */
TerminateLocalBufferIO(bufHdr, true, 0, false);
- pgBufferUsage.local_blks_written++;
+ INSTR_BUFUSAGE_INCR(local_blks_written);
}
static Buffer
@@ -479,7 +479,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
*extended_by = extend_by;
- pgBufferUsage.local_blks_written += extend_by;
+ INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
return first_block;
}
@@ -510,7 +510,7 @@ MarkLocalBufferDirty(Buffer buffer)
buf_state = pg_atomic_read_u64(&bufHdr->state);
if (!(buf_state & BM_DIRTY))
- pgBufferUsage.local_blks_dirtied++;
+ INSTR_BUFUSAGE_INCR(local_blks_dirtied);
buf_state |= BM_DIRTY;
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
}
/* we choose not to advance curOffset here */
if (file->nbytes > 0)
- pgBufferUsage.temp_blks_read++;
+ INSTR_BUFUSAGE_INCR(temp_blks_read);
}
/*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
INSTR_TIME_SET_CURRENT(io_time);
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+ INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
}
file->curOffset += bytestowrite;
wpos += bytestowrite;
- pgBufferUsage.temp_blks_written++;
+ INSTR_BUFUSAGE_INCR(temp_blks_written);
}
file->dirty = false;
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 2be26e92283..e3829d7fe7c 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
{
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
}
else if (io_op == IOOP_READ)
{
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
if (io_object == IOOBJECT_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
else if (io_object == IOOBJECT_TEMP_RELATION)
- INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+ INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
}
}
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 84ef0ad089c..4430c222493 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -158,4 +158,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+ pgBufferUsage.fld++; \
+ } while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+ pgBufferUsage.fld += (val); \
+ } while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+ INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ } while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+ INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ } while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+ pgWalUsage.fld++; \
+ } while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+ pgWalUsage.fld += (val); \
+ } while(0)
+
#endif /* INSTRUMENT_H */
--
2.47.1
[application/x-patch] v16-0003-instrumentation-Use-Instrumentation-instead-of-m.patch (19.2K, 4-v16-0003-instrumentation-Use-Instrumentation-instead-of-m.patch)
download | inline diff:
From b00b04af4ec7ab5cd249544ceea4d2a59b5b4953 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 05:08:23 -0700
Subject: [PATCH v16 03/10] instrumentation: Use Instrumentation instead of
manual buffer tracking
This replaces different repeated code blocks that read pgBufferUsage /
pgWalUsage, and may have also been running a timer to measure activity,
with the new Instrumentation struct and associated helpers.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/pg_stat_statements.c | 62 +++++--------------
src/backend/access/heap/vacuumlazy.c | 15 +++--
src/backend/commands/analyze.c | 31 +++++-----
src/backend/commands/explain.c | 44 +++++++------
src/backend/commands/explain_dr.c | 53 ++++++----------
src/backend/commands/prepare.c | 28 ++++-----
src/include/commands/explain_dr.h | 5 +-
7 files changed, 91 insertions(+), 147 deletions(-)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index c8981dcb5cf..5da71c9be16 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -901,22 +901,16 @@ pgss_planner(Query *parse,
&& pgss_track_planning && query_string
&& parse->queryId != INT64CONST(0))
{
- instr_time start;
- instr_time duration;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
-
- /* We need to track buffer usage as the planner can access them. */
- bufusage_start = pgBufferUsage;
+ Instrumentation instr = {0};
/*
+ * We need to track buffer usage as the planner can access them.
+ *
* Similarly the planner could write some WAL records in some cases
* (e.g. setting a hint bit with those being WAL-logged)
*/
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -934,26 +928,17 @@ pgss_planner(Query *parse,
}
PG_END_TRY();
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
-
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
+ InstrStop(&instr);
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
parse->stmt_len,
PGSS_PLAN,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
0,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
@@ -1135,17 +1120,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
!IsA(parsetree, ExecuteStmt) &&
!IsA(parsetree, PrepareStmt))
{
- instr_time start;
- instr_time duration;
uint64 rows;
- BufferUsage bufusage_start,
- bufusage;
- WalUsage walusage_start,
- walusage;
+ Instrumentation instr = {0};
- bufusage_start = pgBufferUsage;
- walusage_start = pgWalUsage;
- INSTR_TIME_SET_CURRENT(start);
+ InstrInitOptions(&instr, INSTRUMENT_ALL);
+ InstrStart(&instr);
nesting_level++;
PG_TRY();
@@ -1175,8 +1154,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- INSTR_TIME_SET_CURRENT(duration);
- INSTR_TIME_SUBTRACT(duration, start);
+ InstrStop(&instr);
/*
* Track the total number of rows retrieved or affected by the utility
@@ -1189,23 +1167,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
qc->nprocessed : 0;
- /* calc differences of buffer counters. */
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
- /* calc differences of WAL counters. */
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
pgss_store(queryString,
saved_queryId,
saved_stmt_location,
saved_stmt_len,
PGSS_EXEC,
- INSTR_TIME_GET_MILLISEC(duration),
+ INSTR_TIME_GET_MILLISEC(instr.total),
rows,
- &bufusage,
- &walusage,
+ &instr.bufusage,
+ &instr.walusage,
NULL,
NULL,
0,
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 39395aed0d5..6173e53c4ad 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -638,8 +638,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
+ Instrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -655,6 +654,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -996,14 +997,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_vacuum_min_duration))
{
long secs_dur;
int usecs_dur;
- WalUsage walusage;
- BufferUsage bufferusage;
StringInfoData buf;
char *msgfmt;
int32 diff;
@@ -1012,12 +1013,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
+ BufferUsage bufferusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
total_blks_hit = bufferusage.shared_blks_hit +
bufferusage.local_blks_hit;
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 49a5cdf579c..8472fc0c280 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,9 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- WalUsage startwalusage = pgWalUsage;
- BufferUsage startbufferusage = pgBufferUsage;
- BufferUsage bufferusage;
+ Instrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -362,6 +360,9 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
}
pg_rusage_init(&ru0);
+
+ instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrStart(instr);
}
/* Used for instrumentation and stats report */
@@ -742,12 +743,13 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
+ InstrStop(instr);
+
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
params->log_analyze_min_duration))
{
long delay_in_ms;
- WalUsage walusage;
double read_rate = 0;
double write_rate = 0;
char *msgfmt;
@@ -755,18 +757,15 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
-
- memset(&bufferusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
- memset(&walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
- total_blks_hit = bufferusage.shared_blks_hit +
- bufferusage.local_blks_hit;
- total_blks_read = bufferusage.shared_blks_read +
- bufferusage.local_blks_read;
- total_blks_dirtied = bufferusage.shared_blks_dirtied +
- bufferusage.local_blks_dirtied;
+ BufferUsage bufusage = instr->bufusage;
+ WalUsage walusage = instr->walusage;
+
+ total_blks_hit = bufusage.shared_blks_hit +
+ bufusage.local_blks_hit;
+ total_blks_read = bufusage.shared_blks_read +
+ bufusage.local_blks_read;
+ total_blks_dirtied = bufusage.shared_blks_dirtied +
+ bufusage.local_blks_dirtied;
/*
* We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index f151f21f9b3..deaaba6f900 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,14 +324,17 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- instr_time planstart,
- planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
+ int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
+
if (es->memory)
{
/*
@@ -348,15 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -364,16 +364,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -590,7 +583,12 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* grab serialization metrics before we destroy the DestReceiver */
if (es->serialize != EXPLAIN_SERIALIZE_NONE)
- serializeMetrics = GetSerializationMetrics(dest);
+ {
+ SerializeMetrics *metrics = GetSerializationMetrics(dest);
+
+ if (metrics)
+ memcpy(&serializeMetrics, metrics, sizeof(SerializeMetrics));
+ }
/* call the DestReceiver's destroy method even during explain */
dest->rDestroy(dest);
@@ -1019,7 +1017,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
ExplainIndentText(es);
if (es->timing)
appendStringInfo(es->str, "Serialization: time=%.3f ms output=" UINT64_FORMAT "kB format=%s\n",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
else
@@ -1027,10 +1025,10 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent),
format);
- if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
+ if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
es->indent--;
}
}
@@ -1038,13 +1036,13 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
{
if (es->timing)
ExplainPropertyFloat("Time", "ms",
- 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->timeSpent),
+ 1000.0 * INSTR_TIME_GET_DOUBLE(metrics->instr.total),
3, es);
ExplainPropertyUInteger("Output Volume", "kB",
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->bufferUsage);
+ show_buffer_usage(es, &metrics->instr.bufusage);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..df5ae5f4569 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,10 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContext oldcontext;
StringInfo buf = &myState->buf;
int natts = typeinfo->natts;
- instr_time start,
- end;
- BufferUsage instr_start;
+ Instrumentation *instr = &myState->metrics.instr;
- /* only measure time, buffers if requested */
- if (myState->es->timing)
- INSTR_TIME_SET_CURRENT(start);
- if (myState->es->buffers)
- instr_start = pgBufferUsage;
+ /* Start per-tuple measurement */
+ InstrStart(instr);
/* Set or update my derived attribute info, if needed */
if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +181,8 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
MemoryContextSwitchTo(oldcontext);
MemoryContextReset(myState->tmpcontext);
- /* Update timing data */
- if (myState->es->timing)
- {
- INSTR_TIME_SET_CURRENT(end);
- INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
- }
-
- /* Update buffer metrics */
- if (myState->es->buffers)
- BufferUsageAccumDiff(&myState->metrics.bufferUsage,
- &pgBufferUsage,
- &instr_start);
+ /* Stop per-tuple measurement */
+ InstrStop(instr);
return true;
}
@@ -209,6 +194,7 @@ static void
serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ int instrument_options = 0;
Assert(receiver->es != NULL);
@@ -233,9 +219,13 @@ serializeAnalyzeStartup(DestReceiver *self, int operation, TupleDesc typeinfo)
/* The output buffer is re-used across rows, as in printtup.c */
initStringInfo(&receiver->buf);
- /* Initialize results counters */
+ /* Initialize metrics and per-tuple instrumentation */
memset(&receiver->metrics, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(receiver->metrics.timeSpent);
+ if (receiver->es->timing)
+ instrument_options |= INSTRUMENT_TIMER;
+ if (receiver->es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+ InstrInitOptions(&receiver->metrics.instr, instrument_options);
}
/*
@@ -290,22 +280,17 @@ CreateExplainSerializeDestReceiver(ExplainState *es)
}
/*
- * GetSerializationMetrics - collect metrics
+ * GetSerializationMetrics - get serialization metrics
*
- * We have to be careful here since the receiver could be an IntoRel
- * receiver if the subject statement is CREATE TABLE AS. In that
- * case, return all-zeroes stats.
+ * Returns a pointer to the SerializeMetrics inside the dest receiver,
+ * or NULL if the receiver is not a SerializeDestReceiver (e.g. an IntoRel
+ * receiver for CREATE TABLE AS).
*/
-SerializeMetrics
+SerializeMetrics *
GetSerializationMetrics(DestReceiver *dest)
{
- SerializeMetrics empty;
-
if (dest->mydest == DestExplainSerialize)
- return ((SerializeDestReceiver *) dest)->metrics;
-
- memset(&empty, 0, sizeof(SerializeMetrics));
- INSTR_TIME_SET_ZERO(empty.timeSpent);
+ return &((SerializeDestReceiver *) dest)->metrics;
- return empty;
+ return NULL;
}
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 876aad2100a..bf9f2eb6149 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -22,6 +22,7 @@
#include "catalog/pg_type.h"
#include "commands/createas.h"
#include "commands/explain.h"
+#include "executor/instrument.h"
#include "commands/explain_format.h"
#include "commands/explain_state.h"
#include "commands/prepare.h"
@@ -580,14 +581,17 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- instr_time planstart;
- instr_time planduration;
- BufferUsage bufusage_start,
- bufusage;
+ Instrumentation plan_instr = {0};
+ int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
+ if (es->buffers)
+ instrument_options |= INSTRUMENT_BUFFERS;
+
+ InstrInitOptions(&plan_instr, instrument_options);
+
if (es->memory)
{
/* See ExplainOneQuery about this */
@@ -598,9 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- if (es->buffers)
- bufusage_start = pgBufferUsage;
- INSTR_TIME_SET_CURRENT(planstart);
+ InstrStart(&plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -635,8 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- INSTR_TIME_SET_CURRENT(planduration);
- INSTR_TIME_SUBTRACT(planduration, planstart);
+ InstrStop(&plan_instr);
if (es->memory)
{
@@ -644,13 +645,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
MemoryContextMemConsumed(planner_ctx, &mem_counters);
}
- /* calc differences of buffer counters. */
- if (es->buffers)
- {
- memset(&bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
- }
-
plan_list = cplan->stmt_list;
/* Explain each query */
@@ -660,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &planduration, (es->buffers ? &bufusage : NULL),
+ &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/include/commands/explain_dr.h b/src/include/commands/explain_dr.h
index f98eaae1864..ab5c53023e1 100644
--- a/src/include/commands/explain_dr.h
+++ b/src/include/commands/explain_dr.h
@@ -23,11 +23,10 @@ typedef struct ExplainState ExplainState;
typedef struct SerializeMetrics
{
uint64 bytesSent; /* # of bytes serialized */
- instr_time timeSpent; /* time spent serializing */
- BufferUsage bufferUsage; /* buffers accessed during serialization */
+ Instrumentation instr; /* time and buffer usage */
} SerializeMetrics;
extern DestReceiver *CreateExplainSerializeDestReceiver(ExplainState *es);
-extern SerializeMetrics GetSerializationMetrics(DestReceiver *dest);
+extern SerializeMetrics *GetSerializationMetrics(DestReceiver *dest);
#endif
--
2.47.1
[application/x-patch] v16-0001-instrumentation-Allocate-queryDesc-totaltime-in-.patch (6.4K, 5-v16-0001-instrumentation-Allocate-queryDesc-totaltime-in-.patch)
download | inline diff:
From 58919a10cf5542495e7af3afa9a908a923f3ddf6 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v16 01/10] instrumentation: Allocate queryDesc->totaltime in
ExecutorStart if needed
This introduces a new field, queryDesc->totaltime_options, that extensions
can use to indicate whether they need queryDesc->totaltime populated,
and with which instrumentation options. Extensions should take care to
only add options they need, instead of replacing the options of others.
This replaces the practice of extensions allocating queryDesc->totaltime
themselves, which required them to always use INSTRUMENT_ALL for the
options argument. If they wouldn't have, another extension could silently
be impacted by it. It also unnecessarily made extensions hooks worry
about being sure to allocate in the per-query memory context.
Adjust pg_stat_statements and auto_explain to match, and lower the
requested instrumentation level for auto_explain to INSTRUMENT_TIMER,
since the summary instrumentation it needs is only runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
contrib/auto_explain/auto_explain.c | 20 +++------------
.../pg_stat_statements/pg_stat_statements.c | 25 ++++++-------------
src/backend/executor/execMain.c | 9 +++++++
src/backend/tcop/pquery.c | 1 +
src/include/executor/execdesc.h | 4 ++-
5 files changed, 23 insertions(+), 36 deletions(-)
diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 6ceae1c69ce..4f9d35bc30b 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
if (auto_explain_enabled())
{
+ /* We're always interested in runtime */
+ queryDesc->totaltime_options |= INSTRUMENT_TIMER;
+
/* Enable per-node instrumentation iff log_analyze is required. */
if (auto_explain_log_analyze && (eflags & EXEC_FLAG_EXPLAIN_ONLY) == 0)
{
@@ -352,23 +355,6 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
prev_ExecutorStart(queryDesc, eflags);
else
standard_ExecutorStart(queryDesc, eflags);
-
- if (auto_explain_enabled())
- {
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
- if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
- }
}
/*
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b5000bc14b7..c8981dcb5cf 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -993,11 +993,6 @@ pgss_planner(Query *parse,
static void
pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
{
- if (prev_ExecutorStart)
- prev_ExecutorStart(queryDesc, eflags);
- else
- standard_ExecutorStart(queryDesc, eflags);
-
/*
* If query has queryId zero, don't track it. This prevents double
* counting of optimizable statements that are directly contained in
@@ -1005,20 +1000,14 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
*/
if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
{
- /*
- * Set up to track total elapsed time in ExecutorRun. Make sure the
- * space is allocated in the per-query context so it will go away at
- * ExecutorEnd.
- */
- if (queryDesc->totaltime == NULL)
- {
- MemoryContext oldcxt;
-
- oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
- queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
- MemoryContextSwitchTo(oldcxt);
- }
+ /* Request all summary instrumentation, i.e. timing, buffers and WAL */
+ queryDesc->totaltime_options |= INSTRUMENT_ALL;
}
+
+ if (prev_ExecutorStart)
+ prev_ExecutorStart(queryDesc, eflags);
+ else
+ standard_ExecutorStart(queryDesc, eflags);
}
/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b0f636bf8b6..f71f668883c 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -250,6 +250,15 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
estate->es_instrument = queryDesc->instrument_options;
estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
+ /*
+ * Set up query-level instrumentation if extensions have requested it via
+ * totaltime_options. Ensure an extension has not allocated totaltime
+ * itself.
+ */
+ Assert(queryDesc->totaltime == NULL);
+ if (queryDesc->totaltime_options)
+ queryDesc->totaltime = InstrAlloc(queryDesc->totaltime_options);
+
/*
* Set up an AFTER-trigger statement context, unless told not to, or
* unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
diff --git a/src/backend/tcop/pquery.c b/src/backend/tcop/pquery.c
index d8fc75d0bb9..e27f26ecd83 100644
--- a/src/backend/tcop/pquery.c
+++ b/src/backend/tcop/pquery.c
@@ -86,6 +86,7 @@ CreateQueryDesc(PlannedStmt *plannedstmt,
qd->params = params; /* parameter values passed into query */
qd->queryEnv = queryEnv;
qd->instrument_options = instrument_options; /* instrumentation wanted? */
+ qd->totaltime_options = 0;
/* null these fields until set by ExecutorStart */
qd->tupDesc = NULL;
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..3f9143efb8c 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -42,6 +42,8 @@ typedef struct QueryDesc
ParamListInfo params; /* param values being passed in */
QueryEnvironment *queryEnv; /* query environment passed in */
int instrument_options; /* OR of InstrumentOption flags */
+ int totaltime_options; /* OR of InstrumentOption flags for
+ * totaltime */
/* These fields are set by ExecutorStart */
TupleDesc tupDesc; /* descriptor for result tuples */
@@ -51,7 +53,7 @@ typedef struct QueryDesc
/* This field is set by ExecutePlan */
bool already_executed; /* true if previously executed */
- /* This is always set NULL by the core system, but plugins can change it */
+ /* This field is allocated by ExecutorStart if needed */
struct Instrumentation *totaltime; /* total time spent in ExecutorRun */
} QueryDesc;
--
2.47.1
[application/x-patch] v16-0005-instrumentation-Add-additional-regression-tests-.patch (22.5K, 6-v16-0005-instrumentation-Add-additional-regression-tests-.patch)
download | inline diff:
From 587f7eb93682826e03b2955fdd3257a18c0458d4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 03:48:32 -0700
Subject: [PATCH v16 05/10] instrumentation: Add additional regression tests
covering buffer usage
This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.
Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
.../pg_stat_statements/expected/utility.out | 70 +++++++
contrib/pg_stat_statements/expected/wal.out | 48 +++++
contrib/pg_stat_statements/sql/utility.sql | 56 ++++++
contrib/pg_stat_statements/sql/wal.sql | 33 +++
src/test/regress/expected/explain.out | 188 ++++++++++++++++++
src/test/regress/sql/explain.sql | 188 ++++++++++++++++++
6 files changed, 583 insertions(+)
diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
1 | 1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
(3 rows)
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test
+--------------------------
+
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+ query | has_buffer_stats
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
t
(1 row)
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func
+-----------------
+
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+ query | calls | local_hitread | wal_bytes_generated | wal_records_generated
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() | 1 | t | t | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+ v int;
+BEGIN
+ EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
-- CALL
CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
wal_records >= rows as wal_records_ge_rows
FROM pg_stat_statements ORDER BY query COLLATE "C";
SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..5ff96491b0a 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,191 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
(9 rows)
reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok
+-------------+-----------
+ t | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+ trigger_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..9f0e8524497 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,191 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
-- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+ v int;
+BEGIN
+ WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+ NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT explain_exc_func()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ node json;
+ total_buffers int;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT check_explain_exception_buffers()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ total_buffers :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+ RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly reported.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+ cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+ rec RECORD;
+ cnt bigint := 0;
+BEGIN
+ OPEN cur;
+ LOOP
+ FETCH NEXT FROM cur INTO rec;
+ EXIT WHEN NOT FOUND;
+ cnt := cnt + 1;
+ END LOOP;
+ CLOSE cur;
+ RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+ plan_json json;
+ node json;
+ direct_buf int;
+ noscroll_buf int;
+ scroll_buf int;
+BEGIN
+ -- Direct scan: get leaf Seq Scan node buffers as baseline
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT * FROM cursor_buf_test' INTO plan_json;
+ node := plan_json->0->'Plan';
+ WHILE node->'Plans' IS NOT NULL LOOP
+ node := node->'Plans'->0;
+ END LOOP;
+ direct_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_noscroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ noscroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Scrollable cursor path: ExecShutdownNode is skipped
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ SELECT cursor_scroll_scan()' INTO plan_json;
+ node := plan_json->0->'Plan';
+ scroll_buf :=
+ COALESCE((node->>'Local Hit Blocks')::int, 0) +
+ COALESCE((node->>'Local Read Blocks')::int, 0);
+
+ -- Both cursor paths should report buffer counts about as high as
+ -- the direct scan (same data plus minor catalog overhead), and not
+ -- double-counted (< 2x the direct scan)
+ RETURN QUERY SELECT
+ (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+ (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Test trigger instrumentation.
+CREATE TEMP TABLE trig_test_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int);
+INSERT INTO trig_work_tab VALUES (1);
+
+CREATE FUNCTION trig_test_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM * FROM trig_work_tab;
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_test_trig
+ BEFORE INSERT ON trig_test_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_test_func();
+
+CREATE FUNCTION check_trigger_explain_buffers() RETURNS boolean AS $$
+DECLARE
+ plan_json json;
+ trig json;
+BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+ INSERT INTO trig_test_tab VALUES (1)' INTO plan_json;
+ trig := plan_json->0->'Triggers'->0;
+ RETURN COALESCE((trig->>'Calls')::int, 0) > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_trigger_explain_buffers() AS trigger_buffers_visible;
+
+DROP FUNCTION check_trigger_explain_buffers;
+DROP TRIGGER trig_test_trig ON trig_test_tab;
+DROP FUNCTION trig_test_func;
+DROP TABLE trig_test_tab;
+DROP TABLE trig_work_tab;
--
2.47.1
[application/x-patch] v16-0007-instrumentation-Use-Instrumentation-struct-for-p.patch (29.1K, 7-v16-0007-instrumentation-Use-Instrumentation-struct-for-p.patch)
download | inline diff:
From f7b78fd21bda61c2366496f8cfe9a0aa76f87588 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 15 Mar 2026 21:44:58 -0700
Subject: [PATCH v16 07/10] instrumentation: Use Instrumentation struct for
parallel workers
This simplifies the DSM allocations a bit since we don't need to
separately allocate WAL and buffer usage, and allows the easier future
addition of a third stack-based struct being discussed.
Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
src/backend/access/brin/brin.c | 43 ++++++-----------
src/backend/access/gin/gininsert.c | 43 ++++++-----------
src/backend/access/nbtree/nbtsort.c | 43 ++++++-----------
src/backend/commands/vacuumparallel.c | 52 ++++++++-------------
src/backend/executor/execParallel.c | 66 ++++++++++++---------------
src/backend/executor/instrument.c | 14 +++---
src/include/executor/execParallel.h | 5 +-
src/include/executor/instrument.h | 4 +-
8 files changed, 99 insertions(+), 171 deletions(-)
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 3a5176c76c7..9e545b4ef0e 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -51,8 +51,7 @@
#define PARALLEL_KEY_BRIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -148,8 +147,7 @@ typedef struct BrinLeader
BrinShared *brinshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BrinLeader;
/*
@@ -2387,8 +2385,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
BrinShared *brinshared;
Sharedsort *sharedsort;
BrinLeader *brinleader = palloc0_object(BrinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -2430,18 +2427,14 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -2514,15 +2507,12 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -2533,8 +2523,7 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
brinleader->brinshared = brinshared;
brinleader->sharedsort = sharedsort;
brinleader->snapshot = snapshot;
- brinleader->walusage = walusage;
- brinleader->bufferusage = bufferusage;
+ brinleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -2573,7 +2562,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2888,8 +2877,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2950,11 +2938,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 0d80f72a0b0..f3de62ce7f3 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -45,8 +45,7 @@
#define PARALLEL_KEY_GIN_SHARED UINT64CONST(0xB000000000000001)
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xB000000000000002)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xB000000000000003)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xB000000000000004)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xB000000000000005)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xB000000000000004)
/*
* Status for index builds performed in parallel. This is allocated in a
@@ -138,8 +137,7 @@ typedef struct GinLeader
GinBuildShared *ginshared;
Sharedsort *sharedsort;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} GinLeader;
typedef struct
@@ -945,8 +943,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
GinBuildShared *ginshared;
Sharedsort *sharedsort;
GinLeader *ginleader = palloc0_object(GinLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -987,18 +984,14 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
shm_toc_estimate_keys(&pcxt->estimator, 2);
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1066,15 +1059,12 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1085,8 +1075,7 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
ginleader->ginshared = ginshared;
ginleader->sharedsort = sharedsort;
ginleader->snapshot = snapshot;
- ginleader->walusage = walusage;
- ginleader->bufferusage = bufferusage;
+ ginleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1125,7 +1114,7 @@ _gin_end_parallel(GinLeader *ginleader, GinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < ginleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&ginleader->bufferusage[i], &ginleader->walusage[i]);
+ InstrAccumParallelQuery(&ginleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(ginleader->snapshot))
@@ -2119,8 +2108,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
/*
@@ -2200,11 +2188,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
heapRel, indexRel, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 2d7b7cef912..cb238f862a7 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -66,8 +66,7 @@
#define PARALLEL_KEY_TUPLESORT UINT64CONST(0xA000000000000002)
#define PARALLEL_KEY_TUPLESORT_SPOOL2 UINT64CONST(0xA000000000000003)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xA000000000000004)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xA000000000000005)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xA000000000000006)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xA000000000000005)
/*
* DISABLE_LEADER_PARTICIPATION disables the leader's participation in
@@ -195,8 +194,7 @@ typedef struct BTLeader
Sharedsort *sharedsort;
Sharedsort *sharedsort2;
Snapshot snapshot;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
} BTLeader;
/*
@@ -1408,8 +1406,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
Sharedsort *sharedsort2;
BTSpool *btspool = buildstate->spool;
BTLeader *btleader = palloc0_object(BTLeader);
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *instr;
bool leaderparticipates = true;
int querylen;
@@ -1462,18 +1459,14 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Estimate space for WalUsage and BufferUsage -- PARALLEL_KEY_WAL_USAGE
- * and PARALLEL_KEY_BUFFER_USAGE.
+ * Estimate space for Instrumentation -- PARALLEL_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_KEY_QUERY_TEXT space */
@@ -1560,15 +1553,12 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
}
/*
- * Allocate space for each worker's WalUsage and BufferUsage; no need to
+ * Allocate space for each worker's Instrumentation; no need to
* initialize.
*/
- walusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage);
- bufferusage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufferusage);
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
/* Launch workers, saving status for leader/caller */
LaunchParallelWorkers(pcxt);
@@ -1580,8 +1570,7 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
btleader->sharedsort = sharedsort;
btleader->sharedsort2 = sharedsort2;
btleader->snapshot = snapshot;
- btleader->walusage = walusage;
- btleader->bufferusage = bufferusage;
+ btleader->instr = instr;
/* If no workers were successfully launched, back out (do serial build) */
if (pcxt->nworkers_launched == 0)
@@ -1620,7 +1609,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->instr[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1754,8 +1743,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
QueryInstrumentation *instr;
- WalUsage *walusage;
- BufferUsage *bufferusage;
+ Instrumentation *worker_instr;
int sortmem;
#ifdef BTREE_BUILD_STATS
@@ -1837,11 +1825,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
sharedsort2, sortmem, false);
/* Report WAL/buffer usage during parallel execution */
- bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 82bfbc6d492..f7a17e4f73f 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -56,9 +56,8 @@
*/
#define PARALLEL_VACUUM_KEY_SHARED 1
#define PARALLEL_VACUUM_KEY_QUERY_TEXT 2
-#define PARALLEL_VACUUM_KEY_BUFFER_USAGE 3
-#define PARALLEL_VACUUM_KEY_WAL_USAGE 4
-#define PARALLEL_VACUUM_KEY_INDEX_STATS 5
+#define PARALLEL_VACUUM_KEY_INSTRUMENTATION 3
+#define PARALLEL_VACUUM_KEY_INDEX_STATS 4
/*
* Struct for cost-based vacuum delay related parameters to share among an
@@ -236,11 +235,8 @@ struct ParallelVacuumState
/* Shared dead items space among parallel vacuum workers */
TidStore *dead_items;
- /* Points to buffer usage area in DSM */
- BufferUsage *buffer_usage;
-
- /* Points to WAL usage area in DSM */
- WalUsage *wal_usage;
+ /* Points to instrumentation area in DSM */
+ Instrumentation *instr;
/*
* False if the index is totally unsuitable target for all parallel
@@ -311,8 +307,7 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
PVShared *shared;
TidStore *dead_items;
PVIndStats *indstats;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *instr;
bool *will_parallel_vacuum;
Size est_indstats_len;
Size est_shared_len;
@@ -365,18 +360,15 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage and WalUsage --
- * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
+ * Estimate space for Instrumentation --
+ * PARALLEL_VACUUM_KEY_INSTRUMENTATION.
*
* If there are no extensions loaded that care, we could skip this. We
* have no way of knowing whether anyone's looking at instrumentation, so
* do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Finally, estimate PARALLEL_VACUUM_KEY_QUERY_TEXT space */
@@ -474,17 +466,13 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
pvs->shared = shared;
/*
- * Allocate space for each worker's BufferUsage and WalUsage; no need to
- * initialize
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
*/
- buffer_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, buffer_usage);
- pvs->buffer_usage = buffer_usage;
- wal_usage = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_WAL_USAGE, wal_usage);
- pvs->wal_usage = wal_usage;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, instr);
+ pvs->instr = instr;
/* Store query string for workers */
if (debug_query_string)
@@ -945,7 +933,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->instr[i]);
}
/*
@@ -1203,8 +1191,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVShared *shared;
TidStore *dead_items;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
+ Instrumentation *worker_instr;
int nindexes;
char *sharedquery;
ErrorContextCallback errcallback;
@@ -1312,11 +1299,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
parallel_vacuum_process_safe_indexes(&pvs);
/* Report buffer/WAL usage during parallel execution */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ worker_instr = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
if (track_cost_delay_timing)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f32aa660294..934f4d9547f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -60,13 +60,12 @@
#define PARALLEL_KEY_EXECUTOR_FIXED UINT64CONST(0xE000000000000001)
#define PARALLEL_KEY_PLANNEDSTMT UINT64CONST(0xE000000000000002)
#define PARALLEL_KEY_PARAMLISTINFO UINT64CONST(0xE000000000000003)
-#define PARALLEL_KEY_BUFFER_USAGE UINT64CONST(0xE000000000000004)
+#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000004)
#define PARALLEL_KEY_TUPLE_QUEUE UINT64CONST(0xE000000000000005)
-#define PARALLEL_KEY_INSTRUMENTATION UINT64CONST(0xE000000000000006)
+#define PARALLEL_KEY_NODE_INSTRUMENTATION UINT64CONST(0xE000000000000006)
#define PARALLEL_KEY_DSA UINT64CONST(0xE000000000000007)
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
-#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -650,8 +649,6 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_data;
char *pstmt_space;
char *paramlistinfo_space;
- BufferUsage *bufusage_space;
- WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
@@ -715,21 +712,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_estimate_keys(&pcxt->estimator, 1);
/*
- * Estimate space for BufferUsage.
+ * Estimate space for Instrumentation.
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
* looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_estimate_keys(&pcxt->estimator, 1);
-
- /*
- * Same thing for WalUsage.
- */
- shm_toc_estimate_chunk(&pcxt->estimator,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
/* Estimate space for tuple queues. */
@@ -815,17 +805,18 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_PARAMLISTINFO, paramlistinfo_space);
SerializeParamList(estate->es_param_list_info, ¶mlistinfo_space);
- /* Allocate space for each worker's BufferUsage; no need to initialize. */
- bufusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(BufferUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_BUFFER_USAGE, bufusage_space);
- pei->buffer_usage = bufusage_space;
+ /*
+ * Allocate space for each worker's Instrumentation; no need to
+ * initialize.
+ */
+ {
+ Instrumentation *instr;
- /* Same for WalUsage. */
- walusage_space = shm_toc_allocate(pcxt->toc,
- mul_size(sizeof(WalUsage), pcxt->nworkers));
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
- pei->wal_usage = walusage_space;
+ instr = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(Instrumentation), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION, instr);
+ pei->instrumentation = instr;
+ }
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -851,9 +842,9 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument, false);
- shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_NODE_INSTRUMENTATION,
instrumentation);
- pei->instrumentation = instrumentation;
+ pei->node_instrumentation = instrumentation;
if (estate->es_jit_flags != PGJIT_NONE)
{
@@ -1255,7 +1246,7 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
* finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->instrumentation[i]);
pei->finished = true;
}
@@ -1269,11 +1260,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
void
ExecParallelCleanup(ParallelExecutorInfo *pei)
{
- /* Accumulate instrumentation, if any. */
- if (pei->instrumentation)
+ /* Accumulate node instrumentation, if any. */
+ if (pei->node_instrumentation)
{
ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ pei->node_instrumentation);
ExecFinalizeWorkerInstrumentation(pei->planstate);
}
@@ -1510,8 +1501,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
QueryInstrumentation *instr;
- BufferUsage *buffer_usage;
- WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
@@ -1526,7 +1515,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Set up DestReceiver, SharedExecutorInstrumentation, and QueryDesc. */
receiver = ExecParallelGetReceiver(seg, toc);
- instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, true);
+ instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_NODE_INSTRUMENTATION, true);
if (instrumentation != NULL)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
@@ -1584,11 +1573,12 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecutorFinish(queryDesc);
/* Report buffer/WAL usage during parallel execution. */
- buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
- wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(instr,
- &buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ {
+ Instrumentation *worker_instr;
+
+ worker_instr = shm_toc_lookup(toc, PARALLEL_KEY_INSTRUMENTATION, false);
+ InstrEndParallelQuery(instr, &worker_instr[ParallelWorkerNumber]);
+ }
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ef1a94800f3..94d57e3bc40 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -349,11 +349,12 @@ InstrStartParallelQuery(void)
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst)
{
InstrQueryStopFinalize(qinstr);
- memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
- memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+ dst->need_stack = qinstr->instr.need_stack;
+ memcpy(&dst->bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(&dst->walusage, &qinstr->instr.walusage, sizeof(WalUsage));
}
/*
@@ -369,12 +370,11 @@ InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUs
* activity is accumulated.
*/
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(Instrumentation *instr)
{
- BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
- WalUsageAdd(&instr_stack.current->walusage, walusage);
+ InstrAccumStack(instr_stack.current, instr);
- WalUsageAdd(&pgWalUsage, walusage);
+ WalUsageAdd(&pgWalUsage, &instr->walusage);
}
/* Node instrumentation handling */
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..6c8b602d07f 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -25,9 +25,8 @@ typedef struct ParallelExecutorInfo
{
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
- BufferUsage *buffer_usage; /* points to bufusage area in DSM */
- WalUsage *wal_usage; /* walusage area in DSM */
- SharedExecutorInstrumentation *instrumentation; /* optional */
+ Instrumentation *instrumentation; /* instrumentation area in DSM */
+ SharedExecutorInstrumentation *node_instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
dsa_pointer param_exec; /* serialized PARAM_EXEC parameters */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 6ee4ce2b521..72df21334ff 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -286,8 +286,8 @@ extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, Instrumentation *dst);
+extern void InstrAccumParallelQuery(Instrumentation *instr);
extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr,
int instrument_options,
--
2.47.1
[application/x-patch] v16-0006-Optimize-measuring-WAL-buffer-usage-through-stac.patch (81.7K, 8-v16-0006-Optimize-measuring-WAL-buffer-usage-through-stac.patch)
download | inline diff:
From 78347b717da8b95c9eb917f602045f4b29bfa215 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Mon, 6 Apr 2026 01:20:45 -0700
Subject: [PATCH v16 06/10] Optimize measuring WAL/buffer usage through
stack-based instrumentation
Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.
To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.
In tests, the stack-based instrumentation mechanism reduces the overhead
of EXPLAIN (ANALYZE, BUFFERS ON, TIMING OFF) for a large COUNT(*) query
from about 50% to 22% on top of the actual runtime.
This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.
The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
---
.../pg_stat_statements/pg_stat_statements.c | 6 +-
src/backend/access/brin/brin.c | 10 +-
src/backend/access/gin/gininsert.c | 10 +-
src/backend/access/heap/vacuumlazy.c | 12 +-
src/backend/access/nbtree/nbtsort.c | 10 +-
src/backend/commands/analyze.c | 12 +-
src/backend/commands/explain.c | 10 +-
src/backend/commands/explain_dr.c | 2 +
src/backend/commands/prepare.c | 10 +-
src/backend/commands/repack.c | 2 +-
src/backend/commands/tablecmds.c | 5 +-
src/backend/commands/vacuumparallel.c | 10 +-
src/backend/executor/README.instrument | 237 ++++++++++
src/backend/executor/execMain.c | 94 +++-
src/backend/executor/execParallel.c | 32 +-
src/backend/executor/execPartition.c | 5 +-
src/backend/executor/execProcnode.c | 106 ++++-
src/backend/executor/execUtils.c | 13 +-
src/backend/executor/instrument.c | 429 ++++++++++++++----
src/backend/replication/logical/worker.c | 2 +-
src/backend/storage/buffer/bufmgr.c | 6 +-
src/backend/utils/activity/pgstat_io.c | 6 +-
src/include/executor/executor.h | 6 +-
src/include/executor/instrument.h | 199 +++++++-
src/include/nodes/execnodes.h | 2 +
src/include/utils/resowner.h | 1 +
src/tools/pgindent/typedefs.list | 2 +
27 files changed, 1045 insertions(+), 194 deletions(-)
create mode 100644 src/backend/executor/README.instrument
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 5da71c9be16..a9cd1150ebb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -924,12 +924,11 @@ pgss_planner(Query *parse,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
- InstrStop(&instr);
-
pgss_store(query_string,
parse->queryId,
parse->stmt_location,
@@ -1140,6 +1139,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
}
PG_FINALLY();
{
+ InstrStopFinalize(&instr);
nesting_level--;
}
PG_END_TRY();
@@ -1154,8 +1154,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
* former value, which'd otherwise be a good idea.
*/
- InstrStop(&instr);
-
/*
* Track the total number of rows retrieved or affected by the utility
* statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index bdb30752e09..3a5176c76c7 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2434,8 +2434,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index 9d83a495775..0d80f72a0b0 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -991,8 +991,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2118,6 +2118,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -2186,7 +2187,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2201,7 +2202,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 6173e53c4ad..dd285863062 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -638,7 +638,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
TimestampTz starttime = 0;
PgStat_Counter startreadtime = 0,
startwritetime = 0;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
ErrorContextCallback errcallback;
char **indnames = NULL;
Size dead_items_max_bytes = 0;
@@ -654,8 +654,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
startreadtime = pgStatBlockReadTime;
startwritetime = pgStatBlockWriteTime;
}
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -997,7 +997,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_vacuum_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -1013,8 +1013,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufferusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufferusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 756dfa3dcf4..2d7b7cef912 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1466,8 +1466,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
* and PARALLEL_KEY_BUFFER_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgWalUsage or
- * pgBufferUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1753,6 +1753,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
Relation indexRel;
LOCKMODE heapLockmode;
LOCKMODE indexLockmode;
+ QueryInstrumentation *instr;
WalUsage *walusage;
BufferUsage *bufferusage;
int sortmem;
@@ -1828,7 +1829,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1838,7 +1839,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &bufferusage[ParallelWorkerNumber],
&walusage[ParallelWorkerNumber]);
#ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 8472fc0c280..10f8a2dc81c 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -309,7 +309,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
Oid save_userid;
int save_sec_context;
int save_nestlevel;
- Instrumentation *instr = NULL;
+ QueryInstrumentation *instr = NULL;
PgStat_Counter startreadtime = 0;
PgStat_Counter startwritetime = 0;
@@ -361,8 +361,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
pg_rusage_init(&ru0);
- instr = InstrAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
- InstrStart(instr);
+ instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+ InstrQueryStart(instr);
}
/* Used for instrumentation and stats report */
@@ -743,7 +743,7 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
{
TimestampTz endtime = GetCurrentTimestamp();
- InstrStop(instr);
+ InstrQueryStopFinalize(instr);
if (verbose || params->log_analyze_min_duration == 0 ||
TimestampDifferenceExceeds(starttime, endtime,
@@ -757,8 +757,8 @@ do_analyze_rel(Relation onerel, const VacuumParams *params,
int64 total_blks_hit;
int64 total_blks_read;
int64 total_blks_dirtied;
- BufferUsage bufusage = instr->bufusage;
- WalUsage walusage = instr->walusage;
+ BufferUsage bufusage = instr->instr.bufusage;
+ WalUsage walusage = instr->instr.walusage;
total_blks_hit = bufusage.shared_blks_hit +
bufusage.local_blks_hit;
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index deaaba6f900..e2b1d343cca 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -324,7 +324,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
QueryEnvironment *queryEnv)
{
PlannedStmt *plan;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
@@ -333,7 +333,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -351,12 +351,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* plan the query */
plan = pg_plan_query(query, queryString, cursorOptions, params, es);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -366,7 +366,7 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
}
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index df5ae5f4569..836395d6992 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -236,6 +236,8 @@ serializeAnalyzeShutdown(DestReceiver *self)
{
SerializeDestReceiver *receiver = (SerializeDestReceiver *) self;
+ InstrFinalizeChild(&receiver->metrics.instr, instr_stack.current);
+
if (receiver->finfos)
pfree(receiver->finfos);
receiver->finfos = NULL;
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index bf9f2eb6149..ee811357588 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -581,7 +581,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
ListCell *p;
ParamListInfo paramLI = NULL;
EState *estate = NULL;
- Instrumentation plan_instr = {0};
+ QueryInstrumentation *plan_instr = NULL;
int instrument_options = INSTRUMENT_TIMER;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
@@ -590,7 +590,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (es->buffers)
instrument_options |= INSTRUMENT_BUFFERS;
- InstrInitOptions(&plan_instr, instrument_options);
+ plan_instr = InstrQueryAlloc(instrument_options);
if (es->memory)
{
@@ -602,7 +602,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
saved_ctx = MemoryContextSwitchTo(planner_ctx);
}
- InstrStart(&plan_instr);
+ InstrQueryStart(plan_instr);
/* Look it up in the hash table */
entry = FetchPreparedStatement(execstmt->name, true);
@@ -637,7 +637,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
cplan = GetCachedPlan(entry->plansource, paramLI,
CurrentResourceOwner, pstate->p_queryEnv);
- InstrStop(&plan_instr);
+ InstrQueryStopFinalize(plan_instr);
if (es->memory)
{
@@ -654,7 +654,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
- &plan_instr.total, (es->buffers ? &plan_instr.bufusage : NULL),
+ &plan_instr->instr.total, (es->buffers ? &plan_instr->instr.bufusage : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/repack.c b/src/backend/commands/repack.c
index 20dad22c4b7..b2ff5d2ff44 100644
--- a/src/backend/commands/repack.c
+++ b/src/backend/commands/repack.c
@@ -2890,7 +2890,7 @@ initialize_change_context(ChangeContext *chgcxt,
chgcxt->cc_estate = CreateExecutorState();
chgcxt->cc_rri = (ResultRelInfo *) palloc(sizeof(ResultRelInfo));
- InitResultRelInfo(chgcxt->cc_rri, relation, 0, 0, 0);
+ InitResultRelInfo(chgcxt->cc_rri, relation, 0, 0, 0, NULL);
ExecOpenIndices(chgcxt->cc_rri, false);
/*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index eec09ba1ded..f86e4ac67cd 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -2139,7 +2139,7 @@ ExecuteTruncateGuts(List *explicit_rels,
rel,
0, /* dummy rangetable index */
NULL,
- 0);
+ 0, NULL);
estate->es_opened_result_relations =
lappend(estate->es_opened_result_relations, resultRelInfo);
resultRelInfo++;
@@ -6338,7 +6338,8 @@ ATRewriteTable(AlteredTableInfo *tab, Oid OIDNewHeap)
oldrel,
0, /* dummy rangetable index */
NULL,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
MemoryContextSwitchTo(oldcontext);
}
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 979c2be4abd..82bfbc6d492 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -369,8 +369,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
* PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
*
* If there are no extensions loaded that care, we could skip this. We
- * have no way of knowing whether anyone's looking at pgBufferUsage or
- * pgWalUsage, so do it unconditionally.
+ * have no way of knowing whether anyone's looking at instrumentation, so
+ * do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1202,6 +1202,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
PVIndStats *indstats;
PVShared *shared;
TidStore *dead_items;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
int nindexes;
@@ -1305,7 +1306,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1313,7 +1314,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/README.instrument b/src/backend/executor/README.instrument
new file mode 100644
index 00000000000..7df837dbc77
--- /dev/null
+++ b/src/backend/executor/README.instrument
@@ -0,0 +1,237 @@
+src/backend/executor/README.instrument
+
+Instrumentation
+===============
+
+The instrumentation subsystem measures time, buffer usage and WAL activity
+during query execution and other similar activities. It is used by
+EXPLAIN ANALYZE, pg_stat_statements, and other consumers that need
+activity and/or timing metrics over a section of code.
+
+The design has two central goals:
+
+* Make it cheap to measure activity in a section of code, even when
+ that section is called many times and the aggregate is what is used
+ (as is the case with per-node instrumentation in the executor)
+
+* Ensure nested instrumentation accurately measures activity/timing,
+ even when execution is aborted due to errors being thrown.
+
+The key data structures are defined in src/include/executor/instrument.h
+and the implementation lives in src/backend/executor/instrument.c.
+
+
+Instrumentation Options
+-----------------------
+
+Callers specify what to measure with a bitmask of InstrumentOption flags:
+
+ INSTRUMENT_ROWS -- row counts only (used with NodeInstrumentation)
+ INSTRUMENT_TIMER -- wall-clock timing and row counts
+ INSTRUMENT_BUFFERS -- buffer hit/read/dirtied/written counts and I/O time
+ INSTRUMENT_WAL -- WAL records, FPI, bytes
+
+INSTRUMENT_BUFFERS and INSTRUMENT_WAL utilize the instrumentation stack
+(described below) for efficient handling of counter values.
+
+
+Struct Hierarchy
+----------------
+
+There are the following instrumentation structs, each specialized for a
+different scope:
+
+Instrumentation Base struct. Holds timing and buffer/WAL counters.
+
+QueryInstrumentation Extends Instrumentation for query-level tracking. When
+ stack-based tracking is enabled, it owns a dedicated
+ MemoryContext and uses the ResourceOwner mechanism for
+ abort cleanup.
+
+NodeInstrumentation Extends Instrumentation for per-plan-node statistics
+ (startup time, tuple counts, loop counts, etc).
+
+TriggerInstrumentation Extends Instrumentation with a firing count.
+
+
+Stack-based instrumentation
+===========================
+
+For tracking WAL or buffer usage counters, the specialized stack-based
+instrumentation is used.
+
+A simple approach to measuring buffer/WAL activity in a code section could be
+to have a set of global counters, snapshot all the counters at the start, and
+diff them at the end. But, this is expensive in practice: BufferUsage alone
+has many fields, and the diff must be computed for every InstrStartNode /
+InstrStopNode cycle.
+
+An alternative is to write counter updates directly into the struct that
+should receive them, avoiding the diff. But that has two complexities: Low-level
+code such as the buffer manager, has no direct pointers to higher level
+structs, such as plan nodes tracking buffer usage. And instrumentation is often
+nested: We might both be interested in the aggregate buffer usage of a query, and
+the individual per-node details. Stack-based instrumentation solves for that:
+
+At all times, there is a stack that tracks which Instrumentation is currently
+active. The stack is represented by instr_stack, a per-backend global
+that holds a dynamic array of Instrumentation pointers. The field
+instr_stack.current always points to the current stack entry that should
+be updated when activity occurs. When the stack array is empty, the
+current stack points to instr_top.
+
+For example, if a backend has two portals open, the overall nesting of
+Instrumentation and their respective InstrStart/InstrStop calls creates a
+tree-like structure like this:
+
+ Session (instr_top)
+ |
+ +-- Query A (QueryInstrumentation)
+ | |
+ | +-- NestLoop (NodeInstrumentation)
+ | |
+ | +-- Seq Scan A (NodeInstrumentation)
+ | +-- Seq Scan B (NodeInstrumentation)
+ |
+ +-- Query B (QueryInstrumentation)
+ |
+ +-- Seq Scan C (NodeInstrumentation)
+
+While executing Seq Scan B, the stack looks like:
+
+ instr_top (implicit bottom, not in the entries array)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B <-- instr_stack.current
+
+When no query is running, the stack is empty (stack_size == 0) and
+instr_stack.current points to instr_top.
+
+Any buffer or WAL counter update (via the INSTR_BUFUSAGE_* and
+INSTR_WALUSAGE_* macros in the buffer manager, WAL insertion code, etc.)
+writes directly into instr_stack.current. Each instrumentation node starts
+zeroed, so the values it accumulates while on top of the stack represent
+exactly the activity that occurred during that time.
+
+Every Instrumentation node (except for instr_top) has a target, or parent, it
+will be accumulated into, which is typically the Instrumentation that was the
+current stack entry when it was created.
+
+For example, when Seq Scan A gets finalized in regular execution via ExecutorFinish,
+its instrumentation data gets added to the immediate parent in
+the execution tree, the NestLoop, which will then get added to Query A's
+QueryInstrumentation, which then accumulates to the parent.
+
+While we can typically think of this as a tree, the NodeInstrumentation
+underneath a particular QueryInstrumentation could behave differently --
+for example, it could propagate directly to the QueryInstrumentation, in
+order to not show cumulative numbers in EXPLAIN ANALYZE.
+
+Note these relationships are partially implicit, especially when it comes
+to NodeInstrumentation. Each QueryInstrumentation maintains a list of its
+unfinalized child nodes. The parent of a QueryInstrumentation itself is
+determined by the stack (see below): when a query is finalized or cleaned
+up on abort, its counters are accumulated to whatever entry is then current
+on the stack, which is typically instr_top.
+
+
+Finalization and Abort Safety
+=============================
+
+Finalization is the process of rolling up a node's buffer/WAL counters to
+its parent. In normal execution, nodes are pushed onto the stack when they
+start and popped when they stop; at finalization time their accumulated
+counters are added to the parent.
+
+Due to the use of longjmp for error handling, functions can exit abruptly
+without executing their normal cleanup code. On abort, two things need
+to happen:
+
+1. The stack is reset to the level saved at the start of the aborting
+ (sub-)transaction level. This ensures that we don't later try to update
+ counters on a freed stack entry. We also need to ensure that the stack
+ entry that was current before a particular Instrumentation started, is
+ current again after it stops.
+
+2. Finalize all affected Instrumentation nodes, rolling up their counters
+ to the innermost surviving Instrumentation, so that data is not lost.
+
+For example, if Seq Scan B aborts while the stack is:
+
+ instr_top (implicit bottom)
+ 0: Query A
+ 1: NestLoop
+ 2: Seq Scan B
+
+The abort handler for Query A accumulates all unfinalized children (Seq
+Scan A, Seq Scan B, NestLoop) directly into Query A's counters, then
+unwinds the instrumentation stack and accumulates Query A's counters to
+instr_top.
+
+Note that on abort the children do not accumulate through each other (Seq
+Scan B -> NestLoop -> Query A); they all accumulate directly to their
+parent QueryInstrumentation. This means the order in which children are
+released does not matter -- this is important because ResourceOwner cleanup
+does not guarantee a particular release order. The per-node breakdown is lost,
+but the instrumentation active when the query was started (instr_top in the
+above example) survives the abort, and its counters include the activity.
+
+If multiple QueryInstrumentations are active on the stack (e.g. nested
+portals), the abort handler of each uses InstrStopFinalize() to accumulate
+the statistics to the parent entry of either the entry being released, or a
+previously released entry if it was higher up in the stack, so they compose
+correctly regardless of release order.
+
+There are two mechanisms for achieving abort safety:
+
+* Resource Owner (QueryInstrumentation): registers with the current
+ ResourceOwner at start. On transaction abort, the resource owner system
+ calls the release callback, which walks unfinalized child entries,
+ accumulates their data, unwinds the stack, and destroys the dedicated
+ memory context (freeing the QueryInstrumentation and all child
+ allocations as a unit). This is the recommended approach when the
+ instrumented code already has an appropriate resource owner (e.g. it
+ runs inside a portal). The query executor uses this path.
+
+* PG_FINALLY (base Instrumentation): when no suitable resource owner
+ exists, or when the caller wants to inspect the instrumentation data
+ even after an error, the base Instrumentation can be used with a
+ PG_TRY/PG_FINALLY block that calls InstrStopFinalize().
+
+Both mechanisms add overhead, so neither is suitable for high-frequency
+instrumentation like per-node measurements in the executor. Instead,
+plan node and trigger children rely on their parent QueryInstrumentation
+for abort safety: they are allocated in the parent's memory context and
+registered in its unfinalized-entries list, so the parent's abort handler
+recovers their data automatically. In normal execution, children are
+finalized explicitly by the caller.
+
+Parallel Query
+--------------
+
+Parallel workers get their own QueryInstrumentation so they can measure
+buffer and WAL activity independently, then copy the totals into dynamic
+shared memory at worker shutdown. The leader accumulates these into its
+own stack.
+
+When per-node instrumentation is active, parallel workers skip per-node
+finalization at shutdown to avoid double-counting; the per-node data is
+aggregated separately through InstrAggNode().
+
+
+Memory Handling
+===============
+
+Instrumentation objects that use the stack must survive until finalization
+runs, including the abort case. To ensure this, QueryInstrumentation
+creates a dedicated "Instrumentation" MemoryContext (instr_cxt) as a child
+of TopMemoryContext. All child instrumentation (nodes, triggers) should be
+allocated in this context.
+
+On successful completion, instr_cxt is reparented to CurrentMemoryContext
+so its lifetime is tied to the caller's context. On abort, the
+ResourceOwner cleanup frees it after accumulating the instrumentation data
+to the current stack entry after resetting the stack.
+
+When the stack is not needed (timer/rows only), Instrumentation allocations
+happen in CurrentMemoryContext instead of TopMemoryContext.
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index f71f668883c..44d4fea76eb 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -78,6 +78,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
/* decls for local routines only used within this module */
static void InitPlan(QueryDesc *queryDesc, int eflags);
static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
static void ExecPostprocessPlan(EState *estate);
static void ExecEndPlan(PlanState *planstate, EState *estate);
static void ExecutePlan(QueryDesc *queryDesc,
@@ -254,10 +255,18 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
* Set up query-level instrumentation if extensions have requested it via
* totaltime_options. Ensure an extension has not allocated totaltime
* itself.
+ *
+ * Alternatively, also set it up when running EXPLAIN (ANALYZE), as we
+ * utilize totaltime as the parent for node and trigger instrumentation.
*/
Assert(queryDesc->totaltime == NULL);
- if (queryDesc->totaltime_options)
- queryDesc->totaltime = InstrAlloc(queryDesc->totaltime_options);
+ if (queryDesc->totaltime_options || queryDesc->instrument_options)
+ {
+ estate->es_query_instr = InstrQueryAlloc(queryDesc->instrument_options |
+ queryDesc->totaltime_options);
+
+ queryDesc->totaltime = &estate->es_query_instr->instr;
+ }
/*
* Set up an AFTER-trigger statement context, unless told not to, or
@@ -340,9 +349,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
*/
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- /* Allow instrumentation of Executor overall runtime */
- if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ /* Start up instrumentation for this execution run */
+ if (estate->es_query_instr)
+ InstrQueryStart(estate->es_query_instr);
/*
* extract information from the query descriptor and the query feature.
@@ -393,8 +402,8 @@ standard_ExecutorRun(QueryDesc *queryDesc,
if (sendTuples)
dest->rShutdown(dest);
- if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ if (estate->es_query_instr)
+ InstrQueryStop(estate->es_query_instr);
MemoryContextSwitchTo(oldcontext);
}
@@ -443,8 +452,8 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
/* Allow instrumentation of Executor overall runtime */
- if (queryDesc->totaltime)
- InstrStart(queryDesc->totaltime);
+ if (estate->es_query_instr)
+ InstrQueryStart(estate->es_query_instr);
/* Run ModifyTable nodes to completion */
ExecPostprocessPlan(estate);
@@ -453,8 +462,29 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
AfterTriggerEndQuery(estate);
- if (queryDesc->totaltime)
- InstrStop(queryDesc->totaltime);
+ if (estate->es_query_instr)
+ {
+ /*
+ * Accumulate per-node and trigger statistics to their respective
+ * parent instrumentation stacks.
+ *
+ * We skip this in parallel workers because their per-node stats are
+ * reported individually via ExecParallelReportInstrumentation, and
+ * the leader's own ExecFinalizeNodeInstrumentation handles
+ * propagation. If we accumulated here, the leader would
+ * double-count: worker parent nodes would already include their
+ * children's stats, and then the leader's accumulation would add the
+ * children again.
+ */
+ if (!IsParallelWorker() && estate->es_instrument)
+ {
+ ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+ ExecFinalizeTriggerInstrumentation(estate);
+ }
+
+ InstrQueryStopFinalize(estate->es_query_instr);
+ }
MemoryContextSwitchTo(oldcontext);
@@ -1272,7 +1302,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options)
+ int instrument_options,
+ QueryInstrumentation *qinstr)
{
MemSet(resultRelInfo, 0, sizeof(ResultRelInfo));
resultRelInfo->type = T_ResultRelInfo;
@@ -1293,8 +1324,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
palloc0_array(FmgrInfo, n);
resultRelInfo->ri_TrigWhenExprs = (ExprState **)
palloc0_array(ExprState *, n);
- if (instrument_options)
- resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
+ if (qinstr)
+ resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, instrument_options, n);
}
else
{
@@ -1367,6 +1398,10 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
* also provides a way for EXPLAIN ANALYZE to report the runtimes of such
* triggers.) So we make additional ResultRelInfo's as needed, and save them
* in es_trig_target_relations.
+ *
+ * Note: if new relation lists are searched here, they must also be added to
+ * ExecFinalizeTriggerInstrumentation so that trigger instrumentation data
+ * is properly accumulated.
*/
ResultRelInfo *
ExecGetTriggerResultRel(EState *estate, Oid relid,
@@ -1433,7 +1468,8 @@ ExecGetTriggerResultRel(EState *estate, Oid relid,
rel,
0, /* dummy rangetable index */
rootRelInfo,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
estate->es_trig_target_relations =
lappend(estate->es_trig_target_relations, rInfo);
MemoryContextSwitchTo(oldcontext);
@@ -1496,7 +1532,8 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
/* dummy rangetable index */
InitResultRelInfo(rInfo, ancRel, 0, NULL,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
ancResultRels = lappend(ancResultRels, rInfo);
}
ancResultRels = lappend(ancResultRels, rootRelInfo);
@@ -1509,6 +1546,30 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
return resultRelInfo->ri_ancestorResultRels;
}
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+ List *rels = NIL;
+
+ rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+ rels = list_concat(rels, estate->es_opened_result_relations);
+ rels = list_concat(rels, estate->es_trig_target_relations);
+
+ foreach_node(ResultRelInfo, rInfo, rels)
+ {
+ TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+ if (ti == NULL || rInfo->ri_TrigDesc == NULL)
+ continue;
+
+ for (int nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
+ {
+ if (ti[nt].instr.need_stack)
+ InstrAccumStack(&estate->es_query_instr->instr, &ti[nt].instr);
+ }
+ }
+}
+
/* ----------------------------------------------------------------
* ExecPostprocessPlan
*
@@ -3066,6 +3127,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
/* es_trig_target_relations must NOT be copied */
rcestate->es_top_eflags = parentestate->es_top_eflags;
rcestate->es_instrument = parentestate->es_instrument;
+ rcestate->es_query_instr = parentestate->es_query_instr;
/* es_auxmodifytables must NOT be copied */
/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 1a5ec0c305f..f32aa660294 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -719,7 +719,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
*
* If EXPLAIN is not in use and there are no extensions loaded that care,
* we could skip this. But we have no way of knowing whether anyone's
- * looking at pgBufferUsage, so do it unconditionally.
+ * looking at instrumentation, so do it unconditionally.
*/
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1100,14 +1100,28 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
instrument = GetInstrumentationArray(instrumentation);
instrument += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
+ {
InstrAggNode(planstate->instrument, &instrument[n]);
+ /*
+ * Also add worker WAL usage to the global pgWalUsage counter.
+ *
+ * When per-node instrumentation is active, parallel workers skip
+ * ExecFinalizeNodeInstrumentation (to avoid double-counting in
+ * EXPLAIN), so per-node WAL activity is not rolled up into the
+ * query-level stats that InstrAccumParallelQuery receives. Without
+ * this, pgWalUsage would under-report WAL generated by parallel
+ * workers when instrumentation is active.
+ */
+ WalUsageAdd(&pgWalUsage, &instrument[n].instr.walusage);
+ }
+
/*
* Also store the per-worker detail.
*
- * Worker instrumentation should be allocated in the same context as the
- * regular instrumentation information, which is the per-query context.
- * Switch into per-query memory context.
+ * Ensure worker instrumentation is allocated in the per-query context. We
+ * don't need to place this in the instrumentation context since no more
+ * stack-based instrumentation work is being done.
*/
oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
@@ -1257,9 +1271,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
{
/* Accumulate instrumentation, if any. */
if (pei->instrumentation)
+ {
ExecParallelRetrieveInstrumentation(pei->planstate,
pei->instrumentation);
+ ExecFinalizeWorkerInstrumentation(pei->planstate);
+ }
+
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
ExecParallelRetrieveJitInstrumentation(pei->planstate,
@@ -1491,6 +1509,7 @@ void
ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
+ QueryInstrumentation *instr;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
DestReceiver *receiver;
@@ -1551,7 +1570,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ instr = InstrStartParallelQuery();
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1567,7 +1586,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+ InstrEndParallelQuery(instr,
+ &buffer_usage[ParallelWorkerNumber],
&wal_usage[ParallelWorkerNumber]);
/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d96d4f9947b..6888fbe4278 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -586,7 +586,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
partrel,
0,
rootResultRelInfo,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
/*
* Verify result relation is a valid target for an INSERT. An UPDATE of a
@@ -1381,7 +1382,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
{
ResultRelInfo *rri = makeNode(ResultRelInfo);
- InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+ InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0, NULL);
proute->nonleaf_partitions[dispatchidx] = rri;
}
else
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7c4c66e323f..5ca8d91344b 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,8 @@
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context);
/* ------------------------------------------------------------------------
@@ -413,7 +415,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
/* Set up instrumentation for this node if requested */
if (estate->es_instrument)
- result->instrument = InstrAllocNode(estate->es_instrument,
+ result->instrument = InstrAllocNode(estate->es_query_instr,
+ estate->es_instrument,
result->async_capable);
return result;
@@ -768,10 +771,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* at least once already. We don't expect much CPU consumption during
* node shutdown, but in the case of Gather or Gather Merge, we may shut
* down workers at this stage. If so, their buffer usage will get
- * propagated into pgBufferUsage at this point, and we want to make sure
- * that it gets associated with the Gather node. We skip this if the node
- * has never been executed, so as to avoid incorrectly making it appear
- * that it has.
+ * propagated into the current instrumentation stack entry at this point,
+ * and we want to make sure that it gets associated with the Gather node.
+ * We skip this if the node has never been executed, so as to avoid
+ * incorrectly making it appear that it has.
*/
if (node->instrument && node->instrument->running)
InstrStartNode(node->instrument);
@@ -809,6 +812,99 @@ ExecShutdownNode_walker(PlanState *node, void *context)
return false;
}
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+ Instrumentation *parent = (Instrumentation *) context;
+
+ Assert(parent != NULL);
+
+ if (node == NULL)
+ return false;
+
+ Assert(node->instrument != NULL);
+
+ /*
+ * Recurse into children first (bottom-up accumulation), and accumulate to
+ * this node's instrumentation as the parent context.
+ */
+ planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+ &node->instrument->instr);
+
+ InstrFinalizeChild(&node->instrument->instr, parent);
+
+ return false;
+}
+
+/*
+ * ExecFinalizeWorkerInstrumentation
+ *
+ * Accumulate per-worker instrumentation stats from child nodes into their
+ * parents, mirroring what ExecFinalizeNodeInstrumentation does for the
+ * leader's own stats. Without this, per-worker buffer/WAL stats shown by
+ * EXPLAIN (ANALYZE, VERBOSE) would only reflect each node's own direct
+ * activity, not including children.
+ *
+ * This must run after ExecParallelRetrieveInstrumentation has populated
+ * worker_instrument for all nodes in the parallel subtree.
+ */
+void
+ExecFinalizeWorkerInstrumentation(PlanState *node)
+{
+ (void) ExecFinalizeWorkerInstrumentation_walker(node, NULL);
+}
+
+static bool
+ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
+{
+ PlanState *parent = (PlanState *) context;
+ int num_workers;
+
+ if (node == NULL)
+ return false;
+
+ /*
+ * Recurse into children first (bottom-up accumulation), passing this node
+ * as parent context if it has worker_instrument, otherwise pass through
+ * the previous parent.
+ */
+ planstate_tree_walker(node, ExecFinalizeWorkerInstrumentation_walker,
+ node->worker_instrument ? (void *) node : context);
+
+ if (!node->worker_instrument)
+ return false;
+
+ num_workers = node->worker_instrument->num_workers;
+
+ /* Accumulate this node's per-worker stats to parent's per-worker stats */
+ if (parent && parent->worker_instrument)
+ {
+ int parent_workers = parent->worker_instrument->num_workers;
+
+ for (int n = 0; n < Min(num_workers, parent_workers); n++)
+ InstrAccumStack(&parent->worker_instrument->instrument[n].instr,
+ &node->worker_instrument->instrument[n].instr);
+ }
+
+ return false;
+}
+
/*
* ExecSetTupleBound
*
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..8db2b70e5fe 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -151,6 +151,7 @@ CreateExecutorState(void)
estate->es_top_eflags = 0;
estate->es_instrument = 0;
+ estate->es_query_instr = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
@@ -227,6 +228,15 @@ FreeExecutorState(EState *estate)
estate->es_partition_directory = NULL;
}
+ /*
+ * Make sure the instrumentation context gets freed. This usually gets
+ * re-parented under the per-query context in InstrQueryStopFinalize, but
+ * that won't happen during EXPLAIN (BUFFERS) since ExecutorFinish never
+ * gets called, so we would otherwise leak it in TopMemoryContext.
+ */
+ if (estate->es_query_instr && estate->es_query_instr->instr.need_stack)
+ MemoryContextDelete(estate->es_query_instr->instr_cxt);
+
/*
* Free the per-query memory context, thereby releasing all working
* memory, including the EState node itself.
@@ -913,7 +923,8 @@ ExecInitResultRelation(EState *estate, ResultRelInfo *resultRelInfo,
resultRelationDesc,
rti,
NULL,
- estate->es_instrument);
+ estate->es_instrument,
+ estate->es_query_instr);
if (estate->es_result_relations == NULL)
estate->es_result_relations = (ResultRelInfo **)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index dd08fc99fb2..ef1a94800f3 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -20,31 +20,53 @@
#include "nodes/execnodes.h"
#include "portability/instr_time.h"
#include "utils/guc_hooks.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
WalUsage pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {
+ .stack_space = 0,
+ .stack_size = 0,
+ .entries = NULL,
+ .current = &instr_top,
+};
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
+void
+InstrStackGrow(void)
+{
+ int space = instr_stack.stack_space;
+
+ Assert(instr_stack.stack_size >= instr_stack.stack_space);
+
+ if (instr_stack.entries == NULL)
+ {
+ space = 10; /* Allocate sufficient initial space for
+ * typical activity */
+ instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+ sizeof(Instrumentation *) * space);
+ }
+ else
+ {
+ space *= 2;
+ instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, space);
+ }
+ /* Update stack space after allocation succeeded to protect against OOMs */
+ instr_stack.stack_space = space;
+}
/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+static inline bool
+InstrNeedStack(int instrument_options)
{
- Instrumentation *instr = palloc0_object(Instrumentation);
-
- InstrInitOptions(instr, instrument_options);
- return instr;
+ return (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) != 0;
}
void
InstrInitOptions(Instrumentation *instr, int instrument_options)
{
- instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
- instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
+ instr->need_stack = InstrNeedStack(instrument_options);
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
@@ -59,12 +81,8 @@ InstrStart(Instrumentation *instr)
INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
}
- /* save buffer usage totals at start, if needed */
- if (instr->need_bufusage)
- instr->bufusage_start = pgBufferUsage;
-
- if (instr->need_walusage)
- instr->walusage_start = pgWalUsage;
+ if (instr->need_stack)
+ InstrPushStack(instr);
}
/*
@@ -88,14 +106,9 @@ InstrStopCommon(Instrumentation *instr, instr_time *accum_time)
INSTR_TIME_SET_ZERO(instr->starttime);
}
- /* Add delta of buffer usage since InstrStart to the totals */
- if (instr->need_bufusage)
- BufferUsageAccumDiff(&instr->bufusage,
- &pgBufferUsage, &instr->bufusage_start);
-
- if (instr->need_walusage)
- WalUsageAccumDiff(&instr->walusage,
- &pgWalUsage, &instr->walusage_start);
+ /* pop the stack, unless InstrStopFinalize previously cleaned up */
+ if (instr->on_stack)
+ InstrPopStack(instr);
}
void
@@ -104,16 +117,279 @@ InstrStop(Instrumentation *instr)
InstrStopCommon(instr, &instr->total);
}
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+ /*
+ * If our current node is on the stack, make sure we reset the stack to
+ * the parent of whichever of the released stack entries has the lowest
+ * index
+ */
+ if (instr->on_stack)
+ {
+ int idx = -1;
+
+ for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+ {
+ if (instr_stack.entries[i] == instr)
+ {
+ idx = i;
+ break;
+ }
+ }
+
+ if (idx < 0)
+ elog(ERROR, "instrumentation entry not found on stack");
+
+ /* Clear on_stack for any intermediate entries we're skipping over */
+ for (int i = instr_stack.stack_size - 1; i > idx; i--)
+ instr_stack.entries[i]->on_stack = false;
+
+ while (instr_stack.stack_size > idx + 1)
+ instr_stack.stack_size--;
+ }
+
+ InstrStop(instr);
+
+ /*
+ * Accumulate all instrumentation to the currently active instrumentation,
+ * so that callers get a complete picture of activity, even after an abort
+ */
+ InstrAccumStack(instr_stack.current, instr);
+}
+
+/*
+ * Finalize child instrumentation by accumulating buffer/WAL usage to the
+ * provided instrumentation, which may be the current entry, or one the caller
+ * treats as a parent and will add to the totals later.
+ *
+ * Also deletes the unfinalized entry to avoid double counting in an abort
+ * situation, e.g. during executor finish.
+ */
+void
+InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent)
+{
+ if (instr->need_stack)
+ {
+ if (!dlist_node_is_detached(&instr->unfinalized_entry))
+ dlist_delete_thoroughly(&instr->unfinalized_entry);
+
+ InstrAccumStack(parent, instr);
+ }
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+ .name = "instrumentation",
+ .release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+ .release_priority = RELEASE_PRIO_INSTRUMENTATION,
+ .ReleaseResource = ResOwnerReleaseInstrumentation,
+ .DebugPrint = NULL, /* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+ ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+ QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+ MemoryContext instr_cxt = qinstr->instr_cxt;
+ dlist_mutable_iter iter;
+
+ /* Accumulate data from all unfinalized child entries (nodes, triggers) */
+ dlist_foreach_modify(iter, &qinstr->unfinalized_entries)
+ {
+ Instrumentation *child = dlist_container(Instrumentation, unfinalized_entry, iter.cur);
+
+ InstrAccumStack(&qinstr->instr, child);
+ }
+
+ /* Ensure the stack is reset as expected, and we accumulate to the parent */
+ InstrStopFinalize(&qinstr->instr);
+
+ /*
+ * Destroy the dedicated instrumentation context, which frees the
+ * QueryInstrumentation and all child allocations.
+ */
+ MemoryContextDelete(instr_cxt);
+}
+
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+ QueryInstrumentation *instr;
+ MemoryContext instr_cxt;
+
+ /*
+ * When the instrumentation stack is used, create a dedicated memory
+ * context for this query's instrumentation allocations. This context is a
+ * child of TopMemoryContext so it survives transaction abort —
+ * ResourceOwner release needs to access it.
+ *
+ * For simpler cases (timer/rows only), use the current memory context.
+ *
+ * All child instrumentation allocations (nodes, triggers, etc) must be
+ * allocated within this context to ensure correct clean up on abort.
+ */
+ if (InstrNeedStack(instrument_options))
+ instr_cxt = AllocSetContextCreate(TopMemoryContext,
+ "Instrumentation",
+ ALLOCSET_SMALL_SIZES);
+ else
+ instr_cxt = CurrentMemoryContext;
+
+ instr = MemoryContextAllocZero(instr_cxt, sizeof(QueryInstrumentation));
+ instr->instrument_options = instrument_options;
+ instr->instr_cxt = instr_cxt;
+
+ InstrInitOptions(&instr->instr, instrument_options);
+ dlist_init(&instr->unfinalized_entries);
+
+ return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+ InstrStart(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(CurrentResourceOwner != NULL);
+ qinstr->owner = CurrentResourceOwner;
+
+ ResourceOwnerEnlarge(qinstr->owner);
+ ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+ }
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+ InstrStop(&qinstr->instr);
+
+ if (qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+ }
+}
+
+void
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+ InstrStopFinalize(&qinstr->instr);
+
+ if (!qinstr->instr.need_stack)
+ {
+ Assert(qinstr->owner == NULL);
+ return;
+ }
+
+ Assert(qinstr->owner != NULL);
+ ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+ qinstr->owner = NULL;
+
+ /*
+ * Reparent the dedicated instrumentation context under the current memory
+ * context, so that its lifetime is now tied to the caller's context
+ * rather than TopMemoryContext.
+ */
+ MemoryContextSetParent(qinstr->instr_cxt, CurrentMemoryContext);
+}
+
+/*
+ * Register a child Instrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *child)
+{
+ if (child->need_stack)
+ dlist_push_head(&parent->unfinalized_entries, &child->unfinalized_entry);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+ QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+ InstrQueryStart(qinstr);
+ return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+ InstrQueryStopFinalize(qinstr);
+ memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+ memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+ BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+ WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+ WalUsageAdd(&pgWalUsage, walusage);
+}
+
/* Node instrumentation handling */
/* Allocate new node instrumentation structure */
NodeInstrumentation *
-InstrAllocNode(int instrument_options, bool async_mode)
+InstrAllocNode(QueryInstrumentation *qinstr, int instrument_options,
+ bool async_mode)
{
- NodeInstrumentation *instr = palloc_object(NodeInstrumentation);
+ NodeInstrumentation *instr = MemoryContextAlloc(qinstr->instr_cxt, sizeof(NodeInstrumentation));
InstrInitNode(instr, instrument_options, async_mode);
+ InstrQueryRememberChild(qinstr, &instr->instr);
+
return instr;
}
@@ -133,6 +409,7 @@ InstrStartNode(NodeInstrumentation *instr)
InstrStart(&instr->instr);
}
+
/* Exit from a plan node */
inline void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -203,8 +480,8 @@ InstrEndLoop(NodeInstrumentation *instr)
if (!instr->running)
return;
- if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
- elog(ERROR, "InstrEndLoop called on running node");
+ /* Ensure InstrNodeStop was called */
+ Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
/* Accumulate per-cycle statistics into totals */
INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -237,22 +514,30 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
dst->nfiltered1 += add->nfiltered1;
dst->nfiltered2 += add->nfiltered2;
- if (dst->instr.need_bufusage)
- BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
-
- if (dst->instr.need_walusage)
- WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
+ if (dst->instr.need_stack)
+ InstrAccumStack(&dst->instr, &add->instr);
}
/* Trigger instrumentation handling */
TriggerInstrumentation *
-InstrAllocTrigger(int n, int instrument_options)
+InstrAllocTrigger(QueryInstrumentation *qinstr, int instrument_options, int n)
{
- TriggerInstrumentation *tginstr = palloc0_array(TriggerInstrumentation, n);
+ TriggerInstrumentation *tginstr;
int i;
+ /*
+ * Allocate in the query's dedicated instrumentation context so all
+ * instrumentation data is grouped together and cleaned up as a unit.
+ */
+ Assert(qinstr != NULL && qinstr->instr_cxt != NULL);
+ tginstr = MemoryContextAllocZero(qinstr->instr_cxt,
+ n * sizeof(TriggerInstrumentation));
+
for (i = 0; i < n; i++)
+ {
InstrInitOptions(&tginstr[i].instr, instrument_options);
+ InstrQueryRememberChild(qinstr, &tginstr[i].instr);
+ }
return tginstr;
}
@@ -266,38 +551,30 @@ InstrStartTrigger(TriggerInstrumentation *tginstr)
void
InstrStopTrigger(TriggerInstrumentation *tginstr, int64 firings)
{
+ /*
+ * This trigger may be called again, so we don't finalize instrumentation
+ * here. Accumulation to the parent happens at ExecutorFinish through
+ * ExecFinalizeTriggerInstrumentation.
+ */
InstrStop(&tginstr->instr);
tginstr->firings += firings;
}
-/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrAccumStack(Instrumentation *dst, Instrumentation *add)
{
- save_pgBufferUsage = pgBufferUsage;
- save_pgWalUsage = pgWalUsage;
-}
+ Assert(dst != NULL);
+ Assert(add != NULL);
-/* report usage after parallel executor shutdown */
-void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- memset(bufusage, 0, sizeof(BufferUsage));
- BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
- memset(walusage, 0, sizeof(WalUsage));
- WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+ if (!add->need_stack)
+ return;
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
- BufferUsageAdd(&pgBufferUsage, bufusage);
- WalUsageAdd(&pgWalUsage, walusage);
+ BufferUsageAdd(&dst->bufusage, &add->bufusage);
+ WalUsageAdd(&dst->walusage, &add->walusage);
}
/* dst += add */
-static void
+void
BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
{
dst->shared_blks_hit += add->shared_blks_hit;
@@ -318,39 +595,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
}
-/* dst += add - sub */
+/* dst += add */
void
-BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add,
- const BufferUsage *sub)
-{
- dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
- dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
- dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
- dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
- dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
- dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
- dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
- dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
- dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
- dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
- add->shared_blk_read_time, sub->shared_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
- add->shared_blk_write_time, sub->shared_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
- add->local_blk_read_time, sub->local_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
- add->local_blk_write_time, sub->local_blk_write_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
- add->temp_blk_read_time, sub->temp_blk_read_time);
- INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
- add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
{
dst->wal_bytes += add->wal_bytes;
dst->wal_records += add->wal_records;
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index b38170f0fbe..a829ddf5acb 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -904,7 +904,7 @@ create_edata_for_relation(LogicalRepRelMapEntry *rel)
* Use Relation opened by logicalrep_rel_open() instead of opening it
* again.
*/
- InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0);
+ InitResultRelInfo(resultRelInfo, rel->localrel, 1, NULL, 0, NULL);
/*
* We put the ResultRelInfo in the es_opened_result_relations list, even
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3e1c39160db..cf4f4246ca2 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1266,9 +1266,9 @@ PinBufferForBlock(Relation rel,
if (rel)
{
/*
- * While pgBufferUsage's "read" counter isn't bumped unless we reach
- * WaitReadBuffers() (so, not for hits, and not for buffers that are
- * zeroed instead), the per-relation stats always count them.
+ * While the current buffer usage "read" counter isn't bumped unless
+ * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+ * that are zeroed instead), the per-relation stats always count them.
*/
pgstat_count_buffer_read(rel);
}
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index e3829d7fe7c..e7fc7f071d8 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
* pg_stat_database only counts block read and write times, these are done for
* IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
*
- * pgBufferUsage is used for EXPLAIN. pgBufferUsage has write and read stats
- * for shared, local and temporary blocks. pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
*/
void
pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 491c4886506..78961ae058b 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -233,6 +233,7 @@ ExecGetJunkAttribute(TupleTableSlot *slot, AttrNumber attno, bool *isNull)
/*
* prototypes from functions in execMain.c
*/
+typedef struct QueryInstrumentation QueryInstrumentation;
extern void ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void standard_ExecutorStart(QueryDesc *queryDesc, int eflags);
extern void ExecutorRun(QueryDesc *queryDesc,
@@ -254,7 +255,8 @@ extern void InitResultRelInfo(ResultRelInfo *resultRelInfo,
Relation resultRelationDesc,
Index resultRelationIndex,
ResultRelInfo *partition_root_rri,
- int instrument_options);
+ int instrument_options,
+ QueryInstrumentation *qinstr);
extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid,
ResultRelInfo *rootRelInfo);
extern List *ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo);
@@ -301,6 +303,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
extern Node *MultiExecProcNode(PlanState *node);
extern void ExecEndNode(PlanState *node);
extern void ExecShutdownNode(PlanState *node);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
+extern void ExecFinalizeWorkerInstrumentation(PlanState *node);
extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 4430c222493..6ee4ce2b521 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
#ifndef INSTRUMENT_H
#define INSTRUMENT_H
+#include "lib/ilist.h"
#include "portability/instr_time.h"
@@ -68,29 +69,92 @@ typedef enum InstrumentOption
} InstrumentOption;
/*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
*
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured in between).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
*/
typedef struct Instrumentation
{
/* Parameters set at creation: */
bool need_timer; /* true if we need timer data */
- bool need_bufusage; /* true if we need buffer usage data */
- bool need_walusage; /* true if we need WAL usage data */
+ bool need_stack; /* true if we need WAL/buffer usage data */
/* Internal state keeping: */
+ bool on_stack; /* true if currently on instr_stack */
instr_time starttime; /* start time of last InstrStart */
- BufferUsage bufusage_start; /* buffer usage at start */
- WalUsage walusage_start; /* WAL usage at start */
/* Accumulated statistics: */
instr_time total; /* total runtime */
BufferUsage bufusage; /* total buffer usage */
WalUsage walusage; /* total WAL usage */
+ /* Abort handling: link in parent QueryInstrumentation's unfinalized list */
+ dlist_node unfinalized_entry;
} Instrumentation;
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop or
+ * InstrQueryStopFinalize. In the case of a transaction abort, logic equivalent
+ * to InstrQueryStopFinalize will be called automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+ Instrumentation instr;
+
+ /* Original instrument_options flags used to create this instrumentation */
+ int instrument_options;
+
+ /* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+ struct ResourceOwnerData *owner;
+
+ /*
+ * Dedicated memory context for all instrumentation allocations belonging
+ * to this query (node instrumentation, trigger instrumentation, etc.).
+ * Initially a child of TopMemoryContext so it survives transaction abort
+ * for ResourceOwner cleanup, which is then reassigned to the current
+ * memory context on InstrQueryStopFinalize.
+ */
+ MemoryContext instr_cxt;
+
+ /*
+ * Child entries that need to be cleaned up on abort, since they are not
+ * registered as a resource owner themselves. Contains both node and
+ * trigger instrumentation entries linked via instr.unfinalized_entry.
+ */
+ dlist_head unfinalized_entries;
+} QueryInstrumentation;
+
/*
* Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
*/
typedef struct NodeInstrumentation
{
@@ -111,6 +175,10 @@ typedef struct NodeInstrumentation
double nfiltered2; /* # of tuples removed by "other" quals */
} NodeInstrumentation;
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
typedef struct WorkerNodeInstrumentation
{
int num_workers; /* # of structures that follow */
@@ -124,15 +192,105 @@ typedef struct TriggerInstrumentation
* was fired */
} TriggerInstrumentation;
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+ int stack_space; /* allocated capacity of entries array */
+ int stack_size; /* current number of entries */
+
+ Instrumentation **entries; /* dynamic array of pointers */
+ Instrumentation *current; /* top of stack, or &instr_top when empty */
+} InstrStackState;
+
extern PGDLLIMPORT WalUsage pgWalUsage;
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * See note on InstrPopStack regarding safe use of these functions.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+ if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+ InstrStackGrow();
+
+ instr_stack.entries[instr_stack.stack_size++] = instr;
+ instr_stack.current = instr;
+ instr->on_stack = true;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, consider instead using InstrStart + InstrStopFinalize which can skip
+ * intermediate stack entries.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+ instr_stack.stack_size--;
+ instr_stack.current = instr_stack.stack_size > 0
+ ? instr_stack.entries[instr_stack.stack_size - 1]
+ : &instr_top;
+ instr->on_stack = false;
+}
+
extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
extern void InstrStart(Instrumentation *instr);
extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrFinalizeChild(Instrumentation *instr, Instrumentation *parent);
+extern void InstrAccumStack(Instrumentation *dst, Instrumentation *add);
-extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern void InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberChild(QueryInstrumentation *parent, Instrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+
+extern NodeInstrumentation *InstrAllocNode(QueryInstrumentation *qinstr,
+ int instrument_options,
bool async_mode);
extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options,
bool async_mode);
@@ -146,35 +304,36 @@ typedef struct TupleTableSlot TupleTableSlot;
typedef struct PlanState PlanState;
extern TupleTableSlot *ExecProcNodeInstr(PlanState *node);
-extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr,
+ int instrument_options, int n);
extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int64 firings);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
- const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
#define INSTR_BUFUSAGE_INCR(fld) do { \
- pgBufferUsage.fld++; \
+ instr_stack.current->bufusage.fld++; \
} while(0)
#define INSTR_BUFUSAGE_ADD(fld,val) do { \
- pgBufferUsage.fld += (val); \
+ instr_stack.current->bufusage.fld += (val); \
} while(0)
#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
- INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+ INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
} while (0)
#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
- INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+ INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
} while (0)
+
#define INSTR_WALUSAGE_INCR(fld) do { \
pgWalUsage.fld++; \
+ instr_stack.current->walusage.fld++; \
} while(0)
#define INSTR_WALUSAGE_ADD(fld,val) do { \
pgWalUsage.fld += (val); \
+ instr_stack.current->walusage.fld += (val); \
} while(0)
#endif /* INSTRUMENT_H */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 3ecae7552fc..491c4e272d8 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -54,6 +54,7 @@ typedef struct Instrumentation Instrumentation;
typedef struct pairingheap pairingheap;
typedef struct PlanState PlanState;
typedef struct QueryEnvironment QueryEnvironment;
+typedef struct QueryInstrumentation QueryInstrumentation;
typedef struct RelationData *Relation;
typedef Relation *RelationPtr;
typedef struct ScanKeyData ScanKeyData;
@@ -754,6 +755,7 @@ typedef struct EState
int es_top_eflags; /* eflags passed to ExecutorStart */
int es_instrument; /* OR of InstrumentOption flags */
+ QueryInstrumentation *es_query_instr; /* query-level instrumentation */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
#define RELEASE_PRIO_SNAPSHOT_REFS 500
#define RELEASE_PRIO_FILES 600
#define RELEASE_PRIO_WAITEVENTSETS 700
+#define RELEASE_PRIO_INSTRUMENTATION 800
/* 0 is considered invalid */
#define RELEASE_PRIO_FIRST 1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a998bb5e882..32b866611c9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1361,6 +1361,7 @@ InjectionPointSharedState
InjectionPointsCtl
InlineCodeBlock
InsertStmt
+InstrStackState
Instrumentation
Int128AggState
Int8TransTypeData
@@ -2484,6 +2485,7 @@ QueryCompletion
QueryDesc
QueryEnvironment
QueryInfo
+QueryInstrumentation
QueryItem
QueryItemType
QueryMode
--
2.47.1
[application/x-patch] v16-0008-instrumentation-Optimize-ExecProcNodeInstr-instr.patch (9.0K, 9-v16-0008-instrumentation-Optimize-ExecProcNodeInstr-instr.patch)
download | inline diff:
From acf427d2935c3964d0916db06d3690b57b47dda4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 5 Apr 2026 19:30:56 -0700
Subject: [PATCH v16 08/10] instrumentation: Optimize ExecProcNodeInstr
instructions by inlining
For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.
Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.
Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.
Author: Lukas Fittl <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxFP7i7-wy98ZmEJ11edYq-RrPvJoa4kzGhBBjERA4Nyw%40mail.gmail.com#e8dfd018a07d7f8d41565a079d40c564
fix up execprocnode 2
---
src/backend/executor/execProcnode.c | 2 +-
src/backend/executor/instrument.c | 164 +++++++++++++++++++++-------
src/include/executor/instrument.h | 3 +-
3 files changed, 126 insertions(+), 43 deletions(-)
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 5ca8d91344b..357175ac2cb 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -465,7 +465,7 @@ ExecProcNodeFirst(PlanState *node)
* have ExecProcNode() directly call the relevant function from now on.
*/
if (node->instrument)
- node->ExecProcNode = ExecProcNodeInstr;
+ node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
else
node->ExecProcNode = node->ExecProcNodeReal;
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 94d57e3bc40..0a42e800ce9 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -70,19 +70,25 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
}
-inline void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
{
- if (instr->need_timer)
- {
- if (!INSTR_TIME_IS_ZERO(instr->starttime))
- elog(ERROR, "InstrStart called twice in a row");
- else
- INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
- }
+ Assert(INSTR_TIME_IS_ZERO(instr->starttime));
- if (instr->need_stack)
- InstrPushStack(instr);
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+}
+
+static inline void
+InstrStopTimer(Instrumentation *instr, instr_time *accum_time)
+{
+ instr_time endtime;
+
+ Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
+
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
+ INSTR_TIME_ACCUM_DIFF(*accum_time, endtime, instr->starttime);
+
+ INSTR_TIME_SET_ZERO(instr->starttime);
}
/*
@@ -92,18 +98,13 @@ InstrStart(Instrumentation *instr)
static inline void
InstrStopCommon(Instrumentation *instr, instr_time *accum_time)
{
- instr_time endtime;
-
/* update the time only if the timer was requested */
if (instr->need_timer)
{
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStop called without start");
- INSTR_TIME_SET_CURRENT_FAST(endtime);
- INSTR_TIME_ACCUM_DIFF(*accum_time, endtime, instr->starttime);
-
- INSTR_TIME_SET_ZERO(instr->starttime);
+ InstrStopTimer(instr, accum_time);
}
/* pop the stack, unless InstrStopFinalize previously cleaned up */
@@ -111,6 +112,16 @@ InstrStopCommon(Instrumentation *instr, instr_time *accum_time)
InstrPopStack(instr);
}
+void
+InstrStart(Instrumentation *instr)
+{
+ if (instr->need_timer)
+ InstrStartTimer(instr);
+
+ if (instr->need_stack)
+ InstrPushStack(instr);
+}
+
void
InstrStop(Instrumentation *instr)
{
@@ -402,16 +413,15 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options, bool async_mod
instr->async_mode = async_mode;
}
-/* Entry to a plan node */
-inline void
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
InstrStartNode(NodeInstrumentation *instr)
{
InstrStart(&instr->instr);
}
-
-/* Exit from a plan node */
-inline void
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
InstrStopNode(NodeInstrumentation *instr, double nTuples)
{
double save_tuplecount = instr->tuplecount;
@@ -445,25 +455,6 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
}
}
-/*
- * ExecProcNode wrapper that performs instrumentation calls. By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
- TupleTableSlot *result;
-
- InstrStartNode(node->instrument);
-
- result = node->ExecProcNodeReal(node);
-
- InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
- return result;
-}
-
/* Update tuple count */
void
InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
@@ -518,6 +509,97 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
InstrAccumStack(&dst->instr, &add->instr);
}
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+ Assert(instr_stack.stack_size > 0);
+ Assert(instr_stack.stack_size > 1 ? instr_stack.entries[instr_stack.stack_size - 2] == prev : &instr_top == prev);
+ instr_stack.entries[instr_stack.stack_size - 1]->on_stack = false;
+ instr_stack.stack_size--;
+ instr_stack.current = prev;
+}
+
+static pg_attribute_always_inline TupleTableSlot *
+ExecProcNodeInstr(PlanState *node, bool need_timer, bool need_stack)
+{
+ NodeInstrumentation *instr = node->instrument;
+ Instrumentation *prev = instr_stack.current;
+ TupleTableSlot *result;
+
+ if (need_stack)
+ InstrPushStack(&instr->instr);
+ if (need_timer)
+ InstrStartTimer(&instr->instr);
+
+ result = node->ExecProcNodeReal(node);
+
+ if (need_timer)
+ InstrStopTimer(&instr->instr, &instr->counter);
+ if (need_stack)
+ InstrPopStackTo(prev);
+
+ instr->running = true;
+ if (!TupIsNull(result))
+ instr->tuplecount += 1.0;
+
+ return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsStackOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, true);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, true, false);
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+ return ExecProcNodeInstr(node, false, false);
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+ bool need_timer = instr->instr.need_timer;
+ bool need_stack = instr->instr.need_stack;
+
+ if (need_timer && need_stack)
+ return ExecProcNodeInstrFull;
+ else if (need_stack)
+ return ExecProcNodeInstrRowsStackOnly;
+ else if (need_timer)
+ return ExecProcNodeInstrRowsTimerOnly;
+ else
+ return ExecProcNodeInstrRowsOnly;
+}
+
/* Trigger instrumentation handling */
TriggerInstrumentation *
InstrAllocTrigger(QueryInstrumentation *qinstr, int instrument_options, int n)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 72df21334ff..bd481afd0de 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -302,7 +302,8 @@ extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
typedef struct TupleTableSlot TupleTableSlot;
typedef struct PlanState PlanState;
-extern TupleTableSlot *ExecProcNodeInstr(PlanState *node);
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
extern TriggerInstrumentation *InstrAllocTrigger(QueryInstrumentation *qinstr,
int instrument_options, int n);
--
2.47.1
[application/x-patch] v16-0010-Add-test_session_buffer_usage-test-module.patch (30.0K, 10-v16-0010-Add-test_session_buffer_usage-test-module.patch)
download | inline diff:
From 0745fa15bf8d3843de22606edcf885f6fdbf3f44 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v16 10/10] Add test_session_buffer_usage test module
This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
src/test/modules/Makefile | 1 +
src/test/modules/meson.build | 1 +
.../test_session_buffer_usage/Makefile | 23 ++
.../expected/test_session_buffer_usage.out | 342 ++++++++++++++++++
.../test_session_buffer_usage/meson.build | 33 ++
.../sql/test_session_buffer_usage.sql | 245 +++++++++++++
.../test_session_buffer_usage--1.0.sql | 31 ++
.../test_session_buffer_usage.c | 95 +++++
.../test_session_buffer_usage.control | 5 +
9 files changed, 776 insertions(+)
create mode 100644 src/test/modules/test_session_buffer_usage/Makefile
create mode 100644 src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
create mode 100644 src/test/modules/test_session_buffer_usage/meson.build
create mode 100644 src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
create mode 100644 src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 0a74ab5c86f..c45081a39f9 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -49,6 +49,7 @@ SUBDIRS = \
test_resowner \
test_rls_hooks \
test_saslprep \
+ test_session_buffer_usage \
test_shmem \
test_shm_mq \
test_slru \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 4bca42bb370..45db77b621a 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -50,6 +50,7 @@ subdir('test_regex')
subdir('test_resowner')
subdir('test_rls_hooks')
subdir('test_saslprep')
+subdir('test_session_buffer_usage')
subdir('test_shmem')
subdir('test_shm_mq')
subdir('test_slru')
diff --git a/src/test/modules/test_session_buffer_usage/Makefile b/src/test/modules/test_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..1252b222cb9
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_session_buffer_usage/Makefile
+
+MODULE_big = test_session_buffer_usage
+OBJS = \
+ $(WIN32RES) \
+ test_session_buffer_usage.o
+
+EXTENSION = test_session_buffer_usage
+DATA = test_session_buffer_usage--1.0.sql
+PGFILEDESC = "test_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = test_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_session_buffer_usage
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
new file mode 100644
index 00000000000..5f7d349871a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/expected/test_session_buffer_usage.out
@@ -0,0 +1,342 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count
+-------
+ 1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+ blocks_increased
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+ERROR: division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+ exception_buffers_visible
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR: duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL: Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+ unique_violation_buffers_visible
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+ subxact_buffers_visible
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT cursor_exc_func();
+ cursor_exc_func
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+ cursor_subxact_buffers_visible
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT count(*) FROM trig_work_tab;
+ count
+-------
+ 500
+(1 row)
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+ERROR: trigger error
+CONTEXT: PL/pgSQL function trig_err_func() line 4 at RAISE
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+ trigger_abort_buffers_propagated
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count
+-------
+ 5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT test_session_buffer_usage_reset();
+ test_session_buffer_usage_reset
+---------------------------------
+
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR: invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT: parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/meson.build b/src/test/modules/test_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..b96f67dc7fe
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+test_session_buffer_usage_sources = files(
+ 'test_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+ test_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+ '--NAME', 'test_session_buffer_usage',
+ '--FILEDESC', 'test_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+test_session_buffer_usage = shared_module('test_session_buffer_usage',
+ test_session_buffer_usage_sources,
+ kwargs: pg_test_mod_args,
+)
+test_install_libs += test_session_buffer_usage
+
+test_install_data += files(
+ 'test_session_buffer_usage.control',
+ 'test_session_buffer_usage--1.0.sql',
+)
+
+tests += {
+ 'name': 'test_session_buffer_usage',
+ 'sd': meson.current_source_dir(),
+ 'bd': meson.current_build_dir(),
+ 'regress': {
+ 'sql': [
+ 'test_session_buffer_usage',
+ ],
+ },
+}
diff --git a/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
new file mode 100644
index 00000000000..daf2159c4a6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/sql/test_session_buffer_usage.sql
@@ -0,0 +1,245 @@
+LOAD 'test_session_buffer_usage';
+CREATE EXTENSION test_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM test_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+ AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+ AND local_blks_hit >= 0 AND local_blks_read >= 0
+ AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+ AND temp_blks_read >= 0 AND temp_blks_written >= 0
+ AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+ AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+ AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT test_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM test_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+ AS leader_buffers_match
+FROM test_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT test_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+ SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT test_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+ BEGIN
+ EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+ SELECT a / 0 FROM ins';
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+ rec record;
+ cnt int := 0;
+BEGIN
+ BEGIN
+ FOR rec IN SELECT * FROM cursor_tab LOOP
+ cnt := cnt + 1;
+ IF cnt = 250 THEN
+ PERFORM 1 / 0;
+ END IF;
+ END LOOP;
+ EXCEPTION WHEN division_by_zero THEN
+ RETURN 'caught after ' || cnt || ' rows';
+ END;
+ RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT test_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+ AS cursor_subxact_buffers_visible
+FROM test_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Trigger abort under EXPLAIN ANALYZE: verify that buffer activity from a
+-- trigger that throws an error is still properly propagated.
+CREATE TEMP TABLE trig_err_tab (a int);
+CREATE TEMP TABLE trig_work_tab (a int, b char(200));
+INSERT INTO trig_work_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+-- Warm local buffers so trig_work_tab reads become hits
+SELECT count(*) FROM trig_work_tab;
+
+CREATE FUNCTION trig_err_func() RETURNS trigger AS $$
+BEGIN
+ PERFORM count(*) FROM trig_work_tab;
+ RAISE EXCEPTION 'trigger error';
+ RETURN NEW;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE TRIGGER trig_err BEFORE INSERT ON trig_err_tab
+ FOR EACH ROW EXECUTE FUNCTION trig_err_func();
+
+-- Measure how many local buffer hits a scan of trig_work_tab produces
+SELECT test_session_buffer_usage_reset();
+SELECT count(*) FROM trig_work_tab;
+
+CREATE TEMP TABLE trig_serial_result AS
+SELECT local_blks_hit AS serial_hits FROM test_session_buffer_usage();
+
+-- Now trigger the same scan via a trigger that errors
+SELECT test_session_buffer_usage_reset();
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+ INSERT INTO trig_err_tab VALUES (1);
+
+-- The trigger scanned trig_work_tab but errored before InstrStopTrigger ran.
+-- InstrStopFinalize in the PG_CATCH ensures buffer data is still propagated.
+SELECT local_blks_hit >= s.serial_hits / 2
+ AS trigger_abort_buffers_propagated
+FROM test_session_buffer_usage(), trig_serial_result s;
+
+DROP TABLE trig_err_tab, trig_work_tab, trig_serial_result;
+DROP FUNCTION trig_err_func;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM test_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT test_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+ AS worker_abort_buffers_not_propagated
+FROM test_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION test_session_buffer_usage;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..e9833be470a
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* src/test/modules/test_session_buffer_usage/test_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION test_session_buffer_usage(
+ OUT shared_blks_hit bigint,
+ OUT shared_blks_read bigint,
+ OUT shared_blks_dirtied bigint,
+ OUT shared_blks_written bigint,
+ OUT local_blks_hit bigint,
+ OUT local_blks_read bigint,
+ OUT local_blks_dirtied bigint,
+ OUT local_blks_written bigint,
+ OUT temp_blks_read bigint,
+ OUT temp_blks_written bigint,
+ OUT shared_blk_read_time double precision,
+ OUT shared_blk_write_time double precision,
+ OUT local_blk_read_time double precision,
+ OUT local_blk_write_time double precision,
+ OUT temp_blk_read_time double precision,
+ OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION test_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'test_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
new file mode 100644
index 00000000000..50eb1a2ffe6
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * test_session_buffer_usage.c
+ * show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ * src/test/modules/test_session_buffer_usage/test_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+ .name = "test_session_buffer_usage",
+ .version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(test_session_buffer_usage);
+PG_FUNCTION_INFO_V1(test_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1 /* Change to 0 when testing before stack
+ * change */
+
+/*
+ * SQL function: test_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+test_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+ TupleDesc tupdesc;
+ Datum values[NUM_BUFFER_USAGE_COLUMNS];
+ bool nulls[NUM_BUFFER_USAGE_COLUMNS];
+ BufferUsage *usage;
+
+ if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+ elog(ERROR, "return type must be a row type");
+
+ memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+ usage = &instr_top.bufusage;
+#else
+ usage = &pgBufferUsage;
+#endif
+
+ values[0] = Int64GetDatum(usage->shared_blks_hit);
+ values[1] = Int64GetDatum(usage->shared_blks_read);
+ values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+ values[3] = Int64GetDatum(usage->shared_blks_written);
+ values[4] = Int64GetDatum(usage->local_blks_hit);
+ values[5] = Int64GetDatum(usage->local_blks_read);
+ values[6] = Int64GetDatum(usage->local_blks_dirtied);
+ values[7] = Int64GetDatum(usage->local_blks_written);
+ values[8] = Int64GetDatum(usage->temp_blks_read);
+ values[9] = Int64GetDatum(usage->temp_blks_written);
+ values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+ values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+ values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+ values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+ values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+ values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: test_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+test_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+ memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+ memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+ PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
new file mode 100644
index 00000000000..41cfb15a765
--- /dev/null
+++ b/src/test/modules/test_session_buffer_usage/test_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# test_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/test_session_buffer_usage'
+relocatable = true
--
2.47.1
[application/x-patch] v16-0009-Index-scans-Show-table-buffer-accesses-separatel.patch (23.2K, 11-v16-0009-Index-scans-Show-table-buffer-accesses-separatel.patch)
download | inline diff:
From a70c953bd672f1e931a191245fb29d046be6d824 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v16 09/10] Index scans: Show table buffer accesses separately
in EXPLAIN ANALYZE
This sets up a separate instrumentation stack that is used whilst an
Index Scan or Index Only Scan does scanning on the table, for example due
to additional data being needed.
EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.
Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by: Zsolt Parragi <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkxrmpECzVFpeeEEHDGe6u625s%2BYkmVv5-gw3L_NDSfbiA%40mail.gmail.com#cb583a08e8e096aa1f093bb178906173
Actually populate I(O)S table stack pre index prefetching merge
---
doc/src/sgml/perform.sgml | 13 ++-
doc/src/sgml/ref/explain.sgml | 1 +
src/backend/commands/explain.c | 47 ++++++--
src/backend/executor/execProcnode.c | 46 ++++++++
src/backend/executor/nodeBitmapIndexscan.c | 2 +-
src/backend/executor/nodeIndexonlyscan.c | 41 ++++++-
src/backend/executor/nodeIndexscan.c | 127 +++++++++++++++++----
src/include/executor/instrument_node.h | 6 +
8 files changed, 245 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 604e8578a8d..d28f4f22535 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 < 10 AND t1.unique2 = t2.unique2;
-> Index Scan using tenk2_unique2 on tenk2 t2 (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
Index Cond: (unique2 = t1.unique2)
Index Searches: 10
+ Table Buffers: shared hit=10
Buffers: shared hit=24 read=6
Planning:
Buffers: shared hit=15 dirtied=9
@@ -1005,7 +1006,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
Index Cond: (f1 @> '((0.5,2))'::polygon)
Rows Removed by Index Recheck: 1
Index Searches: 1
- Buffers: shared hit=1
+ Table Buffers: shared hit=1
+ Buffers: shared hit=2
Planning Time: 0.039 ms
Execution Time: 0.098 ms
</screen>
@@ -1014,7 +1016,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @> polygon '(0.5,2.0)';
then rejected by a recheck of the index condition. This happens because a
GiST index is <quote>lossy</quote> for polygon containment tests: it actually
returns the rows with polygons that overlap the target, and then we have
- to do the exact containment test on those rows.
+ to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+ counts indicate how many operations were performed on the table instead of
+ the index. This number is included in the <literal>Buffers</literal> counts.
</para>
<para>
@@ -1203,13 +1207,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 < 100 AND unique2 > 9000
QUERY PLAN
-------------------------------------------------------------------&zwsp;------------------------------------------------------------
Limit (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
- Buffers: shared hit=16
+ Buffers: shared hit=14
-> Index Scan using tenk1_unique2 on tenk1 (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
Index Cond: (unique2 > 9000)
Filter: (unique1 < 100)
Rows Removed by Filter: 287
Index Searches: 1
- Buffers: shared hit=16
+ Table Buffers: shared hit=11
+ Buffers: shared hit=14
Planning Time: 0.077 ms
Execution Time: 0.086 ms
</screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5b8b521802e..71070736acb 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -509,6 +509,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
-> Index Scan using test_pkey on test (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
Index Cond: ((id > 100) AND (id < 200))
Index Searches: 1
+ Table Buffers: shared hit=1
Buffers: shared hit=4
Planning Time: 0.244 ms
Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e2b1d343cca..4308a7f1765 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -144,7 +144,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -611,7 +611,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
- show_buffer_usage(es, bufusage);
+ show_buffer_usage(es, bufusage, NULL);
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -1028,7 +1028,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
if (es->buffers && peek_buffer_usage(es, &metrics->instr.bufusage))
{
es->indent++;
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
es->indent--;
}
}
@@ -1042,7 +1042,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
BYTES_TO_KILOBYTES(metrics->bytesSent), es);
ExplainPropertyText("Format", format, es);
if (es->buffers)
- show_buffer_usage(es, &metrics->instr.bufusage);
+ show_buffer_usage(es, &metrics->instr.bufusage, NULL);
}
ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1972,6 +1972,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_instrumentation_count("Rows Removed by Filter", 1,
planstate, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexScanState *) planstate)->iss_Instrument->table_instr.bufusage, "Table");
break;
case T_IndexOnlyScan:
show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -1989,6 +1992,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainPropertyFloat("Heap Fetches", NULL,
planstate->instrument->ntuples2, 0, es);
show_indexsearches_info(planstate, es);
+
+ if (es->buffers && planstate->instrument)
+ show_buffer_usage(es, &((IndexOnlyScanState *) planstate)->ioss_Instrument->table_instr.bufusage, "Table");
break;
case T_BitmapIndexScan:
show_scan_qual(((BitmapIndexScan *) plan)->indexqualorig,
@@ -2290,7 +2296,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
/* Show buffer/WAL usage */
if (es->buffers && planstate->instrument)
- show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+ show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
@@ -2309,7 +2315,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
ExplainOpenWorker(n, es);
if (es->buffers)
- show_buffer_usage(es, &instrument->instr.bufusage);
+ show_buffer_usage(es, &instrument->instr.bufusage, NULL);
if (es->wal)
show_wal_usage(es, &instrument->instr.walusage);
ExplainCloseWorker(n, es);
@@ -4109,7 +4115,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
* Show buffer usage details. This better be sync with peek_buffer_usage.
*/
static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
{
if (es->format == EXPLAIN_FORMAT_TEXT)
{
@@ -4134,6 +4140,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared || has_local || has_temp)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "Buffers:");
if (has_shared)
@@ -4189,6 +4197,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
if (has_shared_timing || has_local_timing || has_temp_timing)
{
ExplainIndentText(es);
+ if (title)
+ appendStringInfo(es->str, "%s ", title);
appendStringInfoString(es->str, "I/O Timings:");
if (has_shared_timing)
@@ -4230,6 +4240,14 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
else
{
+ char *buffers_title = NULL;
+
+ if (title)
+ {
+ buffers_title = psprintf("%s Buffers", title);
+ ExplainOpenGroup(buffers_title, buffers_title, true, es);
+ }
+
ExplainPropertyInteger("Shared Hit Blocks", NULL,
usage->shared_blks_hit, es);
ExplainPropertyInteger("Shared Read Blocks", NULL,
@@ -4250,8 +4268,20 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
usage->temp_blks_read, es);
ExplainPropertyInteger("Temp Written Blocks", NULL,
usage->temp_blks_written, es);
+
+ if (buffers_title)
+ ExplainCloseGroup(buffers_title, buffers_title, true, es);
+
if (track_io_timing)
{
+ char *timings_title = NULL;
+
+ if (title)
+ {
+ timings_title = psprintf("%s I/O Timings", title);
+ ExplainOpenGroup(timings_title, timings_title, true, es);
+ }
+
ExplainPropertyFloat("Shared I/O Read Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
3, es);
@@ -4270,6 +4300,9 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
ExplainPropertyFloat("Temp I/O Write Time", "ms",
INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
3, es);
+
+ if (timings_title)
+ ExplainCloseGroup(timings_title, timings_title, true, es);
}
}
}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 357175ac2cb..57a78325f47 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -847,6 +847,20 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
&node->instrument->instr);
+ /* IndexScan/IndexOnlyScan have a separate entry to track table access */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ InstrFinalizeChild(&iss->iss_Instrument->table_instr, &node->instrument->instr);
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ InstrFinalizeChild(&ioss->ioss_Instrument->table_instr, &node->instrument->instr);
+ }
+
InstrFinalizeChild(&node->instrument->instr, parent);
return false;
@@ -892,6 +906,38 @@ ExecFinalizeWorkerInstrumentation_walker(PlanState *node, void *context)
num_workers = node->worker_instrument->num_workers;
+ /*
+ * Fold per-worker IndexScan/IndexOnlyScan table buffer stats into the
+ * per-worker node stats, matching what ExecFinalizeNodeInstrumentation
+ * does for the leader.
+ */
+ if (IsA(node, IndexScanState))
+ {
+ IndexScanState *iss = castNode(IndexScanState, node);
+
+ if (iss->iss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, iss->iss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &iss->iss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+ else if (IsA(node, IndexOnlyScanState))
+ {
+ IndexOnlyScanState *ioss = castNode(IndexOnlyScanState, node);
+
+ if (ioss->ioss_SharedInfo)
+ {
+ int nworkers = Min(num_workers, ioss->ioss_SharedInfo->num_workers);
+
+ for (int n = 0; n < nworkers; n++)
+ InstrAccumStack(&node->worker_instrument->instrument[n].instr,
+ &ioss->ioss_SharedInfo->winstrument[n].table_instr);
+ }
+ }
+
/* Accumulate this node's per-worker stats to parent's per-worker stats */
if (parent && parent->worker_instrument)
{
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 7978514e1bc..b59b4661597 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -276,7 +276,7 @@ ExecInitBitmapIndexScan(BitmapIndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of bitmap index scans if requested */
if (estate->es_instrument)
- indexstate->biss_Instrument = palloc0_object(IndexScanInstrumentation);
+ indexstate->biss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
diff --git a/src/backend/executor/nodeIndexonlyscan.c b/src/backend/executor/nodeIndexonlyscan.c
index d52012e8a69..aac5e143e6e 100644
--- a/src/backend/executor/nodeIndexonlyscan.c
+++ b/src/backend/executor/nodeIndexonlyscan.c
@@ -67,6 +67,7 @@ IndexOnlyNext(IndexOnlyScanState *node)
IndexScanDesc scandesc;
TupleTableSlot *slot;
ItemPointer tid;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -83,6 +84,9 @@ IndexOnlyNext(IndexOnlyScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->ioss_Instrument && node->ioss_Instrument->table_instr.need_stack)
+ table_instr = &node->ioss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -165,11 +169,22 @@ IndexOnlyNext(IndexOnlyScanState *node)
ItemPointerGetBlockNumber(tid),
&node->ioss_VMBuffer))
{
+ bool found;
+
/*
* Rats, we have to visit the heap to check visibility.
*/
InstrCountTuples2(node, 1);
- if (!index_fetch_heap(scandesc, node->ioss_TableSlot))
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ found = index_fetch_heap(scandesc, node->ioss_TableSlot);
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (!found)
continue; /* no visible tuple, try next index entry */
ExecClearTuple(node->ioss_TableSlot);
@@ -436,6 +451,7 @@ ExecEndIndexOnlyScan(IndexOnlyScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->ioss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->ioss_Instrument->table_instr);
}
/*
@@ -610,7 +626,21 @@ ExecInitIndexOnlyScan(IndexOnlyScan *node, EState *estate, int eflags)
/* Set up instrumentation of index-only scans if requested */
if (estate->es_instrument)
- indexstate->ioss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->ioss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexOnlyNext calls InstrPushStack / InstrPopStack (instead of the
+ * full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->ioss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_query_instr, &indexstate->ioss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -923,4 +953,11 @@ ExecIndexOnlyScanRetrieveInstrumentation(IndexOnlyScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->ioss_SharedInfo = palloc(size);
memcpy(node->ioss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->ioss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->ioss_Instrument->table_instr,
+ &node->ioss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index 39f6691ee35..7c953fa279c 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -85,7 +85,10 @@ IndexNext(IndexScanState *node)
ExprContext *econtext;
ScanDirection direction;
IndexScanDesc scandesc;
+ ItemPointer tid;
TupleTableSlot *slot;
+ bool found;
+ Instrumentation *table_instr = NULL;
/*
* extract necessary information from index scan node
@@ -102,6 +105,9 @@ IndexNext(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -132,8 +138,24 @@ IndexNext(IndexScanState *node)
/*
* ok, now that we have what we need, fetch the next tuple.
*/
- while (index_getnext_slot(scandesc, direction, slot))
+ while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
{
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
+ if (unlikely(!found))
+ continue;
+
CHECK_FOR_INTERRUPTS();
/*
@@ -181,6 +203,7 @@ IndexNextWithReorder(IndexScanState *node)
Datum *lastfetched_vals;
bool *lastfetched_nulls;
int cmp;
+ Instrumentation *table_instr = NULL;
estate = node->ss.ps.state;
@@ -200,6 +223,9 @@ IndexNextWithReorder(IndexScanState *node)
econtext = node->ss.ps.ps_ExprContext;
slot = node->ss.ss_ScanTupleSlot;
+ if (node->iss_Instrument && node->iss_Instrument->table_instr.need_stack)
+ table_instr = &node->iss_Instrument->table_instr;
+
if (scandesc == NULL)
{
/*
@@ -263,36 +289,67 @@ IndexNextWithReorder(IndexScanState *node)
}
/*
- * Fetch next tuple from the index.
+ * Fetch next valid tuple from the index.
*/
-next_indextuple:
- if (!index_getnext_slot(scandesc, ForwardScanDirection, slot))
+ for (;;)
{
+ ItemPointer tid;
+ bool found;
+
+ /* Time to fetch the next TID from the index */
+ tid = index_getnext_tid(scandesc, ForwardScanDirection);
+
+ /* If we're out of index entries, we're done */
+ if (tid == NULL)
+ {
+ /*
+ * No more tuples from the index. But we still need to drain
+ * any remaining tuples from the queue before we're done.
+ */
+ node->iss_ReachedEnd = true;
+ break;
+ }
+
+ Assert(ItemPointerEquals(tid, &scandesc->xs_heaptid));
+
+ if (table_instr)
+ InstrPushStack(table_instr);
+
+ for (;;)
+ {
+ found = index_fetch_heap(scandesc, slot);
+ if (found || !scandesc->xs_heap_continue)
+ break;
+ }
+
+ if (table_instr)
+ InstrPopStack(table_instr);
+
/*
- * No more tuples from the index. But we still need to drain any
- * remaining tuples from the queue before we're done.
+ * If the index was lossy, we have to recheck the index quals and
+ * ORDER BY expressions using the fetched tuple.
*/
- node->iss_ReachedEnd = true;
- continue;
- }
-
- /*
- * If the index was lossy, we have to recheck the index quals and
- * ORDER BY expressions using the fetched tuple.
- */
- if (scandesc->xs_recheck)
- {
- econtext->ecxt_scantuple = slot;
- if (!ExecQualAndReset(node->indexqualorig, econtext))
+ if (found && scandesc->xs_recheck)
{
- /* Fails recheck, so drop it and loop back for another */
- InstrCountFiltered2(node, 1);
- /* allow this loop to be cancellable */
- CHECK_FOR_INTERRUPTS();
- goto next_indextuple;
+ econtext->ecxt_scantuple = slot;
+ if (!ExecQualAndReset(node->indexqualorig, econtext))
+ {
+ /* Fails recheck, so drop it and loop back for another */
+ InstrCountFiltered2(node, 1);
+ /* allow this loop to be cancellable */
+ CHECK_FOR_INTERRUPTS();
+ continue;
+ }
}
+
+ if (found)
+ break;
}
+ /* No more index entries, re-run to clear the reorder queue */
+ if (node->iss_ReachedEnd)
+ continue;
+
if (scandesc->xs_recheckorderby)
{
econtext->ecxt_scantuple = slot;
@@ -818,6 +875,7 @@ ExecEndIndexScan(IndexScanState *node)
* which will have a new IndexOnlyScanState and zeroed stats.
*/
winstrument->nsearches += node->iss_Instrument->nsearches;
+ InstrAccumStack(&winstrument->table_instr, &node->iss_Instrument->table_instr);
}
/*
@@ -980,7 +1038,21 @@ ExecInitIndexScan(IndexScan *node, EState *estate, int eflags)
/* Set up instrumentation of index scans if requested */
if (estate->es_instrument)
- indexstate->iss_Instrument = palloc0_object(IndexScanInstrumentation);
+ {
+ indexstate->iss_Instrument = MemoryContextAllocZero(estate->es_query_instr->instr_cxt, sizeof(IndexScanInstrumentation));
+
+ /*
+ * Track table and index access separately. We intentionally don't
+ * collect timing (even if enabled), since we don't need it, and
+ * IndexNext / IndexNextWithReorder call InstrPushStack /
+ * InstrPopStack (instead of the full InstrNode*) to reduce overhead.
+ */
+ if ((estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+ {
+ InstrInitOptions(&indexstate->iss_Instrument->table_instr, INSTRUMENT_BUFFERS);
+ InstrQueryRememberChild(estate->es_query_instr, &indexstate->iss_Instrument->table_instr);
+ }
+ }
/* Open the index relation. */
lockmode = exec_rt_fetch(node->scan.scanrelid, estate)->rellockmode;
@@ -1858,4 +1930,11 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
node->iss_SharedInfo = palloc(size);
memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+ /* Aggregate workers' table buffer/WAL usage into leader's entry */
+ for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+ {
+ InstrAccumStack(&node->iss_Instrument->table_instr,
+ &node->iss_SharedInfo->winstrument[i].table_instr);
+ }
}
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index e6a3f9f1941..df089f9816a 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
#ifndef INSTRUMENT_NODE_H
#define INSTRUMENT_NODE_H
+#include "executor/instrument.h"
+
/*
* Offset added to plan_node_id to create a second TOC key for per-worker scan
* instrumentation. Instrumentation and parallel-awareness are independent, so
@@ -27,6 +29,7 @@
*/
#define PARALLEL_KEY_SCAN_INSTRUMENT_OFFSET UINT64CONST(0xD000000000000000)
+
/* ---------------------
* Instrumentation information for aggregate function execution
* ---------------------
@@ -56,6 +59,9 @@ typedef struct IndexScanInstrumentation
{
/* Index search count (incremented with pgstat_count_index_scan call) */
uint64 nsearches;
+
+ /* Instrumentation utilized for tracking buffer usage during table access */
+ Instrumentation table_instr;
} IndexScanInstrumentation;
/*
--
2.47.1
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-04-07 00:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-07 20:30 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-07 22:19 ` Andres Freund <[email protected]>
2026-04-07 22:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Andres Freund @ 2026-04-07 22:19 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>; Zsolt Parragi <[email protected]>
Hi,
On 2026-04-07 13:30:11 -0700, Lukas Fittl wrote:
> 0001 is the change to make queryDesc->totaltime be allocated by
> ExecutorStart instead of plugins themselves, and adds a
> queryDesc->totaltime_options to have plugins request which level of
> summary instrumentation they need. This change is pretty simple, and
> could still make sense to get into 19. Because of the earlier
> Instrumentation refactoring that was pushed (thanks!) we're already
> asking extensions allocating queryDesc->totaltime to modify their use
> of InstrAlloc, so I think we might as well clean this up now.
Hm. That's a fair argument. They indeed would have to again change next
release
It's not a complicated change and removes more lines than it adds.
I guess one thing I'm not sure is whether the fields shouldn't be renamed at
the same time:
a) To prevent extensions from continuing to set it, most of them do not test
against assert enabled builds. With a different name they would get a
compiler error.
b) "totaltime" and "totaltime_options" are pretty poor descriptors of tracking
query level statistics. If everyone has to change anyway, this is a good
occasion.
'query_instr[_options]'?
Any opinions?
> 0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
> suggested previously. We still get some quick performance wins from
> doing that (see end of email), and again, its a simple change, so
> could be considered if someone has bandwidth remaining. I've added a
> later patch that then does the more complex inlining and gets us the
> full speed up.
Here it needs a few more inlines to get the full performance, otherwise it
doesn't inline all the helpers. I think on balance I didn't like the
prototype in instrument.h, that's too widely included, and it might even cause
some circularity issues. It seems better to do the somewhat ugly thing and
have the prototype be in executor.h.
> 0002 measurements (with current master and TSC clock source used for
> timing, best of three):
>
> CREATE TABLE lotsarows(key int not null);
> INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> VACUUM FREEZE lotsarows;
With the somewhat more extreme benchmark I used in the rdtsc thread and the
added inline mentioned above I see a bit bigger wins. See the attached
explainbench.sql - it doesn't quite cover all the combinations, but I think it
gives a good enough overview.
c=1 pgbench -f ~/tmp/explainbench.sql -P5 -r -t 10
master:
statement latencies in milliseconds and failures:
200.800 0 SELECT pg_prewarm('pgbench_accounts');
0.098 0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
212.010 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
268.648 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
232.421 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
283.531 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
0.030 0 DEALLOCATE query;
0002:
statement latencies in milliseconds and failures:
201.558 0 SELECT pg_prewarm('pgbench_accounts');
0.103 0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
188.696 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
244.479 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
223.773 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
266.947 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
0.034 0 DEALLOCATE query;
That's something like 4-12%.
Pretty nice for a patch that just adds a few lines around and adds a few
inlines.
> At this point I'd say its safe to say that we should push out later
> changes to PG20, because it needs another good look over, and I don't
> think Andres or Heikki have the capacity for that today (but I really
> appreciate all the effort put in by both of you!).
Indeed.
> @@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
>
> if (auto_explain_enabled())
> {
> + /* We're always interested in runtime */
> + queryDesc->totaltime_options |= INSTRUMENT_TIMER;
> - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
Not that it's going to make a significant difference, but it is nice that this
now would need to track less.
Kinda wonder about having
EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
in plenty cases that'd be all one needs, at substantially lower cost.
Greetings,
Andres Freund
Attachments:
[application/sql] explainbench.sql (397B, 2-explainbench.sql)
download
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-04-07 00:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-07 20:30 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-07 22:19 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
@ 2026-04-07 22:27 ` Lukas Fittl <[email protected]>
2026-04-08 04:09 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
0 siblings, 1 reply; 42+ messages in thread
From: Lukas Fittl @ 2026-04-07 22:27 UTC (permalink / raw)
To: Andres Freund <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>; Zsolt Parragi <[email protected]>
On Tue, Apr 7, 2026 at 3:19 PM Andres Freund <[email protected]> wrote:
>
> Hi,
>
> On 2026-04-07 13:30:11 -0700, Lukas Fittl wrote:
> > 0001 is the change to make queryDesc->totaltime be allocated by
> > ExecutorStart instead of plugins themselves, and adds a
> > queryDesc->totaltime_options to have plugins request which level of
> > summary instrumentation they need. This change is pretty simple, and
> > could still make sense to get into 19. Because of the earlier
> > Instrumentation refactoring that was pushed (thanks!) we're already
> > asking extensions allocating queryDesc->totaltime to modify their use
> > of InstrAlloc, so I think we might as well clean this up now.
>
> Hm. That's a fair argument. They indeed would have to again change next
> release
>
> It's not a complicated change and removes more lines than it adds.
>
> I guess one thing I'm not sure is whether the fields shouldn't be renamed at
> the same time:
>
> a) To prevent extensions from continuing to set it, most of them do not test
> against assert enabled builds. With a different name they would get a
> compiler error.
>
> b) "totaltime" and "totaltime_options" are pretty poor descriptors of tracking
> query level statistics. If everyone has to change anyway, this is a good
> occasion.
>
> 'query_instr[_options]'?
>
>
> Any opinions?
I think renaming makes sense - both to make sure extensions reconsider
how they use it, and because "totaltime" is a bad name anyway, because
its not just about timing (and hasn't been for many releases).
"query_instr[_options]" seems reasonable to me, although we could drop
the "query_" since it'd be "queryDesc->query_instr" vs
"queryDesc->instr".
>
> > 0002 is just ExecProcNodeInstr moved to instrument.c, as Andres had
> > suggested previously. We still get some quick performance wins from
> > doing that (see end of email), and again, its a simple change, so
> > could be considered if someone has bandwidth remaining. I've added a
> > later patch that then does the more complex inlining and gets us the
> > full speed up.
>
> Here it needs a few more inlines to get the full performance, otherwise it
> doesn't inline all the helpers. I think on balance I didn't like the
> prototype in instrument.h, that's too widely included, and it might even cause
> some circularity issues. It seems better to do the somewhat ugly thing and
> have the prototype be in executor.h.
Yeah, that makes sense.
>
> > 0002 measurements (with current master and TSC clock source used for
> > timing, best of three):
> >
> > CREATE TABLE lotsarows(key int not null);
> > INSERT INTO lotsarows SELECT generate_series(1, 50000000);
> > VACUUM FREEZE lotsarows;
>
> With the somewhat more extreme benchmark I used in the rdtsc thread and the
> added inline mentioned above I see a bit bigger wins. See the attached
> explainbench.sql - it doesn't quite cover all the combinations, but I think it
> gives a good enough overview.
>
> c=1 pgbench -f ~/tmp/explainbench.sql -P5 -r -t 10
>
> master:
> statement latencies in milliseconds and failures:
> 200.800 0 SELECT pg_prewarm('pgbench_accounts');
> 0.098 0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
> 212.010 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
> 268.648 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
> 232.421 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
> 283.531 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
> 0.030 0 DEALLOCATE query;
>
>
> 0002:
>
> statement latencies in milliseconds and failures:
> 201.558 0 SELECT pg_prewarm('pgbench_accounts');
> 0.103 0 PREPARE query AS SELECT * FROM pgbench_accounts OFFSET 5000000 LIMIT 1;
> 188.696 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING OFF)
> 244.479 0 EXPLAIN (ANALYZE, BUFFERS OFF, WAL OFF, TIMING ON)
> 223.773 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING OFF)
> 266.947 0 EXPLAIN (ANALYZE, BUFFERS ON, WAL ON, TIMING ON)
> 0.034 0 DEALLOCATE query;
>
> That's something like 4-12%.
>
> Pretty nice for a patch that just adds a few lines around and adds a few
> inlines.
Agreed.
> > @@ -334,6 +334,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
> >
> > if (auto_explain_enabled())
> > {
> > + /* We're always interested in runtime */
> > + queryDesc->totaltime_options |= INSTRUMENT_TIMER;
>
> > - queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
>
> Not that it's going to make a significant difference, but it is nice that this
> now would need to track less.
Yup.
>
> Kinda wonder about having
> EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
>
> in plenty cases that'd be all one needs, at substantially lower cost.
True. I don't like the name "totals_only", but I like the concept.
Today someone has to go to pg_stat_statements to get just the total
numbers, without running them for all nodes with EXPLAIN ANALYZE (and
incurring its overhead).
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 19:38 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 23:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-06 09:58 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-04-07 00:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-07 20:30 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-07 22:19 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-07 22:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-08 04:09 ` Andres Freund <[email protected]>
0 siblings, 0 replies; 42+ messages in thread
From: Andres Freund @ 2026-04-08 04:09 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; PostgreSQL Hackers <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>; Zsolt Parragi <[email protected]>
Hi,
On 2026-04-07 15:27:45 -0700, Lukas Fittl wrote:
> On Tue, Apr 7, 2026 at 3:19 PM Andres Freund <[email protected]> wrote:
> I think renaming makes sense - both to make sure extensions reconsider
> how they use it, and because "totaltime" is a bad name anyway, because
> its not just about timing (and hasn't been for many releases).
>
> "query_instr[_options]" seems reasonable to me, although we could drop
> the "query_" since it'd be "queryDesc->query_instr" vs
> "queryDesc->instr".
Done that way.
I earlier pushed 0002 too.
> > Kinda wonder about having
> > EXPLAIN (ANALYZE BUFFERS totals_only, WAL totals_only) ...;
> >
> > in plenty cases that'd be all one needs, at substantially lower cost.
>
> True. I don't like the name "totals_only", but I like the concept.
I spent all of three seconds coming up with it... :)
> Today someone has to go to pg_stat_statements to get just the total
> numbers, without running them for all nodes with EXPLAIN ANALYZE (and
> incurring its overhead).
Yep.
Greetings,
Andres Freund
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-05 18:22 ` Heikki Linnakangas <[email protected]>
2 siblings, 0 replies; 42+ messages in thread
From: Heikki Linnakangas @ 2026-04-05 18:22 UTC (permalink / raw)
To: Lukas Fittl <[email protected]>; Andres Freund <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
On 05/04/2026 15:31, Lukas Fittl wrote:
> Heikki, your further review is very welcome, if you have the time.
> It'd also be great if you could review the README.instrument (now in
> v13/0008) to see if that makes sense to you.
I don't have very substantial comments to make, an haven't had a chance
to review the latest patch, but I did read your replies. I think I
understand the stack vs. tree model now and why it is the way it is, but
I still find it pretty confusing and I don't know what to about it.
- Heikki
^ permalink raw reply [nested|flat] 42+ messages in thread
* Re: Stack-based tracking of per-node WAL/buffer usage
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-09-09 19:35 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2025-10-31 07:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Re: Stack-based tracking of per-node WAL/buffer usage Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Re: Stack-based tracking of per-node WAL/buffer usage Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Re: Stack-based tracking of per-node WAL/buffer usage Andres Freund <[email protected]>
2026-04-05 12:31 ` Re: Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
@ 2026-04-06 09:26 ` Lukas Fittl <[email protected]>
2 siblings, 0 replies; 42+ messages in thread
From: Lukas Fittl @ 2026-04-06 09:26 UTC (permalink / raw)
To: Andres Freund <[email protected]>; Heikki Linnakangas <[email protected]>; +Cc: PostgreSQL Hackers <[email protected]>; Zsolt Parragi <[email protected]>; Tomas Vondra <[email protected]>; Peter Smith <[email protected]>
On Sun, Apr 5, 2026 at 5:31 AM Lukas Fittl <[email protected]> wrote:
>
> On Sat, Apr 4, 2026 at 12:39 PM Andres Freund <[email protected]> wrote:
> >
> > > @@ -247,9 +248,19 @@ standard_ExecutorStart(QueryDesc *queryDesc, int eflags)
> > > estate->es_snapshot = RegisterSnapshot(queryDesc->snapshot);
> > > estate->es_crosscheck_snapshot = RegisterSnapshot(queryDesc->crosscheck_snapshot);
> > > estate->es_top_eflags = eflags;
> > > - estate->es_instrument = queryDesc->instrument_options;
> > > estate->es_jit_flags = queryDesc->plannedstmt->jitFlags;
> > >
> > > + /*
> > > + * Set up query-level instrumentation if needed. We do this before
> > > + * InitPlan so that node and trigger instrumentation can be allocated
> > > + * within the query's dedicated instrumentation memory context.
> > > + */
> > > + if (!queryDesc->totaltime && queryDesc->instrument_options)
> > > + {
> > > + queryDesc->totaltime = InstrQueryAlloc(queryDesc->instrument_options);
> > > + estate->es_instrument = queryDesc->totaltime;
> > > + }
> > > +
> > > /*
> > > * Set up an AFTER-trigger statement context, unless told not to, or
> > > * unless it's EXPLAIN-only mode (when ExecutorFinish won't be called).
> >
> > It seems pretty weird to still have queryDesc->totaltime *sometimes* created
> > by pgss etc, but also create it in standard_ExecutorStart if not already
> > created. What if the explain options aren't compatible? Sure
> > pgss/auto_explain use ALL, but that's not a given.
>
> Yeah, I think in practice all use cases I've ever seen pass
> INSTRUMENT_ALL (and in fact it won't behave sane if this differs
> between extensions), but you're right there is no guarantee.
>
> Overall, there are two aspects to this:
>
> 1) Query instrumentation as the parent for node instrumentation,
> driven by use of EXPLAIN or auto_explain setting
> queryDesc->instrument_options
>
> 2) Instrumentation as a mechanism to measure the activity of a query,
> as used by pg_stat_statements or auto_explain (to get the runtime /
> aggregate buffer usage)
>
> I could see two solutions:
>
> A) Keep two separate QueryInstrumentations (EXPLAIN/auto_explain get
> es_instrument, any extensions measuring aggregate activity get
> query->totaltime)
>
> B) Have one internal QueryInstrumentation (that's responsible to be
> the abort "parent" to both node instrumentation, and query->totaltime)
>
> I was initially thinking we could maybe combine them creatively (i.e.
> expand on what we've done so far), but I'm not sure there is a
> reasonable design that isn't convoluted. We could also have a way for
> extensions to "request" a certain level of instrumentation (instead of
> directly allocating it), but it seems the current hooks are
> insufficient for that.
>
> I've gone with solution (A) for now, with es_instrument being
> allocated when per-node instrumentation is needed. Obviously that gets
> us two ResOwner cleanups instead of one when e.g. auto_explain is
> active, but I think that's still acceptable. It also shows how its
> easy to do an extra level of nesting with the stack-based
> instrumentation, without too much expense.
>
> With this in place, I do wonder if we should avoid the full memory
> context setup in InstrQueryAlloc (i.e. instead just make a direct
> allocation), unless we know that children are going to be attached.
> The downside of that would be that we can't just re-assign the
> instr_cxt in InstrQueryStopFinalize (we'd have to go back to the
> previous logic of doing a memcpy into the callers context, for the
> no-children case), but it might make a notable performance difference?
I've done a stress test of the logic I had added here in v13 (two
separate QueryInstrumentations to not mess with query->totaltime),
specifically "pgbench -n -j 32 -c 32 -f select1.sql -T 60 postgres"
with auto_explain enabled with all log_* settings (so its both
exercising query->totaltime and instrument_options), and unfortunately
that showed about a 1 to 2% impact.
So I don't think this was the right direction. I'll go back to what I
had before, but fix the specific issue you pointed out when
instrumentation options differ. Specifically, I'll add a preparatory
patch to stop extensions from allocating queryDesc->totaltime
themselves, and add queryDesc->totaltime_options that they use to
request which level of totaltime instrumentation they need.
If they request less than INSTRUMENT_ALL, they might still get more
instrumentation actually collected, when the query in question is an
EXPLAIN (ANALYZE). But since they don't have to read those fields from
query->totaltime, I think that's acceptable.
>
> >
> >
> > > @@ -1284,8 +1325,8 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
> > > palloc0_array(FmgrInfo, n);
> > > resultRelInfo->ri_TrigWhenExprs = (ExprState **)
> > > palloc0_array(ExprState *, n);
> > > - if (instrument_options)
> > > - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
> > > + if (qinstr)
> > > + resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(qinstr, n);
> >
> > Hm. Why do we not need to pass down the instrument_options anymore? I guess
> > the assumption is that we always are going to use the flags from qinstr?
> >
> > Is that right? Because right now pgss/auto_explain use _ALL, even when an
> > EXPLAIN ANALYZE doesn't.
> >
>
> With the solution mentioned earlier, where es_instrument is a separate
> allocation, this problem now goes away without any extra changes
> needed.
>
> Overall, I think its reasonable to make node/trigger instrumentation
> be attached to a query instrumentation that has the instrumentation
> options set that should be applied. That way we don't have think about
> edge cases like a query instrumentation that doesn't need a stack, but
> children that do.
Because we now again have a query->totaltime that may have more
instrumentation_options than the per-node/per-trigger instrumentation
need, we need to explicitly pass the instrumentation options that were
requested.
To support that I'll bring back "es_instrument" with its prior
meaning, and instead add "es_query_instr" to pass down the query
instrumentation to the trigger instrumentation calls.
> >
> > > From 16e44d5508f91dd23da780901f3ec0126965628d Mon Sep 17 00:00:00 2001
> > > From: Lukas Fittl <[email protected]>
> > > Date: Sat, 7 Mar 2026 17:52:24 -0800
> > > Subject: [PATCH v12 7/9] instrumentation: Optimize ExecProcNodeInstr
> > > instructions by inlining
> > >
> > > For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
> > > ExecProcNodeInstr when starting/stopping instrumentation for that node.
> > >
> > > Previously each ExecProcNodeInstr would check which instrumentation
> > > options are active in the InstrStartNode/InstrStopNode calls, and do the
> > > corresponding work (timers, instrumentation stack, etc.). These
> > > conditionals being checked for each tuple being emitted add up, and cause
> > > non-optimal set of instructions to be generated by the compiler.
> > >
> > > Because we already have an existing mechanism to specify a function
> > > pointer when instrumentation is enabled, we can instead create specialized
> > > functions that are tailored to the instrumentation options enabled, and
> > > avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
> > > the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
> > > test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
> > > top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
> > > from ~ 20% to ~ 10% on top of actual runtime.
> >
> > I assume this is to a significant degree due to to allowing for inlining. Have
> > you checked how much of the effort you get by just putting ExecProcNodeInstr()
> > into instrument.c?
>
> Worth a try - I haven't tested that yet - I'll come back to this
> separately and verify how much that buys us, vs spelling out the
> different variants.
I've run a test of just putting ExecProcNodeInstr into instrument.c
(and adding an inline keyword to the functions it calls), and it does
help over not doing it at all, but its not the full experience:
CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;
EXPLAIN (ANALYZE, ...) SELECT count(*) FROM lotsarows;
Below measurements are best out of three, for these three versions:
(1) with stack only
(2) with stack + move ExecProcNodeInstr with no changes (your idea)
(3) with stack + move + avoid branches (current patch set)
BUFFERS OFF, TIMING OFF:
(1): 309ms
(2): 292ms
(3): 283ms
BUFFERS ON, TIMING OFF:
(1): 322ms
(2): 314ms
(3): 294ms
BUFFERS ON, TIMING ON:
(1): 829ms
(2): 814ms
(3): 803ms
I suspect the discrepancy for BUFFERS in particular is because the
commit has an optimized form of the stack popping (InstrPopStackTo),
but I have not taken a close look at the assembly differences here.
For now I'll keep this as-is, but that can be changed quickly.
Thanks,
Lukas
--
Lukas Fittl
^ permalink raw reply [nested|flat] 42+ messages in thread
end of thread, other threads:[~2026-04-08 04:09 UTC | newest]
Thread overview: 42+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-08-31 23:57 Stack-based tracking of per-node WAL/buffer usage Lukas Fittl <[email protected]>
2025-09-04 20:23 ` Andres Freund <[email protected]>
2025-09-09 19:35 ` Lukas Fittl <[email protected]>
2025-10-22 11:28 ` Lukas Fittl <[email protected]>
2025-10-22 12:59 ` Andres Freund <[email protected]>
2025-10-31 07:18 ` Lukas Fittl <[email protected]>
2026-03-08 04:27 ` Lukas Fittl <[email protected]>
2026-03-08 04:31 ` Lukas Fittl <[email protected]>
2026-03-09 21:55 ` Zsolt Parragi <[email protected]>
2026-03-09 23:45 ` Lukas Fittl <[email protected]>
2026-03-10 08:12 ` Zsolt Parragi <[email protected]>
2026-03-14 20:49 ` Lukas Fittl <[email protected]>
2026-03-17 06:21 ` Zsolt Parragi <[email protected]>
2026-03-17 08:18 ` Lukas Fittl <[email protected]>
2026-03-18 20:49 ` Zsolt Parragi <[email protected]>
2026-03-18 23:36 ` Lukas Fittl <[email protected]>
2026-03-19 00:45 ` Lukas Fittl <[email protected]>
2026-03-23 14:41 ` Heikki Linnakangas <[email protected]>
2026-03-23 19:07 ` Zsolt Parragi <[email protected]>
2026-03-23 20:03 ` Lukas Fittl <[email protected]>
2026-03-24 06:03 ` Lukas Fittl <[email protected]>
2026-03-24 22:59 ` Zsolt Parragi <[email protected]>
2026-03-25 05:34 ` Lukas Fittl <[email protected]>
2026-03-25 10:47 ` Heikki Linnakangas <[email protected]>
2026-03-26 00:41 ` Lukas Fittl <[email protected]>
2026-03-27 07:21 ` Lukas Fittl <[email protected]>
2026-04-04 09:43 ` Lukas Fittl <[email protected]>
2026-04-04 19:39 ` Andres Freund <[email protected]>
2026-04-05 12:31 ` Lukas Fittl <[email protected]>
2026-04-05 18:13 ` Andres Freund <[email protected]>
2026-04-05 19:38 ` Lukas Fittl <[email protected]>
2026-04-05 21:02 ` Andres Freund <[email protected]>
2026-04-05 23:12 ` Andres Freund <[email protected]>
2026-04-06 09:58 ` Lukas Fittl <[email protected]>
2026-04-06 22:46 ` Zsolt Parragi <[email protected]>
2026-04-07 00:39 ` Lukas Fittl <[email protected]>
2026-04-07 20:30 ` Lukas Fittl <[email protected]>
2026-04-07 22:19 ` Andres Freund <[email protected]>
2026-04-07 22:27 ` Lukas Fittl <[email protected]>
2026-04-08 04:09 ` Andres Freund <[email protected]>
2026-04-05 18:22 ` Heikki Linnakangas <[email protected]>
2026-04-06 09:26 ` Lukas Fittl <[email protected]>
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox