public inbox for [email protected]
help / color / mirror / Atom feedFrom: r314tive <[email protected]>
To: Michael Paquier <[email protected]>
Cc: [email protected]
Subject: [RFC PATCH v1] Add EXPLAIN ANALYZE wait event reporting
Date: Thu, 14 May 2026 16:19:45 +0500
Message-ID: <CALCfnuquuxtZmmzQBZ_yxaihfj7bnALXdzi9Nj=RYUW4iwY6GQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
<[email protected]>
Following Michael Paquier's note about PostgreSQL list practice, this v1
resends the same patch content as v0 as a single email with the patch series
attached.
No code changes from v0.
v0 thread:
https://www.postgresql.org/message-id/cover.1778280923.git.tanswis42%40gmail.com
This RFC prototype adds EXPLAIN (ANALYZE, WAITS), which reports completed
wait
intervals observed through pgstat_report_wait_start/end().
Statement-level wait events are counted once per active statement collector.
Plan-node wait events use inclusive attribution, similar to EXPLAIN ANALYZE
timing: a completed wait is attributed to every active plan node captured
when
the wait begins. Per-node wait times therefore must not be summed across
plan
nodes.
The patch series is split as follows:
1. Add statement-level EXPLAIN WAITS reporting.
2. Aggregate statement-level EXPLAIN WAITS from parallel workers.
3. Attribute EXPLAIN WAITS to plan nodes.
4. Refine attribution semantics, documentation, and regression coverage.
5. Harden accumulator handling with fixed-size, allocation-free storage.
6. Hide accumulator internals from public headers.
7. Keep psql EXPLAIN option completion current.
The main review questions are:
- whether the option should be named WAITS or WAIT_EVENTS;
- whether inclusive per-node attribution is the right initial semantics;
- whether the fixed accumulator limit and overflow reporting are acceptable;
- whether the disabled/enabled hot-path overhead is acceptable.
Regards,
Ilmar
Attachments:
[application/octet-stream] 0001-Add-EXPLAIN-WAITS-statement-reporting.patch (17.9K, 3-0001-Add-EXPLAIN-WAITS-statement-reporting.patch)
download | inline diff:
From 6fc0404acb69765067ccbf1ca3ee0fd39240f9d6 Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:10 +0500
Subject: [RFC PATCH v1 1/7] Add EXPLAIN WAITS statement reporting
---
doc/src/sgml/ref/explain.sgml | 14 +++
src/backend/commands/explain.c | 129 +++++++++++++++++++-
src/backend/commands/explain_state.c | 8 ++
src/backend/utils/activity/wait_event.c | 149 ++++++++++++++++++++++++
src/include/commands/explain_state.h | 1 +
src/include/utils/wait_event.h | 29 +++++
src/test/regress/expected/explain.out | 17 +++
src/test/regress/sql/explain.sql | 5 +
8 files changed, 349 insertions(+), 3 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index e95e19081e1..5460568af1d 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -47,6 +47,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
IO [ <replaceable class="parameter">boolean</replaceable> ]
+ WAITS [ <replaceable class="parameter">boolean</replaceable> ]
FORMAT { TEXT | XML | JSON | YAML }
</synopsis>
</refsynopsisdiv>
@@ -310,6 +311,19 @@ ROLLBACK;
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><literal>WAITS</literal></term>
+ <listitem>
+ <para>
+ Include exact backend-local wait event timing observed during query
+ execution. Wait events are grouped by wait event type and wait event
+ name, and include the number of waits and total time spent waiting.
+ This parameter may only be used when <literal>ANALYZE</literal> is also
+ enabled. It defaults to <literal>FALSE</literal>.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><literal>FORMAT</literal></term>
<listitem>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 112c17b0d64..0b7cc5c15c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -45,6 +45,7 @@
#include "utils/tuplesort.h"
#include "utils/tuplestore.h"
#include "utils/typcache.h"
+#include "utils/wait_event.h"
#include "utils/xml.h"
@@ -149,6 +150,9 @@ static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
+static int wait_event_usage_cmp(const void *a, const void *b);
+static void show_wait_event_usage(ExplainState *es,
+ const WaitEventUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
static void show_result_replacement_info(Result *result, ExplainState *es);
@@ -510,6 +514,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ WaitEventUsage waitEventUsage;
+ WaitEventUsage *waitEventUsagePtr = NULL;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -583,11 +589,27 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
else
dir = ForwardScanDirection;
+ if (es->waits)
+ {
+ waitEventUsagePtr = &waitEventUsage;
+ pgstat_begin_wait_event_usage(waitEventUsagePtr,
+ queryDesc->estate->es_query_cxt);
+ }
+
/* run the plan */
- ExecutorRun(queryDesc, dir, 0);
+ PG_TRY();
+ {
+ ExecutorRun(queryDesc, dir, 0);
- /* run cleanup too */
- ExecutorFinish(queryDesc);
+ /* run cleanup too */
+ ExecutorFinish(queryDesc);
+ }
+ PG_FINALLY();
+ {
+ if (waitEventUsagePtr)
+ pgstat_end_wait_event_usage(waitEventUsagePtr);
+ }
+ PG_END_TRY();
/* We can't run ExecutorEnd 'till we're done printing the stats... */
totaltime += elapsed_time(&starttime);
@@ -605,6 +627,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
+ if (waitEventUsagePtr)
+ show_wait_event_usage(es, waitEventUsagePtr);
+
/* Show buffer and/or memory usage in planning */
if (peek_buffer_usage(es, bufusage) || mem_counters)
{
@@ -4503,6 +4528,104 @@ show_wal_usage(ExplainState *es, const WalUsage *usage)
}
}
+static int
+wait_event_usage_cmp(const void *a, const void *b)
+{
+ const WaitEventUsageEntry *ea = (const WaitEventUsageEntry *) a;
+ const WaitEventUsageEntry *eb = (const WaitEventUsageEntry *) b;
+ int64 ta = INSTR_TIME_GET_MICROSEC(ea->time);
+ int64 tb = INSTR_TIME_GET_MICROSEC(eb->time);
+
+ if (ta < tb)
+ return 1;
+ if (ta > tb)
+ return -1;
+ if (ea->wait_event_info < eb->wait_event_info)
+ return -1;
+ if (ea->wait_event_info > eb->wait_event_info)
+ return 1;
+ return 0;
+}
+
+static void
+show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+{
+ WaitEventUsageEntry *entries;
+
+ if (usage == NULL)
+ return;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+ return;
+
+ if (usage->nentries > 0)
+ {
+ entries = palloc_array(WaitEventUsageEntry, usage->nentries);
+ memcpy(entries, usage->entries,
+ sizeof(WaitEventUsageEntry) * usage->nentries);
+ qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+ wait_event_usage_cmp);
+ }
+ else
+ entries = NULL;
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Wait Events:\n");
+ es->indent++;
+
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ const char *event_type;
+ const char *event_name;
+
+ event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+ event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+ ExplainIndentText(es);
+ appendStringInfo(es->str, "%s:%s calls=%" PRIu64 " time=%0.3f ms\n",
+ event_type ? event_type : "Unknown",
+ event_name ? event_name : "unknown",
+ entries[i].calls,
+ INSTR_TIME_GET_MILLISEC(entries[i].time));
+ }
+
+ es->indent--;
+ }
+ else
+ {
+ ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
+
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ const char *event_type;
+ const char *event_name;
+
+ event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+ event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+ ExplainOpenGroup("Wait-Event", NULL, true, es);
+ ExplainPropertyText("Wait Event Type",
+ event_type ? event_type : "Unknown",
+ es);
+ ExplainPropertyText("Wait Event",
+ event_name ? event_name : "unknown",
+ es);
+ ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+ ExplainPropertyFloat("Time", "ms",
+ INSTR_TIME_GET_MILLISEC(entries[i].time),
+ 3, es);
+ ExplainCloseGroup("Wait-Event", NULL, true, es);
+ }
+
+ ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+ }
+
+ if (entries)
+ pfree(entries);
+}
+
/*
* Show memory usage details.
*/
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index a0ee0a664be..2abcb64cc95 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -164,6 +164,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
}
else if (strcmp(opt->defname, "io") == 0)
es->io = defGetBoolean(opt);
+ else if (strcmp(opt->defname, "waits") == 0)
+ es->waits = defGetBoolean(opt);
else if (!ApplyExtensionExplainOption(es, opt, pstate))
ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),
@@ -196,6 +198,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("EXPLAIN option %s requires ANALYZE", "IO")));
+ /* check that WAITS is used with EXPLAIN ANALYZE */
+ if (es->waits && !es->analyze)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("EXPLAIN option %s requires ANALYZE", "WAITS")));
+
/* check that serialize is used with EXPLAIN ANALYZE */
if (es->serialize != EXPLAIN_SERIALIZE_NONE && !es->analyze)
ereport(ERROR,
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 95635c7f56c..60d37ccbb73 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,11 +36,20 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
static const char *pgstat_get_wait_ipc(WaitEventIPC w);
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
+static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+ const instr_time *elapsed);
static uint32 local_my_wait_event_info;
uint32 *my_wait_event_info = &local_my_wait_event_info;
+#define WAIT_EVENT_USAGE_INITIAL_EVENTS 16
+
+int pgstat_wait_event_usage_depth = 0;
+static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static uint32 pgstat_wait_event_usage_current = 0;
+static instr_time pgstat_wait_event_usage_start;
+
#define WAIT_EVENT_CLASS_MASK 0xFF000000
#define WAIT_EVENT_ID_MASK 0x0000FFFF
@@ -349,6 +358,146 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}
+/*
+ * Start collecting exact wait event timings in this backend.
+ *
+ * This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
+ * It records waits observed through pgstat_report_wait_start/end in backend
+ * local memory. Nested collection is deliberately treated as part of the
+ * outer collection for now; callers that want independent nested accounting
+ * need a stack of WaitEventUsage contexts.
+ */
+void
+pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+ Assert(usage != NULL);
+ Assert(memcontext != NULL);
+
+ if (pgstat_wait_event_usage_depth++ == 0)
+ {
+ memset(usage, 0, sizeof(WaitEventUsage));
+ usage->memcontext = memcontext;
+ pgstat_wait_event_usage = usage;
+ pgstat_wait_event_usage_current = 0;
+ INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+ }
+}
+
+/*
+ * Stop collecting wait event timings.
+ */
+void
+pgstat_end_wait_event_usage(WaitEventUsage *usage)
+{
+ Assert(usage != NULL);
+ Assert(pgstat_wait_event_usage_depth > 0);
+
+ if (--pgstat_wait_event_usage_depth == 0)
+ {
+ if (pgstat_wait_event_usage_current != 0)
+ pgstat_count_wait_event_end();
+
+ pgstat_wait_event_usage = NULL;
+ pgstat_wait_event_usage_current = 0;
+ INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+ }
+}
+
+/*
+ * Record the beginning of a wait event for exact EXPLAIN-style accounting.
+ */
+void
+pgstat_count_wait_event_start(uint32 wait_event_info)
+{
+ if (pgstat_wait_event_usage == NULL)
+ return;
+
+ /*
+ * Waits are not expected to nest. If they do, finish the previous wait
+ * at the boundary so accounting remains internally consistent.
+ */
+ if (pgstat_wait_event_usage_current != 0)
+ pgstat_count_wait_event_end();
+
+ pgstat_wait_event_usage_current = wait_event_info;
+ INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
+}
+
+/*
+ * Record the end of the current wait event.
+ */
+void
+pgstat_count_wait_event_end(void)
+{
+ instr_time end;
+ instr_time elapsed;
+
+ if (pgstat_wait_event_usage == NULL ||
+ pgstat_wait_event_usage_current == 0)
+ return;
+
+ INSTR_TIME_SET_CURRENT(end);
+ elapsed = end;
+ INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
+
+ WaitEventUsageAdd(pgstat_wait_event_usage,
+ pgstat_wait_event_usage_current,
+ &elapsed);
+
+ pgstat_wait_event_usage_current = 0;
+ INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+}
+
+static void
+WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+ const instr_time *elapsed)
+{
+ WaitEventUsageEntry *entry = NULL;
+
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ if (usage->entries[i].wait_event_info == wait_event_info)
+ {
+ entry = &usage->entries[i];
+ break;
+ }
+ }
+
+ if (entry == NULL)
+ {
+ if (usage->nentries >= usage->maxentries)
+ {
+ MemoryContext oldcontext;
+ int newmaxentries;
+
+ if (usage->maxentries > 0)
+ newmaxentries = usage->maxentries * 2;
+ else
+ newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
+
+ oldcontext = MemoryContextSwitchTo(usage->memcontext);
+ if (usage->entries)
+ usage->entries = repalloc_array(usage->entries,
+ WaitEventUsageEntry,
+ newmaxentries);
+ else
+ usage->entries = palloc_array(WaitEventUsageEntry,
+ newmaxentries);
+ MemoryContextSwitchTo(oldcontext);
+
+ usage->maxentries = newmaxentries;
+ }
+
+ entry = &usage->entries[usage->nentries++];
+ entry->wait_event_info = wait_event_info;
+ entry->calls = 0;
+ INSTR_TIME_SET_ZERO(entry->time);
+ }
+
+ entry->calls++;
+ INSTR_TIME_ADD(entry->time, *elapsed);
+}
+
/* ----------
* pgstat_get_wait_event_type() -
*
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 97bc7ed49f6..4f0d0d495d4 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
bool memory; /* print planner's memory usage information */
bool settings; /* print modified settings */
bool io; /* print info about IO (prefetch, ...) */
+ bool waits; /* print wait event usage */
bool generic; /* generate a generic plan */
ExplainSerializeOption serialize; /* serialize the query's output? */
ExplainFormat format; /* output format */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 86ee348220d..f7fab5736bb 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -11,16 +11,39 @@
#define WAIT_EVENT_H
/* enums for wait events */
+#include "portability/instr_time.h"
+#include "utils/palloc.h"
#include "utils/wait_event_types.h"
+typedef struct WaitEventUsageEntry
+{
+ uint32 wait_event_info;
+ uint64 calls;
+ instr_time time;
+} WaitEventUsageEntry;
+
+typedef struct WaitEventUsage
+{
+ MemoryContext memcontext;
+ int nentries;
+ int maxentries;
+ WaitEventUsageEntry *entries;
+} WaitEventUsage;
+
extern const char *pgstat_get_wait_event(uint32 wait_event_info);
extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
+ MemoryContext memcontext);
+extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_count_wait_event_start(uint32 wait_event_info);
+extern void pgstat_count_wait_event_end(void);
extern PGDLLIMPORT uint32 *my_wait_event_info;
+extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
/*
@@ -66,6 +89,9 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
static inline void
pgstat_report_wait_start(uint32 wait_event_info)
{
+ if (pgstat_wait_event_usage_depth > 0)
+ pgstat_count_wait_event_start(wait_event_info);
+
/*
* Since this is a four-byte field which is always read and written as
* four-bytes, updates are atomic.
@@ -82,6 +108,9 @@ pgstat_report_wait_start(uint32 wait_event_info)
static inline void
pgstat_report_wait_end(void)
{
+ if (pgstat_wait_event_usage_depth > 0)
+ pgstat_count_wait_event_end();
+
/* see pgstat_report_wait_start() */
*(volatile uint32 *) my_wait_event_info = 0;
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 74a4d87801e..2c7a7e1d4c6 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -99,6 +99,23 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N)
(1 row)
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+ explain_filter
+---------------------------------------
+ Result (actual rows=N.N loops=N)
+ Wait Events:
+ Timeout:PgSleep calls=N time=N.N ms
+(3 rows)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+explain (waits) select 1;
+ERROR: EXPLAIN option WAITS requires ANALYZE
\a
select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
explain_filter
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 2f163c64bf6..fe025ddeac5 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -68,6 +68,11 @@ select explain_filter('explain (analyze, buffers off, verbose) select * from int
select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+explain (waits) select 1;
+
\a
select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
select explain_filter('explain (analyze, serialize, buffers, io, format yaml) select * from int8_tbl i8');
--
2.52.0
[application/octet-stream] 0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patch (17.0K, 4-0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patch)
download | inline diff:
From eb5f9c83a0f248f2def60ff6e93e998149047f06 Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:24 +0500
Subject: [RFC PATCH v1 2/7] Aggregate EXPLAIN WAITS from parallel workers
---
src/backend/commands/explain.c | 4 +
src/backend/executor/execMain.c | 1 +
src/backend/executor/execParallel.c | 129 ++++++++++++++++++++++++
src/backend/executor/execUtils.c | 1 +
src/backend/utils/activity/wait_event.c | 22 +++-
src/include/executor/execParallel.h | 2 +
src/include/nodes/execnodes.h | 2 +
src/include/utils/wait_event.h | 3 +
src/test/regress/expected/explain.out | 17 ++++
src/test/regress/sql/explain.sql | 12 +++
10 files changed, 190 insertions(+), 3 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0b7cc5c15c6..9d7372f5415 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -594,6 +594,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
waitEventUsagePtr = &waitEventUsage;
pgstat_begin_wait_event_usage(waitEventUsagePtr,
queryDesc->estate->es_query_cxt);
+ queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
}
/* run the plan */
@@ -607,7 +608,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
PG_FINALLY();
{
if (waitEventUsagePtr)
+ {
pgstat_end_wait_event_usage(waitEventUsagePtr);
+ queryDesc->estate->es_wait_event_usage = NULL;
+ }
}
PG_END_TRY();
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b30f768680..86ab124c1c0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3066,6 +3066,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
/* es_trig_target_relations must NOT be copied */
rcestate->es_top_eflags = parentestate->es_top_eflags;
rcestate->es_instrument = parentestate->es_instrument;
+ rcestate->es_wait_event_usage = parentestate->es_wait_event_usage;
/* es_auxmodifytables must NOT be copied */
/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 81b87d82fab..8213565a708 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -51,6 +51,7 @@
#include "utils/dsa.h"
#include "utils/lsyscache.h"
#include "utils/snapmgr.h"
+#include "utils/wait_event.h"
/*
* Magic numbers for parallel executor communication. We use constants
@@ -67,6 +68,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_WAIT_EVENT_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -114,6 +116,18 @@ struct SharedExecutorInstrumentation
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
(NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
+typedef struct SharedWaitEventUsageWorker
+{
+ int nentries;
+ dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
+struct SharedWaitEventUsage
+{
+ int num_workers;
+ SharedWaitEventUsageWorker worker_usage[FLEXIBLE_ARRAY_MEMBER];
+};
+
/* Context object for ExecParallelEstimate. */
typedef struct ExecParallelEstimateContext
{
@@ -141,6 +155,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
ParallelContext *pcxt);
static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
SharedExecutorInstrumentation *instrumentation);
+static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+ dsa_area *area,
+ const WaitEventUsage *usage);
/* Helper function that runs in the parallel worker. */
static DestReceiver *ExecParallelGetReceiver(dsm_segment *seg, shm_toc *toc);
@@ -664,10 +682,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *paramlistinfo_space;
BufferUsage *bufusage_space;
WalUsage *walusage_space;
+ SharedWaitEventUsage *wait_event_usage = NULL;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
int pstmt_len;
int paramlistinfo_len;
+ int wait_event_usage_len = 0;
int instrumentation_len = 0;
int jit_instrumentation_len = 0;
int instrument_offset = 0;
@@ -744,6 +764,16 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /* Estimate space for per-worker wait event usage metadata. */
+ if (estate->es_wait_event_usage != NULL)
+ {
+ wait_event_usage_len =
+ offsetof(SharedWaitEventUsage, worker_usage) +
+ mul_size(sizeof(SharedWaitEventUsageWorker), pcxt->nworkers);
+ shm_toc_estimate_chunk(&pcxt->estimator, wait_event_usage_len);
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+ }
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -839,6 +869,21 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Allocate metadata for each worker's wait event usage, if requested. */
+ if (estate->es_wait_event_usage != NULL)
+ {
+ wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
+ wait_event_usage->num_workers = nworkers;
+ for (int i = 0; i < nworkers; i++)
+ {
+ wait_event_usage->worker_usage[i].nentries = 0;
+ wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
+ }
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
+ wait_event_usage);
+ pei->wait_event_usage = wait_event_usage;
+ }
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1213,6 +1258,68 @@ ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
}
+static void
+ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
+{
+ SharedWaitEventUsage *shared = pei->wait_event_usage;
+ WaitEventUsage *usage;
+
+ if (shared == NULL)
+ return;
+
+ usage = pei->planstate->state->es_wait_event_usage;
+ if (usage == NULL)
+ return;
+
+ for (int i = 0; i < shared->num_workers; i++)
+ {
+ SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
+ WaitEventUsageEntry *entries;
+
+ if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+ continue;
+
+ entries = dsa_get_address(pei->area, worker->entries);
+ pgstat_accumulate_wait_event_usage(usage,
+ entries,
+ worker->nentries);
+ dsa_free(pei->area, worker->entries);
+ worker->nentries = 0;
+ worker->entries = InvalidDsaPointer;
+ }
+}
+
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+ dsa_area *area,
+ const WaitEventUsage *usage)
+{
+ SharedWaitEventUsageWorker *worker;
+ WaitEventUsageEntry *entries;
+ dsa_pointer entries_dsa;
+ Size entries_size;
+
+ Assert(shared != NULL);
+ Assert(area != NULL);
+ Assert(usage != NULL);
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber < shared->num_workers);
+
+ if (usage->nentries <= 0)
+ return;
+
+ worker = &shared->worker_usage[ParallelWorkerNumber];
+ entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+ entries_dsa = dsa_allocate(area, entries_size);
+ entries = dsa_get_address(area, entries_dsa);
+ memcpy(entries, usage->entries, entries_size);
+
+ if (DsaPointerIsValid(worker->entries))
+ dsa_free(area, worker->entries);
+ worker->nentries = usage->nentries;
+ worker->entries = entries_dsa;
+}
+
/*
* Finish parallel execution. We wait for parallel workers to finish, and
* accumulate their buffer/WAL usage.
@@ -1261,6 +1368,9 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
for (i = 0; i < nworkers; i++)
InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ /* Accumulate wait event usage, if requested. */
+ ExecParallelRetrieveWaitEventUsage(pei);
+
pei->finished = true;
}
@@ -1516,10 +1626,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
WalUsage *wal_usage;
+ SharedWaitEventUsage *wait_event_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
SharedJitInstrumentation *jit_instrumentation;
+ WaitEventUsage waitEventUsage;
+ WaitEventUsage *waitEventUsagePtr = NULL;
int instrument_options = 0;
void *area_space;
dsa_area *area;
@@ -1535,6 +1648,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
instrument_options = instrumentation->instrument_options;
jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
true);
+ wait_event_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAIT_EVENT_USAGE, true);
queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
/* Setting debug_query_string for individual workers */
@@ -1576,6 +1690,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
*/
InstrStartParallelQuery();
+ if (wait_event_usage != NULL)
+ {
+ waitEventUsagePtr = &waitEventUsage;
+ pgstat_begin_wait_event_usage(waitEventUsagePtr,
+ queryDesc->estate->es_query_cxt);
+ }
+
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
* more tuples than that.
@@ -1587,6 +1708,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
+ if (waitEventUsagePtr != NULL)
+ {
+ pgstat_end_wait_event_usage(waitEventUsagePtr);
+ ExecParallelReportWaitEventUsage(wait_event_usage,
+ area,
+ waitEventUsagePtr);
+ }
+
/* Report buffer/WAL usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..80ea777632b 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -151,6 +151,7 @@ CreateExecutorState(void)
estate->es_top_eflags = 0;
estate->es_instrument = 0;
+ estate->es_wait_event_usage = NULL;
estate->es_finished = false;
estate->es_exprcontexts = NIL;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60d37ccbb73..eb01bc3d88c 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -37,7 +37,7 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
- const instr_time *elapsed);
+ uint64 calls, const instr_time *elapsed);
static uint32 local_my_wait_event_info;
@@ -442,15 +442,31 @@ pgstat_count_wait_event_end(void)
WaitEventUsageAdd(pgstat_wait_event_usage,
pgstat_wait_event_usage_current,
+ 1,
&elapsed);
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
+void
+pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+ const WaitEventUsageEntry *entries,
+ int nentries)
+{
+ Assert(usage != NULL);
+ Assert(nentries == 0 || entries != NULL);
+
+ for (int i = 0; i < nentries; i++)
+ WaitEventUsageAdd(usage,
+ entries[i].wait_event_info,
+ entries[i].calls,
+ &entries[i].time);
+}
+
static void
WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
- const instr_time *elapsed)
+ uint64 calls, const instr_time *elapsed)
{
WaitEventUsageEntry *entry = NULL;
@@ -494,7 +510,7 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
INSTR_TIME_SET_ZERO(entry->time);
}
- entry->calls++;
+ entry->calls += calls;
INSTR_TIME_ADD(entry->time, *elapsed);
}
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..71df2c2511c 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -20,6 +20,7 @@
#include "utils/dsa.h"
typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
+typedef struct SharedWaitEventUsage SharedWaitEventUsage;
typedef struct ParallelExecutorInfo
{
@@ -27,6 +28,7 @@ typedef struct ParallelExecutorInfo
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
WalUsage *wal_usage; /* walusage area in DSM */
+ SharedWaitEventUsage *wait_event_usage; /* optional */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
dsa_area *area; /* points to DSA area in DSM */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 13359180d25..781c8c79132 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -69,6 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
typedef struct TupleTableSlot TupleTableSlot;
typedef struct TupleTableSlotOps TupleTableSlotOps;
typedef struct WalUsage WalUsage;
+typedef struct WaitEventUsage WaitEventUsage;
typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
@@ -754,6 +755,7 @@ typedef struct EState
int es_top_eflags; /* eflags passed to ExecutorStart */
int es_instrument; /* OR of InstrumentOption flags */
+ WaitEventUsage *es_wait_event_usage; /* EXPLAIN WAITS accumulator */
bool es_finished; /* true when ExecutorFinish is done */
List *es_exprcontexts; /* List of ExprContexts within EState */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f7fab5736bb..63992137ee7 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -39,6 +39,9 @@ extern void pgstat_reset_wait_event_storage(void);
extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+ const WaitEventUsageEntry *entries,
+ int nentries);
extern void pgstat_count_wait_event_start(uint32 wait_event_info);
extern void pgstat_count_wait_event_end(void);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 2c7a7e1d4c6..e3847e222be 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -114,6 +114,23 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+ language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
explain (waits) select 1;
ERROR: EXPLAIN option WAITS requires ANALYZE
\a
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index fe025ddeac5..8821250bcef 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,18 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+ language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
explain (waits) select 1;
\a
--
2.52.0
[application/octet-stream] 0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patch (35.0K, 5-0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patch)
download | inline diff:
From cb26aa486d42f227209c27df4aef4b07f13e334a Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:35 +0500
Subject: [RFC PATCH v1 3/7] Attribute EXPLAIN WAITS to plan nodes
---
src/backend/commands/explain.c | 6 +-
src/backend/executor/execAsync.c | 22 ++++
src/backend/executor/execParallel.c | 131 ++++++++++++++++++---
src/backend/executor/execProcnode.c | 31 ++++-
src/backend/executor/instrument.c | 7 ++
src/backend/executor/nodeBitmapAnd.c | 7 ++
src/backend/executor/nodeBitmapIndexscan.c | 7 ++
src/backend/executor/nodeBitmapOr.c | 7 ++
src/backend/executor/nodeHash.c | 7 ++
src/backend/utils/activity/wait_event.c | 49 +++++++-
src/include/executor/instrument.h | 1 +
src/include/nodes/execnodes.h | 1 +
src/include/utils/wait_event.h | 4 +
src/test/regress/expected/explain.out | 52 +++++++-
src/test/regress/sql/explain.sql | 29 +++++
15 files changed, 333 insertions(+), 28 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9d7372f5415..0575ae8368b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -530,6 +530,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_WAL;
if (es->io)
instrument_option |= INSTRUMENT_IO;
+ if (es->waits)
+ instrument_option |= INSTRUMENT_WAITS;
/*
* We always collect timing for the entire statement, even when node-level
@@ -2332,6 +2334,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
show_buffer_usage(es, &planstate->instrument->instr.bufusage);
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
+ if (es->waits)
+ show_wait_event_usage(es, planstate->wait_event_usage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4559,7 +4563,7 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
if (usage == NULL)
return;
- if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+ if (usage->nentries == 0)
return;
if (usage->nentries > 0)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index cf7ddbb01f4..e73ea55cf92 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -19,6 +19,7 @@
#include "executor/instrument.h"
#include "executor/nodeAppend.h"
#include "executor/nodeForeignscan.h"
+#include "utils/wait_event.h"
/*
* Asynchronously request a tuple from a designed async-capable node.
@@ -26,12 +27,17 @@
void
ExecAsyncRequest(AsyncRequest *areq)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
if (areq->requestee->chgParam != NULL) /* something changed? */
ExecReScan(areq->requestee); /* let ReScan handle this */
/* must provide our own instrumentation support */
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
+ if (areq->requestee->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
switch (nodeTag(areq->requestee))
{
@@ -47,6 +53,8 @@ ExecAsyncRequest(AsyncRequest *areq)
ExecAsyncResponse(areq);
/* must provide our own instrumentation support */
+ if (areq->requestee->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (areq->requestee->instrument)
InstrStopNode(areq->requestee->instrument,
TupIsNull(areq->result) ? 0.0 : 1.0);
@@ -62,9 +70,14 @@ ExecAsyncRequest(AsyncRequest *areq)
void
ExecAsyncConfigureWait(AsyncRequest *areq)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
/* must provide our own instrumentation support */
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
+ if (areq->requestee->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
switch (nodeTag(areq->requestee))
{
@@ -78,6 +91,8 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
}
/* must provide our own instrumentation support */
+ if (areq->requestee->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (areq->requestee->instrument)
InstrStopNode(areq->requestee->instrument, 0.0);
}
@@ -88,9 +103,14 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
void
ExecAsyncNotify(AsyncRequest *areq)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
/* must provide our own instrumentation support */
if (areq->requestee->instrument)
InstrStartNode(areq->requestee->instrument);
+ if (areq->requestee->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
switch (nodeTag(areq->requestee))
{
@@ -106,6 +126,8 @@ ExecAsyncNotify(AsyncRequest *areq)
ExecAsyncResponse(areq);
/* must provide our own instrumentation support */
+ if (areq->requestee->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (areq->requestee->instrument)
InstrStopNode(areq->requestee->instrument,
TupIsNull(areq->result) ? 0.0 : 1.0);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8213565a708..f77f539dd27 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -83,6 +83,12 @@ typedef struct FixedParallelExecutorState
int jit_flags;
} FixedParallelExecutorState;
+typedef struct SharedWaitEventUsageWorker
+{
+ int nentries;
+ dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
/*
* DSM structure for accumulating per-PlanState instrumentation.
*
@@ -92,6 +98,10 @@ typedef struct FixedParallelExecutorState
* of the first NodeInstrumentation object. This will depend on the length of
* the plan_node_id array.
*
+ * wait_event_usage_offset: Offset, relative to the start of this structure,
+ * of the first SharedWaitEventUsageWorker object, or 0 if wait event usage is
+ * not being collected.
+ *
* num_workers: Number of workers.
*
* num_plan_nodes: Number of plan nodes.
@@ -103,6 +113,7 @@ struct SharedExecutorInstrumentation
{
int instrument_options;
int instrument_offset;
+ int wait_event_usage_offset;
int num_workers;
int num_plan_nodes;
int plan_node_id[FLEXIBLE_ARRAY_MEMBER];
@@ -110,17 +121,17 @@ struct SharedExecutorInstrumentation
/*
* Array of num_plan_nodes * num_workers NodeInstrumentation objects
* follows.
+ *
+ * If wait_event_usage_offset is non-zero, an array of num_plan_nodes *
+ * num_workers SharedWaitEventUsageWorker objects follows.
*/
};
#define GetInstrumentationArray(sei) \
(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
(NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
-
-typedef struct SharedWaitEventUsageWorker
-{
- int nentries;
- dsa_pointer entries;
-} SharedWaitEventUsageWorker;
+#define GetInstrumentationWaitEventUsageArray(sei) \
+ (StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
+ (SharedWaitEventUsageWorker *) (((char *) sei) + sei->wait_event_usage_offset))
struct SharedWaitEventUsage
{
@@ -143,6 +154,12 @@ typedef struct ExecParallelInitializeDSMContext
int nnodes;
} ExecParallelInitializeDSMContext;
+typedef struct ExecParallelRetrieveInstrumentationContext
+{
+ SharedExecutorInstrumentation *instrumentation;
+ dsa_area *area;
+} ExecParallelRetrieveInstrumentationContext;
+
/* Helper functions that run in the parallel leader. */
static char *ExecSerializePlan(Plan *plan, EState *estate);
static bool ExecParallelEstimate(PlanState *planstate,
@@ -154,8 +171,11 @@ static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt,
static bool ExecParallelReInitializeDSM(PlanState *planstate,
ParallelContext *pcxt);
static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
- SharedExecutorInstrumentation *instrumentation);
+ ExecParallelRetrieveInstrumentationContext *r);
static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+ dsa_area *area,
+ const WaitEventUsage *usage);
static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
dsa_area *area,
const WaitEventUsage *usage);
@@ -691,6 +711,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
int instrumentation_len = 0;
int jit_instrumentation_len = 0;
int instrument_offset = 0;
+ int wait_event_usage_offset = 0;
Size dsa_minsize = dsa_minimum_size();
char *query_string;
int query_len;
@@ -798,6 +819,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation_len +=
mul_size(sizeof(NodeInstrumentation),
mul_size(e.nnodes, nworkers));
+ if (estate->es_instrument & INSTRUMENT_WAITS)
+ {
+ instrumentation_len = MAXALIGN(instrumentation_len);
+ wait_event_usage_offset = instrumentation_len;
+ instrumentation_len +=
+ mul_size(sizeof(SharedWaitEventUsageWorker),
+ mul_size(e.nnodes, nworkers));
+ }
shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -903,11 +932,23 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
instrumentation->instrument_options = estate->es_instrument;
instrumentation->instrument_offset = instrument_offset;
+ instrumentation->wait_event_usage_offset = wait_event_usage_offset;
instrumentation->num_workers = nworkers;
instrumentation->num_plan_nodes = e.nnodes;
instrument = GetInstrumentationArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
InstrInitNode(&instrument[i], estate->es_instrument, false);
+ if (wait_event_usage_offset != 0)
+ {
+ SharedWaitEventUsageWorker *worker_usage;
+
+ worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+ for (i = 0; i < nworkers * e.nnodes; ++i)
+ {
+ worker_usage[i].nentries = 0;
+ worker_usage[i].entries = InvalidDsaPointer;
+ }
+ }
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
pei->instrumentation = instrumentation;
@@ -1137,9 +1178,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
*/
static bool
ExecParallelRetrieveInstrumentation(PlanState *planstate,
- SharedExecutorInstrumentation *instrumentation)
+ ExecParallelRetrieveInstrumentationContext *r)
{
+ SharedExecutorInstrumentation *instrumentation = r->instrumentation;
NodeInstrumentation *instrument;
+ SharedWaitEventUsageWorker *wait_event_usage = NULL;
int i;
int n;
int ibytes;
@@ -1159,6 +1202,30 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
for (n = 0; n < instrumentation->num_workers; ++n)
InstrAggNode(planstate->instrument, &instrument[n]);
+ /* Accumulate the wait event usage from all workers. */
+ if (instrumentation->wait_event_usage_offset != 0 &&
+ planstate->wait_event_usage != NULL)
+ {
+ wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+ wait_event_usage += i * instrumentation->num_workers;
+ for (n = 0; n < instrumentation->num_workers; ++n)
+ {
+ SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
+ WaitEventUsageEntry *entries;
+
+ if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+ continue;
+
+ entries = dsa_get_address(r->area, worker->entries);
+ pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
+ entries,
+ worker->nentries);
+ dsa_free(r->area, worker->entries);
+ worker->nentries = 0;
+ worker->entries = InvalidDsaPointer;
+ }
+ }
+
/*
* Also store the per-worker detail.
*
@@ -1216,7 +1283,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
}
return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
- instrumentation);
+ r);
}
/*
@@ -1290,25 +1357,21 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
}
static void
-ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
- dsa_area *area,
- const WaitEventUsage *usage)
+ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+ dsa_area *area,
+ const WaitEventUsage *usage)
{
- SharedWaitEventUsageWorker *worker;
WaitEventUsageEntry *entries;
dsa_pointer entries_dsa;
Size entries_size;
- Assert(shared != NULL);
+ Assert(worker != NULL);
Assert(area != NULL);
Assert(usage != NULL);
- Assert(IsParallelWorker());
- Assert(ParallelWorkerNumber < shared->num_workers);
if (usage->nentries <= 0)
return;
- worker = &shared->worker_usage[ParallelWorkerNumber];
entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
entries_dsa = dsa_allocate(area, entries_size);
entries = dsa_get_address(area, entries_dsa);
@@ -1320,6 +1383,20 @@ ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
worker->entries = entries_dsa;
}
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+ dsa_area *area,
+ const WaitEventUsage *usage)
+{
+ Assert(shared != NULL);
+ Assert(IsParallelWorker());
+ Assert(ParallelWorkerNumber < shared->num_workers);
+
+ ExecParallelReportWaitEventUsageWorker(&shared->worker_usage[ParallelWorkerNumber],
+ area,
+ usage);
+}
+
/*
* Finish parallel execution. We wait for parallel workers to finish, and
* accumulate their buffer/WAL usage.
@@ -1385,8 +1462,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
{
/* Accumulate instrumentation, if any. */
if (pei->instrumentation)
- ExecParallelRetrieveInstrumentation(pei->planstate,
- pei->instrumentation);
+ {
+ ExecParallelRetrieveInstrumentationContext r;
+
+ r.instrumentation = pei->instrumentation;
+ r.area = pei->area;
+ ExecParallelRetrieveInstrumentation(pei->planstate, &r);
+ }
/* Accumulate JIT instrumentation, if any. */
if (pei->jit_instrumentation)
@@ -1495,6 +1577,17 @@ ExecParallelReportInstrumentation(PlanState *planstate,
Assert(IsParallelWorker());
Assert(ParallelWorkerNumber < instrumentation->num_workers);
InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
+ if (instrumentation->wait_event_usage_offset != 0 &&
+ planstate->wait_event_usage != NULL)
+ {
+ SharedWaitEventUsageWorker *wait_event_usage;
+
+ wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+ wait_event_usage += i * instrumentation->num_workers;
+ ExecParallelReportWaitEventUsageWorker(&wait_event_usage[ParallelWorkerNumber],
+ planstate->state->es_query_dsa,
+ planstate->wait_event_usage);
+ }
return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
instrumentation);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7c4c66e323f..081855b3fed 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
#include "executor/nodeWindowAgg.h"
#include "executor/nodeWorktablescan.h"
#include "miscadmin.h"
+#include "utils/wait_event.h"
#include "nodes/nodeFuncs.h"
static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
@@ -415,6 +416,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
if (estate->es_instrument)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
+ if (estate->es_instrument & INSTRUMENT_WAITS)
+ {
+ MemoryContext oldcontext;
+
+ oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+ result->wait_event_usage = palloc_object(WaitEventUsage);
+ pgstat_init_wait_event_usage(result->wait_event_usage,
+ estate->es_query_cxt);
+ MemoryContextSwitchTo(oldcontext);
+ }
return result;
}
@@ -481,7 +492,8 @@ ExecProcNodeFirst(PlanState *node)
* This has essentially the same responsibilities as ExecProcNode,
* but it does not do InstrStartNode/InstrStopNode (mainly because
* it can't tell how many returned tuples to count). Each per-node
- * function must provide its own instrumentation support.
+ * function must provide its own instrumentation support, including
+ * wait event attribution when enabled.
* ----------------------------------------------------------------
*/
Node *
@@ -758,6 +770,9 @@ ExecShutdownNode(PlanState *node)
static bool
ExecShutdownNode_walker(PlanState *node, void *context)
{
+ bool node_running;
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
if (node == NULL)
return false;
@@ -773,8 +788,14 @@ ExecShutdownNode_walker(PlanState *node, void *context)
* has never been executed, so as to avoid incorrectly making it appear
* that it has.
*/
- if (node->instrument && node->instrument->running)
+ node_running = node->instrument && node->instrument->running;
+ if (node_running)
+ {
InstrStartNode(node->instrument);
+ if (node->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->wait_event_usage);
+ }
planstate_tree_walker(node, ExecShutdownNode_walker, context);
@@ -803,8 +824,12 @@ ExecShutdownNode_walker(PlanState *node, void *context)
}
/* Stop the node if we started it above, reporting 0 tuples. */
- if (node->instrument && node->instrument->running)
+ if (node_running)
+ {
+ if (node->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
InstrStopNode(node->instrument, 0);
+ }
return false;
}
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ffbcd572133..a1f22180f82 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,7 @@
#include "nodes/execnodes.h"
#include "portability/instr_time.h"
#include "utils/guc_hooks.h"
+#include "utils/wait_event.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
@@ -181,11 +182,17 @@ TupleTableSlot *
ExecProcNodeInstr(PlanState *node)
{
TupleTableSlot *result;
+ WaitEventUsage *previous_wait_event_usage = NULL;
InstrStartNode(node->instrument);
+ if (node->wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->wait_event_usage);
result = node->ExecProcNodeReal(node);
+ if (node->wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
return result;
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index 9007dda3802..e2b442bd378 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -32,6 +32,7 @@
#include "executor/instrument.h"
#include "executor/nodeBitmapAnd.h"
#include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
/* ----------------------------------------------------------------
@@ -115,10 +116,14 @@ MultiExecBitmapAnd(BitmapAndState *node)
int nplans;
int i;
TIDBitmap *result = NULL;
+ WaitEventUsage *previous_wait_event_usage = NULL;
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStartNode(node->ps.instrument);
+ if (node->ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
/*
* get information from the node
@@ -162,6 +167,8 @@ MultiExecBitmapAnd(BitmapAndState *node)
elog(ERROR, "BitmapAnd doesn't support zero inputs");
/* must provide our own instrumentation support */
+ if (node->ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, 0 /* XXX */ );
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 7978514e1bc..c2798a81e97 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -28,6 +28,7 @@
#include "executor/nodeIndexscan.h"
#include "miscadmin.h"
#include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
/* ----------------------------------------------------------------
@@ -54,10 +55,14 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
IndexScanDesc scandesc;
double nTuples = 0;
bool doscan;
+ WaitEventUsage *previous_wait_event_usage = NULL;
/* must provide our own instrumentation support */
if (node->ss.ps.instrument)
InstrStartNode(node->ss.ps.instrument);
+ if (node->ss.ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ss.ps.wait_event_usage);
/*
* extract necessary information from index scan node
@@ -116,6 +121,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
}
/* must provide our own instrumentation support */
+ if (node->ss.ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ss.ps.instrument)
InstrStopNode(node->ss.ps.instrument, nTuples);
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 148c80fdae6..e9ee934a6f6 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -33,6 +33,7 @@
#include "executor/nodeBitmapOr.h"
#include "nodes/tidbitmap.h"
#include "miscadmin.h"
+#include "utils/wait_event.h"
/* ----------------------------------------------------------------
@@ -116,10 +117,14 @@ MultiExecBitmapOr(BitmapOrState *node)
int nplans;
int i;
TIDBitmap *result = NULL;
+ WaitEventUsage *previous_wait_event_usage = NULL;
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStartNode(node->ps.instrument);
+ if (node->ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
/*
* get information from the node
@@ -180,6 +185,8 @@ MultiExecBitmapOr(BitmapOrState *node)
elog(ERROR, "BitmapOr doesn't support zero inputs");
/* must provide our own instrumentation support */
+ if (node->ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, 0 /* XXX */ );
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8825bb6fa23..6e9e4b4218f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -105,9 +105,14 @@ ExecHash(PlanState *pstate)
Node *
MultiExecHash(HashState *node)
{
+ WaitEventUsage *previous_wait_event_usage = NULL;
+
/* must provide our own instrumentation support */
if (node->ps.instrument)
InstrStartNode(node->ps.instrument);
+ if (node->ps.wait_event_usage)
+ previous_wait_event_usage =
+ pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
if (node->parallel_state != NULL)
MultiExecParallelHash(node);
@@ -115,6 +120,8 @@ MultiExecHash(HashState *node)
MultiExecPrivateHash(node);
/* must provide our own instrumentation support */
+ if (node->ps.wait_event_usage)
+ pgstat_restore_wait_event_usage(previous_wait_event_usage);
if (node->ps.instrument)
InstrStopNode(node->ps.instrument, node->hashtable->reportTuples);
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index eb01bc3d88c..3d5a10b44c4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,6 +27,7 @@
#include "storage/shmem.h"
#include "storage/subsystems.h"
#include "storage/spin.h"
+#include "utils/memutils.h"
#include "utils/wait_event.h"
@@ -47,6 +48,7 @@ uint32 *my_wait_event_info = &local_my_wait_event_info;
int pgstat_wait_event_usage_depth = 0;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
static uint32 pgstat_wait_event_usage_current = 0;
static instr_time pgstat_wait_event_usage_start;
@@ -358,6 +360,28 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}
+/*
+ * Initialize a wait event usage accumulator.
+ */
+void
+pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+ Assert(usage != NULL);
+ Assert(memcontext != NULL);
+
+ memset(usage, 0, sizeof(WaitEventUsage));
+
+ /*
+ * Wait events may end inside critical sections, for example while
+ * performing synchronous I/O. Keep usage entries in a dedicated context
+ * where the memory manager permits that accounting path to grow.
+ */
+ usage->memcontext = AllocSetContextCreate(memcontext,
+ "Wait Event Usage",
+ ALLOCSET_SMALL_SIZES);
+ MemoryContextAllowInCriticalSection(usage->memcontext, true);
+}
+
/*
* Start collecting exact wait event timings in this backend.
*
@@ -375,9 +399,9 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
if (pgstat_wait_event_usage_depth++ == 0)
{
- memset(usage, 0, sizeof(WaitEventUsage));
- usage->memcontext = memcontext;
+ pgstat_init_wait_event_usage(usage, memcontext);
pgstat_wait_event_usage = usage;
+ pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
@@ -398,11 +422,27 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
pgstat_count_wait_event_end();
pgstat_wait_event_usage = NULL;
+ pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
}
+WaitEventUsage *
+pgstat_enter_wait_event_usage(WaitEventUsage *usage)
+{
+ WaitEventUsage *previous = pgstat_wait_event_node_usage;
+
+ pgstat_wait_event_node_usage = usage;
+ return previous;
+}
+
+void
+pgstat_restore_wait_event_usage(WaitEventUsage *usage)
+{
+ pgstat_wait_event_node_usage = usage;
+}
+
/*
* Record the beginning of a wait event for exact EXPLAIN-style accounting.
*/
@@ -444,6 +484,11 @@ pgstat_count_wait_event_end(void)
pgstat_wait_event_usage_current,
1,
&elapsed);
+ if (pgstat_wait_event_node_usage != NULL)
+ WaitEventUsageAdd(pgstat_wait_event_node_usage,
+ pgstat_wait_event_usage_current,
+ 1,
+ &elapsed);
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f093a52aae0..f46ee03d2c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -65,6 +65,7 @@ typedef enum InstrumentOption
INSTRUMENT_ROWS = 1 << 2, /* needs row count */
INSTRUMENT_WAL = 1 << 3, /* needs WAL usage */
INSTRUMENT_IO = 1 << 4, /* needs IO usage */
+ INSTRUMENT_WAITS = 1 << 5, /* needs wait event usage */
INSTRUMENT_ALL = PG_INT32_MAX
} InstrumentOption;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 781c8c79132..56f9dcbef88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1212,6 +1212,7 @@ typedef struct PlanState
NodeInstrumentation *instrument; /* Optional runtime stats for this
* node */
+ WaitEventUsage *wait_event_usage; /* Optional wait event usage */
WorkerNodeInstrumentation *worker_instrument; /* per-worker
* instrumentation */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 63992137ee7..64411288403 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -36,12 +36,16 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
+ MemoryContext memcontext);
extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
const WaitEventUsageEntry *entries,
int nentries);
+extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_count_wait_event_start(uint32 wait_event_info);
extern void pgstat_count_wait_event_end(void);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index e3847e222be..c086fa48d80 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -101,12 +101,14 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
- explain_filter
----------------------------------------
+ explain_filter
+-----------------------------------------
Result (actual rows=N.N loops=N)
+ Wait Events:
+ Timeout:PgSleep calls=N time=N.N ms
Wait Events:
Timeout:PgSleep calls=N time=N.N ms
-(3 rows)
+(5 rows)
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
?column?
@@ -114,6 +116,12 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
@@ -130,6 +138,44 @@ select jsonb_path_query_first(
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Plan}',
+ '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+ language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
rollback;
explain (waits) select 1;
ERROR: EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 8821250bcef..bd196f248db 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,7 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
@@ -82,6 +83,34 @@ select jsonb_path_query_first(
from tenk1 where unique1 = 1') #> '{0,Wait Events}',
'$[*] ? (@."Wait Event" == "PgSleep")'
);
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select pg_temp.parallel_pg_sleep(0.01)
+ from tenk1 where unique1 = 1') #> '{0,Plan}',
+ '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+ language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
rollback;
explain (waits) select 1;
--
2.52.0
[application/octet-stream] 0004-Refine-EXPLAIN-WAITS-attribution-semantics.patch (42.8K, 6-0004-Refine-EXPLAIN-WAITS-attribution-semantics.patch)
download | inline diff:
From 1ea51ee5f55dccad7f7692244640171f4e42fcbf Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:49 +0500
Subject: [RFC PATCH v1 4/7] Refine EXPLAIN WAITS attribution semantics
---
doc/src/sgml/ref/explain.sgml | 52 ++++++-
src/backend/commands/explain.c | 82 +++++++----
src/backend/executor/execParallel.c | 121 +++++++++++-----
src/backend/utils/activity/wait_event.c | 184 +++++++++++++++++++-----
src/bin/psql/tab-complete.in.c | 4 +-
src/include/utils/wait_event.h | 5 +
src/test/regress/expected/explain.out | 128 ++++++++++++++++-
src/test/regress/sql/explain.sql | 102 ++++++++++++-
8 files changed, 574 insertions(+), 104 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5460568af1d..d699b215120 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -315,9 +315,46 @@ ROLLBACK;
<term><literal>WAITS</literal></term>
<listitem>
<para>
- Include exact backend-local wait event timing observed during query
- execution. Wait events are grouped by wait event type and wait event
- name, and include the number of waits and total time spent waiting.
+ Include wait event timing observed during query execution. Wait
+ events are grouped by wait event type and wait event name, using the
+ same wait event identifiers reported by
+ <structname>pg_stat_activity</structname>; see
+ <xref linkend="wait-event-table"/>. Each entry includes the number
+ of completed waits and the total time spent waiting. Wait timing is
+ collected when this option is enabled even if <literal>TIMING</literal>
+ is disabled.
+ </para>
+
+ <para>
+ If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
+ event storage without risking an error while a wait is ending, waits
+ whose exact event identifier could not be stored are accumulated in an
+ <literal>Unrecorded Wait Event Calls</literal> counter and
+ <literal>Unrecorded Wait Event Time</literal> total. This is a
+ reporting fallback under memory pressure, not a wait event emitted by
+ server instrumentation.
+ </para>
+
+ <para>
+ The top-level <literal>Statement Wait Events</literal> summary counts
+ each wait once for the statement, including waits reported by parallel
+ workers. Nested <command>EXPLAIN (ANALYZE, WAITS)</command> commands
+ maintain separate statement-level summaries; a wait is counted once in
+ each active summary. Plan nodes can also show <literal>Wait
+ Events</literal>. Node-level wait events are inclusive, like plan node
+ timing in
+ <command>EXPLAIN ANALYZE</command>: a wait is attributed to every
+ active plan node when the wait begins. Therefore node-level wait event
+ times should not be summed to compute the statement total.
+ </para>
+
+ <para>
+ The overhead of this option depends on the number of waits observed and,
+ for node-level output, on the number of active plan nodes when each wait
+ begins.
+ </para>
+
+ <para>
This parameter may only be used when <literal>ANALYZE</literal> is also
enabled. It defaults to <literal>FALSE</literal>.
</para>
@@ -405,6 +442,15 @@ ROLLBACK;
execution, and on machines that have relatively slow operating
system calls for obtaining the time of day.
</para>
+
+ <para>
+ The <literal>WAITS</literal> option also adds profiling overhead when
+ <literal>ANALYZE</literal> is used. It measures wait intervals reported
+ by server instrumentation and maintains per-statement and per-node wait
+ event counters while the statement executes. The overhead depends on the
+ number of reported waits, the number of distinct wait events, and the depth
+ of the active executor plan node stack.
+ </para>
</refsect1>
<refsect1>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0575ae8368b..9c198f8e599 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,7 +151,7 @@ static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static int wait_event_usage_cmp(const void *a, const void *b);
-static void show_wait_event_usage(ExplainState *es,
+static void show_wait_event_usage(ExplainState *es, const char *labelname,
const WaitEventUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -219,7 +219,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
* In the case of an INSTEAD NOTHING, tell at least that. But in
* non-text format, the output is delimited, so this isn't necessary.
*/
- if (es->format == EXPLAIN_FORMAT_TEXT)
+ if (es->format == EXPLAIN_FORMAT_TEXT)
appendStringInfoString(es->str, "Query rewrites to nothing\n");
}
else
@@ -634,7 +634,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
ExplainPrintPlan(es, queryDesc);
if (waitEventUsagePtr)
- show_wait_event_usage(es, waitEventUsagePtr);
+ show_wait_event_usage(es, "Statement Wait Events", waitEventUsagePtr);
/* Show buffer and/or memory usage in planning */
if (peek_buffer_usage(es, bufusage) || mem_counters)
@@ -2335,7 +2335,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
if (es->wal && planstate->instrument)
show_wal_usage(es, &planstate->instrument->instr.walusage);
if (es->waits)
- show_wait_event_usage(es, planstate->wait_event_usage);
+ show_wait_event_usage(es, "Wait Events", planstate->wait_event_usage);
/* Prepare per-worker buffer/WAL usage */
if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4556,14 +4556,15 @@ wait_event_usage_cmp(const void *a, const void *b)
}
static void
-show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+show_wait_event_usage(ExplainState *es, const char *labelname,
+ const WaitEventUsage *usage)
{
WaitEventUsageEntry *entries;
if (usage == NULL)
return;
- if (usage->nentries == 0)
+ if (usage->nentries == 0 && usage->overflowed_calls == 0)
return;
if (usage->nentries > 0)
@@ -4577,10 +4578,10 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
else
entries = NULL;
- if (es->format == EXPLAIN_FORMAT_TEXT)
+ if (es->format == EXPLAIN_FORMAT_TEXT)
{
ExplainIndentText(es);
- appendStringInfoString(es->str, "Wait Events:\n");
+ appendStringInfo(es->str, "%s:\n", labelname);
es->indent++;
for (int i = 0; i < usage->nentries; i++)
@@ -4599,35 +4600,60 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
INSTR_TIME_GET_MILLISEC(entries[i].time));
}
+ if (usage->overflowed_calls > 0)
+ {
+ ExplainIndentText(es);
+ appendStringInfo(es->str,
+ "Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
+ usage->overflowed_calls,
+ INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+ }
+
es->indent--;
}
else
{
- ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
-
- for (int i = 0; i < usage->nentries; i++)
+ if (usage->nentries > 0)
{
- const char *event_type;
- const char *event_name;
+ ExplainOpenGroup("Wait-Events", labelname, false, es);
- event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
- event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+ for (int i = 0; i < usage->nentries; i++)
+ {
+ const char *event_type;
+ const char *event_name;
- ExplainOpenGroup("Wait-Event", NULL, true, es);
- ExplainPropertyText("Wait Event Type",
- event_type ? event_type : "Unknown",
- es);
- ExplainPropertyText("Wait Event",
- event_name ? event_name : "unknown",
- es);
- ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
- ExplainPropertyFloat("Time", "ms",
- INSTR_TIME_GET_MILLISEC(entries[i].time),
- 3, es);
- ExplainCloseGroup("Wait-Event", NULL, true, es);
+ event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+ event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+ ExplainOpenGroup("Wait-Event", NULL, true, es);
+ ExplainPropertyText("Wait Event Type",
+ event_type ? event_type : "Unknown",
+ es);
+ ExplainPropertyText("Wait Event",
+ event_name ? event_name : "unknown",
+ es);
+ ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+ ExplainPropertyFloat("Time", "ms",
+ INSTR_TIME_GET_MILLISEC(entries[i].time),
+ 3, es);
+ ExplainCloseGroup("Wait-Event", NULL, true, es);
+ }
+
+ ExplainCloseGroup("Wait-Events", labelname, false, es);
}
- ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+ if (usage->overflowed_calls > 0)
+ {
+ /*
+ * This is not a wait event identity, so keep it outside the
+ * Wait Events array in structured output.
+ */
+ ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
+ usage->overflowed_calls, es);
+ ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
+ INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+ 3, es);
+ }
}
if (entries)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f77f539dd27..520b4b8484f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,6 +87,8 @@ typedef struct SharedWaitEventUsageWorker
{
int nentries;
dsa_pointer entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
} SharedWaitEventUsageWorker;
/*
@@ -173,6 +175,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
ExecParallelRetrieveInstrumentationContext *r);
static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker);
+static void ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+ SharedWaitEventUsageWorker *worker,
+ dsa_area *area);
static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
dsa_area *area,
const WaitEventUsage *usage);
@@ -904,10 +910,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
wait_event_usage->num_workers = nworkers;
for (int i = 0; i < nworkers; i++)
- {
- wait_event_usage->worker_usage[i].nentries = 0;
- wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
- }
+ ExecParallelInitWaitEventUsageWorker(&wait_event_usage->worker_usage[i]);
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
wait_event_usage);
pei->wait_event_usage = wait_event_usage;
@@ -944,10 +947,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
for (i = 0; i < nworkers * e.nnodes; ++i)
- {
- worker_usage[i].nentries = 0;
- worker_usage[i].entries = InvalidDsaPointer;
- }
+ ExecParallelInitWaitEventUsageWorker(&worker_usage[i]);
}
shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
instrumentation);
@@ -1209,21 +1209,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
wait_event_usage += i * instrumentation->num_workers;
for (n = 0; n < instrumentation->num_workers; ++n)
- {
- SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
- WaitEventUsageEntry *entries;
-
- if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
- continue;
-
- entries = dsa_get_address(r->area, worker->entries);
- pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
- entries,
- worker->nentries);
- dsa_free(r->area, worker->entries);
- worker->nentries = 0;
- worker->entries = InvalidDsaPointer;
- }
+ ExecParallelAccumulateWaitEventUsageWorker(planstate->wait_event_usage,
+ &wait_event_usage[n],
+ r->area);
}
/*
@@ -1339,18 +1327,46 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
return;
for (int i = 0; i < shared->num_workers; i++)
+ ExecParallelAccumulateWaitEventUsageWorker(usage,
+ &shared->worker_usage[i],
+ pei->area);
+}
+
+static void
+ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker)
+{
+ worker->nentries = 0;
+ worker->entries = InvalidDsaPointer;
+ worker->overflowed_calls = 0;
+ INSTR_TIME_SET_ZERO(worker->overflowed_time);
+}
+
+static void
+ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+ SharedWaitEventUsageWorker *worker,
+ dsa_area *area)
+{
+ Assert(usage != NULL);
+ Assert(worker != NULL);
+ Assert(area != NULL);
+
+ if (worker->overflowed_calls > 0)
{
- SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
- WaitEventUsageEntry *entries;
+ usage->overflowed_calls += worker->overflowed_calls;
+ INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+ worker->overflowed_calls = 0;
+ INSTR_TIME_SET_ZERO(worker->overflowed_time);
+ }
- if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
- continue;
+ if (worker->nentries > 0 && DsaPointerIsValid(worker->entries))
+ {
+ WaitEventUsageEntry *entries;
- entries = dsa_get_address(pei->area, worker->entries);
+ entries = dsa_get_address(area, worker->entries);
pgstat_accumulate_wait_event_usage(usage,
entries,
worker->nentries);
- dsa_free(pei->area, worker->entries);
+ dsa_free(area, worker->entries);
worker->nentries = 0;
worker->entries = InvalidDsaPointer;
}
@@ -1362,24 +1378,65 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
const WaitEventUsage *usage)
{
WaitEventUsageEntry *entries;
+ WaitEventUsageEntry *old_entries = NULL;
dsa_pointer entries_dsa;
Size entries_size;
+ int old_nentries = 0;
+ int new_nentries = 0;
+ int i = 0;
+ int j = 0;
Assert(worker != NULL);
Assert(area != NULL);
Assert(usage != NULL);
+ worker->overflowed_calls += usage->overflowed_calls;
+ INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+
if (usage->nentries <= 0)
return;
- entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+ if (DsaPointerIsValid(worker->entries))
+ {
+ Assert(worker->nentries > 0);
+ old_nentries = worker->nentries;
+ old_entries = dsa_get_address(area, worker->entries);
+ }
+
+ entries_size = mul_size(sizeof(WaitEventUsageEntry),
+ (Size) old_nentries + (Size) usage->nentries);
entries_dsa = dsa_allocate(area, entries_size);
entries = dsa_get_address(area, entries_dsa);
- memcpy(entries, usage->entries, entries_size);
+
+ while (i < old_nentries && j < usage->nentries)
+ {
+ WaitEventUsageEntry *entry = &entries[new_nentries];
+ uint32 old_info = old_entries[i].wait_event_info;
+ uint32 new_info = usage->entries[j].wait_event_info;
+
+ if (old_info < new_info)
+ *entry = old_entries[i++];
+ else if (old_info > new_info)
+ *entry = usage->entries[j++];
+ else
+ {
+ *entry = old_entries[i++];
+ entry->calls += usage->entries[j].calls;
+ INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+ j++;
+ }
+
+ new_nentries++;
+ }
+
+ while (i < old_nentries)
+ entries[new_nentries++] = old_entries[i++];
+ while (j < usage->nentries)
+ entries[new_nentries++] = usage->entries[j++];
if (DsaPointerIsValid(worker->entries))
dsa_free(area, worker->entries);
- worker->nentries = usage->nentries;
+ worker->nentries = new_nentries;
worker->entries = entries_dsa;
}
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 3d5a10b44c4..61b418e8fd7 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -39,6 +39,10 @@ static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
uint64 calls, const instr_time *elapsed);
+static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+ const instr_time *elapsed);
+static int WaitEventUsageFind(const WaitEventUsage *usage,
+ uint32 wait_event_info, bool *found);
static uint32 local_my_wait_event_info;
@@ -48,7 +52,16 @@ uint32 *my_wait_event_info = &local_my_wait_event_info;
int pgstat_wait_event_usage_depth = 0;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+
+/*
+ * Top of the active executor node and query-level stacks. Query-level wait
+ * accounting records each wait once in every active query-level collector.
+ * Per-node wait accounting records each wait in every active plan node,
+ * matching the inclusive timing semantics of EXPLAIN ANALYZE plan nodes.
+ */
static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_node_stack = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_query_stack = NULL;
static uint32 pgstat_wait_event_usage_current = 0;
static instr_time pgstat_wait_event_usage_start;
@@ -387,24 +400,37 @@ pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
*
* This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
* It records waits observed through pgstat_report_wait_start/end in backend
- * local memory. Nested collection is deliberately treated as part of the
- * outer collection for now; callers that want independent nested accounting
- * need a stack of WaitEventUsage contexts.
+ * local memory. Nested top-level collectors are kept in a query-level stack;
+ * a wait is counted once in each active collector.
*/
void
pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
{
+ bool first;
+
Assert(usage != NULL);
Assert(memcontext != NULL);
- if (pgstat_wait_event_usage_depth++ == 0)
+ first = pgstat_wait_event_usage_depth == 0;
+ if (first)
{
- pgstat_init_wait_event_usage(usage, memcontext);
- pgstat_wait_event_usage = usage;
pgstat_wait_event_node_usage = NULL;
+ pgstat_wait_event_usage_node_stack = NULL;
+ pgstat_wait_event_usage_query_stack = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
+
+ pgstat_init_wait_event_usage(usage, memcontext);
+ usage->query_parent = pgstat_wait_event_usage;
+ /*
+ * A nested EXPLAIN can error out while one of its plan nodes is active,
+ * skipping the usual node-level restore. Remember the outer node stack so
+ * ending this collector can discard any leaked inner node frames.
+ */
+ usage->saved_node_usage = pgstat_wait_event_node_usage;
+ pgstat_wait_event_usage = usage;
+ pgstat_wait_event_usage_depth++;
}
/*
@@ -415,14 +441,22 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
{
Assert(usage != NULL);
Assert(pgstat_wait_event_usage_depth > 0);
+ Assert(pgstat_wait_event_usage == usage);
+
+ if (pgstat_wait_event_usage_current != 0)
+ pgstat_count_wait_event_end();
+
+ pgstat_wait_event_usage = usage->query_parent;
+ usage->query_parent = NULL;
+ pgstat_wait_event_node_usage = usage->saved_node_usage;
+ usage->saved_node_usage = NULL;
if (--pgstat_wait_event_usage_depth == 0)
{
- if (pgstat_wait_event_usage_current != 0)
- pgstat_count_wait_event_end();
-
pgstat_wait_event_usage = NULL;
pgstat_wait_event_node_usage = NULL;
+ pgstat_wait_event_usage_node_stack = NULL;
+ pgstat_wait_event_usage_query_stack = NULL;
pgstat_wait_event_usage_current = 0;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
@@ -433,6 +467,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
{
WaitEventUsage *previous = pgstat_wait_event_node_usage;
+ Assert(usage != NULL);
+
+ usage->active_parent = previous;
pgstat_wait_event_node_usage = usage;
return previous;
}
@@ -440,6 +477,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
void
pgstat_restore_wait_event_usage(WaitEventUsage *usage)
{
+ Assert(pgstat_wait_event_node_usage == NULL ||
+ pgstat_wait_event_node_usage->active_parent == usage);
+
pgstat_wait_event_node_usage = usage;
}
@@ -460,6 +500,8 @@ pgstat_count_wait_event_start(uint32 wait_event_info)
pgstat_count_wait_event_end();
pgstat_wait_event_usage_current = wait_event_info;
+ pgstat_wait_event_usage_node_stack = pgstat_wait_event_node_usage;
+ pgstat_wait_event_usage_query_stack = pgstat_wait_event_usage;
INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
}
@@ -480,17 +522,24 @@ pgstat_count_wait_event_end(void)
elapsed = end;
INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
- WaitEventUsageAdd(pgstat_wait_event_usage,
- pgstat_wait_event_usage_current,
- 1,
- &elapsed);
- if (pgstat_wait_event_node_usage != NULL)
- WaitEventUsageAdd(pgstat_wait_event_node_usage,
+ for (WaitEventUsage *query_usage = pgstat_wait_event_usage_query_stack;
+ query_usage != NULL;
+ query_usage = query_usage->query_parent)
+ WaitEventUsageAdd(query_usage,
+ pgstat_wait_event_usage_current,
+ 1,
+ &elapsed);
+ for (WaitEventUsage *node_usage = pgstat_wait_event_usage_node_stack;
+ node_usage != NULL;
+ node_usage = node_usage->active_parent)
+ WaitEventUsageAdd(node_usage,
pgstat_wait_event_usage_current,
1,
&elapsed);
pgstat_wait_event_usage_current = 0;
+ pgstat_wait_event_usage_node_stack = NULL;
+ pgstat_wait_event_usage_query_stack = NULL;
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
@@ -509,56 +558,123 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
&entries[i].time);
}
+/*
+ * Find the existing entry, or the insertion position for a new entry.
+ *
+ * WaitEventUsage entries are kept sorted by wait_event_info so the hot
+ * wait-end path does not need a linear scan through all distinct wait events
+ * already seen by the query or plan node.
+ */
+static int
+WaitEventUsageFind(const WaitEventUsage *usage, uint32 wait_event_info,
+ bool *found)
+{
+ int low = 0;
+ int high = usage->nentries;
+
+ while (low < high)
+ {
+ int mid = low + (high - low) / 2;
+ uint32 entry_info = usage->entries[mid].wait_event_info;
+
+ if (entry_info < wait_event_info)
+ low = mid + 1;
+ else
+ high = mid;
+ }
+
+ *found = low < usage->nentries &&
+ usage->entries[low].wait_event_info == wait_event_info;
+ return low;
+}
+
static void
WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
uint64 calls, const instr_time *elapsed)
{
- WaitEventUsageEntry *entry = NULL;
+ bool found;
+ int idx;
+ WaitEventUsageEntry *entry;
- for (int i = 0; i < usage->nentries; i++)
- {
- if (usage->entries[i].wait_event_info == wait_event_info)
- {
- entry = &usage->entries[i];
- break;
- }
- }
+ idx = WaitEventUsageFind(usage, wait_event_info, &found);
- if (entry == NULL)
+ if (!found)
{
if (usage->nentries >= usage->maxentries)
{
- MemoryContext oldcontext;
int newmaxentries;
+ Size entries_size;
+ WaitEventUsageEntry *newentries;
if (usage->maxentries > 0)
+ {
+ if ((Size) usage->maxentries >
+ MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
+ {
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
+ }
+
newmaxentries = usage->maxentries * 2;
+ }
else
newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
- oldcontext = MemoryContextSwitchTo(usage->memcontext);
+ if ((Size) newmaxentries >
+ MaxAllocSize / sizeof(WaitEventUsageEntry))
+ {
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
+ }
+
+ entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
+ /*
+ * Wait completion can happen in a critical section, so growth
+ * must not throw ERROR. If storage cannot be grown without
+ * throwing, preserve total wait time in the overflow bucket.
+ */
if (usage->entries)
- usage->entries = repalloc_array(usage->entries,
- WaitEventUsageEntry,
- newmaxentries);
+ newentries = repalloc_extended(usage->entries, entries_size,
+ MCXT_ALLOC_NO_OOM);
else
- usage->entries = palloc_array(WaitEventUsageEntry,
- newmaxentries);
- MemoryContextSwitchTo(oldcontext);
+ newentries = MemoryContextAllocExtended(usage->memcontext,
+ entries_size,
+ MCXT_ALLOC_NO_OOM);
+ if (newentries == NULL)
+ {
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
+ }
+ usage->entries = newentries;
usage->maxentries = newmaxentries;
}
- entry = &usage->entries[usage->nentries++];
+ if (idx < usage->nentries)
+ memmove(&usage->entries[idx + 1], &usage->entries[idx],
+ sizeof(WaitEventUsageEntry) * (usage->nentries - idx));
+
+ entry = &usage->entries[idx];
+ usage->nentries++;
entry->wait_event_info = wait_event_info;
entry->calls = 0;
INSTR_TIME_SET_ZERO(entry->time);
}
+ else
+ entry = &usage->entries[idx];
entry->calls += calls;
INSTR_TIME_ADD(entry->time, *elapsed);
}
+static void
+WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+ const instr_time *elapsed)
+{
+ usage->overflowed_calls += calls;
+ INSTR_TIME_ADD(usage->overflowed_time, *elapsed);
+}
+
/* ----------
* pgstat_get_wait_event_type() -
*
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index db65d130fcb..b33490e7415 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4524,9 +4524,9 @@ match_previous_words(int pattern_id,
*/
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
- "BUFFERS", "SERIALIZE", "WAL", "TIMING", "SUMMARY",
+ "BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
"MEMORY", "FORMAT");
- else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|TIMING|SUMMARY|MEMORY"))
+ else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("SERIALIZE"))
COMPLETE_WITH("TEXT", "NONE", "BINARY");
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 64411288403..f14945cdb16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -25,9 +25,14 @@ typedef struct WaitEventUsageEntry
typedef struct WaitEventUsage
{
MemoryContext memcontext;
+ struct WaitEventUsage *active_parent; /* active plan-node stack link */
+ struct WaitEventUsage *query_parent; /* active query-level stack link */
+ struct WaitEventUsage *saved_node_usage; /* node stack at query start */
int nentries;
int maxentries;
WaitEventUsageEntry *entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
} WaitEventUsage;
extern const char *pgstat_get_wait_event(uint32 wait_event_info);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index c086fa48d80..275292d6172 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -47,6 +47,21 @@ begin
return data::jsonb;
end;
$$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
-- Disable JIT, or we'll get different output on machines where that's been
-- forced on
set jit = off;
@@ -106,11 +121,11 @@ select explain_filter('explain (analyze, waits, costs off, summary off, timing o
Result (actual rows=N.N loops=N)
Wait Events:
Timeout:PgSleep calls=N time=N.N ms
- Wait Events:
+ Statement Wait Events:
Timeout:PgSleep calls=N time=N.N ms
(5 rows)
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
?column?
----------------------------------------------------------------------------------
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
@@ -122,6 +137,52 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+begin;
+create function pg_temp.nested_explain_waits() returns void
+ language plpgsql as
+$$
+begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack. The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+ language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+ language plpgsql as
+$$
+begin
+ begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+ exception when division_by_zero then
+ null;
+ end;
+ perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+ ?column?
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
@@ -130,7 +191,7 @@ set local max_parallel_workers_per_gather = 1;
select jsonb_path_query_first(
explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
select pg_temp.parallel_pg_sleep(0.01)
- from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
'$[*] ? (@."Wait Event" == "PgSleep")'
);
jsonb_path_query_first
@@ -149,6 +210,53 @@ select jsonb_path_query_first(
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+rollback;
+begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic. The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan. The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+ language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+ select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from
+ (select count(unique1) from tenk1
+ where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+ right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+ select jsonb_path_query_first(plan,
+ '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+ from plan_json
+),
+pgsleep_wait as (
+ select node,
+ jsonb_path_query_first(node,
+ '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+ from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+ parallel rescan waits accumulated
+-----------------------------------
+ t
+(1 row)
+
rollback;
begin;
-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
@@ -157,6 +265,9 @@ begin;
-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
-- The planner GUCs below are likewise test-only scaffolding to make the
-- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
create function pg_temp.explain_waits_sleep_int(int) returns int
language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
create temp table explain_waits_bitmap (a int);
@@ -176,6 +287,17 @@ select jsonb_path_query_first(
{"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
(1 row)
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+ jsonb_path_query_first
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
rollback;
explain (waits) select 1;
ERROR: EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index bd196f248db..9f50ef189b0 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -51,6 +51,22 @@ begin
end;
$$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+ data text := '';
+ ln text;
+begin
+ for ln in execute $1
+ loop
+ data := data || ln;
+ end loop;
+ return data::jsonb;
+end;
+$$;
+
-- Disable JIT, or we'll get different output on machines where that's been
-- forced on
set jit = off;
@@ -70,9 +86,40 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
-- WAITS option
select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
begin;
+create function pg_temp.nested_explain_waits() returns void
+ language plpgsql as
+$$
+begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack. The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+ language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+ language plpgsql as
+$$
+begin
+ begin
+ perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+ exception when division_by_zero then
+ null;
+ end;
+ perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
create function pg_temp.parallel_pg_sleep(float8) returns void
language internal volatile parallel safe as 'pg_sleep';
set local debug_parallel_query = on;
@@ -80,7 +127,7 @@ set local max_parallel_workers_per_gather = 1;
select jsonb_path_query_first(
explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
select pg_temp.parallel_pg_sleep(0.01)
- from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+ from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
'$[*] ? (@."Wait Event" == "PgSleep")'
);
select jsonb_path_query_first(
@@ -91,12 +138,57 @@ select jsonb_path_query_first(
);
rollback;
begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic. The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan. The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+ language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+ select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from
+ (select count(unique1) from tenk1
+ where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+ right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+ select jsonb_path_query_first(plan,
+ '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+ from plan_json
+),
+pgsleep_wait as (
+ select node,
+ jsonb_path_query_first(node,
+ '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+ from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+rollback;
+begin;
-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
-- The planner GUCs below are likewise test-only scaffolding to make the
-- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
create function pg_temp.explain_waits_sleep_int(int) returns int
language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
create temp table explain_waits_bitmap (a int);
@@ -111,6 +203,12 @@ select jsonb_path_query_first(
where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
'$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
);
+select jsonb_path_query_first(
+ explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+ select * from explain_waits_bitmap
+ where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+ '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
rollback;
explain (waits) select 1;
--
2.52.0
[application/octet-stream] 0005-Harden-EXPLAIN-WAITS-accumulator-handling.patch (10.6K, 7-0005-Harden-EXPLAIN-WAITS-accumulator-handling.patch)
download | inline diff:
From 7535889991ec4af6104e4ac241f9fca7b603258e Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:45:44 +0500
Subject: [RFC PATCH v1 5/7] Harden EXPLAIN WAITS accumulator handling
---
doc/src/sgml/ref/explain.sgml | 15 ++--
src/backend/commands/explain.c | 2 +-
src/backend/executor/execProcnode.c | 11 +--
src/backend/utils/activity/wait_event.c | 100 ++++++++++--------------
src/include/utils/wait_event.h | 10 +--
5 files changed, 56 insertions(+), 82 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index d699b215120..7fa4b1cd955 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -326,13 +326,14 @@ ROLLBACK;
</para>
<para>
- If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
- event storage without risking an error while a wait is ending, waits
- whose exact event identifier could not be stored are accumulated in an
- <literal>Unrecorded Wait Event Calls</literal> counter and
- <literal>Unrecorded Wait Event Time</literal> total. This is a
- reporting fallback under memory pressure, not a wait event emitted by
- server instrumentation.
+ Each statement and plan-node accumulator preallocates storage for up to
+ <literal>64</literal> distinct wait event identifiers. This bound
+ avoids memory allocation while a wait is ending. If more distinct wait
+ event identifiers are observed, waits whose exact event identifier could
+ not be stored are accumulated in an <literal>Unrecorded Wait Event
+ Calls</literal> counter and <literal>Unrecorded Wait Event Time</literal>
+ total. This is a reporting fallback, not a wait event emitted by server
+ instrumentation.
</para>
<para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9c198f8e599..ee69d723cd8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -4578,7 +4578,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
else
entries = NULL;
- if (es->format == EXPLAIN_FORMAT_TEXT)
+ if (es->format == EXPLAIN_FORMAT_TEXT)
{
ExplainIndentText(es);
appendStringInfo(es->str, "%s:\n", labelname);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 081855b3fed..deee14839f2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -417,15 +417,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
result->instrument = InstrAllocNode(estate->es_instrument,
result->async_capable);
if (estate->es_instrument & INSTRUMENT_WAITS)
- {
- MemoryContext oldcontext;
-
- oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
- result->wait_event_usage = palloc_object(WaitEventUsage);
- pgstat_init_wait_event_usage(result->wait_event_usage,
- estate->es_query_cxt);
- MemoryContextSwitchTo(oldcontext);
- }
+ result->wait_event_usage =
+ pgstat_create_wait_event_usage(estate->es_query_cxt);
return result;
}
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 61b418e8fd7..67980cc0a3b 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,7 +27,6 @@
#include "storage/shmem.h"
#include "storage/subsystems.h"
#include "storage/spin.h"
-#include "utils/memutils.h"
#include "utils/wait_event.h"
@@ -43,15 +42,25 @@ static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
const instr_time *elapsed);
static int WaitEventUsageFind(const WaitEventUsage *usage,
uint32 wait_event_info, bool *found);
+static void WaitEventUsageInit(WaitEventUsage *usage,
+ MemoryContext memcontext);
static uint32 local_my_wait_event_info;
uint32 *my_wait_event_info = &local_my_wait_event_info;
-#define WAIT_EVENT_USAGE_INITIAL_EVENTS 16
+/*
+ * Hardcoded limit: each EXPLAIN WAITS statement-level or plan-node accumulator
+ * can record this many distinct wait event identities without allocating while
+ * waits are ending. Additional distinct wait identities are accounted for in
+ * the overflow bucket.
+ */
+#define WAIT_EVENT_USAGE_MAX_EVENTS 64
-int pgstat_wait_event_usage_depth = 0;
+/* Fast-path flag exported for inline pgstat_report_wait_start/end(). */
+bool pgstat_wait_event_usage_active = false;
static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static int pgstat_wait_event_usage_depth = 0;
/*
* Top of the active executor node and query-level stacks. Query-level wait
@@ -373,26 +382,36 @@ pgstat_reset_wait_event_storage(void)
my_wait_event_info = &local_my_wait_event_info;
}
+/*
+ * Allocate and initialize a wait event usage accumulator.
+ */
+WaitEventUsage *
+pgstat_create_wait_event_usage(MemoryContext memcontext)
+{
+ WaitEventUsage *usage;
+
+ Assert(memcontext != NULL);
+
+ usage = MemoryContextAlloc(memcontext, sizeof(WaitEventUsage));
+ WaitEventUsageInit(usage, memcontext);
+ return usage;
+}
+
/*
* Initialize a wait event usage accumulator.
*/
-void
-pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+static void
+WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
{
Assert(usage != NULL);
Assert(memcontext != NULL);
memset(usage, 0, sizeof(WaitEventUsage));
- /*
- * Wait events may end inside critical sections, for example while
- * performing synchronous I/O. Keep usage entries in a dedicated context
- * where the memory manager permits that accounting path to grow.
- */
- usage->memcontext = AllocSetContextCreate(memcontext,
- "Wait Event Usage",
- ALLOCSET_SMALL_SIZES);
- MemoryContextAllowInCriticalSection(usage->memcontext, true);
+ usage->entries = MemoryContextAlloc(memcontext,
+ sizeof(WaitEventUsageEntry) *
+ WAIT_EVENT_USAGE_MAX_EVENTS);
+ usage->maxentries = WAIT_EVENT_USAGE_MAX_EVENTS;
}
/*
@@ -421,7 +440,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
- pgstat_init_wait_event_usage(usage, memcontext);
+ WaitEventUsageInit(usage, memcontext);
usage->query_parent = pgstat_wait_event_usage;
/*
* A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -431,6 +450,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
usage->saved_node_usage = pgstat_wait_event_node_usage;
pgstat_wait_event_usage = usage;
pgstat_wait_event_usage_depth++;
+ pgstat_wait_event_usage_active = true;
}
/*
@@ -453,6 +473,7 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
if (--pgstat_wait_event_usage_depth == 0)
{
+ pgstat_wait_event_usage_active = false;
pgstat_wait_event_usage = NULL;
pgstat_wait_event_node_usage = NULL;
pgstat_wait_event_usage_node_stack = NULL;
@@ -602,52 +623,13 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
{
if (usage->nentries >= usage->maxentries)
{
- int newmaxentries;
- Size entries_size;
- WaitEventUsageEntry *newentries;
-
- if (usage->maxentries > 0)
- {
- if ((Size) usage->maxentries >
- MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
- {
- WaitEventUsageAddOverflow(usage, calls, elapsed);
- return;
- }
-
- newmaxentries = usage->maxentries * 2;
- }
- else
- newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
-
- if ((Size) newmaxentries >
- MaxAllocSize / sizeof(WaitEventUsageEntry))
- {
- WaitEventUsageAddOverflow(usage, calls, elapsed);
- return;
- }
-
- entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
/*
- * Wait completion can happen in a critical section, so growth
- * must not throw ERROR. If storage cannot be grown without
- * throwing, preserve total wait time in the overflow bucket.
+ * Wait-end accounting must not allocate: it can run in a critical
+ * section. Preserve total calls/time without the exact event
+ * identity once preallocated storage is full.
*/
- if (usage->entries)
- newentries = repalloc_extended(usage->entries, entries_size,
- MCXT_ALLOC_NO_OOM);
- else
- newentries = MemoryContextAllocExtended(usage->memcontext,
- entries_size,
- MCXT_ALLOC_NO_OOM);
- if (newentries == NULL)
- {
- WaitEventUsageAddOverflow(usage, calls, elapsed);
- return;
- }
-
- usage->entries = newentries;
- usage->maxentries = newmaxentries;
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+ return;
}
if (idx < usage->nentries)
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f14945cdb16..67497790307 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -24,7 +24,6 @@ typedef struct WaitEventUsageEntry
typedef struct WaitEventUsage
{
- MemoryContext memcontext;
struct WaitEventUsage *active_parent; /* active plan-node stack link */
struct WaitEventUsage *query_parent; /* active query-level stack link */
struct WaitEventUsage *saved_node_usage; /* node stack at query start */
@@ -41,8 +40,7 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
-extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
- MemoryContext memcontext);
+extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
@@ -55,7 +53,7 @@ extern void pgstat_count_wait_event_start(uint32 wait_event_info);
extern void pgstat_count_wait_event_end(void);
extern PGDLLIMPORT uint32 *my_wait_event_info;
-extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
+extern PGDLLIMPORT bool pgstat_wait_event_usage_active;
/*
@@ -101,7 +99,7 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
static inline void
pgstat_report_wait_start(uint32 wait_event_info)
{
- if (pgstat_wait_event_usage_depth > 0)
+ if (unlikely(pgstat_wait_event_usage_active))
pgstat_count_wait_event_start(wait_event_info);
/*
@@ -120,7 +118,7 @@ pgstat_report_wait_start(uint32 wait_event_info)
static inline void
pgstat_report_wait_end(void)
{
- if (pgstat_wait_event_usage_depth > 0)
+ if (unlikely(pgstat_wait_event_usage_active))
pgstat_count_wait_event_end();
/* see pgstat_report_wait_start() */
--
2.52.0
[application/octet-stream] 0006-Hide-EXPLAIN-WAITS-accumulator-internals.patch (14.5K, 8-0006-Hide-EXPLAIN-WAITS-accumulator-internals.patch)
download | inline diff:
From e9f1030cd864c93a3e17d53f4f344b3202bb57bb Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:46:12 +0500
Subject: [RFC PATCH v1 6/7] Hide EXPLAIN WAITS accumulator internals
---
src/backend/commands/explain.c | 45 ++++++++++--------
src/backend/executor/execParallel.c | 44 +++++++++--------
src/backend/utils/activity/wait_event.c | 63 +++++++++++++++++++++++--
src/include/utils/wait_event.h | 32 ++++++++-----
4 files changed, 129 insertions(+), 55 deletions(-)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee69d723cd8..0e2ec510fee 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -514,7 +514,6 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
- WaitEventUsage waitEventUsage;
WaitEventUsage *waitEventUsagePtr = NULL;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -593,9 +592,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
if (es->waits)
{
- waitEventUsagePtr = &waitEventUsage;
- pgstat_begin_wait_event_usage(waitEventUsagePtr,
- queryDesc->estate->es_query_cxt);
+ waitEventUsagePtr =
+ pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
}
@@ -4559,20 +4557,29 @@ static void
show_wait_event_usage(ExplainState *es, const char *labelname,
const WaitEventUsage *usage)
{
+ const WaitEventUsageEntry *usage_entries;
WaitEventUsageEntry *entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
+ int nentries;
if (usage == NULL)
return;
- if (usage->nentries == 0 && usage->overflowed_calls == 0)
+ if (pgstat_wait_event_usage_is_empty(usage))
return;
- if (usage->nentries > 0)
+ nentries = pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+ pgstat_get_wait_event_usage_overflow(usage,
+ &overflowed_calls,
+ &overflowed_time);
+
+ if (nentries > 0)
{
- entries = palloc_array(WaitEventUsageEntry, usage->nentries);
- memcpy(entries, usage->entries,
- sizeof(WaitEventUsageEntry) * usage->nentries);
- qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+ entries = palloc_array(WaitEventUsageEntry, nentries);
+ memcpy(entries, usage_entries,
+ sizeof(WaitEventUsageEntry) * nentries);
+ qsort(entries, nentries, sizeof(WaitEventUsageEntry),
wait_event_usage_cmp);
}
else
@@ -4584,7 +4591,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
appendStringInfo(es->str, "%s:\n", labelname);
es->indent++;
- for (int i = 0; i < usage->nentries; i++)
+ for (int i = 0; i < nentries; i++)
{
const char *event_type;
const char *event_name;
@@ -4600,24 +4607,24 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
INSTR_TIME_GET_MILLISEC(entries[i].time));
}
- if (usage->overflowed_calls > 0)
+ if (overflowed_calls > 0)
{
ExplainIndentText(es);
appendStringInfo(es->str,
"Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
- usage->overflowed_calls,
- INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+ overflowed_calls,
+ INSTR_TIME_GET_MILLISEC(overflowed_time));
}
es->indent--;
}
else
{
- if (usage->nentries > 0)
+ if (nentries > 0)
{
ExplainOpenGroup("Wait-Events", labelname, false, es);
- for (int i = 0; i < usage->nentries; i++)
+ for (int i = 0; i < nentries; i++)
{
const char *event_type;
const char *event_name;
@@ -4642,16 +4649,16 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
ExplainCloseGroup("Wait-Events", labelname, false, es);
}
- if (usage->overflowed_calls > 0)
+ if (overflowed_calls > 0)
{
/*
* This is not a wait event identity, so keep it outside the
* Wait Events array in structured output.
*/
ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
- usage->overflowed_calls, es);
+ overflowed_calls, es);
ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
- INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+ INSTR_TIME_GET_MILLISEC(overflowed_time),
3, es);
}
}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 520b4b8484f..dcd06c718c8 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1352,8 +1352,9 @@ ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
if (worker->overflowed_calls > 0)
{
- usage->overflowed_calls += worker->overflowed_calls;
- INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+ pgstat_accumulate_wait_event_usage_overflow(usage,
+ worker->overflowed_calls,
+ &worker->overflowed_time);
worker->overflowed_calls = 0;
INSTR_TIME_SET_ZERO(worker->overflowed_time);
}
@@ -1377,11 +1378,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
dsa_area *area,
const WaitEventUsage *usage)
{
+ const WaitEventUsageEntry *usage_entries;
WaitEventUsageEntry *entries;
WaitEventUsageEntry *old_entries = NULL;
dsa_pointer entries_dsa;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
Size entries_size;
int old_nentries = 0;
+ int usage_nentries;
int new_nentries = 0;
int i = 0;
int j = 0;
@@ -1390,10 +1395,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
Assert(area != NULL);
Assert(usage != NULL);
- worker->overflowed_calls += usage->overflowed_calls;
- INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+ usage_nentries =
+ pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+ pgstat_get_wait_event_usage_overflow(usage,
+ &overflowed_calls,
+ &overflowed_time);
+ worker->overflowed_calls += overflowed_calls;
+ INSTR_TIME_ADD(worker->overflowed_time, overflowed_time);
- if (usage->nentries <= 0)
+ if (usage_nentries <= 0)
return;
if (DsaPointerIsValid(worker->entries))
@@ -1404,25 +1414,25 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
}
entries_size = mul_size(sizeof(WaitEventUsageEntry),
- (Size) old_nentries + (Size) usage->nentries);
+ (Size) old_nentries + (Size) usage_nentries);
entries_dsa = dsa_allocate(area, entries_size);
entries = dsa_get_address(area, entries_dsa);
- while (i < old_nentries && j < usage->nentries)
+ while (i < old_nentries && j < usage_nentries)
{
WaitEventUsageEntry *entry = &entries[new_nentries];
uint32 old_info = old_entries[i].wait_event_info;
- uint32 new_info = usage->entries[j].wait_event_info;
+ uint32 new_info = usage_entries[j].wait_event_info;
if (old_info < new_info)
*entry = old_entries[i++];
else if (old_info > new_info)
- *entry = usage->entries[j++];
+ *entry = usage_entries[j++];
else
{
*entry = old_entries[i++];
- entry->calls += usage->entries[j].calls;
- INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+ entry->calls += usage_entries[j].calls;
+ INSTR_TIME_ADD(entry->time, usage_entries[j].time);
j++;
}
@@ -1431,8 +1441,8 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
while (i < old_nentries)
entries[new_nentries++] = old_entries[i++];
- while (j < usage->nentries)
- entries[new_nentries++] = usage->entries[j++];
+ while (j < usage_nentries)
+ entries[new_nentries++] = usage_entries[j++];
if (DsaPointerIsValid(worker->entries))
dsa_free(area, worker->entries);
@@ -1781,7 +1791,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
QueryDesc *queryDesc;
SharedExecutorInstrumentation *instrumentation;
SharedJitInstrumentation *jit_instrumentation;
- WaitEventUsage waitEventUsage;
WaitEventUsage *waitEventUsagePtr = NULL;
int instrument_options = 0;
void *area_space;
@@ -1841,11 +1850,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
InstrStartParallelQuery();
if (wait_event_usage != NULL)
- {
- waitEventUsagePtr = &waitEventUsage;
- pgstat_begin_wait_event_usage(waitEventUsagePtr,
- queryDesc->estate->es_query_cxt);
- }
+ waitEventUsagePtr =
+ pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 67980cc0a3b..9719e38729e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,6 +36,17 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
static const char *pgstat_get_wait_ipc(WaitEventIPC w);
static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
static const char *pgstat_get_wait_io(WaitEventIO w);
+struct WaitEventUsage
+{
+ struct WaitEventUsage *active_parent; /* active plan-node stack link */
+ struct WaitEventUsage *query_parent; /* active query-level stack link */
+ struct WaitEventUsage *saved_node_usage; /* node stack at query start */
+ int nentries;
+ int maxentries;
+ WaitEventUsageEntry *entries;
+ uint64 overflowed_calls;
+ instr_time overflowed_time;
+};
static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
uint64 calls, const instr_time *elapsed);
static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
@@ -422,12 +433,12 @@ WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
* local memory. Nested top-level collectors are kept in a query-level stack;
* a wait is counted once in each active collector.
*/
-void
-pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+WaitEventUsage *
+pgstat_begin_wait_event_usage(MemoryContext memcontext)
{
+ WaitEventUsage *usage;
bool first;
- Assert(usage != NULL);
Assert(memcontext != NULL);
first = pgstat_wait_event_usage_depth == 0;
@@ -440,7 +451,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
}
- WaitEventUsageInit(usage, memcontext);
+ usage = pgstat_create_wait_event_usage(memcontext);
usage->query_parent = pgstat_wait_event_usage;
/*
* A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -451,6 +462,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
pgstat_wait_event_usage = usage;
pgstat_wait_event_usage_depth++;
pgstat_wait_event_usage_active = true;
+ return usage;
}
/*
@@ -579,6 +591,49 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
&entries[i].time);
}
+void
+pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+ uint64 calls,
+ const instr_time *elapsed)
+{
+ Assert(usage != NULL);
+ Assert(elapsed != NULL);
+
+ WaitEventUsageAddOverflow(usage, calls, elapsed);
+}
+
+bool
+pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage)
+{
+ Assert(usage != NULL);
+
+ return usage->nentries == 0 && usage->overflowed_calls == 0;
+}
+
+int
+pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+ const WaitEventUsageEntry **entries)
+{
+ Assert(usage != NULL);
+ Assert(entries != NULL);
+
+ *entries = usage->entries;
+ return usage->nentries;
+}
+
+void
+pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+ uint64 *calls,
+ instr_time *elapsed)
+{
+ Assert(usage != NULL);
+ Assert(calls != NULL);
+ Assert(elapsed != NULL);
+
+ *calls = usage->overflowed_calls;
+ *elapsed = usage->overflowed_time;
+}
+
/*
* Find the existing entry, or the insertion position for a new entry.
*
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 67497790307..19763cfcae5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -15,6 +15,12 @@
#include "utils/palloc.h"
#include "utils/wait_event_types.h"
+/*
+ * EXPLAIN wait event accounting support. WaitEventUsage is intentionally
+ * opaque outside wait_event.c; callers should allocate, accumulate, and read
+ * it through the functions below. WaitEventUsageEntry is the reportable
+ * tuple copied to EXPLAIN output and parallel-worker storage.
+ */
typedef struct WaitEventUsageEntry
{
uint32 wait_event_info;
@@ -22,17 +28,7 @@ typedef struct WaitEventUsageEntry
instr_time time;
} WaitEventUsageEntry;
-typedef struct WaitEventUsage
-{
- struct WaitEventUsage *active_parent; /* active plan-node stack link */
- struct WaitEventUsage *query_parent; /* active query-level stack link */
- struct WaitEventUsage *saved_node_usage; /* node stack at query start */
- int nentries;
- int maxentries;
- WaitEventUsageEntry *entries;
- uint64 overflowed_calls;
- instr_time overflowed_time;
-} WaitEventUsage;
+typedef struct WaitEventUsage WaitEventUsage;
extern const char *pgstat_get_wait_event(uint32 wait_event_info);
extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
@@ -40,13 +36,23 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
static inline void pgstat_report_wait_end(void);
extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
extern void pgstat_reset_wait_event_storage(void);
+
+/* EXPLAIN wait event accounting. */
extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
-extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
- MemoryContext memcontext);
+extern WaitEventUsage *pgstat_begin_wait_event_usage(MemoryContext memcontext);
extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
const WaitEventUsageEntry *entries,
int nentries);
+extern void pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+ uint64 calls,
+ const instr_time *elapsed);
+extern bool pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage);
+extern int pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+ const WaitEventUsageEntry **entries);
+extern void pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+ uint64 *calls,
+ instr_time *elapsed);
extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
extern void pgstat_count_wait_event_start(uint32 wait_event_info);
--
2.52.0
[application/octet-stream] 0007-Keep-EXPLAIN-option-completion-current.patch (1.2K, 9-0007-Keep-EXPLAIN-option-completion-current.patch)
download | inline diff:
From 7fb9ef35e18a863870ac5c10c346dd739fe7050b Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:46:43 +0500
Subject: [RFC PATCH v1 7/7] Keep EXPLAIN option completion current
---
src/bin/psql/tab-complete.in.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b33490e7415..45712f4ac12 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4525,8 +4525,8 @@ match_previous_words(int pattern_id,
if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
"BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
- "MEMORY", "FORMAT");
- else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
+ "MEMORY", "IO", "FORMAT");
+ else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY|IO"))
COMPLETE_WITH("ON", "OFF");
else if (TailMatches("SERIALIZE"))
COMPLETE_WITH("TEXT", "NONE", "BINARY");
--
2.52.0
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected]
Subject: Re: [RFC PATCH v1] Add EXPLAIN ANALYZE wait event reporting
In-Reply-To: <CALCfnuquuxtZmmzQBZ_yxaihfj7bnALXdzi9Nj=RYUW4iwY6GQ@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox