public inbox for [email protected]  
help / color / mirror / Atom feed
From: r314tive <[email protected]>
To: Michael Paquier <[email protected]>
Cc: [email protected]
Subject: [RFC PATCH v1] Add EXPLAIN ANALYZE wait event reporting
Date: Thu, 14 May 2026 16:19:45 +0500
Message-ID: <CALCfnuquuxtZmmzQBZ_yxaihfj7bnALXdzi9Nj=RYUW4iwY6GQ@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
	<[email protected]>

Following Michael Paquier's note about PostgreSQL list practice, this v1
resends the same patch content as v0 as a single email with the patch series
attached.

No code changes from v0.

v0 thread:
https://www.postgresql.org/message-id/cover.1778280923.git.tanswis42%40gmail.com

This RFC prototype adds EXPLAIN (ANALYZE, WAITS), which reports completed
wait
intervals observed through pgstat_report_wait_start/end().

Statement-level wait events are counted once per active statement collector.
Plan-node wait events use inclusive attribution, similar to EXPLAIN ANALYZE
timing: a completed wait is attributed to every active plan node captured
when
the wait begins.  Per-node wait times therefore must not be summed across
plan
nodes.

The patch series is split as follows:

1. Add statement-level EXPLAIN WAITS reporting.
2. Aggregate statement-level EXPLAIN WAITS from parallel workers.
3. Attribute EXPLAIN WAITS to plan nodes.
4. Refine attribution semantics, documentation, and regression coverage.
5. Harden accumulator handling with fixed-size, allocation-free storage.
6. Hide accumulator internals from public headers.
7. Keep psql EXPLAIN option completion current.

The main review questions are:

- whether the option should be named WAITS or WAIT_EVENTS;
- whether inclusive per-node attribution is the right initial semantics;
- whether the fixed accumulator limit and overflow reporting are acceptable;
- whether the disabled/enabled hot-path overhead is acceptable.

Regards,
Ilmar


Attachments:

  [application/octet-stream] 0001-Add-EXPLAIN-WAITS-statement-reporting.patch (17.9K, 3-0001-Add-EXPLAIN-WAITS-statement-reporting.patch)
  download | inline diff:
From 6fc0404acb69765067ccbf1ca3ee0fd39240f9d6 Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:10 +0500
Subject: [RFC PATCH v1 1/7] Add EXPLAIN WAITS statement reporting

---
 doc/src/sgml/ref/explain.sgml           |  14 +++
 src/backend/commands/explain.c          | 129 +++++++++++++++++++-
 src/backend/commands/explain_state.c    |   8 ++
 src/backend/utils/activity/wait_event.c | 149 ++++++++++++++++++++++++
 src/include/commands/explain_state.h    |   1 +
 src/include/utils/wait_event.h          |  29 +++++
 src/test/regress/expected/explain.out   |  17 +++
 src/test/regress/sql/explain.sql        |   5 +
 8 files changed, 349 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index e95e19081e1..5460568af1d 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -47,6 +47,7 @@ EXPLAIN [ ( <replaceable class="parameter">option</replaceable> [, ...] ) ] <rep
     SUMMARY [ <replaceable class="parameter">boolean</replaceable> ]
     MEMORY [ <replaceable class="parameter">boolean</replaceable> ]
     IO [ <replaceable class="parameter">boolean</replaceable> ]
+    WAITS [ <replaceable class="parameter">boolean</replaceable> ]
     FORMAT { TEXT | XML | JSON | YAML }
 </synopsis>
  </refsynopsisdiv>
@@ -310,6 +311,19 @@ ROLLBACK;
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>WAITS</literal></term>
+    <listitem>
+     <para>
+      Include exact backend-local wait event timing observed during query
+      execution.  Wait events are grouped by wait event type and wait event
+      name, and include the number of waits and total time spent waiting.
+      This parameter may only be used when <literal>ANALYZE</literal> is also
+      enabled.  It defaults to <literal>FALSE</literal>.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>FORMAT</literal></term>
     <listitem>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 112c17b0d64..0b7cc5c15c6 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -45,6 +45,7 @@
 #include "utils/tuplesort.h"
 #include "utils/tuplestore.h"
 #include "utils/typcache.h"
+#include "utils/wait_event.h"
 #include "utils/xml.h"
 
 
@@ -149,6 +150,9 @@ static const char *explain_get_index_name(Oid indexId);
 static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
+static int	wait_event_usage_cmp(const void *a, const void *b);
+static void show_wait_event_usage(ExplainState *es,
+								  const WaitEventUsage *usage);
 static void show_memory_counters(ExplainState *es,
 								 const MemoryContextCounters *mem_counters);
 static void show_result_replacement_info(Result *result, ExplainState *es);
@@ -510,6 +514,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	int			eflags;
 	int			instrument_option = 0;
 	SerializeMetrics serializeMetrics = {0};
+	WaitEventUsage waitEventUsage;
+	WaitEventUsage *waitEventUsagePtr = NULL;
 
 	Assert(plannedstmt->commandType != CMD_UTILITY);
 
@@ -583,11 +589,27 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		else
 			dir = ForwardScanDirection;
 
+		if (es->waits)
+		{
+			waitEventUsagePtr = &waitEventUsage;
+			pgstat_begin_wait_event_usage(waitEventUsagePtr,
+										  queryDesc->estate->es_query_cxt);
+		}
+
 		/* run the plan */
-		ExecutorRun(queryDesc, dir, 0);
+		PG_TRY();
+		{
+			ExecutorRun(queryDesc, dir, 0);
 
-		/* run cleanup too */
-		ExecutorFinish(queryDesc);
+			/* run cleanup too */
+			ExecutorFinish(queryDesc);
+		}
+		PG_FINALLY();
+		{
+			if (waitEventUsagePtr)
+				pgstat_end_wait_event_usage(waitEventUsagePtr);
+		}
+		PG_END_TRY();
 
 		/* We can't run ExecutorEnd 'till we're done printing the stats... */
 		totaltime += elapsed_time(&starttime);
@@ -605,6 +627,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	/* Create textual dump of plan tree */
 	ExplainPrintPlan(es, queryDesc);
 
+	if (waitEventUsagePtr)
+		show_wait_event_usage(es, waitEventUsagePtr);
+
 	/* Show buffer and/or memory usage in planning */
 	if (peek_buffer_usage(es, bufusage) || mem_counters)
 	{
@@ -4503,6 +4528,104 @@ show_wal_usage(ExplainState *es, const WalUsage *usage)
 	}
 }
 
+static int
+wait_event_usage_cmp(const void *a, const void *b)
+{
+	const WaitEventUsageEntry *ea = (const WaitEventUsageEntry *) a;
+	const WaitEventUsageEntry *eb = (const WaitEventUsageEntry *) b;
+	int64		ta = INSTR_TIME_GET_MICROSEC(ea->time);
+	int64		tb = INSTR_TIME_GET_MICROSEC(eb->time);
+
+	if (ta < tb)
+		return 1;
+	if (ta > tb)
+		return -1;
+	if (ea->wait_event_info < eb->wait_event_info)
+		return -1;
+	if (ea->wait_event_info > eb->wait_event_info)
+		return 1;
+	return 0;
+}
+
+static void
+show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+{
+	WaitEventUsageEntry *entries;
+
+	if (usage == NULL)
+		return;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+		return;
+
+	if (usage->nentries > 0)
+	{
+		entries = palloc_array(WaitEventUsageEntry, usage->nentries);
+		memcpy(entries, usage->entries,
+			   sizeof(WaitEventUsageEntry) * usage->nentries);
+		qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+			  wait_event_usage_cmp);
+	}
+	else
+		entries = NULL;
+
+	if (es->format == EXPLAIN_FORMAT_TEXT)
+	{
+		ExplainIndentText(es);
+		appendStringInfoString(es->str, "Wait Events:\n");
+		es->indent++;
+
+		for (int i = 0; i < usage->nentries; i++)
+		{
+			const char *event_type;
+			const char *event_name;
+
+			event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+			event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+			ExplainIndentText(es);
+			appendStringInfo(es->str, "%s:%s calls=%" PRIu64 " time=%0.3f ms\n",
+							 event_type ? event_type : "Unknown",
+							 event_name ? event_name : "unknown",
+							 entries[i].calls,
+							 INSTR_TIME_GET_MILLISEC(entries[i].time));
+		}
+
+		es->indent--;
+	}
+	else
+	{
+		ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
+
+		for (int i = 0; i < usage->nentries; i++)
+		{
+			const char *event_type;
+			const char *event_name;
+
+			event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+			event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+			ExplainOpenGroup("Wait-Event", NULL, true, es);
+			ExplainPropertyText("Wait Event Type",
+								event_type ? event_type : "Unknown",
+								es);
+			ExplainPropertyText("Wait Event",
+								event_name ? event_name : "unknown",
+								es);
+			ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+			ExplainPropertyFloat("Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(entries[i].time),
+								 3, es);
+			ExplainCloseGroup("Wait-Event", NULL, true, es);
+		}
+
+		ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+	}
+
+	if (entries)
+		pfree(entries);
+}
+
 /*
  * Show memory usage details.
  */
diff --git a/src/backend/commands/explain_state.c b/src/backend/commands/explain_state.c
index a0ee0a664be..2abcb64cc95 100644
--- a/src/backend/commands/explain_state.c
+++ b/src/backend/commands/explain_state.c
@@ -164,6 +164,8 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 		}
 		else if (strcmp(opt->defname, "io") == 0)
 			es->io = defGetBoolean(opt);
+		else if (strcmp(opt->defname, "waits") == 0)
+			es->waits = defGetBoolean(opt);
 		else if (!ApplyExtensionExplainOption(es, opt, pstate))
 			ereport(ERROR,
 					(errcode(ERRCODE_SYNTAX_ERROR),
@@ -196,6 +198,12 @@ ParseExplainOptionList(ExplainState *es, List *options, ParseState *pstate)
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("EXPLAIN option %s requires ANALYZE", "IO")));
 
+	/* check that WAITS is used with EXPLAIN ANALYZE */
+	if (es->waits && !es->analyze)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("EXPLAIN option %s requires ANALYZE", "WAITS")));
+
 	/* check that serialize is used with EXPLAIN ANALYZE */
 	if (es->serialize != EXPLAIN_SERIALIZE_NONE && !es->analyze)
 		ereport(ERROR,
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 95635c7f56c..60d37ccbb73 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,11 +36,20 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
+static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+							  const instr_time *elapsed);
 
 
 static uint32 local_my_wait_event_info;
 uint32	   *my_wait_event_info = &local_my_wait_event_info;
 
+#define WAIT_EVENT_USAGE_INITIAL_EVENTS	16
+
+int			pgstat_wait_event_usage_depth = 0;
+static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static uint32 pgstat_wait_event_usage_current = 0;
+static instr_time pgstat_wait_event_usage_start;
+
 #define WAIT_EVENT_CLASS_MASK	0xFF000000
 #define WAIT_EVENT_ID_MASK		0x0000FFFF
 
@@ -349,6 +358,146 @@ pgstat_reset_wait_event_storage(void)
 	my_wait_event_info = &local_my_wait_event_info;
 }
 
+/*
+ * Start collecting exact wait event timings in this backend.
+ *
+ * This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
+ * It records waits observed through pgstat_report_wait_start/end in backend
+ * local memory.  Nested collection is deliberately treated as part of the
+ * outer collection for now; callers that want independent nested accounting
+ * need a stack of WaitEventUsage contexts.
+ */
+void
+pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+	Assert(usage != NULL);
+	Assert(memcontext != NULL);
+
+	if (pgstat_wait_event_usage_depth++ == 0)
+	{
+		memset(usage, 0, sizeof(WaitEventUsage));
+		usage->memcontext = memcontext;
+		pgstat_wait_event_usage = usage;
+		pgstat_wait_event_usage_current = 0;
+		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+	}
+}
+
+/*
+ * Stop collecting wait event timings.
+ */
+void
+pgstat_end_wait_event_usage(WaitEventUsage *usage)
+{
+	Assert(usage != NULL);
+	Assert(pgstat_wait_event_usage_depth > 0);
+
+	if (--pgstat_wait_event_usage_depth == 0)
+	{
+		if (pgstat_wait_event_usage_current != 0)
+			pgstat_count_wait_event_end();
+
+		pgstat_wait_event_usage = NULL;
+		pgstat_wait_event_usage_current = 0;
+		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+	}
+}
+
+/*
+ * Record the beginning of a wait event for exact EXPLAIN-style accounting.
+ */
+void
+pgstat_count_wait_event_start(uint32 wait_event_info)
+{
+	if (pgstat_wait_event_usage == NULL)
+		return;
+
+	/*
+	 * Waits are not expected to nest.  If they do, finish the previous wait
+	 * at the boundary so accounting remains internally consistent.
+	 */
+	if (pgstat_wait_event_usage_current != 0)
+		pgstat_count_wait_event_end();
+
+	pgstat_wait_event_usage_current = wait_event_info;
+	INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
+}
+
+/*
+ * Record the end of the current wait event.
+ */
+void
+pgstat_count_wait_event_end(void)
+{
+	instr_time	end;
+	instr_time	elapsed;
+
+	if (pgstat_wait_event_usage == NULL ||
+		pgstat_wait_event_usage_current == 0)
+		return;
+
+	INSTR_TIME_SET_CURRENT(end);
+	elapsed = end;
+	INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
+
+	WaitEventUsageAdd(pgstat_wait_event_usage,
+					  pgstat_wait_event_usage_current,
+					  &elapsed);
+
+	pgstat_wait_event_usage_current = 0;
+	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
+}
+
+static void
+WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
+				  const instr_time *elapsed)
+{
+	WaitEventUsageEntry *entry = NULL;
+
+	for (int i = 0; i < usage->nentries; i++)
+	{
+		if (usage->entries[i].wait_event_info == wait_event_info)
+		{
+			entry = &usage->entries[i];
+			break;
+		}
+	}
+
+	if (entry == NULL)
+	{
+		if (usage->nentries >= usage->maxentries)
+		{
+			MemoryContext oldcontext;
+			int			newmaxentries;
+
+			if (usage->maxentries > 0)
+				newmaxentries = usage->maxentries * 2;
+			else
+				newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
+
+			oldcontext = MemoryContextSwitchTo(usage->memcontext);
+			if (usage->entries)
+				usage->entries = repalloc_array(usage->entries,
+												WaitEventUsageEntry,
+												newmaxentries);
+			else
+				usage->entries = palloc_array(WaitEventUsageEntry,
+											  newmaxentries);
+			MemoryContextSwitchTo(oldcontext);
+
+			usage->maxentries = newmaxentries;
+		}
+
+		entry = &usage->entries[usage->nentries++];
+		entry->wait_event_info = wait_event_info;
+		entry->calls = 0;
+		INSTR_TIME_SET_ZERO(entry->time);
+	}
+
+	entry->calls++;
+	INSTR_TIME_ADD(entry->time, *elapsed);
+}
+
 /* ----------
  * pgstat_get_wait_event_type() -
  *
diff --git a/src/include/commands/explain_state.h b/src/include/commands/explain_state.h
index 97bc7ed49f6..4f0d0d495d4 100644
--- a/src/include/commands/explain_state.h
+++ b/src/include/commands/explain_state.h
@@ -56,6 +56,7 @@ typedef struct ExplainState
 	bool		memory;			/* print planner's memory usage information */
 	bool		settings;		/* print modified settings */
 	bool		io;				/* print info about IO (prefetch, ...) */
+	bool		waits;			/* print wait event usage */
 	bool		generic;		/* generate a generic plan */
 	ExplainSerializeOption serialize;	/* serialize the query's output? */
 	ExplainFormat format;		/* output format */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 86ee348220d..f7fab5736bb 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -11,16 +11,39 @@
 #define WAIT_EVENT_H
 
 /* enums for wait events */
+#include "portability/instr_time.h"
+#include "utils/palloc.h"
 #include "utils/wait_event_types.h"
 
+typedef struct WaitEventUsageEntry
+{
+	uint32		wait_event_info;
+	uint64		calls;
+	instr_time	time;
+} WaitEventUsageEntry;
+
+typedef struct WaitEventUsage
+{
+	MemoryContext memcontext;
+	int			nentries;
+	int			maxentries;
+	WaitEventUsageEntry *entries;
+} WaitEventUsage;
+
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
 extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
 static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
+										  MemoryContext memcontext);
+extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_count_wait_event_start(uint32 wait_event_info);
+extern void pgstat_count_wait_event_end(void);
 
 extern PGDLLIMPORT uint32 *my_wait_event_info;
+extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
 
 
 /*
@@ -66,6 +89,9 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
 static inline void
 pgstat_report_wait_start(uint32 wait_event_info)
 {
+	if (pgstat_wait_event_usage_depth > 0)
+		pgstat_count_wait_event_start(wait_event_info);
+
 	/*
 	 * Since this is a four-byte field which is always read and written as
 	 * four-bytes, updates are atomic.
@@ -82,6 +108,9 @@ pgstat_report_wait_start(uint32 wait_event_info)
 static inline void
 pgstat_report_wait_end(void)
 {
+	if (pgstat_wait_event_usage_depth > 0)
+		pgstat_count_wait_event_end();
+
 	/* see pgstat_report_wait_start() */
 	*(volatile uint32 *) my_wait_event_info = 0;
 }
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 74a4d87801e..2c7a7e1d4c6 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -99,6 +99,23 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
  Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N)
 (1 row)
 
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+            explain_filter             
+---------------------------------------
+ Result (actual rows=N.N loops=N)
+ Wait Events:
+   Timeout:PgSleep calls=N time=N.N ms
+(3 rows)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+explain (waits) select 1;
+ERROR:  EXPLAIN option WAITS requires ANALYZE
 \a
 select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
 explain_filter
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 2f163c64bf6..fe025ddeac5 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -68,6 +68,11 @@ select explain_filter('explain (analyze, buffers off, verbose) select * from int
 select explain_filter('explain (analyze, buffers, format text) select * from int8_tbl i8');
 select explain_filter('explain (buffers, format text) select * from int8_tbl i8');
 
+-- WAITS option
+select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+explain (waits) select 1;
+
 \a
 select explain_filter('explain (analyze, buffers, io, format xml) select * from int8_tbl i8');
 select explain_filter('explain (analyze, serialize, buffers, io, format yaml) select * from int8_tbl i8');
-- 
2.52.0



  [application/octet-stream] 0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patch (17.0K, 4-0002-Aggregate-EXPLAIN-WAITS-from-parallel-workers.patch)
  download | inline diff:
From eb5f9c83a0f248f2def60ff6e93e998149047f06 Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:24 +0500
Subject: [RFC PATCH v1 2/7] Aggregate EXPLAIN WAITS from parallel workers

---
 src/backend/commands/explain.c          |   4 +
 src/backend/executor/execMain.c         |   1 +
 src/backend/executor/execParallel.c     | 129 ++++++++++++++++++++++++
 src/backend/executor/execUtils.c        |   1 +
 src/backend/utils/activity/wait_event.c |  22 +++-
 src/include/executor/execParallel.h     |   2 +
 src/include/nodes/execnodes.h           |   2 +
 src/include/utils/wait_event.h          |   3 +
 src/test/regress/expected/explain.out   |  17 ++++
 src/test/regress/sql/explain.sql        |  12 +++
 10 files changed, 190 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0b7cc5c15c6..9d7372f5415 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -594,6 +594,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 			waitEventUsagePtr = &waitEventUsage;
 			pgstat_begin_wait_event_usage(waitEventUsagePtr,
 										  queryDesc->estate->es_query_cxt);
+			queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
 		}
 
 		/* run the plan */
@@ -607,7 +608,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		PG_FINALLY();
 		{
 			if (waitEventUsagePtr)
+			{
 				pgstat_end_wait_event_usage(waitEventUsagePtr);
+				queryDesc->estate->es_wait_event_usage = NULL;
+			}
 		}
 		PG_END_TRY();
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 4b30f768680..86ab124c1c0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -3066,6 +3066,7 @@ EvalPlanQualStart(EPQState *epqstate, Plan *planTree)
 	/* es_trig_target_relations must NOT be copied */
 	rcestate->es_top_eflags = parentestate->es_top_eflags;
 	rcestate->es_instrument = parentestate->es_instrument;
+	rcestate->es_wait_event_usage = parentestate->es_wait_event_usage;
 	/* es_auxmodifytables must NOT be copied */
 
 	/*
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 81b87d82fab..8213565a708 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -51,6 +51,7 @@
 #include "utils/dsa.h"
 #include "utils/lsyscache.h"
 #include "utils/snapmgr.h"
+#include "utils/wait_event.h"
 
 /*
  * Magic numbers for parallel executor communication.  We use constants
@@ -67,6 +68,7 @@
 #define PARALLEL_KEY_QUERY_TEXT		UINT64CONST(0xE000000000000008)
 #define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
 #define PARALLEL_KEY_WAL_USAGE			UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_WAIT_EVENT_USAGE	UINT64CONST(0xE00000000000000B)
 
 #define PARALLEL_TUPLE_QUEUE_SIZE		65536
 
@@ -114,6 +116,18 @@ struct SharedExecutorInstrumentation
 	(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
 	 (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
 
+typedef struct SharedWaitEventUsageWorker
+{
+	int			nentries;
+	dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
+struct SharedWaitEventUsage
+{
+	int			num_workers;
+	SharedWaitEventUsageWorker worker_usage[FLEXIBLE_ARRAY_MEMBER];
+};
+
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
 {
@@ -141,6 +155,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
 										ParallelContext *pcxt);
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
 												SharedExecutorInstrumentation *instrumentation);
+static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+											 dsa_area *area,
+											 const WaitEventUsage *usage);
 
 /* Helper function that runs in the parallel worker. */
 static DestReceiver *ExecParallelGetReceiver(dsm_segment *seg, shm_toc *toc);
@@ -664,10 +682,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	char	   *paramlistinfo_space;
 	BufferUsage *bufusage_space;
 	WalUsage   *walusage_space;
+	SharedWaitEventUsage *wait_event_usage = NULL;
 	SharedExecutorInstrumentation *instrumentation = NULL;
 	SharedJitInstrumentation *jit_instrumentation = NULL;
 	int			pstmt_len;
 	int			paramlistinfo_len;
+	int			wait_event_usage_len = 0;
 	int			instrumentation_len = 0;
 	int			jit_instrumentation_len = 0;
 	int			instrument_offset = 0;
@@ -744,6 +764,16 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
 	shm_toc_estimate_keys(&pcxt->estimator, 1);
 
+	/* Estimate space for per-worker wait event usage metadata. */
+	if (estate->es_wait_event_usage != NULL)
+	{
+		wait_event_usage_len =
+			offsetof(SharedWaitEventUsage, worker_usage) +
+			mul_size(sizeof(SharedWaitEventUsageWorker), pcxt->nworkers);
+		shm_toc_estimate_chunk(&pcxt->estimator, wait_event_usage_len);
+		shm_toc_estimate_keys(&pcxt->estimator, 1);
+	}
+
 	/* Estimate space for tuple queues. */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -839,6 +869,21 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
 	pei->wal_usage = walusage_space;
 
+	/* Allocate metadata for each worker's wait event usage, if requested. */
+	if (estate->es_wait_event_usage != NULL)
+	{
+		wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
+		wait_event_usage->num_workers = nworkers;
+		for (int i = 0; i < nworkers; i++)
+		{
+			wait_event_usage->worker_usage[i].nentries = 0;
+			wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
+		}
+		shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
+					   wait_event_usage);
+		pei->wait_event_usage = wait_event_usage;
+	}
+
 	/* Set up the tuple queues that the workers will write into. */
 	pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
 
@@ -1213,6 +1258,68 @@ ExecParallelRetrieveJitInstrumentation(PlanState *planstate,
 	memcpy(planstate->worker_jit_instrument, shared_jit, ibytes);
 }
 
+static void
+ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
+{
+	SharedWaitEventUsage *shared = pei->wait_event_usage;
+	WaitEventUsage *usage;
+
+	if (shared == NULL)
+		return;
+
+	usage = pei->planstate->state->es_wait_event_usage;
+	if (usage == NULL)
+		return;
+
+	for (int i = 0; i < shared->num_workers; i++)
+	{
+		SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
+		WaitEventUsageEntry *entries;
+
+		if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+			continue;
+
+		entries = dsa_get_address(pei->area, worker->entries);
+		pgstat_accumulate_wait_event_usage(usage,
+										   entries,
+										   worker->nentries);
+		dsa_free(pei->area, worker->entries);
+		worker->nentries = 0;
+		worker->entries = InvalidDsaPointer;
+	}
+}
+
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+								 dsa_area *area,
+								 const WaitEventUsage *usage)
+{
+	SharedWaitEventUsageWorker *worker;
+	WaitEventUsageEntry *entries;
+	dsa_pointer entries_dsa;
+	Size		entries_size;
+
+	Assert(shared != NULL);
+	Assert(area != NULL);
+	Assert(usage != NULL);
+	Assert(IsParallelWorker());
+	Assert(ParallelWorkerNumber < shared->num_workers);
+
+	if (usage->nentries <= 0)
+		return;
+
+	worker = &shared->worker_usage[ParallelWorkerNumber];
+	entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+	entries_dsa = dsa_allocate(area, entries_size);
+	entries = dsa_get_address(area, entries_dsa);
+	memcpy(entries, usage->entries, entries_size);
+
+	if (DsaPointerIsValid(worker->entries))
+		dsa_free(area, worker->entries);
+	worker->nentries = usage->nentries;
+	worker->entries = entries_dsa;
+}
+
 /*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer/WAL usage.
@@ -1261,6 +1368,9 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
 	for (i = 0; i < nworkers; i++)
 		InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
 
+	/* Accumulate wait event usage, if requested. */
+	ExecParallelRetrieveWaitEventUsage(pei);
+
 	pei->finished = true;
 }
 
@@ -1516,10 +1626,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	FixedParallelExecutorState *fpes;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
+	SharedWaitEventUsage *wait_event_usage;
 	DestReceiver *receiver;
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
 	SharedJitInstrumentation *jit_instrumentation;
+	WaitEventUsage waitEventUsage;
+	WaitEventUsage *waitEventUsagePtr = NULL;
 	int			instrument_options = 0;
 	void	   *area_space;
 	dsa_area   *area;
@@ -1535,6 +1648,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 		instrument_options = instrumentation->instrument_options;
 	jit_instrumentation = shm_toc_lookup(toc, PARALLEL_KEY_JIT_INSTRUMENTATION,
 										 true);
+	wait_event_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAIT_EVENT_USAGE, true);
 	queryDesc = ExecParallelGetQueryDesc(toc, receiver, instrument_options);
 
 	/* Setting debug_query_string for individual workers */
@@ -1576,6 +1690,13 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	 */
 	InstrStartParallelQuery();
 
+	if (wait_event_usage != NULL)
+	{
+		waitEventUsagePtr = &waitEventUsage;
+		pgstat_begin_wait_event_usage(waitEventUsagePtr,
+									  queryDesc->estate->es_query_cxt);
+	}
+
 	/*
 	 * Run the plan.  If we specified a tuple bound, be careful not to demand
 	 * more tuples than that.
@@ -1587,6 +1708,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Shut down the executor */
 	ExecutorFinish(queryDesc);
 
+	if (waitEventUsagePtr != NULL)
+	{
+		pgstat_end_wait_event_usage(waitEventUsagePtr);
+		ExecParallelReportWaitEventUsage(wait_event_usage,
+										 area,
+										 waitEventUsagePtr);
+	}
+
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 1eb6b9f1f40..80ea777632b 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -151,6 +151,7 @@ CreateExecutorState(void)
 
 	estate->es_top_eflags = 0;
 	estate->es_instrument = 0;
+	estate->es_wait_event_usage = NULL;
 	estate->es_finished = false;
 
 	estate->es_exprcontexts = NIL;
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 60d37ccbb73..eb01bc3d88c 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -37,7 +37,7 @@ static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
-							  const instr_time *elapsed);
+							  uint64 calls, const instr_time *elapsed);
 
 
 static uint32 local_my_wait_event_info;
@@ -442,15 +442,31 @@ pgstat_count_wait_event_end(void)
 
 	WaitEventUsageAdd(pgstat_wait_event_usage,
 					  pgstat_wait_event_usage_current,
+					  1,
 					  &elapsed);
 
 	pgstat_wait_event_usage_current = 0;
 	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 }
 
+void
+pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+								   const WaitEventUsageEntry *entries,
+								   int nentries)
+{
+	Assert(usage != NULL);
+	Assert(nentries == 0 || entries != NULL);
+
+	for (int i = 0; i < nentries; i++)
+		WaitEventUsageAdd(usage,
+						  entries[i].wait_event_info,
+						  entries[i].calls,
+						  &entries[i].time);
+}
+
 static void
 WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
-				  const instr_time *elapsed)
+				  uint64 calls, const instr_time *elapsed)
 {
 	WaitEventUsageEntry *entry = NULL;
 
@@ -494,7 +510,7 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 		INSTR_TIME_SET_ZERO(entry->time);
 	}
 
-	entry->calls++;
+	entry->calls += calls;
 	INSTR_TIME_ADD(entry->time, *elapsed);
 }
 
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5a2034811d5..71df2c2511c 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -20,6 +20,7 @@
 #include "utils/dsa.h"
 
 typedef struct SharedExecutorInstrumentation SharedExecutorInstrumentation;
+typedef struct SharedWaitEventUsage SharedWaitEventUsage;
 
 typedef struct ParallelExecutorInfo
 {
@@ -27,6 +28,7 @@ typedef struct ParallelExecutorInfo
 	ParallelContext *pcxt;		/* parallel context we're using */
 	BufferUsage *buffer_usage;	/* points to bufusage area in DSM */
 	WalUsage   *wal_usage;		/* walusage area in DSM */
+	SharedWaitEventUsage *wait_event_usage;	/* optional */
 	SharedExecutorInstrumentation *instrumentation; /* optional */
 	struct SharedJitInstrumentation *jit_instrumentation;	/* optional */
 	dsa_area   *area;			/* points to DSA area in DSM */
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 13359180d25..781c8c79132 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -69,6 +69,7 @@ typedef struct Tuplestorestate Tuplestorestate;
 typedef struct TupleTableSlot TupleTableSlot;
 typedef struct TupleTableSlotOps TupleTableSlotOps;
 typedef struct WalUsage WalUsage;
+typedef struct WaitEventUsage WaitEventUsage;
 typedef struct WorkerNodeInstrumentation WorkerNodeInstrumentation;
 
 
@@ -754,6 +755,7 @@ typedef struct EState
 
 	int			es_top_eflags;	/* eflags passed to ExecutorStart */
 	int			es_instrument;	/* OR of InstrumentOption flags */
+	WaitEventUsage *es_wait_event_usage;	/* EXPLAIN WAITS accumulator */
 	bool		es_finished;	/* true when ExecutorFinish is done */
 
 	List	   *es_exprcontexts;	/* List of ExprContexts within EState */
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f7fab5736bb..63992137ee7 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -39,6 +39,9 @@ extern void pgstat_reset_wait_event_storage(void);
 extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
 										  MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
+											   const WaitEventUsageEntry *entries,
+											   int nentries);
 extern void pgstat_count_wait_event_start(uint32 wait_event_info);
 extern void pgstat_count_wait_event_end(void);
 
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 2c7a7e1d4c6..e3847e222be 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -114,6 +114,23 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
 
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+  language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+  '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
 explain (waits) select 1;
 ERROR:  EXPLAIN option WAITS requires ANALYZE
 \a
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index fe025ddeac5..8821250bcef 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,18 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+begin;
+create function pg_temp.parallel_pg_sleep(float8) returns void
+  language internal volatile parallel safe as 'pg_sleep';
+set local debug_parallel_query = on;
+set local max_parallel_workers_per_gather = 1;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+  '$[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
 explain (waits) select 1;
 
 \a
-- 
2.52.0



  [application/octet-stream] 0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patch (35.0K, 5-0003-Attribute-EXPLAIN-WAITS-to-plan-nodes.patch)
  download | inline diff:
From cb26aa486d42f227209c27df4aef4b07f13e334a Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:35 +0500
Subject: [RFC PATCH v1 3/7] Attribute EXPLAIN WAITS to plan nodes

---
 src/backend/commands/explain.c             |   6 +-
 src/backend/executor/execAsync.c           |  22 ++++
 src/backend/executor/execParallel.c        | 131 ++++++++++++++++++---
 src/backend/executor/execProcnode.c        |  31 ++++-
 src/backend/executor/instrument.c          |   7 ++
 src/backend/executor/nodeBitmapAnd.c       |   7 ++
 src/backend/executor/nodeBitmapIndexscan.c |   7 ++
 src/backend/executor/nodeBitmapOr.c        |   7 ++
 src/backend/executor/nodeHash.c            |   7 ++
 src/backend/utils/activity/wait_event.c    |  49 +++++++-
 src/include/executor/instrument.h          |   1 +
 src/include/nodes/execnodes.h              |   1 +
 src/include/utils/wait_event.h             |   4 +
 src/test/regress/expected/explain.out      |  52 +++++++-
 src/test/regress/sql/explain.sql           |  29 +++++
 15 files changed, 333 insertions(+), 28 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9d7372f5415..0575ae8368b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -530,6 +530,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		instrument_option |= INSTRUMENT_WAL;
 	if (es->io)
 		instrument_option |= INSTRUMENT_IO;
+	if (es->waits)
+		instrument_option |= INSTRUMENT_WAITS;
 
 	/*
 	 * We always collect timing for the entire statement, even when node-level
@@ -2332,6 +2334,8 @@ ExplainNode(PlanState *planstate, List *ancestors,
 		show_buffer_usage(es, &planstate->instrument->instr.bufusage);
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->instr.walusage);
+	if (es->waits)
+		show_wait_event_usage(es, planstate->wait_event_usage);
 
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4559,7 +4563,7 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
 	if (usage == NULL)
 		return;
 
-	if (es->format == EXPLAIN_FORMAT_TEXT && usage->nentries == 0)
+	if (usage->nentries == 0)
 		return;
 
 	if (usage->nentries > 0)
diff --git a/src/backend/executor/execAsync.c b/src/backend/executor/execAsync.c
index cf7ddbb01f4..e73ea55cf92 100644
--- a/src/backend/executor/execAsync.c
+++ b/src/backend/executor/execAsync.c
@@ -19,6 +19,7 @@
 #include "executor/instrument.h"
 #include "executor/nodeAppend.h"
 #include "executor/nodeForeignscan.h"
+#include "utils/wait_event.h"
 
 /*
  * Asynchronously request a tuple from a designed async-capable node.
@@ -26,12 +27,17 @@
 void
 ExecAsyncRequest(AsyncRequest *areq)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	if (areq->requestee->chgParam != NULL)	/* something changed? */
 		ExecReScan(areq->requestee);	/* let ReScan handle this */
 
 	/* must provide our own instrumentation support */
 	if (areq->requestee->instrument)
 		InstrStartNode(areq->requestee->instrument);
+	if (areq->requestee->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
 
 	switch (nodeTag(areq->requestee))
 	{
@@ -47,6 +53,8 @@ ExecAsyncRequest(AsyncRequest *areq)
 	ExecAsyncResponse(areq);
 
 	/* must provide our own instrumentation support */
+	if (areq->requestee->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (areq->requestee->instrument)
 		InstrStopNode(areq->requestee->instrument,
 					  TupIsNull(areq->result) ? 0.0 : 1.0);
@@ -62,9 +70,14 @@ ExecAsyncRequest(AsyncRequest *areq)
 void
 ExecAsyncConfigureWait(AsyncRequest *areq)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	/* must provide our own instrumentation support */
 	if (areq->requestee->instrument)
 		InstrStartNode(areq->requestee->instrument);
+	if (areq->requestee->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
 
 	switch (nodeTag(areq->requestee))
 	{
@@ -78,6 +91,8 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
 	}
 
 	/* must provide our own instrumentation support */
+	if (areq->requestee->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (areq->requestee->instrument)
 		InstrStopNode(areq->requestee->instrument, 0.0);
 }
@@ -88,9 +103,14 @@ ExecAsyncConfigureWait(AsyncRequest *areq)
 void
 ExecAsyncNotify(AsyncRequest *areq)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	/* must provide our own instrumentation support */
 	if (areq->requestee->instrument)
 		InstrStartNode(areq->requestee->instrument);
+	if (areq->requestee->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(areq->requestee->wait_event_usage);
 
 	switch (nodeTag(areq->requestee))
 	{
@@ -106,6 +126,8 @@ ExecAsyncNotify(AsyncRequest *areq)
 	ExecAsyncResponse(areq);
 
 	/* must provide our own instrumentation support */
+	if (areq->requestee->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (areq->requestee->instrument)
 		InstrStopNode(areq->requestee->instrument,
 					  TupIsNull(areq->result) ? 0.0 : 1.0);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 8213565a708..f77f539dd27 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -83,6 +83,12 @@ typedef struct FixedParallelExecutorState
 	int			jit_flags;
 } FixedParallelExecutorState;
 
+typedef struct SharedWaitEventUsageWorker
+{
+	int			nentries;
+	dsa_pointer entries;
+} SharedWaitEventUsageWorker;
+
 /*
  * DSM structure for accumulating per-PlanState instrumentation.
  *
@@ -92,6 +98,10 @@ typedef struct FixedParallelExecutorState
  * of the first NodeInstrumentation object.  This will depend on the length of
  * the plan_node_id array.
  *
+ * wait_event_usage_offset: Offset, relative to the start of this structure,
+ * of the first SharedWaitEventUsageWorker object, or 0 if wait event usage is
+ * not being collected.
+ *
  * num_workers: Number of workers.
  *
  * num_plan_nodes: Number of plan nodes.
@@ -103,6 +113,7 @@ struct SharedExecutorInstrumentation
 {
 	int			instrument_options;
 	int			instrument_offset;
+	int			wait_event_usage_offset;
 	int			num_workers;
 	int			num_plan_nodes;
 	int			plan_node_id[FLEXIBLE_ARRAY_MEMBER];
@@ -110,17 +121,17 @@ struct SharedExecutorInstrumentation
 	/*
 	 * Array of num_plan_nodes * num_workers NodeInstrumentation objects
 	 * follows.
+	 *
+	 * If wait_event_usage_offset is non-zero, an array of num_plan_nodes *
+	 * num_workers SharedWaitEventUsageWorker objects follows.
 	 */
 };
 #define GetInstrumentationArray(sei) \
 	(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
 	 (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
-
-typedef struct SharedWaitEventUsageWorker
-{
-	int			nentries;
-	dsa_pointer entries;
-} SharedWaitEventUsageWorker;
+#define GetInstrumentationWaitEventUsageArray(sei) \
+	(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
+	 (SharedWaitEventUsageWorker *) (((char *) sei) + sei->wait_event_usage_offset))
 
 struct SharedWaitEventUsage
 {
@@ -143,6 +154,12 @@ typedef struct ExecParallelInitializeDSMContext
 	int			nnodes;
 } ExecParallelInitializeDSMContext;
 
+typedef struct ExecParallelRetrieveInstrumentationContext
+{
+	SharedExecutorInstrumentation *instrumentation;
+	dsa_area   *area;
+} ExecParallelRetrieveInstrumentationContext;
+
 /* Helper functions that run in the parallel leader. */
 static char *ExecSerializePlan(Plan *plan, EState *estate);
 static bool ExecParallelEstimate(PlanState *planstate,
@@ -154,8 +171,11 @@ static shm_mq_handle **ExecParallelSetupTupleQueues(ParallelContext *pcxt,
 static bool ExecParallelReInitializeDSM(PlanState *planstate,
 										ParallelContext *pcxt);
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
-												SharedExecutorInstrumentation *instrumentation);
+												ExecParallelRetrieveInstrumentationContext *r);
 static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+												   dsa_area *area,
+												   const WaitEventUsage *usage);
 static void ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
 											 dsa_area *area,
 											 const WaitEventUsage *usage);
@@ -691,6 +711,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	int			instrumentation_len = 0;
 	int			jit_instrumentation_len = 0;
 	int			instrument_offset = 0;
+	int			wait_event_usage_offset = 0;
 	Size		dsa_minsize = dsa_minimum_size();
 	char	   *query_string;
 	int			query_len;
@@ -798,6 +819,14 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation_len +=
 			mul_size(sizeof(NodeInstrumentation),
 					 mul_size(e.nnodes, nworkers));
+		if (estate->es_instrument & INSTRUMENT_WAITS)
+		{
+			instrumentation_len = MAXALIGN(instrumentation_len);
+			wait_event_usage_offset = instrumentation_len;
+			instrumentation_len +=
+				mul_size(sizeof(SharedWaitEventUsageWorker),
+						 mul_size(e.nnodes, nworkers));
+		}
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
 
@@ -903,11 +932,23 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
 		instrumentation->instrument_options = estate->es_instrument;
 		instrumentation->instrument_offset = instrument_offset;
+		instrumentation->wait_event_usage_offset = wait_event_usage_offset;
 		instrumentation->num_workers = nworkers;
 		instrumentation->num_plan_nodes = e.nnodes;
 		instrument = GetInstrumentationArray(instrumentation);
 		for (i = 0; i < nworkers * e.nnodes; ++i)
 			InstrInitNode(&instrument[i], estate->es_instrument, false);
+		if (wait_event_usage_offset != 0)
+		{
+			SharedWaitEventUsageWorker *worker_usage;
+
+			worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+			for (i = 0; i < nworkers * e.nnodes; ++i)
+			{
+				worker_usage[i].nentries = 0;
+				worker_usage[i].entries = InvalidDsaPointer;
+			}
+		}
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
@@ -1137,9 +1178,11 @@ ExecParallelReInitializeDSM(PlanState *planstate,
  */
 static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
-									SharedExecutorInstrumentation *instrumentation)
+									ExecParallelRetrieveInstrumentationContext *r)
 {
+	SharedExecutorInstrumentation *instrumentation = r->instrumentation;
 	NodeInstrumentation *instrument;
+	SharedWaitEventUsageWorker *wait_event_usage = NULL;
 	int			i;
 	int			n;
 	int			ibytes;
@@ -1159,6 +1202,30 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	for (n = 0; n < instrumentation->num_workers; ++n)
 		InstrAggNode(planstate->instrument, &instrument[n]);
 
+	/* Accumulate the wait event usage from all workers. */
+	if (instrumentation->wait_event_usage_offset != 0 &&
+		planstate->wait_event_usage != NULL)
+	{
+		wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+		wait_event_usage += i * instrumentation->num_workers;
+		for (n = 0; n < instrumentation->num_workers; ++n)
+		{
+			SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
+			WaitEventUsageEntry *entries;
+
+			if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
+				continue;
+
+			entries = dsa_get_address(r->area, worker->entries);
+			pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
+											   entries,
+											   worker->nentries);
+			dsa_free(r->area, worker->entries);
+			worker->nentries = 0;
+			worker->entries = InvalidDsaPointer;
+		}
+	}
+
 	/*
 	 * Also store the per-worker detail.
 	 *
@@ -1216,7 +1283,7 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	}
 
 	return planstate_tree_walker(planstate, ExecParallelRetrieveInstrumentation,
-								 instrumentation);
+								 r);
 }
 
 /*
@@ -1290,25 +1357,21 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
 }
 
 static void
-ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
-								 dsa_area *area,
-								 const WaitEventUsage *usage)
+ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
+									   dsa_area *area,
+									   const WaitEventUsage *usage)
 {
-	SharedWaitEventUsageWorker *worker;
 	WaitEventUsageEntry *entries;
 	dsa_pointer entries_dsa;
 	Size		entries_size;
 
-	Assert(shared != NULL);
+	Assert(worker != NULL);
 	Assert(area != NULL);
 	Assert(usage != NULL);
-	Assert(IsParallelWorker());
-	Assert(ParallelWorkerNumber < shared->num_workers);
 
 	if (usage->nentries <= 0)
 		return;
 
-	worker = &shared->worker_usage[ParallelWorkerNumber];
 	entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
 	entries_dsa = dsa_allocate(area, entries_size);
 	entries = dsa_get_address(area, entries_dsa);
@@ -1320,6 +1383,20 @@ ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
 	worker->entries = entries_dsa;
 }
 
+static void
+ExecParallelReportWaitEventUsage(SharedWaitEventUsage *shared,
+								 dsa_area *area,
+								 const WaitEventUsage *usage)
+{
+	Assert(shared != NULL);
+	Assert(IsParallelWorker());
+	Assert(ParallelWorkerNumber < shared->num_workers);
+
+	ExecParallelReportWaitEventUsageWorker(&shared->worker_usage[ParallelWorkerNumber],
+										   area,
+										   usage);
+}
+
 /*
  * Finish parallel execution.  We wait for parallel workers to finish, and
  * accumulate their buffer/WAL usage.
@@ -1385,8 +1462,13 @@ ExecParallelCleanup(ParallelExecutorInfo *pei)
 {
 	/* Accumulate instrumentation, if any. */
 	if (pei->instrumentation)
-		ExecParallelRetrieveInstrumentation(pei->planstate,
-											pei->instrumentation);
+	{
+		ExecParallelRetrieveInstrumentationContext r;
+
+		r.instrumentation = pei->instrumentation;
+		r.area = pei->area;
+		ExecParallelRetrieveInstrumentation(pei->planstate, &r);
+	}
 
 	/* Accumulate JIT instrumentation, if any. */
 	if (pei->jit_instrumentation)
@@ -1495,6 +1577,17 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 	Assert(IsParallelWorker());
 	Assert(ParallelWorkerNumber < instrumentation->num_workers);
 	InstrAggNode(&instrument[ParallelWorkerNumber], planstate->instrument);
+	if (instrumentation->wait_event_usage_offset != 0 &&
+		planstate->wait_event_usage != NULL)
+	{
+		SharedWaitEventUsageWorker *wait_event_usage;
+
+		wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
+		wait_event_usage += i * instrumentation->num_workers;
+		ExecParallelReportWaitEventUsageWorker(&wait_event_usage[ParallelWorkerNumber],
+											   planstate->state->es_query_dsa,
+											   planstate->wait_event_usage);
+	}
 
 	return planstate_tree_walker(planstate, ExecParallelReportInstrumentation,
 								 instrumentation);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7c4c66e323f..081855b3fed 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -118,6 +118,7 @@
 #include "executor/nodeWindowAgg.h"
 #include "executor/nodeWorktablescan.h"
 #include "miscadmin.h"
+#include "utils/wait_event.h"
 #include "nodes/nodeFuncs.h"
 
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
@@ -415,6 +416,16 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 	if (estate->es_instrument)
 		result->instrument = InstrAllocNode(estate->es_instrument,
 											result->async_capable);
+	if (estate->es_instrument & INSTRUMENT_WAITS)
+	{
+		MemoryContext oldcontext;
+
+		oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
+		result->wait_event_usage = palloc_object(WaitEventUsage);
+		pgstat_init_wait_event_usage(result->wait_event_usage,
+									 estate->es_query_cxt);
+		MemoryContextSwitchTo(oldcontext);
+	}
 
 	return result;
 }
@@ -481,7 +492,8 @@ ExecProcNodeFirst(PlanState *node)
  * This has essentially the same responsibilities as ExecProcNode,
  * but it does not do InstrStartNode/InstrStopNode (mainly because
  * it can't tell how many returned tuples to count).  Each per-node
- * function must provide its own instrumentation support.
+ * function must provide its own instrumentation support, including
+ * wait event attribution when enabled.
  * ----------------------------------------------------------------
  */
 Node *
@@ -758,6 +770,9 @@ ExecShutdownNode(PlanState *node)
 static bool
 ExecShutdownNode_walker(PlanState *node, void *context)
 {
+	bool		node_running;
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	if (node == NULL)
 		return false;
 
@@ -773,8 +788,14 @@ ExecShutdownNode_walker(PlanState *node, void *context)
 	 * has never been executed, so as to avoid incorrectly making it appear
 	 * that it has.
 	 */
-	if (node->instrument && node->instrument->running)
+	node_running = node->instrument && node->instrument->running;
+	if (node_running)
+	{
 		InstrStartNode(node->instrument);
+		if (node->wait_event_usage)
+			previous_wait_event_usage =
+				pgstat_enter_wait_event_usage(node->wait_event_usage);
+	}
 
 	planstate_tree_walker(node, ExecShutdownNode_walker, context);
 
@@ -803,8 +824,12 @@ ExecShutdownNode_walker(PlanState *node, void *context)
 	}
 
 	/* Stop the node if we started it above, reporting 0 tuples. */
-	if (node->instrument && node->instrument->running)
+	if (node_running)
+	{
+		if (node->wait_event_usage)
+			pgstat_restore_wait_event_usage(previous_wait_event_usage);
 		InstrStopNode(node->instrument, 0);
+	}
 
 	return false;
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ffbcd572133..a1f22180f82 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -21,6 +21,7 @@
 #include "nodes/execnodes.h"
 #include "portability/instr_time.h"
 #include "utils/guc_hooks.h"
+#include "utils/wait_event.h"
 
 BufferUsage pgBufferUsage;
 static BufferUsage save_pgBufferUsage;
@@ -181,11 +182,17 @@ TupleTableSlot *
 ExecProcNodeInstr(PlanState *node)
 {
 	TupleTableSlot *result;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 
 	InstrStartNode(node->instrument);
+	if (node->wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->wait_event_usage);
 
 	result = node->ExecProcNodeReal(node);
 
+	if (node->wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
 
 	return result;
diff --git a/src/backend/executor/nodeBitmapAnd.c b/src/backend/executor/nodeBitmapAnd.c
index 9007dda3802..e2b442bd378 100644
--- a/src/backend/executor/nodeBitmapAnd.c
+++ b/src/backend/executor/nodeBitmapAnd.c
@@ -32,6 +32,7 @@
 #include "executor/instrument.h"
 #include "executor/nodeBitmapAnd.h"
 #include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
 
 
 /* ----------------------------------------------------------------
@@ -115,10 +116,14 @@ MultiExecBitmapAnd(BitmapAndState *node)
 	int			nplans;
 	int			i;
 	TIDBitmap  *result = NULL;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
+	if (node->ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
 
 	/*
 	 * get information from the node
@@ -162,6 +167,8 @@ MultiExecBitmapAnd(BitmapAndState *node)
 		elog(ERROR, "BitmapAnd doesn't support zero inputs");
 
 	/* must provide our own instrumentation support */
+	if (node->ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
 
diff --git a/src/backend/executor/nodeBitmapIndexscan.c b/src/backend/executor/nodeBitmapIndexscan.c
index 7978514e1bc..c2798a81e97 100644
--- a/src/backend/executor/nodeBitmapIndexscan.c
+++ b/src/backend/executor/nodeBitmapIndexscan.c
@@ -28,6 +28,7 @@
 #include "executor/nodeIndexscan.h"
 #include "miscadmin.h"
 #include "nodes/tidbitmap.h"
+#include "utils/wait_event.h"
 
 
 /* ----------------------------------------------------------------
@@ -54,10 +55,14 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	IndexScanDesc scandesc;
 	double		nTuples = 0;
 	bool		doscan;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 
 	/* must provide our own instrumentation support */
 	if (node->ss.ps.instrument)
 		InstrStartNode(node->ss.ps.instrument);
+	if (node->ss.ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ss.ps.wait_event_usage);
 
 	/*
 	 * extract necessary information from index scan node
@@ -116,6 +121,8 @@ MultiExecBitmapIndexScan(BitmapIndexScanState *node)
 	}
 
 	/* must provide our own instrumentation support */
+	if (node->ss.ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ss.ps.instrument)
 		InstrStopNode(node->ss.ps.instrument, nTuples);
 
diff --git a/src/backend/executor/nodeBitmapOr.c b/src/backend/executor/nodeBitmapOr.c
index 148c80fdae6..e9ee934a6f6 100644
--- a/src/backend/executor/nodeBitmapOr.c
+++ b/src/backend/executor/nodeBitmapOr.c
@@ -33,6 +33,7 @@
 #include "executor/nodeBitmapOr.h"
 #include "nodes/tidbitmap.h"
 #include "miscadmin.h"
+#include "utils/wait_event.h"
 
 
 /* ----------------------------------------------------------------
@@ -116,10 +117,14 @@ MultiExecBitmapOr(BitmapOrState *node)
 	int			nplans;
 	int			i;
 	TIDBitmap  *result = NULL;
+	WaitEventUsage *previous_wait_event_usage = NULL;
 
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
+	if (node->ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
 
 	/*
 	 * get information from the node
@@ -180,6 +185,8 @@ MultiExecBitmapOr(BitmapOrState *node)
 		elog(ERROR, "BitmapOr doesn't support zero inputs");
 
 	/* must provide our own instrumentation support */
+	if (node->ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, 0 /* XXX */ );
 
diff --git a/src/backend/executor/nodeHash.c b/src/backend/executor/nodeHash.c
index 8825bb6fa23..6e9e4b4218f 100644
--- a/src/backend/executor/nodeHash.c
+++ b/src/backend/executor/nodeHash.c
@@ -105,9 +105,14 @@ ExecHash(PlanState *pstate)
 Node *
 MultiExecHash(HashState *node)
 {
+	WaitEventUsage *previous_wait_event_usage = NULL;
+
 	/* must provide our own instrumentation support */
 	if (node->ps.instrument)
 		InstrStartNode(node->ps.instrument);
+	if (node->ps.wait_event_usage)
+		previous_wait_event_usage =
+			pgstat_enter_wait_event_usage(node->ps.wait_event_usage);
 
 	if (node->parallel_state != NULL)
 		MultiExecParallelHash(node);
@@ -115,6 +120,8 @@ MultiExecHash(HashState *node)
 		MultiExecPrivateHash(node);
 
 	/* must provide our own instrumentation support */
+	if (node->ps.wait_event_usage)
+		pgstat_restore_wait_event_usage(previous_wait_event_usage);
 	if (node->ps.instrument)
 		InstrStopNode(node->ps.instrument, node->hashtable->reportTuples);
 
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index eb01bc3d88c..3d5a10b44c4 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,6 +27,7 @@
 #include "storage/shmem.h"
 #include "storage/subsystems.h"
 #include "storage/spin.h"
+#include "utils/memutils.h"
 #include "utils/wait_event.h"
 
 
@@ -47,6 +48,7 @@ uint32	   *my_wait_event_info = &local_my_wait_event_info;
 
 int			pgstat_wait_event_usage_depth = 0;
 static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
 static uint32 pgstat_wait_event_usage_current = 0;
 static instr_time pgstat_wait_event_usage_start;
 
@@ -358,6 +360,28 @@ pgstat_reset_wait_event_storage(void)
 	my_wait_event_info = &local_my_wait_event_info;
 }
 
+/*
+ * Initialize a wait event usage accumulator.
+ */
+void
+pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+{
+	Assert(usage != NULL);
+	Assert(memcontext != NULL);
+
+	memset(usage, 0, sizeof(WaitEventUsage));
+
+	/*
+	 * Wait events may end inside critical sections, for example while
+	 * performing synchronous I/O.  Keep usage entries in a dedicated context
+	 * where the memory manager permits that accounting path to grow.
+	 */
+	usage->memcontext = AllocSetContextCreate(memcontext,
+											  "Wait Event Usage",
+											  ALLOCSET_SMALL_SIZES);
+	MemoryContextAllowInCriticalSection(usage->memcontext, true);
+}
+
 /*
  * Start collecting exact wait event timings in this backend.
  *
@@ -375,9 +399,9 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 
 	if (pgstat_wait_event_usage_depth++ == 0)
 	{
-		memset(usage, 0, sizeof(WaitEventUsage));
-		usage->memcontext = memcontext;
+		pgstat_init_wait_event_usage(usage, memcontext);
 		pgstat_wait_event_usage = usage;
+		pgstat_wait_event_node_usage = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
@@ -398,11 +422,27 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
 			pgstat_count_wait_event_end();
 
 		pgstat_wait_event_usage = NULL;
+		pgstat_wait_event_node_usage = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
 }
 
+WaitEventUsage *
+pgstat_enter_wait_event_usage(WaitEventUsage *usage)
+{
+	WaitEventUsage *previous = pgstat_wait_event_node_usage;
+
+	pgstat_wait_event_node_usage = usage;
+	return previous;
+}
+
+void
+pgstat_restore_wait_event_usage(WaitEventUsage *usage)
+{
+	pgstat_wait_event_node_usage = usage;
+}
+
 /*
  * Record the beginning of a wait event for exact EXPLAIN-style accounting.
  */
@@ -444,6 +484,11 @@ pgstat_count_wait_event_end(void)
 					  pgstat_wait_event_usage_current,
 					  1,
 					  &elapsed);
+	if (pgstat_wait_event_node_usage != NULL)
+		WaitEventUsageAdd(pgstat_wait_event_node_usage,
+						  pgstat_wait_event_usage_current,
+						  1,
+						  &elapsed);
 
 	pgstat_wait_event_usage_current = 0;
 	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index f093a52aae0..f46ee03d2c1 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -65,6 +65,7 @@ typedef enum InstrumentOption
 	INSTRUMENT_ROWS = 1 << 2,	/* needs row count */
 	INSTRUMENT_WAL = 1 << 3,	/* needs WAL usage */
 	INSTRUMENT_IO = 1 << 4,		/* needs IO usage */
+	INSTRUMENT_WAITS = 1 << 5,	/* needs wait event usage */
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 781c8c79132..56f9dcbef88 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1212,6 +1212,7 @@ typedef struct PlanState
 
 	NodeInstrumentation *instrument;	/* Optional runtime stats for this
 										 * node */
+	WaitEventUsage *wait_event_usage; /* Optional wait event usage */
 	WorkerNodeInstrumentation *worker_instrument;	/* per-worker
 													 * instrumentation */
 
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 63992137ee7..64411288403 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -36,12 +36,16 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
+extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
+										 MemoryContext memcontext);
 extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
 										  MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
 											   const WaitEventUsageEntry *entries,
 											   int nentries);
+extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
+extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_count_wait_event_start(uint32 wait_event_info);
 extern void pgstat_count_wait_event_end(void);
 
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index e3847e222be..c086fa48d80 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -101,12 +101,14 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
-            explain_filter             
----------------------------------------
+             explain_filter              
+-----------------------------------------
  Result (actual rows=N.N loops=N)
+   Wait Events:
+     Timeout:PgSleep calls=N time=N.N ms
  Wait Events:
    Timeout:PgSleep calls=N time=N.N ms
-(3 rows)
+(5 rows)
 
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
                                      ?column?                                     
@@ -114,6 +116,12 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
 
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
 begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
@@ -130,6 +138,44 @@ select jsonb_path_query_first(
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
 
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Plan}',
+  '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+  language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
 rollback;
 explain (waits) select 1;
 ERROR:  EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index 8821250bcef..bd196f248db 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -71,6 +71,7 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
 begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
@@ -82,6 +83,34 @@ select jsonb_path_query_first(
                          from tenk1 where unique1 = 1') #> '{0,Wait Events}',
   '$[*] ? (@."Wait Event" == "PgSleep")'
 );
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select pg_temp.parallel_pg_sleep(0.01)
+                         from tenk1 where unique1 = 1') #> '{0,Plan}',
+  '$.**."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+rollback;
+begin;
+-- This test deliberately creates a Bitmap Index Scan runtime-key wait.
+-- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
+-- expression be used as an index runtime key, while PL/pgSQL prevents SQL
+-- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
+-- The planner GUCs below are likewise test-only scaffolding to make the
+-- node shape deterministic.
+create function pg_temp.explain_waits_sleep_int(int) returns int
+  language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
+create temp table explain_waits_bitmap (a int);
+insert into explain_waits_bitmap select g from generate_series(1, 10) g;
+create index explain_waits_bitmap_a_idx on explain_waits_bitmap(a);
+analyze explain_waits_bitmap;
+set local enable_seqscan = off;
+set local enable_indexscan = off;
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
 rollback;
 explain (waits) select 1;
 
-- 
2.52.0



  [application/octet-stream] 0004-Refine-EXPLAIN-WAITS-attribution-semantics.patch (42.8K, 6-0004-Refine-EXPLAIN-WAITS-attribution-semantics.patch)
  download | inline diff:
From 1ea51ee5f55dccad7f7692244640171f4e42fcbf Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:44:49 +0500
Subject: [RFC PATCH v1 4/7] Refine EXPLAIN WAITS attribution semantics

---
 doc/src/sgml/ref/explain.sgml           |  52 ++++++-
 src/backend/commands/explain.c          |  82 +++++++----
 src/backend/executor/execParallel.c     | 121 +++++++++++-----
 src/backend/utils/activity/wait_event.c | 184 +++++++++++++++++++-----
 src/bin/psql/tab-complete.in.c          |   4 +-
 src/include/utils/wait_event.h          |   5 +
 src/test/regress/expected/explain.out   | 128 ++++++++++++++++-
 src/test/regress/sql/explain.sql        | 102 ++++++++++++-
 8 files changed, 574 insertions(+), 104 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 5460568af1d..d699b215120 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -315,9 +315,46 @@ ROLLBACK;
     <term><literal>WAITS</literal></term>
     <listitem>
      <para>
-      Include exact backend-local wait event timing observed during query
-      execution.  Wait events are grouped by wait event type and wait event
-      name, and include the number of waits and total time spent waiting.
+      Include wait event timing observed during query execution.  Wait
+      events are grouped by wait event type and wait event name, using the
+      same wait event identifiers reported by
+      <structname>pg_stat_activity</structname>; see
+      <xref linkend="wait-event-table"/>.  Each entry includes the number
+      of completed waits and the total time spent waiting.  Wait timing is
+      collected when this option is enabled even if <literal>TIMING</literal>
+      is disabled.
+     </para>
+
+     <para>
+      If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
+      event storage without risking an error while a wait is ending, waits
+      whose exact event identifier could not be stored are accumulated in an
+      <literal>Unrecorded Wait Event Calls</literal> counter and
+      <literal>Unrecorded Wait Event Time</literal> total.  This is a
+      reporting fallback under memory pressure, not a wait event emitted by
+      server instrumentation.
+     </para>
+
+     <para>
+      The top-level <literal>Statement Wait Events</literal> summary counts
+      each wait once for the statement, including waits reported by parallel
+      workers.  Nested <command>EXPLAIN (ANALYZE, WAITS)</command> commands
+      maintain separate statement-level summaries; a wait is counted once in
+      each active summary.  Plan nodes can also show <literal>Wait
+      Events</literal>.  Node-level wait events are inclusive, like plan node
+      timing in
+      <command>EXPLAIN ANALYZE</command>: a wait is attributed to every
+      active plan node when the wait begins.  Therefore node-level wait event
+      times should not be summed to compute the statement total.
+     </para>
+
+     <para>
+      The overhead of this option depends on the number of waits observed and,
+      for node-level output, on the number of active plan nodes when each wait
+      begins.
+     </para>
+
+     <para>
       This parameter may only be used when <literal>ANALYZE</literal> is also
       enabled.  It defaults to <literal>FALSE</literal>.
      </para>
@@ -405,6 +442,15 @@ ROLLBACK;
    execution, and on machines that have relatively slow operating
    system calls for obtaining the time of day.
   </para>
+
+  <para>
+   The <literal>WAITS</literal> option also adds profiling overhead when
+   <literal>ANALYZE</literal> is used.  It measures wait intervals reported
+   by server instrumentation and maintains per-statement and per-node wait
+   event counters while the statement executes.  The overhead depends on the
+   number of reported waits, the number of distinct wait events, and the depth
+   of the active executor plan node stack.
+  </para>
  </refsect1>
 
  <refsect1>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 0575ae8368b..9c198f8e599 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -151,7 +151,7 @@ static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
 static int	wait_event_usage_cmp(const void *a, const void *b);
-static void show_wait_event_usage(ExplainState *es,
+static void show_wait_event_usage(ExplainState *es, const char *labelname,
 								  const WaitEventUsage *usage);
 static void show_memory_counters(ExplainState *es,
 								 const MemoryContextCounters *mem_counters);
@@ -219,7 +219,7 @@ ExplainQuery(ParseState *pstate, ExplainStmt *stmt,
 		 * In the case of an INSTEAD NOTHING, tell at least that.  But in
 		 * non-text format, the output is delimited, so this isn't necessary.
 		 */
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 			appendStringInfoString(es->str, "Query rewrites to nothing\n");
 	}
 	else
@@ -634,7 +634,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	ExplainPrintPlan(es, queryDesc);
 
 	if (waitEventUsagePtr)
-		show_wait_event_usage(es, waitEventUsagePtr);
+		show_wait_event_usage(es, "Statement Wait Events", waitEventUsagePtr);
 
 	/* Show buffer and/or memory usage in planning */
 	if (peek_buffer_usage(es, bufusage) || mem_counters)
@@ -2335,7 +2335,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->instr.walusage);
 	if (es->waits)
-		show_wait_event_usage(es, planstate->wait_event_usage);
+		show_wait_event_usage(es, "Wait Events", planstate->wait_event_usage);
 
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
@@ -4556,14 +4556,15 @@ wait_event_usage_cmp(const void *a, const void *b)
 }
 
 static void
-show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
+show_wait_event_usage(ExplainState *es, const char *labelname,
+					  const WaitEventUsage *usage)
 {
 	WaitEventUsageEntry *entries;
 
 	if (usage == NULL)
 		return;
 
-	if (usage->nentries == 0)
+	if (usage->nentries == 0 && usage->overflowed_calls == 0)
 		return;
 
 	if (usage->nentries > 0)
@@ -4577,10 +4578,10 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
 	else
 		entries = NULL;
 
-	if (es->format == EXPLAIN_FORMAT_TEXT)
+		if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainIndentText(es);
-		appendStringInfoString(es->str, "Wait Events:\n");
+		appendStringInfo(es->str, "%s:\n", labelname);
 		es->indent++;
 
 		for (int i = 0; i < usage->nentries; i++)
@@ -4599,35 +4600,60 @@ show_wait_event_usage(ExplainState *es, const WaitEventUsage *usage)
 							 INSTR_TIME_GET_MILLISEC(entries[i].time));
 		}
 
+		if (usage->overflowed_calls > 0)
+		{
+			ExplainIndentText(es);
+			appendStringInfo(es->str,
+							 "Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
+							 usage->overflowed_calls,
+							 INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+		}
+
 		es->indent--;
 	}
 	else
 	{
-		ExplainOpenGroup("Wait-Events", "Wait Events", false, es);
-
-		for (int i = 0; i < usage->nentries; i++)
+		if (usage->nentries > 0)
 		{
-			const char *event_type;
-			const char *event_name;
+			ExplainOpenGroup("Wait-Events", labelname, false, es);
 
-			event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
-			event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+			for (int i = 0; i < usage->nentries; i++)
+			{
+				const char *event_type;
+				const char *event_name;
 
-			ExplainOpenGroup("Wait-Event", NULL, true, es);
-			ExplainPropertyText("Wait Event Type",
-								event_type ? event_type : "Unknown",
-								es);
-			ExplainPropertyText("Wait Event",
-								event_name ? event_name : "unknown",
-								es);
-			ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
-			ExplainPropertyFloat("Time", "ms",
-								 INSTR_TIME_GET_MILLISEC(entries[i].time),
-								 3, es);
-			ExplainCloseGroup("Wait-Event", NULL, true, es);
+				event_type = pgstat_get_wait_event_type(entries[i].wait_event_info);
+				event_name = pgstat_get_wait_event(entries[i].wait_event_info);
+
+				ExplainOpenGroup("Wait-Event", NULL, true, es);
+				ExplainPropertyText("Wait Event Type",
+									event_type ? event_type : "Unknown",
+									es);
+				ExplainPropertyText("Wait Event",
+									event_name ? event_name : "unknown",
+									es);
+				ExplainPropertyUInteger("Calls", NULL, entries[i].calls, es);
+				ExplainPropertyFloat("Time", "ms",
+									 INSTR_TIME_GET_MILLISEC(entries[i].time),
+									 3, es);
+				ExplainCloseGroup("Wait-Event", NULL, true, es);
+			}
+
+			ExplainCloseGroup("Wait-Events", labelname, false, es);
 		}
 
-		ExplainCloseGroup("Wait-Events", "Wait Events", false, es);
+		if (usage->overflowed_calls > 0)
+		{
+			/*
+			 * This is not a wait event identity, so keep it outside the
+			 * Wait Events array in structured output.
+			 */
+			ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
+									usage->overflowed_calls, es);
+			ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
+								 INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+								 3, es);
+		}
 	}
 
 	if (entries)
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index f77f539dd27..520b4b8484f 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,6 +87,8 @@ typedef struct SharedWaitEventUsageWorker
 {
 	int			nentries;
 	dsa_pointer entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
 } SharedWaitEventUsageWorker;
 
 /*
@@ -173,6 +175,10 @@ static bool ExecParallelReInitializeDSM(PlanState *planstate,
 static bool ExecParallelRetrieveInstrumentation(PlanState *planstate,
 												ExecParallelRetrieveInstrumentationContext *r);
 static void ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei);
+static void ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker);
+static void ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+													   SharedWaitEventUsageWorker *worker,
+													   dsa_area *area);
 static void ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 												   dsa_area *area,
 												   const WaitEventUsage *usage);
@@ -904,10 +910,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		wait_event_usage = shm_toc_allocate(pcxt->toc, wait_event_usage_len);
 		wait_event_usage->num_workers = nworkers;
 		for (int i = 0; i < nworkers; i++)
-		{
-			wait_event_usage->worker_usage[i].nentries = 0;
-			wait_event_usage->worker_usage[i].entries = InvalidDsaPointer;
-		}
+			ExecParallelInitWaitEventUsageWorker(&wait_event_usage->worker_usage[i]);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAIT_EVENT_USAGE,
 					   wait_event_usage);
 		pei->wait_event_usage = wait_event_usage;
@@ -944,10 +947,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 
 			worker_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
 			for (i = 0; i < nworkers * e.nnodes; ++i)
-			{
-				worker_usage[i].nentries = 0;
-				worker_usage[i].entries = InvalidDsaPointer;
-			}
+				ExecParallelInitWaitEventUsageWorker(&worker_usage[i]);
 		}
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
@@ -1209,21 +1209,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 		wait_event_usage = GetInstrumentationWaitEventUsageArray(instrumentation);
 		wait_event_usage += i * instrumentation->num_workers;
 		for (n = 0; n < instrumentation->num_workers; ++n)
-		{
-			SharedWaitEventUsageWorker *worker = &wait_event_usage[n];
-			WaitEventUsageEntry *entries;
-
-			if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
-				continue;
-
-			entries = dsa_get_address(r->area, worker->entries);
-			pgstat_accumulate_wait_event_usage(planstate->wait_event_usage,
-											   entries,
-											   worker->nentries);
-			dsa_free(r->area, worker->entries);
-			worker->nentries = 0;
-			worker->entries = InvalidDsaPointer;
-		}
+			ExecParallelAccumulateWaitEventUsageWorker(planstate->wait_event_usage,
+													   &wait_event_usage[n],
+													   r->area);
 	}
 
 	/*
@@ -1339,18 +1327,46 @@ ExecParallelRetrieveWaitEventUsage(ParallelExecutorInfo *pei)
 		return;
 
 	for (int i = 0; i < shared->num_workers; i++)
+		ExecParallelAccumulateWaitEventUsageWorker(usage,
+												   &shared->worker_usage[i],
+												   pei->area);
+}
+
+static void
+ExecParallelInitWaitEventUsageWorker(SharedWaitEventUsageWorker *worker)
+{
+	worker->nentries = 0;
+	worker->entries = InvalidDsaPointer;
+	worker->overflowed_calls = 0;
+	INSTR_TIME_SET_ZERO(worker->overflowed_time);
+}
+
+static void
+ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
+										   SharedWaitEventUsageWorker *worker,
+										   dsa_area *area)
+{
+	Assert(usage != NULL);
+	Assert(worker != NULL);
+	Assert(area != NULL);
+
+	if (worker->overflowed_calls > 0)
 	{
-		SharedWaitEventUsageWorker *worker = &shared->worker_usage[i];
-		WaitEventUsageEntry *entries;
+		usage->overflowed_calls += worker->overflowed_calls;
+		INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+		worker->overflowed_calls = 0;
+		INSTR_TIME_SET_ZERO(worker->overflowed_time);
+	}
 
-		if (worker->nentries <= 0 || !DsaPointerIsValid(worker->entries))
-			continue;
+	if (worker->nentries > 0 && DsaPointerIsValid(worker->entries))
+	{
+		WaitEventUsageEntry *entries;
 
-		entries = dsa_get_address(pei->area, worker->entries);
+		entries = dsa_get_address(area, worker->entries);
 		pgstat_accumulate_wait_event_usage(usage,
 										   entries,
 										   worker->nentries);
-		dsa_free(pei->area, worker->entries);
+		dsa_free(area, worker->entries);
 		worker->nentries = 0;
 		worker->entries = InvalidDsaPointer;
 	}
@@ -1362,24 +1378,65 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 									   const WaitEventUsage *usage)
 {
 	WaitEventUsageEntry *entries;
+	WaitEventUsageEntry *old_entries = NULL;
 	dsa_pointer entries_dsa;
 	Size		entries_size;
+	int			old_nentries = 0;
+	int			new_nentries = 0;
+	int			i = 0;
+	int			j = 0;
 
 	Assert(worker != NULL);
 	Assert(area != NULL);
 	Assert(usage != NULL);
 
+	worker->overflowed_calls += usage->overflowed_calls;
+	INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+
 	if (usage->nentries <= 0)
 		return;
 
-	entries_size = mul_size(sizeof(WaitEventUsageEntry), usage->nentries);
+	if (DsaPointerIsValid(worker->entries))
+	{
+		Assert(worker->nentries > 0);
+		old_nentries = worker->nentries;
+		old_entries = dsa_get_address(area, worker->entries);
+	}
+
+	entries_size = mul_size(sizeof(WaitEventUsageEntry),
+							(Size) old_nentries + (Size) usage->nentries);
 	entries_dsa = dsa_allocate(area, entries_size);
 	entries = dsa_get_address(area, entries_dsa);
-	memcpy(entries, usage->entries, entries_size);
+
+	while (i < old_nentries && j < usage->nentries)
+	{
+		WaitEventUsageEntry *entry = &entries[new_nentries];
+		uint32		old_info = old_entries[i].wait_event_info;
+		uint32		new_info = usage->entries[j].wait_event_info;
+
+		if (old_info < new_info)
+			*entry = old_entries[i++];
+		else if (old_info > new_info)
+			*entry = usage->entries[j++];
+		else
+		{
+			*entry = old_entries[i++];
+			entry->calls += usage->entries[j].calls;
+			INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+			j++;
+		}
+
+		new_nentries++;
+	}
+
+	while (i < old_nentries)
+		entries[new_nentries++] = old_entries[i++];
+	while (j < usage->nentries)
+		entries[new_nentries++] = usage->entries[j++];
 
 	if (DsaPointerIsValid(worker->entries))
 		dsa_free(area, worker->entries);
-	worker->nentries = usage->nentries;
+	worker->nentries = new_nentries;
 	worker->entries = entries_dsa;
 }
 
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 3d5a10b44c4..61b418e8fd7 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -39,6 +39,10 @@ static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
 static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 							  uint64 calls, const instr_time *elapsed);
+static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+									  const instr_time *elapsed);
+static int	WaitEventUsageFind(const WaitEventUsage *usage,
+							   uint32 wait_event_info, bool *found);
 
 
 static uint32 local_my_wait_event_info;
@@ -48,7 +52,16 @@ uint32	   *my_wait_event_info = &local_my_wait_event_info;
 
 int			pgstat_wait_event_usage_depth = 0;
 static WaitEventUsage *pgstat_wait_event_usage = NULL;
+
+/*
+ * Top of the active executor node and query-level stacks.  Query-level wait
+ * accounting records each wait once in every active query-level collector.
+ * Per-node wait accounting records each wait in every active plan node,
+ * matching the inclusive timing semantics of EXPLAIN ANALYZE plan nodes.
+ */
 static WaitEventUsage *pgstat_wait_event_node_usage = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_node_stack = NULL;
+static WaitEventUsage *pgstat_wait_event_usage_query_stack = NULL;
 static uint32 pgstat_wait_event_usage_current = 0;
 static instr_time pgstat_wait_event_usage_start;
 
@@ -387,24 +400,37 @@ pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
  *
  * This is intended for short-lived instrumentation such as EXPLAIN ANALYZE.
  * It records waits observed through pgstat_report_wait_start/end in backend
- * local memory.  Nested collection is deliberately treated as part of the
- * outer collection for now; callers that want independent nested accounting
- * need a stack of WaitEventUsage contexts.
+ * local memory.  Nested top-level collectors are kept in a query-level stack;
+ * a wait is counted once in each active collector.
  */
 void
 pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 {
+	bool		first;
+
 	Assert(usage != NULL);
 	Assert(memcontext != NULL);
 
-	if (pgstat_wait_event_usage_depth++ == 0)
+	first = pgstat_wait_event_usage_depth == 0;
+	if (first)
 	{
-		pgstat_init_wait_event_usage(usage, memcontext);
-		pgstat_wait_event_usage = usage;
 		pgstat_wait_event_node_usage = NULL;
+		pgstat_wait_event_usage_node_stack = NULL;
+		pgstat_wait_event_usage_query_stack = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
+
+	pgstat_init_wait_event_usage(usage, memcontext);
+	usage->query_parent = pgstat_wait_event_usage;
+	/*
+	 * A nested EXPLAIN can error out while one of its plan nodes is active,
+	 * skipping the usual node-level restore.  Remember the outer node stack so
+	 * ending this collector can discard any leaked inner node frames.
+	 */
+	usage->saved_node_usage = pgstat_wait_event_node_usage;
+	pgstat_wait_event_usage = usage;
+	pgstat_wait_event_usage_depth++;
 }
 
 /*
@@ -415,14 +441,22 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
 {
 	Assert(usage != NULL);
 	Assert(pgstat_wait_event_usage_depth > 0);
+	Assert(pgstat_wait_event_usage == usage);
+
+	if (pgstat_wait_event_usage_current != 0)
+		pgstat_count_wait_event_end();
+
+	pgstat_wait_event_usage = usage->query_parent;
+	usage->query_parent = NULL;
+	pgstat_wait_event_node_usage = usage->saved_node_usage;
+	usage->saved_node_usage = NULL;
 
 	if (--pgstat_wait_event_usage_depth == 0)
 	{
-		if (pgstat_wait_event_usage_current != 0)
-			pgstat_count_wait_event_end();
-
 		pgstat_wait_event_usage = NULL;
 		pgstat_wait_event_node_usage = NULL;
+		pgstat_wait_event_usage_node_stack = NULL;
+		pgstat_wait_event_usage_query_stack = NULL;
 		pgstat_wait_event_usage_current = 0;
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
@@ -433,6 +467,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
 {
 	WaitEventUsage *previous = pgstat_wait_event_node_usage;
 
+	Assert(usage != NULL);
+
+	usage->active_parent = previous;
 	pgstat_wait_event_node_usage = usage;
 	return previous;
 }
@@ -440,6 +477,9 @@ pgstat_enter_wait_event_usage(WaitEventUsage *usage)
 void
 pgstat_restore_wait_event_usage(WaitEventUsage *usage)
 {
+	Assert(pgstat_wait_event_node_usage == NULL ||
+		   pgstat_wait_event_node_usage->active_parent == usage);
+
 	pgstat_wait_event_node_usage = usage;
 }
 
@@ -460,6 +500,8 @@ pgstat_count_wait_event_start(uint32 wait_event_info)
 		pgstat_count_wait_event_end();
 
 	pgstat_wait_event_usage_current = wait_event_info;
+	pgstat_wait_event_usage_node_stack = pgstat_wait_event_node_usage;
+	pgstat_wait_event_usage_query_stack = pgstat_wait_event_usage;
 	INSTR_TIME_SET_CURRENT(pgstat_wait_event_usage_start);
 }
 
@@ -480,17 +522,24 @@ pgstat_count_wait_event_end(void)
 	elapsed = end;
 	INSTR_TIME_SUBTRACT(elapsed, pgstat_wait_event_usage_start);
 
-	WaitEventUsageAdd(pgstat_wait_event_usage,
-					  pgstat_wait_event_usage_current,
-					  1,
-					  &elapsed);
-	if (pgstat_wait_event_node_usage != NULL)
-		WaitEventUsageAdd(pgstat_wait_event_node_usage,
+	for (WaitEventUsage *query_usage = pgstat_wait_event_usage_query_stack;
+		 query_usage != NULL;
+		 query_usage = query_usage->query_parent)
+		WaitEventUsageAdd(query_usage,
+						  pgstat_wait_event_usage_current,
+						  1,
+						  &elapsed);
+	for (WaitEventUsage *node_usage = pgstat_wait_event_usage_node_stack;
+		 node_usage != NULL;
+		 node_usage = node_usage->active_parent)
+		WaitEventUsageAdd(node_usage,
 						  pgstat_wait_event_usage_current,
 						  1,
 						  &elapsed);
 
 	pgstat_wait_event_usage_current = 0;
+	pgstat_wait_event_usage_node_stack = NULL;
+	pgstat_wait_event_usage_query_stack = NULL;
 	INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 }
 
@@ -509,56 +558,123 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
 						  &entries[i].time);
 }
 
+/*
+ * Find the existing entry, or the insertion position for a new entry.
+ *
+ * WaitEventUsage entries are kept sorted by wait_event_info so the hot
+ * wait-end path does not need a linear scan through all distinct wait events
+ * already seen by the query or plan node.
+ */
+static int
+WaitEventUsageFind(const WaitEventUsage *usage, uint32 wait_event_info,
+				   bool *found)
+{
+	int			low = 0;
+	int			high = usage->nentries;
+
+	while (low < high)
+	{
+		int			mid = low + (high - low) / 2;
+		uint32		entry_info = usage->entries[mid].wait_event_info;
+
+		if (entry_info < wait_event_info)
+			low = mid + 1;
+		else
+			high = mid;
+	}
+
+	*found = low < usage->nentries &&
+		usage->entries[low].wait_event_info == wait_event_info;
+	return low;
+}
+
 static void
 WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 				  uint64 calls, const instr_time *elapsed)
 {
-	WaitEventUsageEntry *entry = NULL;
+	bool		found;
+	int			idx;
+	WaitEventUsageEntry *entry;
 
-	for (int i = 0; i < usage->nentries; i++)
-	{
-		if (usage->entries[i].wait_event_info == wait_event_info)
-		{
-			entry = &usage->entries[i];
-			break;
-		}
-	}
+	idx = WaitEventUsageFind(usage, wait_event_info, &found);
 
-	if (entry == NULL)
+	if (!found)
 	{
 		if (usage->nentries >= usage->maxentries)
 		{
-			MemoryContext oldcontext;
 			int			newmaxentries;
+			Size		entries_size;
+			WaitEventUsageEntry *newentries;
 
 			if (usage->maxentries > 0)
+			{
+				if ((Size) usage->maxentries >
+					MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
+				{
+					WaitEventUsageAddOverflow(usage, calls, elapsed);
+					return;
+				}
+
 				newmaxentries = usage->maxentries * 2;
+			}
 			else
 				newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
 
-			oldcontext = MemoryContextSwitchTo(usage->memcontext);
+			if ((Size) newmaxentries >
+				MaxAllocSize / sizeof(WaitEventUsageEntry))
+			{
+				WaitEventUsageAddOverflow(usage, calls, elapsed);
+				return;
+			}
+
+			entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
+			/*
+			 * Wait completion can happen in a critical section, so growth
+			 * must not throw ERROR.  If storage cannot be grown without
+			 * throwing, preserve total wait time in the overflow bucket.
+			 */
 			if (usage->entries)
-				usage->entries = repalloc_array(usage->entries,
-												WaitEventUsageEntry,
-												newmaxentries);
+				newentries = repalloc_extended(usage->entries, entries_size,
+											   MCXT_ALLOC_NO_OOM);
 			else
-				usage->entries = palloc_array(WaitEventUsageEntry,
-											  newmaxentries);
-			MemoryContextSwitchTo(oldcontext);
+				newentries = MemoryContextAllocExtended(usage->memcontext,
+														entries_size,
+														MCXT_ALLOC_NO_OOM);
+			if (newentries == NULL)
+			{
+				WaitEventUsageAddOverflow(usage, calls, elapsed);
+				return;
+			}
 
+			usage->entries = newentries;
 			usage->maxentries = newmaxentries;
 		}
 
-		entry = &usage->entries[usage->nentries++];
+		if (idx < usage->nentries)
+			memmove(&usage->entries[idx + 1], &usage->entries[idx],
+					sizeof(WaitEventUsageEntry) * (usage->nentries - idx));
+
+		entry = &usage->entries[idx];
+		usage->nentries++;
 		entry->wait_event_info = wait_event_info;
 		entry->calls = 0;
 		INSTR_TIME_SET_ZERO(entry->time);
 	}
+	else
+		entry = &usage->entries[idx];
 
 	entry->calls += calls;
 	INSTR_TIME_ADD(entry->time, *elapsed);
 }
 
+static void
+WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
+						  const instr_time *elapsed)
+{
+	usage->overflowed_calls += calls;
+	INSTR_TIME_ADD(usage->overflowed_time, *elapsed);
+}
+
 /* ----------
  * pgstat_get_wait_event_type() -
  *
diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index db65d130fcb..b33490e7415 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4524,9 +4524,9 @@ match_previous_words(int pattern_id,
 		 */
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
-						  "BUFFERS", "SERIALIZE", "WAL", "TIMING", "SUMMARY",
+						  "BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
 						  "MEMORY", "FORMAT");
-		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|TIMING|SUMMARY|MEMORY"))
+		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
 			COMPLETE_WITH("ON", "OFF");
 		else if (TailMatches("SERIALIZE"))
 			COMPLETE_WITH("TEXT", "NONE", "BINARY");
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 64411288403..f14945cdb16 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -25,9 +25,14 @@ typedef struct WaitEventUsageEntry
 typedef struct WaitEventUsage
 {
 	MemoryContext memcontext;
+	struct WaitEventUsage *active_parent; /* active plan-node stack link */
+	struct WaitEventUsage *query_parent;	/* active query-level stack link */
+	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
 	int			nentries;
 	int			maxentries;
 	WaitEventUsageEntry *entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
 } WaitEventUsage;
 
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index c086fa48d80..275292d6172 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -47,6 +47,21 @@ begin
     return data::jsonb;
 end;
 $$;
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+    data text := '';
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        data := data || ln;
+    end loop;
+    return data::jsonb;
+end;
+$$;
 -- Disable JIT, or we'll get different output on machines where that's been
 -- forced on
 set jit = off;
@@ -106,11 +121,11 @@ select explain_filter('explain (analyze, waits, costs off, summary off, timing o
  Result (actual rows=N.N loops=N)
    Wait Events:
      Timeout:PgSleep calls=N time=N.N ms
- Wait Events:
+ Statement Wait Events:
    Timeout:PgSleep calls=N time=N.N ms
 (5 rows)
 
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
                                      ?column?                                     
 ----------------------------------------------------------------------------------
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
@@ -122,6 +137,52 @@ select explain_filter_to_json('explain (analyze, waits, costs off, summary off,
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
 
+begin;
+create function pg_temp.nested_explain_waits() returns void
+  language plpgsql as
+$$
+begin
+  perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack.  The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+  language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+  language plpgsql as
+$$
+begin
+  begin
+    perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+  exception when division_by_zero then
+    null;
+  end;
+  perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+                                     ?column?                                     
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
+rollback;
 begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
@@ -130,7 +191,7 @@ set local max_parallel_workers_per_gather = 1;
 select jsonb_path_query_first(
   explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
                          select pg_temp.parallel_pg_sleep(0.01)
-                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+                         from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
   '$[*] ? (@."Wait Event" == "PgSleep")'
 );
                               jsonb_path_query_first                              
@@ -149,6 +210,53 @@ select jsonb_path_query_first(
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
 
+rollback;
+begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic.  The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan.  The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+  language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+  select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from
+                           (select count(unique1) from tenk1
+                            where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+                           right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+  select jsonb_path_query_first(plan,
+                                '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+  from plan_json
+),
+pgsleep_wait as (
+  select node,
+         jsonb_path_query_first(node,
+                                '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+  from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+ parallel rescan waits accumulated 
+-----------------------------------
+ t
+(1 row)
+
 rollback;
 begin;
 -- This test deliberately creates a Bitmap Index Scan runtime-key wait.
@@ -157,6 +265,9 @@ begin;
 -- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
 -- The planner GUCs below are likewise test-only scaffolding to make the
 -- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
 create function pg_temp.explain_waits_sleep_int(int) returns int
   language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
 create temp table explain_waits_bitmap (a int);
@@ -176,6 +287,17 @@ select jsonb_path_query_first(
  {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
 (1 row)
 
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
+                              jsonb_path_query_first                              
+----------------------------------------------------------------------------------
+ {"Time": 0.0, "Calls": 0, "Wait Event": "PgSleep", "Wait Event Type": "Timeout"}
+(1 row)
+
 rollback;
 explain (waits) select 1;
 ERROR:  EXPLAIN option WAITS requires ANALYZE
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index bd196f248db..9f50ef189b0 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -51,6 +51,22 @@ begin
 end;
 $$;
 
+-- For tests that assert numeric EXPLAIN fields, preserve unfiltered JSON.
+create function explain_to_json(text) returns jsonb
+language plpgsql as
+$$
+declare
+    data text := '';
+    ln text;
+begin
+    for ln in execute $1
+    loop
+        data := data || ln;
+    end loop;
+    return data::jsonb;
+end;
+$$;
+
 -- Disable JIT, or we'll get different output on machines where that's been
 -- forced on
 set jit = off;
@@ -70,9 +86,40 @@ select explain_filter('explain (buffers, format text) select * from int8_tbl i8'
 
 -- WAITS option
 select explain_filter('explain (analyze, waits, costs off, summary off, timing off, buffers off) select pg_sleep(0.01)');
-select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Statement Wait Events,0}';
 select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)') #> '{0,Plan,Wait Events,0}';
 begin;
+create function pg_temp.nested_explain_waits() returns void
+  language plpgsql as
+$$
+begin
+  perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_sleep(0.01)');
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Statement Wait Events,0}';
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
+-- If a nested EXPLAIN errors while one of its plan nodes is active, ending
+-- that collector must restore the outer node wait-attribution stack.  The
+-- volatile division-by-zero helper keeps the error at execution time.
+create function pg_temp.explain_waits_divzero() returns int
+  language plpgsql volatile as $$begin return 1 / 0; end$$;
+create function pg_temp.nested_explain_waits_error() returns void
+  language plpgsql as
+$$
+begin
+  begin
+    perform explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.explain_waits_divzero()');
+  exception when division_by_zero then
+    null;
+  end;
+  perform pg_sleep(0.01);
+end;
+$$;
+select explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json) select pg_temp.nested_explain_waits_error()') #> '{0,Plan,Wait Events,0}';
+rollback;
+begin;
 create function pg_temp.parallel_pg_sleep(float8) returns void
   language internal volatile parallel safe as 'pg_sleep';
 set local debug_parallel_query = on;
@@ -80,7 +127,7 @@ set local max_parallel_workers_per_gather = 1;
 select jsonb_path_query_first(
   explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
                          select pg_temp.parallel_pg_sleep(0.01)
-                         from tenk1 where unique1 = 1') #> '{0,Wait Events}',
+                         from tenk1 where unique1 = 1') #> '{0,Statement Wait Events}',
   '$[*] ? (@."Wait Event" == "PgSleep")'
 );
 select jsonb_path_query_first(
@@ -91,12 +138,57 @@ select jsonb_path_query_first(
 );
 rollback;
 begin;
+-- This test deliberately creates a rescanned parallel-aware Index Scan.
+-- The planner GUCs and tenk1 parallel_workers reloption are test-only
+-- scaffolding to make the parallel rescanned node shape deterministic.  The
+-- STABLE PARALLEL SAFE wrapper around pg_sleep() creates a runtime key wait
+-- under the parallel-aware Index Scan.  The invariant checked below is that
+-- PgSleep calls accumulated for that node cover all reported scan loops; this
+-- fails if per-node worker wait usage is replaced on relaunch instead of
+-- merged across worker reports.
+create function pg_temp.explain_waits_parallel_sleep_int(int) returns int
+  language plpgsql stable parallel safe as $$begin perform pg_sleep(0.001); return $1; end$$;
+alter table tenk1 set (parallel_workers = 4);
+set local parallel_setup_cost = 0;
+set local parallel_tuple_cost = 0;
+set local max_parallel_workers_per_gather = 4;
+set local parallel_leader_participation = off;
+set local min_parallel_index_scan_size = 0;
+set local enable_seqscan = off;
+set local enable_bitmapscan = off;
+set local enable_material = off;
+set local random_page_cost = 2;
+with plan_json as (
+  select explain_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from
+                           (select count(unique1) from tenk1
+                            where hundred > pg_temp.explain_waits_parallel_sleep_int(10)) ss
+                           right join (values (1),(2),(3)) v(x) on true') #> '{0,Plan}' as plan
+),
+parallel_scan as (
+  select jsonb_path_query_first(plan,
+                                '$.** ? (@."Node Type" == "Index Scan" && @."Parallel Aware" == true)') as node
+  from plan_json
+),
+pgsleep_wait as (
+  select node,
+         jsonb_path_query_first(node,
+                                '$."Wait Events"[*] ? (@."Wait Event" == "PgSleep")') as wait
+  from parallel_scan
+)
+select (wait->>'Calls')::numeric >= (node->>'Actual Loops')::numeric as "parallel rescan waits accumulated"
+from pgsleep_wait;
+rollback;
+begin;
 -- This test deliberately creates a Bitmap Index Scan runtime-key wait.
 -- The STABLE PL/pgSQL wrapper is test scaffolding: STABLE lets the
 -- expression be used as an index runtime key, while PL/pgSQL prevents SQL
 -- inlining from moving pg_sleep() out of the Bitmap Index Scan boundary.
 -- The planner GUCs below are likewise test-only scaffolding to make the
 -- node shape deterministic.
+-- The two JSONPath checks verify inclusive per-node attribution: the same
+-- runtime-key wait appears on the Bitmap Index Scan and its Bitmap Heap Scan
+-- parent, while query-level wait accounting still counts the wait once.
 create function pg_temp.explain_waits_sleep_int(int) returns int
   language plpgsql stable as $$begin perform pg_sleep(0.01); return $1; end$$;
 create temp table explain_waits_bitmap (a int);
@@ -111,6 +203,12 @@ select jsonb_path_query_first(
                          where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
   '$.** ? (@."Node Type" == "Bitmap Index Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
 );
+select jsonb_path_query_first(
+  explain_filter_to_json('explain (analyze, waits, costs off, summary off, timing off, buffers off, format json)
+                         select * from explain_waits_bitmap
+                         where a = pg_temp.explain_waits_sleep_int(1)') #> '{0,Plan}',
+  '$.** ? (@."Node Type" == "Bitmap Heap Scan")."Wait Events"[*] ? (@."Wait Event" == "PgSleep")'
+);
 rollback;
 explain (waits) select 1;
 
-- 
2.52.0



  [application/octet-stream] 0005-Harden-EXPLAIN-WAITS-accumulator-handling.patch (10.6K, 7-0005-Harden-EXPLAIN-WAITS-accumulator-handling.patch)
  download | inline diff:
From 7535889991ec4af6104e4ac241f9fca7b603258e Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:45:44 +0500
Subject: [RFC PATCH v1 5/7] Harden EXPLAIN WAITS accumulator handling

---
 doc/src/sgml/ref/explain.sgml           |  15 ++--
 src/backend/commands/explain.c          |   2 +-
 src/backend/executor/execProcnode.c     |  11 +--
 src/backend/utils/activity/wait_event.c | 100 ++++++++++--------------
 src/include/utils/wait_event.h          |  10 +--
 5 files changed, 56 insertions(+), 82 deletions(-)

diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index d699b215120..7fa4b1cd955 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -326,13 +326,14 @@ ROLLBACK;
      </para>
 
      <para>
-      If <command>EXPLAIN</command> cannot grow its per-query or per-node wait
-      event storage without risking an error while a wait is ending, waits
-      whose exact event identifier could not be stored are accumulated in an
-      <literal>Unrecorded Wait Event Calls</literal> counter and
-      <literal>Unrecorded Wait Event Time</literal> total.  This is a
-      reporting fallback under memory pressure, not a wait event emitted by
-      server instrumentation.
+      Each statement and plan-node accumulator preallocates storage for up to
+      <literal>64</literal> distinct wait event identifiers.  This bound
+      avoids memory allocation while a wait is ending.  If more distinct wait
+      event identifiers are observed, waits whose exact event identifier could
+      not be stored are accumulated in an <literal>Unrecorded Wait Event
+      Calls</literal> counter and <literal>Unrecorded Wait Event Time</literal>
+      total.  This is a reporting fallback, not a wait event emitted by server
+      instrumentation.
      </para>
 
      <para>
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 9c198f8e599..ee69d723cd8 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -4578,7 +4578,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 	else
 		entries = NULL;
 
-		if (es->format == EXPLAIN_FORMAT_TEXT)
+	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
 		ExplainIndentText(es);
 		appendStringInfo(es->str, "%s:\n", labelname);
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 081855b3fed..deee14839f2 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -417,15 +417,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 		result->instrument = InstrAllocNode(estate->es_instrument,
 											result->async_capable);
 	if (estate->es_instrument & INSTRUMENT_WAITS)
-	{
-		MemoryContext oldcontext;
-
-		oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
-		result->wait_event_usage = palloc_object(WaitEventUsage);
-		pgstat_init_wait_event_usage(result->wait_event_usage,
-									 estate->es_query_cxt);
-		MemoryContextSwitchTo(oldcontext);
-	}
+		result->wait_event_usage =
+			pgstat_create_wait_event_usage(estate->es_query_cxt);
 
 	return result;
 }
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 61b418e8fd7..67980cc0a3b 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -27,7 +27,6 @@
 #include "storage/shmem.h"
 #include "storage/subsystems.h"
 #include "storage/spin.h"
-#include "utils/memutils.h"
 #include "utils/wait_event.h"
 
 
@@ -43,15 +42,25 @@ static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
 									  const instr_time *elapsed);
 static int	WaitEventUsageFind(const WaitEventUsage *usage,
 							   uint32 wait_event_info, bool *found);
+static void WaitEventUsageInit(WaitEventUsage *usage,
+							   MemoryContext memcontext);
 
 
 static uint32 local_my_wait_event_info;
 uint32	   *my_wait_event_info = &local_my_wait_event_info;
 
-#define WAIT_EVENT_USAGE_INITIAL_EVENTS	16
+/*
+ * Hardcoded limit: each EXPLAIN WAITS statement-level or plan-node accumulator
+ * can record this many distinct wait event identities without allocating while
+ * waits are ending.  Additional distinct wait identities are accounted for in
+ * the overflow bucket.
+ */
+#define WAIT_EVENT_USAGE_MAX_EVENTS		64
 
-int			pgstat_wait_event_usage_depth = 0;
+/* Fast-path flag exported for inline pgstat_report_wait_start/end(). */
+bool		pgstat_wait_event_usage_active = false;
 static WaitEventUsage *pgstat_wait_event_usage = NULL;
+static int	pgstat_wait_event_usage_depth = 0;
 
 /*
  * Top of the active executor node and query-level stacks.  Query-level wait
@@ -373,26 +382,36 @@ pgstat_reset_wait_event_storage(void)
 	my_wait_event_info = &local_my_wait_event_info;
 }
 
+/*
+ * Allocate and initialize a wait event usage accumulator.
+ */
+WaitEventUsage *
+pgstat_create_wait_event_usage(MemoryContext memcontext)
+{
+	WaitEventUsage *usage;
+
+	Assert(memcontext != NULL);
+
+	usage = MemoryContextAlloc(memcontext, sizeof(WaitEventUsage));
+	WaitEventUsageInit(usage, memcontext);
+	return usage;
+}
+
 /*
  * Initialize a wait event usage accumulator.
  */
-void
-pgstat_init_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+static void
+WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
 {
 	Assert(usage != NULL);
 	Assert(memcontext != NULL);
 
 	memset(usage, 0, sizeof(WaitEventUsage));
 
-	/*
-	 * Wait events may end inside critical sections, for example while
-	 * performing synchronous I/O.  Keep usage entries in a dedicated context
-	 * where the memory manager permits that accounting path to grow.
-	 */
-	usage->memcontext = AllocSetContextCreate(memcontext,
-											  "Wait Event Usage",
-											  ALLOCSET_SMALL_SIZES);
-	MemoryContextAllowInCriticalSection(usage->memcontext, true);
+	usage->entries = MemoryContextAlloc(memcontext,
+										sizeof(WaitEventUsageEntry) *
+										WAIT_EVENT_USAGE_MAX_EVENTS);
+	usage->maxentries = WAIT_EVENT_USAGE_MAX_EVENTS;
 }
 
 /*
@@ -421,7 +440,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
 
-	pgstat_init_wait_event_usage(usage, memcontext);
+	WaitEventUsageInit(usage, memcontext);
 	usage->query_parent = pgstat_wait_event_usage;
 	/*
 	 * A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -431,6 +450,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 	usage->saved_node_usage = pgstat_wait_event_node_usage;
 	pgstat_wait_event_usage = usage;
 	pgstat_wait_event_usage_depth++;
+	pgstat_wait_event_usage_active = true;
 }
 
 /*
@@ -453,6 +473,7 @@ pgstat_end_wait_event_usage(WaitEventUsage *usage)
 
 	if (--pgstat_wait_event_usage_depth == 0)
 	{
+		pgstat_wait_event_usage_active = false;
 		pgstat_wait_event_usage = NULL;
 		pgstat_wait_event_node_usage = NULL;
 		pgstat_wait_event_usage_node_stack = NULL;
@@ -602,52 +623,13 @@ WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 	{
 		if (usage->nentries >= usage->maxentries)
 		{
-			int			newmaxentries;
-			Size		entries_size;
-			WaitEventUsageEntry *newentries;
-
-			if (usage->maxentries > 0)
-			{
-				if ((Size) usage->maxentries >
-					MaxAllocSize / sizeof(WaitEventUsageEntry) / 2)
-				{
-					WaitEventUsageAddOverflow(usage, calls, elapsed);
-					return;
-				}
-
-				newmaxentries = usage->maxentries * 2;
-			}
-			else
-				newmaxentries = WAIT_EVENT_USAGE_INITIAL_EVENTS;
-
-			if ((Size) newmaxentries >
-				MaxAllocSize / sizeof(WaitEventUsageEntry))
-			{
-				WaitEventUsageAddOverflow(usage, calls, elapsed);
-				return;
-			}
-
-			entries_size = sizeof(WaitEventUsageEntry) * newmaxentries;
 			/*
-			 * Wait completion can happen in a critical section, so growth
-			 * must not throw ERROR.  If storage cannot be grown without
-			 * throwing, preserve total wait time in the overflow bucket.
+			 * Wait-end accounting must not allocate: it can run in a critical
+			 * section.  Preserve total calls/time without the exact event
+			 * identity once preallocated storage is full.
 			 */
-			if (usage->entries)
-				newentries = repalloc_extended(usage->entries, entries_size,
-											   MCXT_ALLOC_NO_OOM);
-			else
-				newentries = MemoryContextAllocExtended(usage->memcontext,
-														entries_size,
-														MCXT_ALLOC_NO_OOM);
-			if (newentries == NULL)
-			{
-				WaitEventUsageAddOverflow(usage, calls, elapsed);
-				return;
-			}
-
-			usage->entries = newentries;
-			usage->maxentries = newmaxentries;
+			WaitEventUsageAddOverflow(usage, calls, elapsed);
+			return;
 		}
 
 		if (idx < usage->nentries)
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index f14945cdb16..67497790307 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -24,7 +24,6 @@ typedef struct WaitEventUsageEntry
 
 typedef struct WaitEventUsage
 {
-	MemoryContext memcontext;
 	struct WaitEventUsage *active_parent; /* active plan-node stack link */
 	struct WaitEventUsage *query_parent;	/* active query-level stack link */
 	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
@@ -41,8 +40,7 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
-extern void pgstat_init_wait_event_usage(WaitEventUsage *usage,
-										 MemoryContext memcontext);
+extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
 extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
 										  MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
@@ -55,7 +53,7 @@ extern void pgstat_count_wait_event_start(uint32 wait_event_info);
 extern void pgstat_count_wait_event_end(void);
 
 extern PGDLLIMPORT uint32 *my_wait_event_info;
-extern PGDLLIMPORT int pgstat_wait_event_usage_depth;
+extern PGDLLIMPORT bool pgstat_wait_event_usage_active;
 
 
 /*
@@ -101,7 +99,7 @@ extern char **GetWaitEventCustomNames(uint32 classId, int *nwaitevents);
 static inline void
 pgstat_report_wait_start(uint32 wait_event_info)
 {
-	if (pgstat_wait_event_usage_depth > 0)
+	if (unlikely(pgstat_wait_event_usage_active))
 		pgstat_count_wait_event_start(wait_event_info);
 
 	/*
@@ -120,7 +118,7 @@ pgstat_report_wait_start(uint32 wait_event_info)
 static inline void
 pgstat_report_wait_end(void)
 {
-	if (pgstat_wait_event_usage_depth > 0)
+	if (unlikely(pgstat_wait_event_usage_active))
 		pgstat_count_wait_event_end();
 
 	/* see pgstat_report_wait_start() */
-- 
2.52.0



  [application/octet-stream] 0006-Hide-EXPLAIN-WAITS-accumulator-internals.patch (14.5K, 8-0006-Hide-EXPLAIN-WAITS-accumulator-internals.patch)
  download | inline diff:
From e9f1030cd864c93a3e17d53f4f344b3202bb57bb Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:46:12 +0500
Subject: [RFC PATCH v1 6/7] Hide EXPLAIN WAITS accumulator internals

---
 src/backend/commands/explain.c          | 45 ++++++++++--------
 src/backend/executor/execParallel.c     | 44 +++++++++--------
 src/backend/utils/activity/wait_event.c | 63 +++++++++++++++++++++++--
 src/include/utils/wait_event.h          | 32 ++++++++-----
 4 files changed, 129 insertions(+), 55 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index ee69d723cd8..0e2ec510fee 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -514,7 +514,6 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 	int			eflags;
 	int			instrument_option = 0;
 	SerializeMetrics serializeMetrics = {0};
-	WaitEventUsage waitEventUsage;
 	WaitEventUsage *waitEventUsagePtr = NULL;
 
 	Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -593,9 +592,8 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 
 		if (es->waits)
 		{
-			waitEventUsagePtr = &waitEventUsage;
-			pgstat_begin_wait_event_usage(waitEventUsagePtr,
-										  queryDesc->estate->es_query_cxt);
+			waitEventUsagePtr =
+				pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
 			queryDesc->estate->es_wait_event_usage = waitEventUsagePtr;
 		}
 
@@ -4559,20 +4557,29 @@ static void
 show_wait_event_usage(ExplainState *es, const char *labelname,
 					  const WaitEventUsage *usage)
 {
+	const WaitEventUsageEntry *usage_entries;
 	WaitEventUsageEntry *entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
+	int			nentries;
 
 	if (usage == NULL)
 		return;
 
-	if (usage->nentries == 0 && usage->overflowed_calls == 0)
+	if (pgstat_wait_event_usage_is_empty(usage))
 		return;
 
-	if (usage->nentries > 0)
+	nentries = pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+	pgstat_get_wait_event_usage_overflow(usage,
+										 &overflowed_calls,
+										 &overflowed_time);
+
+	if (nentries > 0)
 	{
-		entries = palloc_array(WaitEventUsageEntry, usage->nentries);
-		memcpy(entries, usage->entries,
-			   sizeof(WaitEventUsageEntry) * usage->nentries);
-		qsort(entries, usage->nentries, sizeof(WaitEventUsageEntry),
+		entries = palloc_array(WaitEventUsageEntry, nentries);
+		memcpy(entries, usage_entries,
+			   sizeof(WaitEventUsageEntry) * nentries);
+		qsort(entries, nentries, sizeof(WaitEventUsageEntry),
 			  wait_event_usage_cmp);
 	}
 	else
@@ -4584,7 +4591,7 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 		appendStringInfo(es->str, "%s:\n", labelname);
 		es->indent++;
 
-		for (int i = 0; i < usage->nentries; i++)
+		for (int i = 0; i < nentries; i++)
 		{
 			const char *event_type;
 			const char *event_name;
@@ -4600,24 +4607,24 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 							 INSTR_TIME_GET_MILLISEC(entries[i].time));
 		}
 
-		if (usage->overflowed_calls > 0)
+		if (overflowed_calls > 0)
 		{
 			ExplainIndentText(es);
 			appendStringInfo(es->str,
 							 "Unrecorded Wait Event Calls: calls=%" PRIu64 " time=%0.3f ms\n",
-							 usage->overflowed_calls,
-							 INSTR_TIME_GET_MILLISEC(usage->overflowed_time));
+							 overflowed_calls,
+							 INSTR_TIME_GET_MILLISEC(overflowed_time));
 		}
 
 		es->indent--;
 	}
 	else
 	{
-		if (usage->nentries > 0)
+		if (nentries > 0)
 		{
 			ExplainOpenGroup("Wait-Events", labelname, false, es);
 
-			for (int i = 0; i < usage->nentries; i++)
+			for (int i = 0; i < nentries; i++)
 			{
 				const char *event_type;
 				const char *event_name;
@@ -4642,16 +4649,16 @@ show_wait_event_usage(ExplainState *es, const char *labelname,
 			ExplainCloseGroup("Wait-Events", labelname, false, es);
 		}
 
-		if (usage->overflowed_calls > 0)
+		if (overflowed_calls > 0)
 		{
 			/*
 			 * This is not a wait event identity, so keep it outside the
 			 * Wait Events array in structured output.
 			 */
 			ExplainPropertyUInteger("Unrecorded Wait Event Calls", NULL,
-									usage->overflowed_calls, es);
+									overflowed_calls, es);
 			ExplainPropertyFloat("Unrecorded Wait Event Time", "ms",
-								 INSTR_TIME_GET_MILLISEC(usage->overflowed_time),
+								 INSTR_TIME_GET_MILLISEC(overflowed_time),
 								 3, es);
 		}
 	}
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 520b4b8484f..dcd06c718c8 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -1352,8 +1352,9 @@ ExecParallelAccumulateWaitEventUsageWorker(WaitEventUsage *usage,
 
 	if (worker->overflowed_calls > 0)
 	{
-		usage->overflowed_calls += worker->overflowed_calls;
-		INSTR_TIME_ADD(usage->overflowed_time, worker->overflowed_time);
+		pgstat_accumulate_wait_event_usage_overflow(usage,
+													worker->overflowed_calls,
+													&worker->overflowed_time);
 		worker->overflowed_calls = 0;
 		INSTR_TIME_SET_ZERO(worker->overflowed_time);
 	}
@@ -1377,11 +1378,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 									   dsa_area *area,
 									   const WaitEventUsage *usage)
 {
+	const WaitEventUsageEntry *usage_entries;
 	WaitEventUsageEntry *entries;
 	WaitEventUsageEntry *old_entries = NULL;
 	dsa_pointer entries_dsa;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
 	Size		entries_size;
 	int			old_nentries = 0;
+	int			usage_nentries;
 	int			new_nentries = 0;
 	int			i = 0;
 	int			j = 0;
@@ -1390,10 +1395,15 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 	Assert(area != NULL);
 	Assert(usage != NULL);
 
-	worker->overflowed_calls += usage->overflowed_calls;
-	INSTR_TIME_ADD(worker->overflowed_time, usage->overflowed_time);
+	usage_nentries =
+		pgstat_get_wait_event_usage_entries(usage, &usage_entries);
+	pgstat_get_wait_event_usage_overflow(usage,
+										 &overflowed_calls,
+										 &overflowed_time);
+	worker->overflowed_calls += overflowed_calls;
+	INSTR_TIME_ADD(worker->overflowed_time, overflowed_time);
 
-	if (usage->nentries <= 0)
+	if (usage_nentries <= 0)
 		return;
 
 	if (DsaPointerIsValid(worker->entries))
@@ -1404,25 +1414,25 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 	}
 
 	entries_size = mul_size(sizeof(WaitEventUsageEntry),
-							(Size) old_nentries + (Size) usage->nentries);
+							(Size) old_nentries + (Size) usage_nentries);
 	entries_dsa = dsa_allocate(area, entries_size);
 	entries = dsa_get_address(area, entries_dsa);
 
-	while (i < old_nentries && j < usage->nentries)
+	while (i < old_nentries && j < usage_nentries)
 	{
 		WaitEventUsageEntry *entry = &entries[new_nentries];
 		uint32		old_info = old_entries[i].wait_event_info;
-		uint32		new_info = usage->entries[j].wait_event_info;
+		uint32		new_info = usage_entries[j].wait_event_info;
 
 		if (old_info < new_info)
 			*entry = old_entries[i++];
 		else if (old_info > new_info)
-			*entry = usage->entries[j++];
+			*entry = usage_entries[j++];
 		else
 		{
 			*entry = old_entries[i++];
-			entry->calls += usage->entries[j].calls;
-			INSTR_TIME_ADD(entry->time, usage->entries[j].time);
+			entry->calls += usage_entries[j].calls;
+			INSTR_TIME_ADD(entry->time, usage_entries[j].time);
 			j++;
 		}
 
@@ -1431,8 +1441,8 @@ ExecParallelReportWaitEventUsageWorker(SharedWaitEventUsageWorker *worker,
 
 	while (i < old_nentries)
 		entries[new_nentries++] = old_entries[i++];
-	while (j < usage->nentries)
-		entries[new_nentries++] = usage->entries[j++];
+	while (j < usage_nentries)
+		entries[new_nentries++] = usage_entries[j++];
 
 	if (DsaPointerIsValid(worker->entries))
 		dsa_free(area, worker->entries);
@@ -1781,7 +1791,6 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	QueryDesc  *queryDesc;
 	SharedExecutorInstrumentation *instrumentation;
 	SharedJitInstrumentation *jit_instrumentation;
-	WaitEventUsage waitEventUsage;
 	WaitEventUsage *waitEventUsagePtr = NULL;
 	int			instrument_options = 0;
 	void	   *area_space;
@@ -1841,11 +1850,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	InstrStartParallelQuery();
 
 	if (wait_event_usage != NULL)
-	{
-		waitEventUsagePtr = &waitEventUsage;
-		pgstat_begin_wait_event_usage(waitEventUsagePtr,
-									  queryDesc->estate->es_query_cxt);
-	}
+		waitEventUsagePtr =
+			pgstat_begin_wait_event_usage(queryDesc->estate->es_query_cxt);
 
 	/*
 	 * Run the plan.  If we specified a tuple bound, be careful not to demand
diff --git a/src/backend/utils/activity/wait_event.c b/src/backend/utils/activity/wait_event.c
index 67980cc0a3b..9719e38729e 100644
--- a/src/backend/utils/activity/wait_event.c
+++ b/src/backend/utils/activity/wait_event.c
@@ -36,6 +36,17 @@ static const char *pgstat_get_wait_client(WaitEventClient w);
 static const char *pgstat_get_wait_ipc(WaitEventIPC w);
 static const char *pgstat_get_wait_timeout(WaitEventTimeout w);
 static const char *pgstat_get_wait_io(WaitEventIO w);
+struct WaitEventUsage
+{
+	struct WaitEventUsage *active_parent; /* active plan-node stack link */
+	struct WaitEventUsage *query_parent;	/* active query-level stack link */
+	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
+	int			nentries;
+	int			maxentries;
+	WaitEventUsageEntry *entries;
+	uint64		overflowed_calls;
+	instr_time	overflowed_time;
+};
 static void WaitEventUsageAdd(WaitEventUsage *usage, uint32 wait_event_info,
 							  uint64 calls, const instr_time *elapsed);
 static void WaitEventUsageAddOverflow(WaitEventUsage *usage, uint64 calls,
@@ -422,12 +433,12 @@ WaitEventUsageInit(WaitEventUsage *usage, MemoryContext memcontext)
  * local memory.  Nested top-level collectors are kept in a query-level stack;
  * a wait is counted once in each active collector.
  */
-void
-pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
+WaitEventUsage *
+pgstat_begin_wait_event_usage(MemoryContext memcontext)
 {
+	WaitEventUsage *usage;
 	bool		first;
 
-	Assert(usage != NULL);
 	Assert(memcontext != NULL);
 
 	first = pgstat_wait_event_usage_depth == 0;
@@ -440,7 +451,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 		INSTR_TIME_SET_ZERO(pgstat_wait_event_usage_start);
 	}
 
-	WaitEventUsageInit(usage, memcontext);
+	usage = pgstat_create_wait_event_usage(memcontext);
 	usage->query_parent = pgstat_wait_event_usage;
 	/*
 	 * A nested EXPLAIN can error out while one of its plan nodes is active,
@@ -451,6 +462,7 @@ pgstat_begin_wait_event_usage(WaitEventUsage *usage, MemoryContext memcontext)
 	pgstat_wait_event_usage = usage;
 	pgstat_wait_event_usage_depth++;
 	pgstat_wait_event_usage_active = true;
+	return usage;
 }
 
 /*
@@ -579,6 +591,49 @@ pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
 						  &entries[i].time);
 }
 
+void
+pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+											uint64 calls,
+											const instr_time *elapsed)
+{
+	Assert(usage != NULL);
+	Assert(elapsed != NULL);
+
+	WaitEventUsageAddOverflow(usage, calls, elapsed);
+}
+
+bool
+pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage)
+{
+	Assert(usage != NULL);
+
+	return usage->nentries == 0 && usage->overflowed_calls == 0;
+}
+
+int
+pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+									const WaitEventUsageEntry **entries)
+{
+	Assert(usage != NULL);
+	Assert(entries != NULL);
+
+	*entries = usage->entries;
+	return usage->nentries;
+}
+
+void
+pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+									 uint64 *calls,
+									 instr_time *elapsed)
+{
+	Assert(usage != NULL);
+	Assert(calls != NULL);
+	Assert(elapsed != NULL);
+
+	*calls = usage->overflowed_calls;
+	*elapsed = usage->overflowed_time;
+}
+
 /*
  * Find the existing entry, or the insertion position for a new entry.
  *
diff --git a/src/include/utils/wait_event.h b/src/include/utils/wait_event.h
index 67497790307..19763cfcae5 100644
--- a/src/include/utils/wait_event.h
+++ b/src/include/utils/wait_event.h
@@ -15,6 +15,12 @@
 #include "utils/palloc.h"
 #include "utils/wait_event_types.h"
 
+/*
+ * EXPLAIN wait event accounting support.  WaitEventUsage is intentionally
+ * opaque outside wait_event.c; callers should allocate, accumulate, and read
+ * it through the functions below.  WaitEventUsageEntry is the reportable
+ * tuple copied to EXPLAIN output and parallel-worker storage.
+ */
 typedef struct WaitEventUsageEntry
 {
 	uint32		wait_event_info;
@@ -22,17 +28,7 @@ typedef struct WaitEventUsageEntry
 	instr_time	time;
 } WaitEventUsageEntry;
 
-typedef struct WaitEventUsage
-{
-	struct WaitEventUsage *active_parent; /* active plan-node stack link */
-	struct WaitEventUsage *query_parent;	/* active query-level stack link */
-	struct WaitEventUsage *saved_node_usage;	/* node stack at query start */
-	int			nentries;
-	int			maxentries;
-	WaitEventUsageEntry *entries;
-	uint64		overflowed_calls;
-	instr_time	overflowed_time;
-} WaitEventUsage;
+typedef struct WaitEventUsage WaitEventUsage;
 
 extern const char *pgstat_get_wait_event(uint32 wait_event_info);
 extern const char *pgstat_get_wait_event_type(uint32 wait_event_info);
@@ -40,13 +36,23 @@ static inline void pgstat_report_wait_start(uint32 wait_event_info);
 static inline void pgstat_report_wait_end(void);
 extern void pgstat_set_wait_event_storage(uint32 *wait_event_info);
 extern void pgstat_reset_wait_event_storage(void);
+
+/* EXPLAIN wait event accounting. */
 extern WaitEventUsage *pgstat_create_wait_event_usage(MemoryContext memcontext);
-extern void pgstat_begin_wait_event_usage(WaitEventUsage *usage,
-										  MemoryContext memcontext);
+extern WaitEventUsage *pgstat_begin_wait_event_usage(MemoryContext memcontext);
 extern void pgstat_end_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_accumulate_wait_event_usage(WaitEventUsage *usage,
 											   const WaitEventUsageEntry *entries,
 											   int nentries);
+extern void pgstat_accumulate_wait_event_usage_overflow(WaitEventUsage *usage,
+														uint64 calls,
+														const instr_time *elapsed);
+extern bool pgstat_wait_event_usage_is_empty(const WaitEventUsage *usage);
+extern int pgstat_get_wait_event_usage_entries(const WaitEventUsage *usage,
+											   const WaitEventUsageEntry **entries);
+extern void pgstat_get_wait_event_usage_overflow(const WaitEventUsage *usage,
+												 uint64 *calls,
+												 instr_time *elapsed);
 extern WaitEventUsage *pgstat_enter_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_restore_wait_event_usage(WaitEventUsage *usage);
 extern void pgstat_count_wait_event_start(uint32 wait_event_info);
-- 
2.52.0



  [application/octet-stream] 0007-Keep-EXPLAIN-option-completion-current.patch (1.2K, 9-0007-Keep-EXPLAIN-option-completion-current.patch)
  download | inline diff:
From 7fb9ef35e18a863870ac5c10c346dd739fe7050b Mon Sep 17 00:00:00 2001
From: Ilmar Yunusov <[email protected]>
Date: Sat, 9 May 2026 03:46:43 +0500
Subject: [RFC PATCH v1 7/7] Keep EXPLAIN option completion current

---
 src/bin/psql/tab-complete.in.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/bin/psql/tab-complete.in.c b/src/bin/psql/tab-complete.in.c
index b33490e7415..45712f4ac12 100644
--- a/src/bin/psql/tab-complete.in.c
+++ b/src/bin/psql/tab-complete.in.c
@@ -4525,8 +4525,8 @@ match_previous_words(int pattern_id,
 		if (ends_with(prev_wd, '(') || ends_with(prev_wd, ','))
 			COMPLETE_WITH("ANALYZE", "VERBOSE", "COSTS", "SETTINGS", "GENERIC_PLAN",
 						  "BUFFERS", "SERIALIZE", "WAL", "WAITS", "TIMING", "SUMMARY",
-						  "MEMORY", "FORMAT");
-		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY"))
+						  "MEMORY", "IO", "FORMAT");
+		else if (TailMatches("ANALYZE|VERBOSE|COSTS|SETTINGS|GENERIC_PLAN|BUFFERS|WAL|WAITS|TIMING|SUMMARY|MEMORY|IO"))
 			COMPLETE_WITH("ON", "OFF");
 		else if (TailMatches("SERIALIZE"))
 			COMPLETE_WITH("TEXT", "NONE", "BINARY");
-- 
2.52.0



reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: [RFC PATCH v1] Add EXPLAIN ANALYZE wait event reporting
  In-Reply-To: <CALCfnuquuxtZmmzQBZ_yxaihfj7bnALXdzi9Nj=RYUW4iwY6GQ@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox