public inbox for [email protected]  
help / color / mirror / Atom feed
From: Lukas Fittl <[email protected]>
To: PostgreSQL Hackers <[email protected]>
To: Andres Freund <[email protected]>
Cc: Peter Smith <[email protected]>
Subject: Re: Stack-based tracking of per-node WAL/buffer usage
Date: Sat, 7 Mar 2026 20:27:38 -0800
Message-ID: <CAP53PkxrmpECzVFpeeEEHDGe6u625s+YkmVv5-gw3L_NDSfbiA@mail.gmail.com> (raw)
In-Reply-To: <CAP53PkzGbyeJMLDAcvMRgzXPXYsYXZr3SBg0UwhfkYjqu8oK_g@mail.gmail.com>
References: <CAP53PkzdBK8VJ1fS4AZ481LgMN8f9mJiC39ZRHqkFUSYq6KWmg@mail.gmail.com>
	<h7zq426fhzt3oqkps53qxfwl3dlz3ifcwqmybuvxntlcaqwb33@m6huopxovgle>
	<CAP53Pkw1RpyAVsvJwUOMgGnVoQUibMk+EB+w-8WqY8kzsC4WjA@mail.gmail.com>
	<CAP53Pkz-BmkLFJ+sAcr8H1sBPtWp8WYGzorH9yw+W5Tn0yH0sg@mail.gmail.com>
	<mflexfbmbkmql5snr322ldu72dfnjr2dntxx3ippaz4tcso7vl@zmxxgiws3qux>
	<CAP53PkzZ3UotnRrrnXWAv=F4avRq9MQ8zU+bxoN9tpovEu6fGQ@mail.gmail.com>
	<k2uxwancbi5blb2duny3kbuz667m47hqc4jv7xstlx3yyzcd3a@fjqww6avalt3>
	<CAHut+Put94CTpjQsqOJHdHkgJ2ZXq+qVSfMEcmDKLiWLW-hPfA@mail.gmail.com>
	<CAP53PkyOvXC7pWAiamvWth_JNeb=isrxX+PJT0pw_Hw5Czzf+Q@mail.gmail.com>
	<CAP53PkzGbyeJMLDAcvMRgzXPXYsYXZr3SBg0UwhfkYjqu8oK_g@mail.gmail.com>

On Mon, Feb 23, 2026 at 8:18 PM Lukas Fittl <[email protected]> wrote:
> 0001 addresses the issue that Peter raised and corrects the macro name
> + adds a missing comment.

This has been pushed by Andres, thanks again Peter for noting this!

See attached v7, with the changes as noted later. But first, some
fresh performance numbers that take into account extra inlining added
in a new v7/0006:

Example (default shared_buffers, runtimes are best out of 3-ish):

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

250ms actual runtime (no instrumentation)

BUFFERS OFF, TIMING OFF:
295ms master
295ms with stack-based instrumentation only (v7/0005) -- no change
because BUFFERS OFF
260ms with ExecProcNodeInstr inlining work (v7/0006)

BUFFERS ON, TIMING OFF:
380ms master
305ms with stack-based instrumentation only (v7/0005)
280ms with ExecProcNodeInstr inlining work (v7/0006)

In summary: For BUFFERS ON, we're going from 52% overhead in this
stress test, to 12% overhead (22% without the ExecProcNodeInstr
change). With rows instrumentation only, we go from 18% to 3%
overhead.

> 0002/0003 are the same as in v5 and are preparatory commits that
> should be in good shape to go.

I split out the trigger instrumentation refactoring into its own
commit (0001) per Andres' request off-list.

0002 is the node instrumentation refactoring, which changed slightly,
because Instrumentation is now treated as a base struct and
NodeInstrumentation now contains an Instrumentation "instr" field to
reduce duplication.

0003 replaces pgBufferUsage/pgWalUsage with INSTR_* macros, pretty
much unchanged from v6/0002.

> 0004 adds new regression tests that pass on both master and the
> patched version, and stress test some of the nested instrumentation
> handling.

This patch is the same in v7 as in v6.

> 0005 is what was previously 0003, and introduces the stack-based
> instrumentation mechanism.

0005 is still the stack-based instrumentation commit, but has several
improvements over v6:

- Andres suggested a different stack structure in a conversation
off-list, where we track the stack entries in a separately kept array
instead of with inline pointers on the Instrumentation structure -
that ends up making the whole change a lot easier to reason about, and
clarifies the abort handling as well
- I've separated out the ResOwner handling into a new
"QueryInstrumentation" struct - this makes it clearer that one can use
Instrumentation directly with certain caveats (i.e. one must deal with
aborts directly), and feels better now that we're including
Instrumentation in NodeInstrumentation as well

> 0006 could probably be merged into 0005, and is the existing
> mechanical change to drop the pgBufferUsage global, but kept separate
> for easier review for now.

I've merged that together now into 0005 because it feels better to
reason about this as a whole and drop pgBufferUsage within the same
commit where we're adding the stack-based instrumentation.

There is now a new 0006 patch that further optimizes ExecProcNodeInstr
(with performance numbers as noted in the beginning of this mail).
During my testing I realized that the conditional checks in
InstrStartNode/InstrStopNode are quite unnecessary - we never change
what gets instrumented during execution, and so we can create
specialized functions for the different instrumentation combinations,
giving the compiler a much better chance at generating optimal
instructions. The assembly here looks a lot better now.

> 0007 is the stack-mechanism for splitting out "Table Buffers" for
> Index Scans.

This stayed pretty much the same from the earlier version, but I
reworked this a bit to avoid special Instr functions for dealing with
this case, instead it re-uses NodeInstrumentation and
InstrStartNode/InstrStopNode.

> 0008 adds a pg_session_buffer_usage() module for testing the global
> counters. This helps to verify the handling of aborts behave sanely
> (and can run on both master and patched), but I don't think we should
> commit this.

This is identical with v6/0008, but continues to prove very useful in
testing the refactorings.

> Two questions on my mind:
>
> 1) Should we call this "instrumentation context" (InstrContext?)
> instead of "instrumentation stack" (InstrStack)? When writing comments
> I found the difference between "stack entry" and "stack" confusing at
> times, and "context" feels a bit more clear as a term (e.g. as in
> "CurrentInstrContext" or "CurrentInstrumentationContext").

With the refactoring of the stack-mechanism done now, I think we don't
need to change the naming here, since "InstrStack" as a commonly used
struct is gone now (its just Instrumentation now), so I'd suggest
keeping the "stack" terminology to describe the approach itself.

> 2) For 0007, "Table Buffers" feels a bit inconsistent with "Heap
> Fetches" and "Heap Blocks" used elsewhere, should we potentially use
> "Heap Buffers", or change the existing names to "Table ..." instead?

That is still an open question.

Thanks,
Lukas

-- 
Lukas Fittl


Attachments:

  [application/octet-stream] v7-0004-instrumentation-Add-additional-regression-tests-c.patch (23.5K, 2-v7-0004-instrumentation-Add-additional-regression-tests-c.patch)
  download | inline diff:
From 03eb693edd0c20386a5c1fb87bdd343444f3a0f4 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sun, 22 Feb 2026 16:12:48 -0800
Subject: [PATCH v7 4/8] instrumentation: Add additional regression tests
 covering buffer usage

This adds regression tests that cover some of the expected behaviour
around the buffer statistics reported in EXPLAIN ANALYZE, specifically
how they behave in parallel query, nested function calls and abort
situations.

Testing this is challenging because there can be different sources of
buffer activity, so we rely on temporary tables where we can to prove
that activity was captured and not lost. This supports a future commit
that will rework some of the instrumentation logic that could cause
areas covered by these tests to fail.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
 .../pg_stat_statements/expected/utility.out   |  70 +++++++
 contrib/pg_stat_statements/expected/wal.out   |  48 +++++
 contrib/pg_stat_statements/sql/utility.sql    |  56 +++++
 contrib/pg_stat_statements/sql/wal.sql        |  33 +++
 src/test/regress/expected/explain.out         | 197 ++++++++++++++++++
 src/test/regress/sql/explain.sql              | 194 +++++++++++++++++
 6 files changed, 598 insertions(+)

diff --git a/contrib/pg_stat_statements/expected/utility.out b/contrib/pg_stat_statements/expected/utility.out
index e4d6564ea5b..cba487f6be5 100644
--- a/contrib/pg_stat_statements/expected/utility.out
+++ b/contrib/pg_stat_statements/expected/utility.out
@@ -289,6 +289,76 @@ SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
      1 |    1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
 (3 rows)
 
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT run_explain_buffers_test();
+ run_explain_buffers_test 
+--------------------------
+ 
+(1 row)
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+               query               | has_buffer_stats 
+-----------------------------------+------------------
+ SELECT run_explain_buffers_test() | t
+(1 row)
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+    v int;
+BEGIN
+    EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+    SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+CALL pgss_call_rollback_proc();
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+             query              | calls | local_hitread | wal_bytes_generated | wal_records_generated 
+--------------------------------+-------+---------------+---------------------+-----------------------
+ CALL pgss_call_rollback_proc() |     1 | t             | t                   | t
+(1 row)
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
 -- CALL
 CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
 DECLARE
diff --git a/contrib/pg_stat_statements/expected/wal.out b/contrib/pg_stat_statements/expected/wal.out
index 977e382d848..611213daef6 100644
--- a/contrib/pg_stat_statements/expected/wal.out
+++ b/contrib/pg_stat_statements/expected/wal.out
@@ -28,3 +28,51 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+    v int;
+BEGIN
+    WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+    SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+    NULL;
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT pgss_error_func();
+ pgss_error_func 
+-----------------
+ 
+(1 row)
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+          query           | calls | local_hitread | wal_bytes_generated | wal_records_generated 
+--------------------------+-------+---------------+---------------------+-----------------------
+ SELECT pgss_error_func() |     1 | t             | t                   | t
+(1 row)
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/sql/utility.sql b/contrib/pg_stat_statements/sql/utility.sql
index dd97203c210..7540e49c73c 100644
--- a/contrib/pg_stat_statements/sql/utility.sql
+++ b/contrib/pg_stat_statements/sql/utility.sql
@@ -152,6 +152,62 @@ EXPLAIN (costs off) SELECT a FROM generate_series(1,10) AS tab(a) WHERE a = 7;
 
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 
+-- Buffer stats should flow through EXPLAIN ANALYZE
+CREATE TEMP TABLE flow_through_test (a int, b char(200));
+INSERT INTO flow_through_test SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+CREATE FUNCTION run_explain_buffers_test() RETURNS void AS $$
+DECLARE
+BEGIN
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM flow_through_test';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+SELECT run_explain_buffers_test();
+
+-- EXPLAIN entries should have non-zero buffer stats
+SELECT query, local_blks_hit + local_blks_read > 0 as has_buffer_stats
+FROM pg_stat_statements
+WHERE query LIKE 'SELECT run_explain_buffers_test%'
+ORDER BY query COLLATE "C";
+
+DROP FUNCTION run_explain_buffers_test;
+DROP TABLE flow_through_test;
+
+-- Validate buffer/WAL counting during abort
+SET pg_stat_statements.track = 'all';
+CREATE TEMP TABLE pgss_call_tab (a int, b char(20));
+CREATE TEMP TABLE pgss_call_tab2 (a int, b char(20));
+INSERT INTO pgss_call_tab VALUES (0, 'zzz');
+
+CREATE PROCEDURE pgss_call_rollback_proc() AS $$
+DECLARE
+    v int;
+BEGIN
+    EXPLAIN ANALYZE WITH ins AS (INSERT INTO pgss_call_tab2 SELECT * FROM pgss_call_tab RETURNING a)
+    SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+CALL pgss_call_rollback_proc();
+
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_call_rollback_proc%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_call_tab2;
+DROP TABLE pgss_call_tab;
+DROP PROCEDURE pgss_call_rollback_proc;
+SET pg_stat_statements.track = 'top';
+
 -- CALL
 CREATE OR REPLACE PROCEDURE sum_one(i int) AS $$
 DECLARE
diff --git a/contrib/pg_stat_statements/sql/wal.sql b/contrib/pg_stat_statements/sql/wal.sql
index 1dc1552a81e..467e321b206 100644
--- a/contrib/pg_stat_statements/sql/wal.sql
+++ b/contrib/pg_stat_statements/sql/wal.sql
@@ -18,3 +18,36 @@ wal_records > 0 as wal_records_generated,
 wal_records >= rows as wal_records_ge_rows
 FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+--
+-- Validate buffer/WAL counting with caught exception in PL/pgSQL
+--
+CREATE TEMP TABLE pgss_error_tab (a int, b char(20));
+INSERT INTO pgss_error_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION pgss_error_func() RETURNS void AS $$
+DECLARE
+    v int;
+BEGIN
+    WITH ins AS (INSERT INTO pgss_error_tab VALUES (1, 'aaa') RETURNING a)
+    SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+    NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT pgss_error_func();
+
+-- Buffer/WAL usage from the wCTE INSERT should survive the exception
+SELECT query, calls,
+local_blks_hit + local_blks_read > 0 as local_hitread,
+wal_bytes > 0 as wal_bytes_generated,
+wal_records > 0 as wal_records_generated
+FROM pg_stat_statements
+WHERE query LIKE '%pgss_error_func%'
+ORDER BY query COLLATE "C";
+
+DROP TABLE pgss_error_tab;
+DROP FUNCTION pgss_error_func;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index 7c1f26b182c..e28e7543693 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -822,3 +822,200 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 (9 rows)
 
 reset work_mem;
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+    v int;
+BEGIN
+    WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+    SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+    NULL;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+    plan_json json;
+    node json;
+    total_buffers int;
+BEGIN
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT explain_exc_func()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    total_buffers :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+    RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+ exception_buffers_visible 
+---------------------------
+ t
+(1 row)
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+    plan_json json;
+    node json;
+    total_buffers int;
+BEGIN
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT check_explain_exception_buffers()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    total_buffers :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+    RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+ exception_buffers_nested_visible 
+----------------------------------
+ t
+(1 row)
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+    cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+    rec RECORD;
+    cnt bigint := 0;
+BEGIN
+    OPEN cur;
+    LOOP
+        FETCH NEXT FROM cur INTO rec;
+        EXIT WHEN NOT FOUND;
+        cnt := cnt + 1;
+    END LOOP;
+    CLOSE cur;
+    RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+    cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+    rec RECORD;
+    cnt bigint := 0;
+BEGIN
+    OPEN cur;
+    LOOP
+        FETCH NEXT FROM cur INTO rec;
+        EXIT WHEN NOT FOUND;
+        cnt := cnt + 1;
+    END LOOP;
+    CLOSE cur;
+    RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+    plan_json json;
+    node json;
+    direct_buf int;
+    noscroll_buf int;
+    scroll_buf int;
+BEGIN
+    -- Direct scan: get leaf Seq Scan node buffers as baseline
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT * FROM cursor_buf_test' INTO plan_json;
+    node := plan_json->0->'Plan';
+    WHILE node->'Plans' IS NOT NULL LOOP
+        node := node->'Plans'->0;
+    END LOOP;
+    direct_buf :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+
+    -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT cursor_noscroll_scan()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    noscroll_buf :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+
+    -- Scrollable cursor path: ExecShutdownNode is skipped
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT cursor_scroll_scan()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    scroll_buf :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+
+    -- Both cursor paths should report buffer counts about as high as
+    -- the direct scan (same data plus minor catalog overhead), and not
+    -- double-counted (< 2x the direct scan)
+    RETURN QUERY SELECT
+        (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+        (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_cursor_explain_buffers();
+ noscroll_ok | scroll_ok 
+-------------+-----------
+ t           | t
+(1 row)
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+    plan_json json;
+    serial_buffers int;
+    parallel_buffers int;
+    node json;
+BEGIN
+    -- Serial --
+    SET LOCAL max_parallel_workers_per_gather = 0;
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT count(*) FROM tenk1' INTO plan_json;
+    node := plan_json->0->'Plan';
+    serial_buffers :=
+        COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+        COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+    -- Parallel --
+    SET LOCAL parallel_setup_cost = 0;
+    SET LOCAL parallel_tuple_cost = 0;
+    SET LOCAL min_parallel_table_scan_size = 0;
+    SET LOCAL max_parallel_workers_per_gather = 2;
+    SET LOCAL parallel_leader_participation = off;
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT count(*) FROM tenk1' INTO plan_json;
+    node := plan_json->0->'Plan';
+    parallel_buffers :=
+        COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+        COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+    RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+SELECT * FROM check_parallel_explain_buffers();
+ ratio 
+-------
+     1
+(1 row)
+
+DROP FUNCTION check_parallel_explain_buffers;
diff --git a/src/test/regress/sql/explain.sql b/src/test/regress/sql/explain.sql
index ebdab42604b..cf5c6335a19 100644
--- a/src/test/regress/sql/explain.sql
+++ b/src/test/regress/sql/explain.sql
@@ -188,3 +188,197 @@ select explain_filter('explain (analyze,buffers off,costs off) select sum(n) ove
 -- Test tuplestore storage usage in Window aggregate (memory and disk case, final result is disk)
 select explain_filter('explain (analyze,buffers off,costs off) select sum(n) over(partition by m) from (SELECT n < 3 as m, n from generate_series(1,2500) a(n))');
 reset work_mem;
+
+-- EXPLAIN (ANALYZE, BUFFERS) should report buffer usage from PL/pgSQL
+-- EXCEPTION blocks, even after subtransaction rollback.
+CREATE TEMP TABLE explain_exc_tab (a int, b char(20));
+INSERT INTO explain_exc_tab VALUES (0, 'zzz');
+
+CREATE FUNCTION explain_exc_func() RETURNS void AS $$
+DECLARE
+    v int;
+BEGIN
+    WITH ins AS (INSERT INTO explain_exc_tab VALUES (1, 'aaa') RETURNING a)
+    SELECT a / 0 INTO v FROM ins;
+EXCEPTION WHEN division_by_zero THEN
+    NULL;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_explain_exception_buffers() RETURNS boolean AS $$
+DECLARE
+    plan_json json;
+    node json;
+    total_buffers int;
+BEGIN
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT explain_exc_func()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    total_buffers :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+    RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers() AS exception_buffers_visible;
+
+-- Also test with nested EXPLAIN ANALYZE (two levels of instrumentation)
+CREATE FUNCTION check_explain_exception_buffers_nested() RETURNS boolean AS $$
+DECLARE
+    plan_json json;
+    node json;
+    total_buffers int;
+BEGIN
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT check_explain_exception_buffers()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    total_buffers :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+    RETURN total_buffers > 0;
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT check_explain_exception_buffers_nested() AS exception_buffers_nested_visible;
+
+DROP FUNCTION check_explain_exception_buffers_nested;
+DROP FUNCTION check_explain_exception_buffers;
+DROP FUNCTION explain_exc_func;
+DROP TABLE explain_exc_tab;
+
+-- Cursor instrumentation test.
+-- Verify that buffer usage is correctly tracked through cursor execution paths.
+-- Non-scrollable cursors exercise ExecShutdownNode after each ExecutorRun
+-- (EXEC_FLAG_BACKWARD is not set), while scrollable cursors only shut down
+-- nodes in ExecutorFinish. In both cases, buffer usage from the inner cursor
+-- scan should be correctly accumulated.
+
+CREATE TEMP TABLE cursor_buf_test AS SELECT * FROM tenk1;
+
+CREATE FUNCTION cursor_noscroll_scan() RETURNS bigint AS $$
+DECLARE
+    cur NO SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+    rec RECORD;
+    cnt bigint := 0;
+BEGIN
+    OPEN cur;
+    LOOP
+        FETCH NEXT FROM cur INTO rec;
+        EXIT WHEN NOT FOUND;
+        cnt := cnt + 1;
+    END LOOP;
+    CLOSE cur;
+    RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION cursor_scroll_scan() RETURNS bigint AS $$
+DECLARE
+    cur SCROLL CURSOR FOR SELECT * FROM cursor_buf_test;
+    rec RECORD;
+    cnt bigint := 0;
+BEGIN
+    OPEN cur;
+    LOOP
+        FETCH NEXT FROM cur INTO rec;
+        EXIT WHEN NOT FOUND;
+        cnt := cnt + 1;
+    END LOOP;
+    CLOSE cur;
+    RETURN cnt;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE FUNCTION check_cursor_explain_buffers() RETURNS TABLE(noscroll_ok boolean, scroll_ok boolean) AS $$
+DECLARE
+    plan_json json;
+    node json;
+    direct_buf int;
+    noscroll_buf int;
+    scroll_buf int;
+BEGIN
+    -- Direct scan: get leaf Seq Scan node buffers as baseline
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT * FROM cursor_buf_test' INTO plan_json;
+    node := plan_json->0->'Plan';
+    WHILE node->'Plans' IS NOT NULL LOOP
+        node := node->'Plans'->0;
+    END LOOP;
+    direct_buf :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+
+    -- Non-scrollable cursor path: ExecShutdownNode runs after each ExecutorRun
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT cursor_noscroll_scan()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    noscroll_buf :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+
+    -- Scrollable cursor path: ExecShutdownNode is skipped
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT cursor_scroll_scan()' INTO plan_json;
+    node := plan_json->0->'Plan';
+    scroll_buf :=
+        COALESCE((node->>'Local Hit Blocks')::int, 0) +
+        COALESCE((node->>'Local Read Blocks')::int, 0);
+
+    -- Both cursor paths should report buffer counts about as high as
+    -- the direct scan (same data plus minor catalog overhead), and not
+    -- double-counted (< 2x the direct scan)
+    RETURN QUERY SELECT
+        (noscroll_buf >= direct_buf * 0.5 AND noscroll_buf < direct_buf * 2),
+        (scroll_buf >= direct_buf * 0.5 AND scroll_buf < direct_buf * 2);
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_cursor_explain_buffers();
+
+DROP FUNCTION check_cursor_explain_buffers;
+DROP FUNCTION cursor_noscroll_scan;
+DROP FUNCTION cursor_scroll_scan;
+DROP TABLE cursor_buf_test;
+
+-- Parallel query buffer double-counting test.
+--
+-- Compares serial Seq Scan buffers vs parallel Seq Scan buffers.
+-- They scan the same table so the buffer count should be similar.
+-- Double-counting would make the parallel count ~2x larger.
+CREATE FUNCTION check_parallel_explain_buffers() RETURNS TABLE(ratio numeric) AS $$
+DECLARE
+    plan_json json;
+    serial_buffers int;
+    parallel_buffers int;
+    node json;
+BEGIN
+    -- Serial --
+    SET LOCAL max_parallel_workers_per_gather = 0;
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT count(*) FROM tenk1' INTO plan_json;
+    node := plan_json->0->'Plan';
+    serial_buffers :=
+        COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+        COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+    -- Parallel --
+    SET LOCAL parallel_setup_cost = 0;
+    SET LOCAL parallel_tuple_cost = 0;
+    SET LOCAL min_parallel_table_scan_size = 0;
+    SET LOCAL max_parallel_workers_per_gather = 2;
+    SET LOCAL parallel_leader_participation = off;
+    EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF, FORMAT JSON)
+        SELECT count(*) FROM tenk1' INTO plan_json;
+    node := plan_json->0->'Plan';
+    parallel_buffers :=
+        COALESCE((node->>'Shared Hit Blocks')::int, 0) +
+        COALESCE((node->>'Shared Read Blocks')::int, 0);
+
+    RETURN QUERY SELECT round(parallel_buffers::numeric / GREATEST(serial_buffers, 1));
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT * FROM check_parallel_explain_buffers();
+
+DROP FUNCTION check_parallel_explain_buffers;
-- 
2.47.1



  [application/octet-stream] v7-0002-instrumentation-Separate-per-node-logic-from-othe.patch (26.3K, 3-v7-0002-instrumentation-Separate-per-node-logic-from-othe.patch)
  download | inline diff:
From 0032774bacdeff43d6565e43e04510c1b1eaf6c2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 01:19:50 -0800
Subject: [PATCH v7 2/8] instrumentation: Separate per-node logic from other
 uses

Previously different places (e.g. query "total time") were repurposing
the Instrumentation struct initially introduced for capturing per-node
statistics during execution. This overuse of the same struct is confusing,
e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
code paths, and prevents future refactorings.

Instead, simplify the Instrumentation struct to only track time and
WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
per-node instrumentation - these calls were added without any apparent
benefit since the relevant fields were never read.

Introduce the NodeInstrumentation struct to carry forward the per-node
instrumentation information. WorkerInstrumentation is renamed to
WorkerNodeInstrumentation for clarity.

In passing, drop the "n" argument to InstrAlloc, as all remaining callers
need exactly one Instrumentation struct.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
 contrib/auto_explain/auto_explain.c           |   8 +-
 .../pg_stat_statements/pg_stat_statements.c   |   8 +-
 contrib/postgres_fdw/postgres_fdw.c           |   2 +-
 src/backend/commands/explain.c                |  20 +--
 src/backend/executor/execMain.c               |   8 +-
 src/backend/executor/execParallel.c           |  24 +--
 src/backend/executor/execProcnode.c           |   4 +-
 src/backend/executor/instrument.c             | 142 ++++++++++++------
 src/include/executor/instrument.h             |  60 +++++---
 src/include/nodes/execnodes.h                 |   6 +-
 src/tools/pgindent/typedefs.list              |   3 +-
 11 files changed, 172 insertions(+), 113 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index e856cd35a6f..39bf2543b70 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -315,7 +315,7 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			MemoryContext oldcxt;
 
 			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+			queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
 			MemoryContextSwitchTo(oldcxt);
 		}
 	}
@@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		 */
 		oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
 
-		/*
-		 * Make sure stats accumulation is done.  (Note: it's okay if several
-		 * levels of hook all do this.)
-		 */
-		InstrEndLoop(queryDesc->totaltime);
-
 		/* Log plan if duration is exceeded. */
 		msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
 		if (msec >= auto_explain_log_min_duration)
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4a427533bd8..388b068ccec 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1023,7 +1023,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 			MemoryContext oldcxt;
 
 			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-			queryDesc->totaltime = InstrAlloc(1, INSTRUMENT_ALL, false);
+			queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
 			MemoryContextSwitchTo(oldcxt);
 		}
 	}
@@ -1082,12 +1082,6 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	if (queryId != INT64CONST(0) && queryDesc->totaltime &&
 		pgss_enabled(nesting_level))
 	{
-		/*
-		 * Make sure stats accumulation is done.  (Note: it's okay if several
-		 * levels of hook all do this.)
-		 */
-		InstrEndLoop(queryDesc->totaltime);
-
 		pgss_store(queryDesc->sourceText,
 				   queryId,
 				   queryDesc->plannedstmt->stmt_location,
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 60d90329a65..6f0cb2a285b 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2778,7 +2778,7 @@ postgresIterateDirectModify(ForeignScanState *node)
 	if (!resultRelInfo->ri_projectReturning)
 	{
 		TupleTableSlot *slot = node->ss.ss_ScanTupleSlot;
-		Instrumentation *instr = node->ss.ps.instrument;
+		NodeInstrumentation *instr = node->ss.ps.instrument;
 
 		Assert(!dmstate->has_returning);
 
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 09b13807d92..389181b8d9b 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1835,7 +1835,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	{
 		double		nloops = planstate->instrument->nloops;
 		double		startup_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->startup) / nloops;
-		double		total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->total) / nloops;
+		double		total_ms = INSTR_TIME_GET_MILLISEC(planstate->instrument->instr.total) / nloops;
 		double		rows = planstate->instrument->ntuples / nloops;
 
 		if (es->format == EXPLAIN_FORMAT_TEXT)
@@ -1888,11 +1888,11 @@ ExplainNode(PlanState *planstate, List *ancestors,
 	/* prepare per-worker general execution details */
 	if (es->workers_state && es->verbose)
 	{
-		WorkerInstrumentation *w = planstate->worker_instrument;
+		WorkerNodeInstrumentation *w = planstate->worker_instrument;
 
 		for (int n = 0; n < w->num_workers; n++)
 		{
-			Instrumentation *instrument = &w->instrument[n];
+			NodeInstrumentation *instrument = &w->instrument[n];
 			double		nloops = instrument->nloops;
 			double		startup_ms;
 			double		total_ms;
@@ -1901,7 +1901,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 			if (nloops <= 0)
 				continue;
 			startup_ms = INSTR_TIME_GET_MILLISEC(instrument->startup) / nloops;
-			total_ms = INSTR_TIME_GET_MILLISEC(instrument->total) / nloops;
+			total_ms = INSTR_TIME_GET_MILLISEC(instrument->instr.total) / nloops;
 			rows = instrument->ntuples / nloops;
 
 			ExplainOpenWorker(n, es);
@@ -2288,18 +2288,18 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	/* Show buffer/WAL usage */
 	if (es->buffers && planstate->instrument)
-		show_buffer_usage(es, &planstate->instrument->bufusage);
+		show_buffer_usage(es, &planstate->instrument->instr.bufusage);
 	if (es->wal && planstate->instrument)
-		show_wal_usage(es, &planstate->instrument->walusage);
+		show_wal_usage(es, &planstate->instrument->instr.walusage);
 
 	/* Prepare per-worker buffer/WAL usage */
 	if (es->workers_state && (es->buffers || es->wal) && es->verbose)
 	{
-		WorkerInstrumentation *w = planstate->worker_instrument;
+		WorkerNodeInstrumentation *w = planstate->worker_instrument;
 
 		for (int n = 0; n < w->num_workers; n++)
 		{
-			Instrumentation *instrument = &w->instrument[n];
+			NodeInstrumentation *instrument = &w->instrument[n];
 			double		nloops = instrument->nloops;
 
 			if (nloops <= 0)
@@ -2307,9 +2307,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 			ExplainOpenWorker(n, es);
 			if (es->buffers)
-				show_buffer_usage(es, &instrument->bufusage);
+				show_buffer_usage(es, &instrument->instr.bufusage);
 			if (es->wal)
-				show_wal_usage(es, &instrument->walusage);
+				show_wal_usage(es, &instrument->instr.walusage);
 			ExplainCloseWorker(n, es);
 		}
 	}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 1a3b8021600..c0b174cfbc0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -331,7 +331,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
-		InstrStartNode(queryDesc->totaltime);
+		InstrStart(queryDesc->totaltime);
 
 	/*
 	 * extract information from the query descriptor and the query feature.
@@ -383,7 +383,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		dest->rShutdown(dest);
 
 	if (queryDesc->totaltime)
-		InstrStopNode(queryDesc->totaltime, estate->es_processed);
+		InstrStop(queryDesc->totaltime);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -433,7 +433,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
-		InstrStartNode(queryDesc->totaltime);
+		InstrStart(queryDesc->totaltime);
 
 	/* Run ModifyTable nodes to completion */
 	ExecPostprocessPlan(estate);
@@ -443,7 +443,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 		AfterTriggerEndQuery(estate);
 
 	if (queryDesc->totaltime)
-		InstrStopNode(queryDesc->totaltime, 0);
+		InstrStop(queryDesc->totaltime);
 
 	MemoryContextSwitchTo(oldcontext);
 
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index ac84af294c9..c153d5c1c3b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -87,7 +87,7 @@ typedef struct FixedParallelExecutorState
  * instrument_options: Same meaning here as in instrument.c.
  *
  * instrument_offset: Offset, relative to the start of this structure,
- * of the first Instrumentation object.  This will depend on the length of
+ * of the first NodeInstrumentation object.  This will depend on the length of
  * the plan_node_id array.
  *
  * num_workers: Number of workers.
@@ -104,11 +104,15 @@ struct SharedExecutorInstrumentation
 	int			num_workers;
 	int			num_plan_nodes;
 	int			plan_node_id[FLEXIBLE_ARRAY_MEMBER];
-	/* array of num_plan_nodes * num_workers Instrumentation objects follows */
+
+	/*
+	 * array of num_plan_nodes * num_workers NodeInstrumentation objects
+	 * follows
+	 */
 };
 #define GetInstrumentationArray(sei) \
 	(StaticAssertVariableIsOfTypeMacro(sei, SharedExecutorInstrumentation *), \
-	 (Instrumentation *) (((char *) sei) + sei->instrument_offset))
+	 (NodeInstrumentation *) (((char *) sei) + sei->instrument_offset))
 
 /* Context object for ExecParallelEstimate. */
 typedef struct ExecParallelEstimateContext
@@ -725,7 +729,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation_len = MAXALIGN(instrumentation_len);
 		instrument_offset = instrumentation_len;
 		instrumentation_len +=
-			mul_size(sizeof(Instrumentation),
+			mul_size(sizeof(NodeInstrumentation),
 					 mul_size(e.nnodes, nworkers));
 		shm_toc_estimate_chunk(&pcxt->estimator, instrumentation_len);
 		shm_toc_estimate_keys(&pcxt->estimator, 1);
@@ -811,7 +815,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	 */
 	if (estate->es_instrument)
 	{
-		Instrumentation *instrument;
+		NodeInstrumentation *instrument;
 		int			i;
 
 		instrumentation = shm_toc_allocate(pcxt->toc, instrumentation_len);
@@ -821,7 +825,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 		instrumentation->num_plan_nodes = e.nnodes;
 		instrument = GetInstrumentationArray(instrumentation);
 		for (i = 0; i < nworkers * e.nnodes; ++i)
-			InstrInit(&instrument[i], estate->es_instrument);
+			InstrInitNode(&instrument[i], estate->es_instrument);
 		shm_toc_insert(pcxt->toc, PARALLEL_KEY_INSTRUMENTATION,
 					   instrumentation);
 		pei->instrumentation = instrumentation;
@@ -1053,7 +1057,7 @@ static bool
 ExecParallelRetrieveInstrumentation(PlanState *planstate,
 									SharedExecutorInstrumentation *instrumentation)
 {
-	Instrumentation *instrument;
+	NodeInstrumentation *instrument;
 	int			i;
 	int			n;
 	int			ibytes;
@@ -1081,9 +1085,9 @@ ExecParallelRetrieveInstrumentation(PlanState *planstate,
 	 * Switch into per-query memory context.
 	 */
 	oldcontext = MemoryContextSwitchTo(planstate->state->es_query_cxt);
-	ibytes = mul_size(instrumentation->num_workers, sizeof(Instrumentation));
+	ibytes = mul_size(instrumentation->num_workers, sizeof(NodeInstrumentation));
 	planstate->worker_instrument =
-		palloc(ibytes + offsetof(WorkerInstrumentation, instrument));
+		palloc(ibytes + offsetof(WorkerNodeInstrumentation, instrument));
 	MemoryContextSwitchTo(oldcontext);
 
 	planstate->worker_instrument->num_workers = instrumentation->num_workers;
@@ -1313,7 +1317,7 @@ ExecParallelReportInstrumentation(PlanState *planstate,
 {
 	int			i;
 	int			plan_node_id = planstate->plan->plan_node_id;
-	Instrumentation *instrument;
+	NodeInstrumentation *instrument;
 
 	InstrEndLoop(planstate->instrument);
 
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 7e40b852517..1846661b503 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -413,8 +413,8 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 
 	/* Set up instrumentation for this node if requested */
 	if (estate->es_instrument)
-		result->instrument = InstrAlloc(1, estate->es_instrument,
-										result->async_capable);
+		result->instrument = InstrAllocNode(estate->es_instrument,
+											result->async_capable);
 
 	return result;
 }
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9354ad7be12..bc551f95a08 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -26,51 +26,30 @@ static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
 static void WalUsageAdd(WalUsage *dst, WalUsage *add);
 
 
-/* Allocate new instrumentation structure(s) */
+/* General purpose instrumentation handling */
 Instrumentation *
-InstrAlloc(int n, int instrument_options, bool async_mode)
+InstrAlloc(int instrument_options)
 {
-	Instrumentation *instr;
-
-	/* initialize all fields to zeroes, then modify as needed */
-	instr = palloc0(n * sizeof(Instrumentation));
-	if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | INSTRUMENT_WAL))
-	{
-		bool		need_buffers = (instrument_options & INSTRUMENT_BUFFERS) != 0;
-		bool		need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
-		bool		need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
-		int			i;
-
-		for (i = 0; i < n; i++)
-		{
-			instr[i].need_bufusage = need_buffers;
-			instr[i].need_walusage = need_wal;
-			instr[i].need_timer = need_timer;
-			instr[i].async_mode = async_mode;
-		}
-	}
-
+	Instrumentation *instr = palloc0(sizeof(Instrumentation));
+	InstrInitOptions(instr, instrument_options);
 	return instr;
 }
 
-/* Initialize a pre-allocated instrumentation structure. */
 void
-InstrInit(Instrumentation *instr, int instrument_options)
+InstrInitOptions(Instrumentation *instr, int instrument_options)
 {
-	memset(instr, 0, sizeof(Instrumentation));
 	instr->need_bufusage = (instrument_options & INSTRUMENT_BUFFERS) != 0;
 	instr->need_walusage = (instrument_options & INSTRUMENT_WAL) != 0;
 	instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
 }
 
-/* Entry to a plan node */
 void
-InstrStartNode(Instrumentation *instr)
+InstrStart(Instrumentation *instr)
 {
 	if (instr->need_timer)
 	{
 		if (!INSTR_TIME_IS_ZERO(instr->starttime))
-			elog(ERROR, "InstrStartNode called twice in a row");
+			elog(ERROR, "InstrStart called twice in a row");
 		else
 			INSTR_TIME_SET_CURRENT(instr->starttime);
 	}
@@ -83,24 +62,19 @@ InstrStartNode(Instrumentation *instr)
 		instr->walusage_start = pgWalUsage;
 }
 
-/* Exit from a plan node */
 void
-InstrStopNode(Instrumentation *instr, double nTuples)
+InstrStop(Instrumentation *instr)
 {
-	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
 
-	/* count the returned tuples */
-	instr->tuplecount += nTuples;
-
 	/* let's update the time only if the timer was requested */
 	if (instr->need_timer)
 	{
 		if (INSTR_TIME_IS_ZERO(instr->starttime))
-			elog(ERROR, "InstrStopNode called without start");
+			elog(ERROR, "InstrStop called without start");
 
 		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
+		INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
 
 		INSTR_TIME_SET_ZERO(instr->starttime);
 	}
@@ -113,6 +87,74 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 	if (instr->need_walusage)
 		WalUsageAccumDiff(&instr->walusage,
 						  &pgWalUsage, &instr->walusage_start);
+}
+
+/* Node instrumentation handling */
+
+/* Allocate new node instrumentation structure */
+NodeInstrumentation *
+InstrAllocNode(int instrument_options, bool async_mode)
+{
+	NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+
+	InstrInitNode(instr, instrument_options);
+	instr->async_mode = async_mode;
+
+	return instr;
+}
+
+/* Initialize a pre-allocated instrumentation structure. */
+void
+InstrInitNode(NodeInstrumentation *instr, int instrument_options)
+{
+	memset(instr, 0, sizeof(NodeInstrumentation));
+	InstrInitOptions(&instr->instr, instrument_options);
+}
+
+/* Entry to a plan node */
+void
+InstrStartNode(NodeInstrumentation *instr)
+{
+	InstrStart(&instr->instr);
+}
+
+/* Exit from a plan node */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+	double		save_tuplecount = instr->tuplecount;
+	instr_time	endtime;
+
+	/* count the returned tuples */
+	instr->tuplecount += nTuples;
+
+	/*
+	 * Update the time only if the timer was requested.
+	 *
+	 * Note this is different from InstrStop because total is only updated in
+	 * InstrEndLoop. We need the separate counter variable because we need to
+	 * calculate start-up time for the first tuple in each cycle, and then
+	 * accumulate it together.
+	 */
+	if (instr->instr.need_timer)
+	{
+		if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
+			elog(ERROR, "InstrStopNode called without start");
+
+		INSTR_TIME_SET_CURRENT(endtime);
+		INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+
+		INSTR_TIME_SET_ZERO(instr->instr.starttime);
+	}
+
+	/* Add delta of buffer usage since entry to node's totals */
+	if (instr->instr.need_bufusage)
+		BufferUsageAccumDiff(&instr->instr.bufusage,
+							 &pgBufferUsage, &instr->instr.bufusage_start);
+
+	if (instr->instr.need_walusage)
+		WalUsageAccumDiff(&instr->instr.walusage,
+						  &pgWalUsage, &instr->instr.walusage_start);
 
 	/* Is this the first tuple of this cycle? */
 	if (!instr->running)
@@ -133,7 +175,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
 
 /* Update tuple count */
 void
-InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
+InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
 {
 	/* count the returned tuples */
 	instr->tuplecount += nTuples;
@@ -141,24 +183,24 @@ InstrUpdateTupleCount(Instrumentation *instr, double nTuples)
 
 /* Finish a run cycle for a plan node */
 void
-InstrEndLoop(Instrumentation *instr)
+InstrEndLoop(NodeInstrumentation *instr)
 {
 	/* Skip if nothing has happened, or already shut down */
 	if (!instr->running)
 		return;
 
-	if (!INSTR_TIME_IS_ZERO(instr->starttime))
+	if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
 		elog(ERROR, "InstrEndLoop called on running node");
 
 	/* Accumulate per-cycle statistics into totals */
 	INSTR_TIME_ADD(instr->startup, instr->firsttuple);
-	INSTR_TIME_ADD(instr->total, instr->counter);
+	INSTR_TIME_ADD(instr->instr.total, instr->counter);
 	instr->ntuples += instr->tuplecount;
 	instr->nloops += 1;
 
 	/* Reset for next cycle (if any) */
 	instr->running = false;
-	INSTR_TIME_SET_ZERO(instr->starttime);
+	INSTR_TIME_SET_ZERO(instr->instr.starttime);
 	INSTR_TIME_SET_ZERO(instr->counter);
 	INSTR_TIME_SET_ZERO(instr->firsttuple);
 	instr->tuplecount = 0;
@@ -166,7 +208,7 @@ InstrEndLoop(Instrumentation *instr)
 
 /* aggregate instrumentation information */
 void
-InstrAggNode(Instrumentation *dst, Instrumentation *add)
+InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
 {
 	if (!dst->running && add->running)
 	{
@@ -181,7 +223,7 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 
 	dst->tuplecount += add->tuplecount;
 	INSTR_TIME_ADD(dst->startup, add->startup);
-	INSTR_TIME_ADD(dst->total, add->total);
+	INSTR_TIME_ADD(dst->instr.total, add->instr.total);
 	dst->ntuples += add->ntuples;
 	dst->ntuples2 += add->ntuples2;
 	dst->nloops += add->nloops;
@@ -189,11 +231,11 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 	dst->nfiltered2 += add->nfiltered2;
 
 	/* Add delta of buffer usage since entry to node's totals */
-	if (dst->need_bufusage)
-		BufferUsageAdd(&dst->bufusage, &add->bufusage);
+	if (dst->instr.need_bufusage)
+		BufferUsageAdd(&dst->instr.bufusage, &add->instr.bufusage);
 
-	if (dst->need_walusage)
-		WalUsageAdd(&dst->walusage, &add->walusage);
+	if (dst->instr.need_walusage)
+		WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
 }
 
 /* Trigger instrumentation handling */
@@ -204,7 +246,7 @@ InstrAllocTrigger(int n, int instrument_options)
 	int			i;
 
 	for (i = 0; i < n; i++)
-		InstrInit(&tginstr[i].instr, instrument_options);
+		InstrInitOptions(&tginstr[i].instr, instrument_options);
 
 	return tginstr;
 }
@@ -212,13 +254,13 @@ InstrAllocTrigger(int n, int instrument_options)
 void
 InstrStartTrigger(TriggerInstrumentation *tginstr)
 {
-	InstrStartNode(&tginstr->instr);
+	InstrStart(&tginstr->instr);
 }
 
 void
 InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
 {
-	InstrStopNode(&tginstr->instr, 0);
+	InstrStop(&tginstr->instr);
 	tginstr->firings += firings;
 }
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index a9c2233227f..b11d64633b5 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -67,38 +67,55 @@ typedef enum InstrumentOption
 	INSTRUMENT_ALL = PG_INT32_MAX
 } InstrumentOption;
 
+/*
+ * General purpose instrumentation that can capture time and WAL/buffer usage
+ *
+ * Initialized through InstrAlloc, followed by one or more calls to a pair of
+ * InstrStart/InstrStop (activity is measured inbetween).
+ */
 typedef struct Instrumentation
 {
-	/* Parameters set at node creation: */
+	/* Parameters set at creation: */
 	bool		need_timer;		/* true if we need timer data */
 	bool		need_bufusage;	/* true if we need buffer usage data */
 	bool		need_walusage;	/* true if we need WAL usage data */
+	/* Internal state keeping: */
+	instr_time	starttime;		/* start time of last InstrStart */
+	BufferUsage bufusage_start; /* buffer usage at start */
+	WalUsage	walusage_start; /* WAL usage at start */
+	/* Accumulated statistics: */
+	instr_time	total;			/* total runtime */
+	BufferUsage bufusage;		/* total buffer usage */
+	WalUsage	walusage;		/* total WAL usage */
+} Instrumentation;
+
+/*
+ * Specialized instrumentation for per-node execution statistics
+ */
+typedef struct NodeInstrumentation
+{
+	Instrumentation instr;
+	/* Parameters set at node creation: */
 	bool		async_mode;		/* true if node is in async mode */
 	/* Info about current plan cycle: */
 	bool		running;		/* true if we've completed first tuple */
-	instr_time	starttime;		/* start time of current iteration of node */
 	instr_time	counter;		/* accumulated runtime for this node */
 	instr_time	firsttuple;		/* time for first tuple of this cycle */
 	double		tuplecount;		/* # of tuples emitted so far this cycle */
-	BufferUsage bufusage_start; /* buffer usage at start */
-	WalUsage	walusage_start; /* WAL usage at start */
 	/* Accumulated statistics across all completed cycles: */
 	instr_time	startup;		/* total startup time */
-	instr_time	total;			/* total time */
 	double		ntuples;		/* total tuples produced */
 	double		ntuples2;		/* secondary node-specific tuple counter */
 	double		nloops;			/* # of run cycles for this node */
 	double		nfiltered1;		/* # of tuples removed by scanqual or joinqual */
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
-	BufferUsage bufusage;		/* total buffer usage */
-	WalUsage	walusage;		/* total WAL usage */
-} Instrumentation;
+} NodeInstrumentation;
 
-typedef struct WorkerInstrumentation
+typedef struct WorkerNodeInstrumentation
 {
 	int			num_workers;	/* # of structures that follow */
-	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
-} WorkerInstrumentation;
+	NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
+} WorkerNodeInstrumentation;
 
 typedef struct TriggerInstrumentation
 {
@@ -110,14 +127,19 @@ typedef struct TriggerInstrumentation
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
-extern Instrumentation *InstrAlloc(int n, int instrument_options,
-								   bool async_mode);
-extern void InstrInit(Instrumentation *instr, int instrument_options);
-extern void InstrStartNode(Instrumentation *instr);
-extern void InstrStopNode(Instrumentation *instr, double nTuples);
-extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
-extern void InstrEndLoop(Instrumentation *instr);
-extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+extern Instrumentation *InstrAlloc(int instrument_options);
+extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
+extern void InstrStart(Instrumentation *instr);
+extern void InstrStop(Instrumentation *instr);
+
+extern NodeInstrumentation *InstrAllocNode(int instrument_options,
+										   bool async_mode);
+extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
+extern void InstrStartNode(NodeInstrumentation *instr);
+extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
+extern void InstrEndLoop(NodeInstrumentation *instr);
+extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
 
 extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
 extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index a43bd428a91..605c7a6cc39 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1175,8 +1175,10 @@ typedef struct PlanState
 	ExecProcNodeMtd ExecProcNodeReal;	/* actual function, if above is a
 										 * wrapper */
 
-	Instrumentation *instrument;	/* Optional runtime stats for this node */
-	WorkerInstrumentation *worker_instrument;	/* per-worker instrumentation */
+	NodeInstrumentation *instrument;	/* Optional runtime stats for this
+										 * node */
+	WorkerNodeInstrumentation *worker_instrument;	/* per-worker
+													 * instrumentation */
 
 	/* Per-worker JIT instrumentation */
 	struct SharedJitInstrumentation *worker_jit_instrument;
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index fd3ec7a7236..7dc0073ab68 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1785,6 +1785,7 @@ NextSampleBlock_function
 NextSampleTuple_function
 NextValueExpr
 Node
+NodeInstrumentation
 NodeTag
 NonEmptyRange
 NoneCompressorState
@@ -3377,9 +3378,9 @@ WorkTableScan
 WorkTableScanState
 WorkerInfo
 WorkerInfoData
-WorkerInstrumentation
 WorkerJobDumpPtrType
 WorkerJobRestorePtrType
+WorkerNodeInstrumentation
 Working_State
 WriteBufPtrType
 WriteBytePtrType
-- 
2.47.1



  [application/octet-stream] v7-0001-instrumentation-Separate-trigger-logic-from-other.patch (9.7K, 4-v7-0001-instrumentation-Separate-trigger-logic-from-other.patch)
  download | inline diff:
From 2c0c9c172cfd23cb8926803ee8db51915b7ccc7d Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 1 Mar 2025 19:31:30 -0800
Subject: [PATCH v7 1/8] instrumentation: Separate trigger logic from other
 uses

Introduce TriggerInstrumentation to capture trigger timing and firings
(previously counted in "ntuples"), to aid a future refactoring that
splits out all Instrumentation fields beyond timing and WAL/buffers into
more specific structs.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
 src/backend/commands/explain.c    | 19 ++++++++-----------
 src/backend/commands/trigger.c    | 22 +++++++++++-----------
 src/backend/executor/execMain.c   |  2 +-
 src/backend/executor/instrument.c | 26 ++++++++++++++++++++++++++
 src/include/executor/instrument.h | 12 ++++++++++++
 src/include/nodes/execnodes.h     |  2 +-
 src/tools/pgindent/typedefs.list  |  1 +
 7 files changed, 60 insertions(+), 24 deletions(-)

diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 93918a223b8..09b13807d92 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -1099,18 +1099,15 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 	for (nt = 0; nt < rInfo->ri_TrigDesc->numtriggers; nt++)
 	{
 		Trigger    *trig = rInfo->ri_TrigDesc->triggers + nt;
-		Instrumentation *instr = rInfo->ri_TrigInstrument + nt;
+		TriggerInstrumentation *tginstr = rInfo->ri_TrigInstrument + nt;
 		char	   *relname;
 		char	   *conname = NULL;
 
-		/* Must clean up instrumentation state */
-		InstrEndLoop(instr);
-
 		/*
 		 * We ignore triggers that were never invoked; they likely aren't
 		 * relevant to the current query type.
 		 */
-		if (instr->ntuples == 0)
+		if (tginstr->firings == 0)
 			continue;
 
 		ExplainOpenGroup("Trigger", NULL, true, es);
@@ -1135,11 +1132,11 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 			if (show_relname)
 				appendStringInfo(es->str, " on %s", relname);
 			if (es->timing)
-				appendStringInfo(es->str, ": time=%.3f calls=%.0f\n",
-								 INSTR_TIME_GET_MILLISEC(instr->total),
-								 instr->ntuples);
+				appendStringInfo(es->str, ": time=%.3f calls=%d\n",
+								 INSTR_TIME_GET_MILLISEC(tginstr->instr.total),
+								 tginstr->firings);
 			else
-				appendStringInfo(es->str, ": calls=%.0f\n", instr->ntuples);
+				appendStringInfo(es->str, ": calls=%d\n", tginstr->firings);
 		}
 		else
 		{
@@ -1149,9 +1146,9 @@ report_triggers(ResultRelInfo *rInfo, bool show_relname, ExplainState *es)
 			ExplainPropertyText("Relation", relname, es);
 			if (es->timing)
 				ExplainPropertyFloat("Time", "ms",
-									 INSTR_TIME_GET_MILLISEC(instr->total), 3,
+									 INSTR_TIME_GET_MILLISEC(tginstr->instr.total), 3,
 									 es);
-			ExplainPropertyFloat("Calls", NULL, instr->ntuples, 0, es);
+			ExplainPropertyInteger("Calls", NULL, tginstr->firings, es);
 		}
 
 		if (conname)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 98d402c0a3b..c3360073141 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -90,7 +90,7 @@ static bool TriggerEnabled(EState *estate, ResultRelInfo *relinfo,
 static HeapTuple ExecCallTriggerFunc(TriggerData *trigdata,
 									 int tgindx,
 									 FmgrInfo *finfo,
-									 Instrumentation *instr,
+									 TriggerInstrumentation *instr,
 									 MemoryContext per_tuple_context);
 static void AfterTriggerSaveEvent(EState *estate, ResultRelInfo *relinfo,
 								  ResultRelInfo *src_partinfo,
@@ -2309,7 +2309,7 @@ static HeapTuple
 ExecCallTriggerFunc(TriggerData *trigdata,
 					int tgindx,
 					FmgrInfo *finfo,
-					Instrumentation *instr,
+					TriggerInstrumentation *instr,
 					MemoryContext per_tuple_context)
 {
 	LOCAL_FCINFO(fcinfo, 0);
@@ -2344,7 +2344,7 @@ ExecCallTriggerFunc(TriggerData *trigdata,
 	 * If doing EXPLAIN ANALYZE, start charging time to this trigger.
 	 */
 	if (instr)
-		InstrStartNode(instr + tgindx);
+		InstrStartTrigger(instr + tgindx);
 
 	/*
 	 * Do the function evaluation in the per-tuple memory context, so that
@@ -2389,10 +2389,10 @@ ExecCallTriggerFunc(TriggerData *trigdata,
 
 	/*
 	 * If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
-	 * one "tuple returned" (really the number of firings).
+	 * the firing of the trigger.
 	 */
 	if (instr)
-		InstrStopNode(instr + tgindx, 1);
+		InstrStopTrigger(instr + tgindx, 1);
 
 	return (HeapTuple) DatumGetPointer(result);
 }
@@ -3936,7 +3936,7 @@ static void AfterTriggerExecute(EState *estate,
 								ResultRelInfo *dst_relInfo,
 								TriggerDesc *trigdesc,
 								FmgrInfo *finfo,
-								Instrumentation *instr,
+								TriggerInstrumentation *instr,
 								MemoryContext per_tuple_context,
 								TupleTableSlot *trig_tuple_slot1,
 								TupleTableSlot *trig_tuple_slot2);
@@ -4330,7 +4330,7 @@ AfterTriggerExecute(EState *estate,
 					ResultRelInfo *src_relInfo,
 					ResultRelInfo *dst_relInfo,
 					TriggerDesc *trigdesc,
-					FmgrInfo *finfo, Instrumentation *instr,
+					FmgrInfo *finfo, TriggerInstrumentation *instr,
 					MemoryContext per_tuple_context,
 					TupleTableSlot *trig_tuple_slot1,
 					TupleTableSlot *trig_tuple_slot2)
@@ -4371,7 +4371,7 @@ AfterTriggerExecute(EState *estate,
 	 * to include time spent re-fetching tuples in the trigger cost.
 	 */
 	if (instr)
-		InstrStartNode(instr + tgindx);
+		InstrStartTrigger(instr + tgindx);
 
 	/*
 	 * Fetch the required tuple(s).
@@ -4588,10 +4588,10 @@ AfterTriggerExecute(EState *estate,
 
 	/*
 	 * If doing EXPLAIN ANALYZE, stop charging time to this trigger, and count
-	 * one "tuple returned" (really the number of firings).
+	 * the firing of the trigger.
 	 */
 	if (instr)
-		InstrStopNode(instr + tgindx, 1);
+		InstrStopTrigger(instr + tgindx, 1);
 }
 
 
@@ -4707,7 +4707,7 @@ afterTriggerInvokeEvents(AfterTriggerEventList *events,
 	Relation	rel = NULL;
 	TriggerDesc *trigdesc = NULL;
 	FmgrInfo   *finfo = NULL;
-	Instrumentation *instr = NULL;
+	TriggerInstrumentation *instr = NULL;
 	TupleTableSlot *slot1 = NULL,
 			   *slot2 = NULL;
 
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index bfd3ebc601e..1a3b8021600 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1270,7 +1270,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 		resultRelInfo->ri_TrigWhenExprs = (ExprState **)
 			palloc0_array(ExprState *, n);
 		if (instrument_options)
-			resultRelInfo->ri_TrigInstrument = InstrAlloc(n, instrument_options, false);
+			resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, instrument_options);
 	}
 	else
 	{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index a40610bc252..9354ad7be12 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -196,6 +196,32 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
 		WalUsageAdd(&dst->walusage, &add->walusage);
 }
 
+/* Trigger instrumentation handling */
+TriggerInstrumentation *
+InstrAllocTrigger(int n, int instrument_options)
+{
+	TriggerInstrumentation *tginstr = palloc0(n * sizeof(TriggerInstrumentation));
+	int			i;
+
+	for (i = 0; i < n; i++)
+		InstrInit(&tginstr[i].instr, instrument_options);
+
+	return tginstr;
+}
+
+void
+InstrStartTrigger(TriggerInstrumentation *tginstr)
+{
+	InstrStartNode(&tginstr->instr);
+}
+
+void
+InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
+{
+	InstrStopNode(&tginstr->instr, 0);
+	tginstr->firings += firings;
+}
+
 /* note current values during parallel executor startup */
 void
 InstrStartParallelQuery(void)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 9759f3ea5d8..a9c2233227f 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -100,6 +100,13 @@ typedef struct WorkerInstrumentation
 	Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
 } WorkerInstrumentation;
 
+typedef struct TriggerInstrumentation
+{
+	Instrumentation instr;
+	int			firings;		/* number of times the instrumented trigger
+								 * was fired */
+} TriggerInstrumentation;
+
 extern PGDLLIMPORT BufferUsage pgBufferUsage;
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
@@ -111,6 +118,11 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
 extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
 extern void InstrEndLoop(Instrumentation *instr);
 extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
+
+extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
+extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
+extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
+
 extern void InstrStartParallelQuery(void);
 extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
 extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 63c067d5aae..a43bd428a91 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -524,7 +524,7 @@ typedef struct ResultRelInfo
 	ExprState **ri_TrigWhenExprs;
 
 	/* optional runtime measurements for triggers */
-	Instrumentation *ri_TrigInstrument;
+	TriggerInstrumentation *ri_TrigInstrument;
 
 	/* On-demand created slots for triggers / returning processing */
 	TupleTableSlot *ri_ReturningSlot;	/* for trigger output tuples */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 77e3c04144e..fd3ec7a7236 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3156,6 +3156,7 @@ TriggerDesc
 TriggerEvent
 TriggerFlags
 TriggerInfo
+TriggerInstrumentation
 TriggerTransition
 TruncateStmt
 TsmRoutine
-- 
2.47.1



  [application/octet-stream] v7-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch (9.9K, 5-v7-0003-instrumentation-Replace-direct-changes-of-pgBuffe.patch)
  download | inline diff:
From 29f9820901383800c659c082876c734cde723601 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 12:12:39 -0800
Subject: [PATCH v7 3/8] instrumentation: Replace direct changes of
 pgBufferUsage/pgWalUsage with INSTR_* macros

This encapsulates the ownership of these globals better, and will allow
a subsequent refactoring.

Author: Lukas Fittl <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Discussion: https://www.postgresql.org/message-id/flat/CAP53PkzZ3UotnRrrnXWAv%3DF4avRq9MQ8zU%2BbxoN9tpovEu6fGQ%40mail.gmail.com#fc7140e8af21e07a90a09d7e76b300c4
---
 src/backend/access/transam/xlog.c      | 10 +++++-----
 src/backend/executor/instrument.c      |  1 -
 src/backend/storage/buffer/bufmgr.c    | 24 ++++++++++++------------
 src/backend/storage/buffer/localbuf.c  |  6 +++---
 src/backend/storage/file/buffile.c     |  8 ++++----
 src/backend/utils/activity/pgstat_io.c |  8 ++++----
 src/include/executor/instrument.h      | 19 +++++++++++++++++++
 7 files changed, 47 insertions(+), 29 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b9b678f3722..f6ac3c530b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1081,10 +1081,10 @@ XLogInsertRecord(XLogRecData *rdata,
 	/* Report WAL traffic to the instrumentation. */
 	if (inserted)
 	{
-		pgWalUsage.wal_bytes += rechdr->xl_tot_len;
-		pgWalUsage.wal_records++;
-		pgWalUsage.wal_fpi += num_fpi;
-		pgWalUsage.wal_fpi_bytes += fpi_bytes;
+		INSTR_WALUSAGE_ADD(wal_bytes, rechdr->xl_tot_len);
+		INSTR_WALUSAGE_INCR(wal_records);
+		INSTR_WALUSAGE_ADD(wal_fpi, num_fpi);
+		INSTR_WALUSAGE_ADD(wal_fpi_bytes, fpi_bytes);
 
 		/* Required for the flush of pending stats WAL data */
 		pgstat_report_fixed = true;
@@ -2063,7 +2063,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 					WriteRqst.Flush = InvalidXLogRecPtr;
 					XLogWrite(WriteRqst, tli, false);
 					LWLockRelease(WALWriteLock);
-					pgWalUsage.wal_buffers_full++;
+					INSTR_WALUSAGE_INCR(wal_buffers_full);
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
 
 					/*
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index bc551f95a08..6a4a08ebb0c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -54,7 +54,6 @@ InstrStart(Instrumentation *instr)
 			INSTR_TIME_SET_CURRENT(instr->starttime);
 	}
 
-	/* save buffer usage totals at node entry, if needed */
 	if (instr->need_bufusage)
 		instr->bufusage_start = pgBufferUsage;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 5f3d083e938..59b098ba0a0 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -827,7 +827,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		{
 			PinLocalBuffer(bufHdr, true);
 
-			pgBufferUsage.local_blks_hit++;
+			INSTR_BUFUSAGE_INCR(local_blks_hit);
 
 			return true;
 		}
@@ -848,7 +848,7 @@ ReadRecentBuffer(RelFileLocator rlocator, ForkNumber forkNum, BlockNumber blockN
 		{
 			if (BufferTagsEqual(&tag, &bufHdr->tag))
 			{
-				pgBufferUsage.shared_blks_hit++;
+				INSTR_BUFUSAGE_INCR(shared_blks_hit);
 				return true;
 			}
 			UnpinBuffer(bufHdr);
@@ -1249,14 +1249,14 @@ PinBufferForBlock(Relation rel,
 	{
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, foundPtr);
 		if (*foundPtr)
-			pgBufferUsage.local_blks_hit++;
+			INSTR_BUFUSAGE_INCR(local_blks_hit);
 	}
 	else
 	{
 		bufHdr = BufferAlloc(smgr, persistence, forkNum, blockNum,
 							 strategy, foundPtr, io_context);
 		if (*foundPtr)
-			pgBufferUsage.shared_blks_hit++;
+			INSTR_BUFUSAGE_INCR(shared_blks_hit);
 	}
 	if (rel)
 	{
@@ -1990,9 +1990,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 										  true);
 
 		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_hit += 1;
+			INSTR_BUFUSAGE_INCR(local_blks_hit);
 		else
-			pgBufferUsage.shared_blks_hit += 1;
+			INSTR_BUFUSAGE_INCR(shared_blks_hit);
 
 		if (operation->rel)
 			pgstat_count_buffer_hit(operation->rel);
@@ -2060,9 +2060,9 @@ AsyncReadBuffers(ReadBuffersOperation *operation, int *nblocks_progress)
 								io_start, 1, io_buffers_len * BLCKSZ);
 
 		if (persistence == RELPERSISTENCE_TEMP)
-			pgBufferUsage.local_blks_read += io_buffers_len;
+			INSTR_BUFUSAGE_ADD(local_blks_read, io_buffers_len);
 		else
-			pgBufferUsage.shared_blks_read += io_buffers_len;
+			INSTR_BUFUSAGE_ADD(shared_blks_read, io_buffers_len);
 
 		/*
 		 * Track vacuum cost when issuing IO, not after waiting for it.
@@ -2955,7 +2955,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		TerminateBufferIO(buf_hdr, false, BM_VALID, true, false);
 	}
 
-	pgBufferUsage.shared_blks_written += extend_by;
+	INSTR_BUFUSAGE_ADD(shared_blks_written, extend_by);
 
 	*extended_by = extend_by;
 
@@ -3100,7 +3100,7 @@ MarkBufferDirty(Buffer buffer)
 	 */
 	if (!(old_buf_state & BM_DIRTY))
 	{
-		pgBufferUsage.shared_blks_dirtied++;
+		INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
 		if (VacuumCostActive)
 			VacuumCostBalance += VacuumCostPageDirty;
 	}
@@ -4529,7 +4529,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	pgstat_count_io_op_time(IOOBJECT_RELATION, io_context,
 							IOOP_WRITE, io_start, 1, BLCKSZ);
 
-	pgBufferUsage.shared_blks_written++;
+	INSTR_BUFUSAGE_INCR(shared_blks_written);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -5690,7 +5690,7 @@ MarkBufferDirtyHint(Buffer buffer, bool buffer_std)
 
 		if (dirtied)
 		{
-			pgBufferUsage.shared_blks_dirtied++;
+			INSTR_BUFUSAGE_INCR(shared_blks_dirtied);
 			if (VacuumCostActive)
 				VacuumCostBalance += VacuumCostPageDirty;
 		}
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 04a540379a2..e6054e745e8 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -217,7 +217,7 @@ FlushLocalBuffer(BufferDesc *bufHdr, SMgrRelation reln)
 	/* Mark not-dirty */
 	TerminateLocalBufferIO(bufHdr, true, 0, false);
 
-	pgBufferUsage.local_blks_written++;
+	INSTR_BUFUSAGE_INCR(local_blks_written);
 }
 
 static Buffer
@@ -478,7 +478,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 
 	*extended_by = extend_by;
 
-	pgBufferUsage.local_blks_written += extend_by;
+	INSTR_BUFUSAGE_ADD(local_blks_written, extend_by);
 
 	return first_block;
 }
@@ -509,7 +509,7 @@ MarkLocalBufferDirty(Buffer buffer)
 	buf_state = pg_atomic_read_u64(&bufHdr->state);
 
 	if (!(buf_state & BM_DIRTY))
-		pgBufferUsage.local_blks_dirtied++;
+		INSTR_BUFUSAGE_INCR(local_blks_dirtied);
 
 	buf_state |= BM_DIRTY;
 
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c4afe4d368a..8b501dfcadd 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -475,13 +475,13 @@ BufFileLoadBuffer(BufFile *file)
 	if (track_io_timing)
 	{
 		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_read_time, io_time, io_start);
+		INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_read_time, io_time, io_start);
 	}
 
 	/* we choose not to advance curOffset here */
 
 	if (file->nbytes > 0)
-		pgBufferUsage.temp_blks_read++;
+		INSTR_BUFUSAGE_INCR(temp_blks_read);
 }
 
 /*
@@ -549,13 +549,13 @@ BufFileDumpBuffer(BufFile *file)
 		if (track_io_timing)
 		{
 			INSTR_TIME_SET_CURRENT(io_time);
-			INSTR_TIME_ACCUM_DIFF(pgBufferUsage.temp_blk_write_time, io_time, io_start);
+			INSTR_BUFUSAGE_TIME_ACCUM_DIFF(temp_blk_write_time, io_time, io_start);
 		}
 
 		file->curOffset += bytestowrite;
 		wpos += bytestowrite;
 
-		pgBufferUsage.temp_blks_written++;
+		INSTR_BUFUSAGE_INCR(temp_blks_written);
 	}
 	file->dirty = false;
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 28de24538dc..9e7a88ec0d0 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -135,17 +135,17 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			{
 				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 				if (io_object == IOOBJECT_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.shared_blk_write_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(shared_blk_write_time, io_time);
 				else if (io_object == IOOBJECT_TEMP_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(local_blk_write_time, io_time);
 			}
 			else if (io_op == IOOP_READ)
 			{
 				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 				if (io_object == IOOBJECT_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.shared_blk_read_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(shared_blk_read_time, io_time);
 				else if (io_object == IOOBJECT_TEMP_RELATION)
-					INSTR_TIME_ADD(pgBufferUsage.local_blk_read_time, io_time);
+					INSTR_BUFUSAGE_TIME_ADD(local_blk_read_time, io_time);
 			}
 		}
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index b11d64633b5..1139be8333e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -153,4 +153,23 @@ extern void BufferUsageAccumDiff(BufferUsage *dst,
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
 							  const WalUsage *sub);
 
+#define INSTR_BUFUSAGE_INCR(fld) do { \
+		pgBufferUsage.fld++; \
+	} while(0)
+#define INSTR_BUFUSAGE_ADD(fld,val) do { \
+		pgBufferUsage.fld += val; \
+	} while(0)
+#define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
+	INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+	} while (0)
+#define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
+	INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+	} while (0)
+#define INSTR_WALUSAGE_INCR(fld) do { \
+		pgWalUsage.fld++; \
+	} while(0)
+#define INSTR_WALUSAGE_ADD(fld,val) do { \
+		pgWalUsage.fld += val; \
+	} while(0)
+
 #endif							/* INSTRUMENT_H */
-- 
2.47.1



  [application/octet-stream] v7-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch (66.3K, 6-v7-0005-Optimize-measuring-WAL-buffer-usage-through-stack.patch)
  download | inline diff:
From a59c9778b9a1cdd98d0ea58f1a6a391efa84b007 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Tue, 9 Sep 2025 02:16:59 -0700
Subject: [PATCH v7 5/8] Optimize measuring WAL/buffer usage through
 stack-based instrumentation

Previously, in order to determine the buffer/WAL usage of a given code
section, we utilized continuously incrementing global counters that get
updated when the actual activity (e.g. shared block read) occurred, and
then calculated a diff when the code section ended. This resulted in a
bottleneck for executor node instrumentation specifically, with the
function BufferUsageAccumDiff showing up in profiles and in some cases
adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.

Instead, introduce a stack-based mechanism, where the actual activity
writes into the current stack entry. In the case of executor nodes, this
means that each node gets its own stack entry that is pushed at
InstrStartNode, and popped at InstrEndNode. Stack entries are zero
initialized (avoiding the diff mechanism) and get added to their parent
entry when they are finalized, i.e. no more modifications can occur.

To correctly handle abort situations, any use of instrumentation stacks
must involve either a top-level QueryInstrumentation struct, and its
associated InstrQueryStart/InstrQueryStop helpers (which use resource
owners to handle aborts), or the Instrumentation struct itself with
dedicated PG_TRY/PG_FINALLY calls that ensure the stack is in a
consistent state after an abort.

This also drops the global pgBufferUsage, any callers interested in
measuring buffer activity should instead utilize InstrStart/InstrStop.

The related global pgWalUsage is kept for now due to its use in pgstat
to track aggregate WAL activity and heap_page_prune_and_freeze for
measuring FPIs.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
 contrib/auto_explain/auto_explain.c           |  16 +-
 .../pg_stat_statements/pg_stat_statements.c   |  87 +---
 src/backend/access/brin/brin.c                |  10 +-
 src/backend/access/gin/gininsert.c            |  10 +-
 src/backend/access/heap/vacuumlazy.c          |  15 +-
 src/backend/access/nbtree/nbtsort.c           |  10 +-
 src/backend/commands/analyze.c                |  31 +-
 src/backend/commands/explain.c                |  26 +-
 src/backend/commands/explain_dr.c             |  31 +-
 src/backend/commands/prepare.c                |  27 +-
 src/backend/commands/vacuumparallel.c         |  10 +-
 src/backend/executor/execMain.c               |  66 ++-
 src/backend/executor/execParallel.c           |   8 +-
 src/backend/executor/execProcnode.c           |  84 +++-
 src/backend/executor/instrument.c             | 389 +++++++++++++-----
 src/backend/storage/buffer/bufmgr.c           |   6 +-
 src/backend/utils/activity/pgstat_io.c        |   6 +-
 src/include/executor/execdesc.h               |   4 +-
 src/include/executor/executor.h               |   2 +
 src/include/executor/instrument.h             | 179 +++++++-
 src/include/utils/resowner.h                  |   1 +
 src/tools/pgindent/typedefs.list              |   2 +
 22 files changed, 720 insertions(+), 300 deletions(-)

diff --git a/contrib/auto_explain/auto_explain.c b/contrib/auto_explain/auto_explain.c
index 39bf2543b70..4be81489ff4 100644
--- a/contrib/auto_explain/auto_explain.c
+++ b/contrib/auto_explain/auto_explain.c
@@ -305,19 +305,9 @@ explain_ExecutorStart(QueryDesc *queryDesc, int eflags)
 
 	if (auto_explain_enabled())
 	{
-		/*
-		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
-		 * space is allocated in the per-query context so it will go away at
-		 * ExecutorEnd.
-		 */
+		/* Set up to track total elapsed time in ExecutorRun. */
 		if (queryDesc->totaltime == NULL)
-		{
-			MemoryContext oldcxt;
-
-			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-			queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
-			MemoryContextSwitchTo(oldcxt);
-		}
+			queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
 	}
 }
 
@@ -382,7 +372,7 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
 		oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
 
 		/* Log plan if duration is exceeded. */
-		msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
+		msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total);
 		if (msec >= auto_explain_log_min_duration)
 		{
 			ExplainState *es = NewExplainState();
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 388b068ccec..8448f9c13fa 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -909,22 +909,11 @@ pgss_planner(Query *parse,
 		&& pgss_track_planning && query_string
 		&& parse->queryId != INT64CONST(0))
 	{
-		instr_time	start;
-		instr_time	duration;
-		BufferUsage bufusage_start,
-					bufusage;
-		WalUsage	walusage_start,
-					walusage;
+		Instrumentation instr = {0};
 
-		/* We need to track buffer usage as the planner can access them. */
-		bufusage_start = pgBufferUsage;
-
-		/*
-		 * Similarly the planner could write some WAL records in some cases
-		 * (e.g. setting a hint bit with those being WAL-logged)
-		 */
-		walusage_start = pgWalUsage;
-		INSTR_TIME_SET_CURRENT(start);
+		/* Track time and buffer/WAL usage as the planner can access them. */
+		InstrInitOptions(&instr, INSTRUMENT_ALL);
+		InstrStart(&instr);
 
 		nesting_level++;
 		PG_TRY();
@@ -938,30 +927,20 @@ pgss_planner(Query *parse,
 		}
 		PG_FINALLY();
 		{
+			InstrStopFinalize(&instr);
 			nesting_level--;
 		}
 		PG_END_TRY();
 
-		INSTR_TIME_SET_CURRENT(duration);
-		INSTR_TIME_SUBTRACT(duration, start);
-
-		/* calc differences of buffer counters. */
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
-		/* calc differences of WAL counters. */
-		memset(&walusage, 0, sizeof(WalUsage));
-		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
 		pgss_store(query_string,
 				   parse->queryId,
 				   parse->stmt_location,
 				   parse->stmt_len,
 				   PGSS_PLAN,
-				   INSTR_TIME_GET_MILLISEC(duration),
+				   INSTR_TIME_GET_MILLISEC(instr.total),
 				   0,
-				   &bufusage,
-				   &walusage,
+				   &instr.bufusage,
+				   &instr.walusage,
 				   NULL,
 				   NULL,
 				   0,
@@ -1013,19 +992,9 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 */
 	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != INT64CONST(0))
 	{
-		/*
-		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
-		 * space is allocated in the per-query context so it will go away at
-		 * ExecutorEnd.
-		 */
+		/* Set up to track total elapsed time in ExecutorRun. */
 		if (queryDesc->totaltime == NULL)
-		{
-			MemoryContext oldcxt;
-
-			oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
-			queryDesc->totaltime = InstrAlloc(INSTRUMENT_ALL);
-			MemoryContextSwitchTo(oldcxt);
-		}
+			queryDesc->totaltime = InstrQueryAlloc(INSTRUMENT_ALL);
 	}
 }
 
@@ -1087,10 +1056,10 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 				   queryDesc->plannedstmt->stmt_location,
 				   queryDesc->plannedstmt->stmt_len,
 				   PGSS_EXEC,
-				   INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
+				   INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->instr.total),
 				   queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
+				   &queryDesc->totaltime->instr.bufusage,
+				   &queryDesc->totaltime->instr.walusage,
 				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
 				   NULL,
 				   queryDesc->estate->es_parallel_workers_to_launch,
@@ -1154,17 +1123,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		!IsA(parsetree, ExecuteStmt) &&
 		!IsA(parsetree, PrepareStmt))
 	{
-		instr_time	start;
-		instr_time	duration;
 		uint64		rows;
-		BufferUsage bufusage_start,
-					bufusage;
-		WalUsage	walusage_start,
-					walusage;
+		Instrumentation instr = {0};
 
-		bufusage_start = pgBufferUsage;
-		walusage_start = pgWalUsage;
-		INSTR_TIME_SET_CURRENT(start);
+		InstrInitOptions(&instr, INSTRUMENT_ALL);
+		InstrStart(&instr);
 
 		nesting_level++;
 		PG_TRY();
@@ -1180,6 +1143,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		}
 		PG_FINALLY();
 		{
+			InstrStopFinalize(&instr);
 			nesting_level--;
 		}
 		PG_END_TRY();
@@ -1194,9 +1158,6 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		 * former value, which'd otherwise be a good idea.
 		 */
 
-		INSTR_TIME_SET_CURRENT(duration);
-		INSTR_TIME_SUBTRACT(duration, start);
-
 		/*
 		 * Track the total number of rows retrieved or affected by the utility
 		 * statements of COPY, FETCH, CREATE TABLE AS, CREATE MATERIALIZED
@@ -1208,23 +1169,15 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 					   qc->commandTag == CMDTAG_REFRESH_MATERIALIZED_VIEW)) ?
 			qc->nprocessed : 0;
 
-		/* calc differences of buffer counters. */
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-
-		/* calc differences of WAL counters. */
-		memset(&walusage, 0, sizeof(WalUsage));
-		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
-
 		pgss_store(queryString,
 				   saved_queryId,
 				   saved_stmt_location,
 				   saved_stmt_len,
 				   PGSS_EXEC,
-				   INSTR_TIME_GET_MILLISEC(duration),
+				   INSTR_TIME_GET_MILLISEC(instr.total),
 				   rows,
-				   &bufusage,
-				   &walusage,
+				   &instr.bufusage,
+				   &instr.walusage,
 				   NULL,
 				   NULL,
 				   0,
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index 146ee97a47d..5ab571b29fa 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2435,8 +2435,8 @@ _brin_begin_parallel(BrinBuildState *buildstate, Relation heap, Relation index,
 	 * and PARALLEL_KEY_BUFFER_USAGE.
 	 *
 	 * If there are no extensions loaded that care, we could skip this.  We
-	 * have no way of knowing whether anyone's looking at pgWalUsage or
-	 * pgBufferUsage, so do it unconditionally.
+	 * have no way of knowing whether anyone's looking at instrumentation, so
+	 * do it unconditionally.
 	 */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2887,6 +2887,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	Relation	indexRel;
 	LOCKMODE	heapLockmode;
 	LOCKMODE	indexLockmode;
+	QueryInstrumentation *instr;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	int			sortmem;
@@ -2936,7 +2937,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	tuplesort_attach_shared(sharedsort, seg);
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
@@ -2951,7 +2952,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber]);
 
 	index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/gin/gininsert.c b/src/backend/access/gin/gininsert.c
index c7e38dbe193..bb91cc600eb 100644
--- a/src/backend/access/gin/gininsert.c
+++ b/src/backend/access/gin/gininsert.c
@@ -984,8 +984,8 @@ _gin_begin_parallel(GinBuildState *buildstate, Relation heap, Relation index,
 	 * and PARALLEL_KEY_BUFFER_USAGE.
 	 *
 	 * If there are no extensions loaded that care, we could skip this.  We
-	 * have no way of knowing whether anyone's looking at pgWalUsage or
-	 * pgBufferUsage, so do it unconditionally.
+	 * have no way of knowing whether anyone's looking at instrumentation, so
+	 * do it unconditionally.
 	 */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -2110,6 +2110,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	Relation	indexRel;
 	LOCKMODE	heapLockmode;
 	LOCKMODE	indexLockmode;
+	QueryInstrumentation *instr;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	int			sortmem;
@@ -2178,7 +2179,7 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	tuplesort_attach_shared(sharedsort, seg);
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/*
 	 * Might as well use reliable figure when doling out maintenance_work_mem
@@ -2193,7 +2194,8 @@ _gin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber]);
 
 	index_close(indexRel, indexLockmode);
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 82c5b28e0ad..b4cbd0e682c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -641,8 +641,7 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	TimestampTz starttime = 0;
 	PgStat_Counter startreadtime = 0,
 				startwritetime = 0;
-	WalUsage	startwalusage = pgWalUsage;
-	BufferUsage startbufferusage = pgBufferUsage;
+	QueryInstrumentation *instr = NULL;
 	ErrorContextCallback errcallback;
 	char	  **indnames = NULL;
 	Size		dead_items_max_bytes = 0;
@@ -658,6 +657,8 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 			startreadtime = pgStatBlockReadTime;
 			startwritetime = pgStatBlockWriteTime;
 		}
+		instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+		InstrQueryStart(instr);
 	}
 
 	/* Used for instrumentation and stats report */
@@ -983,14 +984,14 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
 
+		instr = InstrQueryStopFinalize(instr);
+
 		if (verbose || params.log_vacuum_min_duration == 0 ||
 			TimestampDifferenceExceeds(starttime, endtime,
 									   params.log_vacuum_min_duration))
 		{
 			long		secs_dur;
 			int			usecs_dur;
-			WalUsage	walusage;
-			BufferUsage bufferusage;
 			StringInfoData buf;
 			char	   *msgfmt;
 			int32		diff;
@@ -999,12 +1000,10 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
 			int64		total_blks_hit;
 			int64		total_blks_read;
 			int64		total_blks_dirtied;
+			BufferUsage bufferusage = instr->instr.bufusage;
+			WalUsage	walusage = instr->instr.walusage;
 
 			TimestampDifference(starttime, endtime, &secs_dur, &usecs_dur);
-			memset(&walusage, 0, sizeof(WalUsage));
-			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-			memset(&bufferusage, 0, sizeof(BufferUsage));
-			BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
 
 			total_blks_hit = bufferusage.shared_blks_hit +
 				bufferusage.local_blks_hit;
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 69ef1527e06..dfe4fd9459c 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1465,8 +1465,8 @@ _bt_begin_parallel(BTBuildState *buildstate, bool isconcurrent, int request)
 	 * and PARALLEL_KEY_BUFFER_USAGE.
 	 *
 	 * If there are no extensions loaded that care, we could skip this.  We
-	 * have no way of knowing whether anyone's looking at pgWalUsage or
-	 * pgBufferUsage, so do it unconditionally.
+	 * have no way of knowing whether anyone's looking at instrumentation, so
+	 * do it unconditionally.
 	 */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(WalUsage), pcxt->nworkers));
@@ -1752,6 +1752,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	Relation	indexRel;
 	LOCKMODE	heapLockmode;
 	LOCKMODE	indexLockmode;
+	QueryInstrumentation *instr;
 	WalUsage   *walusage;
 	BufferUsage *bufferusage;
 	int			sortmem;
@@ -1827,7 +1828,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	}
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/* Perform sorting of spool, and possibly a spool2 */
 	sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,7 +1838,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
 	/* Report WAL/buffer usage during parallel execution */
 	bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &bufferusage[ParallelWorkerNumber],
 						  &walusage[ParallelWorkerNumber]);
 
 #ifdef BTREE_BUILD_STATS
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 53adac9139b..38f8b379fa4 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -308,9 +308,7 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 	Oid			save_userid;
 	int			save_sec_context;
 	int			save_nestlevel;
-	WalUsage	startwalusage = pgWalUsage;
-	BufferUsage startbufferusage = pgBufferUsage;
-	BufferUsage bufferusage;
+	QueryInstrumentation *instr = NULL;
 	PgStat_Counter startreadtime = 0;
 	PgStat_Counter startwritetime = 0;
 
@@ -361,6 +359,9 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 		}
 
 		pg_rusage_init(&ru0);
+
+		instr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+		InstrQueryStart(instr);
 	}
 
 	/* Used for instrumentation and stats report */
@@ -741,12 +742,13 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 	{
 		TimestampTz endtime = GetCurrentTimestamp();
 
+		instr = InstrQueryStopFinalize(instr);
+
 		if (verbose || params.log_analyze_min_duration == 0 ||
 			TimestampDifferenceExceeds(starttime, endtime,
 									   params.log_analyze_min_duration))
 		{
 			long		delay_in_ms;
-			WalUsage	walusage;
 			double		read_rate = 0;
 			double		write_rate = 0;
 			char	   *msgfmt;
@@ -754,18 +756,15 @@ do_analyze_rel(Relation onerel, const VacuumParams params,
 			int64		total_blks_hit;
 			int64		total_blks_read;
 			int64		total_blks_dirtied;
-
-			memset(&bufferusage, 0, sizeof(BufferUsage));
-			BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, &startbufferusage);
-			memset(&walusage, 0, sizeof(WalUsage));
-			WalUsageAccumDiff(&walusage, &pgWalUsage, &startwalusage);
-
-			total_blks_hit = bufferusage.shared_blks_hit +
-				bufferusage.local_blks_hit;
-			total_blks_read = bufferusage.shared_blks_read +
-				bufferusage.local_blks_read;
-			total_blks_dirtied = bufferusage.shared_blks_dirtied +
-				bufferusage.local_blks_dirtied;
+			BufferUsage bufusage = instr->instr.bufusage;
+			WalUsage	walusage = instr->instr.walusage;
+
+			total_blks_hit = bufusage.shared_blks_hit +
+				bufusage.local_blks_hit;
+			total_blks_read = bufusage.shared_blks_read +
+				bufusage.local_blks_read;
+			total_blks_dirtied = bufusage.shared_blks_dirtied +
+				bufusage.local_blks_dirtied;
 
 			/*
 			 * We do not expect an analyze to take > 25 days and it simplifies
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 389181b8d9b..aa76f68bd10 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -322,14 +322,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 						 QueryEnvironment *queryEnv)
 {
 	PlannedStmt *plan;
-	instr_time	planstart,
-				planduration;
-	BufferUsage bufusage_start,
-				bufusage;
+	QueryInstrumentation *instr = NULL;
 	MemoryContextCounters mem_counters;
 	MemoryContext planner_ctx = NULL;
 	MemoryContext saved_ctx = NULL;
 
+	if (es->buffers)
+		instr = InstrQueryAlloc(INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
+	else
+		instr = InstrQueryAlloc(INSTRUMENT_TIMER);
+
 	if (es->memory)
 	{
 		/*
@@ -346,15 +348,12 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 		saved_ctx = MemoryContextSwitchTo(planner_ctx);
 	}
 
-	if (es->buffers)
-		bufusage_start = pgBufferUsage;
-	INSTR_TIME_SET_CURRENT(planstart);
+	InstrQueryStart(instr);
 
 	/* plan the query */
 	plan = pg_plan_query(query, queryString, cursorOptions, params, es);
 
-	INSTR_TIME_SET_CURRENT(planduration);
-	INSTR_TIME_SUBTRACT(planduration, planstart);
+	instr = InstrQueryStopFinalize(instr);
 
 	if (es->memory)
 	{
@@ -362,16 +361,9 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
 		MemoryContextMemConsumed(planner_ctx, &mem_counters);
 	}
 
-	/* calc differences of buffer counters. */
-	if (es->buffers)
-	{
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-	}
-
 	/* run it (if needed) and produce output */
 	ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
-				   &planduration, (es->buffers ? &bufusage : NULL),
+				   &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
 				   es->memory ? &mem_counters : NULL);
 }
 
diff --git a/src/backend/commands/explain_dr.c b/src/backend/commands/explain_dr.c
index 3c96061cf32..6868d8972ac 100644
--- a/src/backend/commands/explain_dr.c
+++ b/src/backend/commands/explain_dr.c
@@ -110,15 +110,20 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
 	MemoryContext oldcontext;
 	StringInfo	buf = &myState->buf;
 	int			natts = typeinfo->natts;
-	instr_time	start,
-				end;
-	BufferUsage instr_start;
+	QueryInstrumentation *instr = NULL;
 
 	/* only measure time, buffers if requested */
-	if (myState->es->timing)
-		INSTR_TIME_SET_CURRENT(start);
-	if (myState->es->buffers)
-		instr_start = pgBufferUsage;
+	if (myState->es->timing || myState->es->buffers)
+	{
+		InstrumentOption instrument_options = 0;
+
+		if (myState->es->timing)
+			instrument_options |= INSTRUMENT_TIMER;
+		if (myState->es->buffers)
+			instrument_options |= INSTRUMENT_BUFFERS;
+		instr = InstrQueryAlloc(instrument_options);
+		InstrQueryStart(instr);
+	}
 
 	/* Set or update my derived attribute info, if needed */
 	if (myState->attrinfo != typeinfo || myState->nattrs != natts)
@@ -186,18 +191,16 @@ serializeAnalyzeReceive(TupleTableSlot *slot, DestReceiver *self)
 	MemoryContextSwitchTo(oldcontext);
 	MemoryContextReset(myState->tmpcontext);
 
+	if (myState->es->timing || myState->es->buffers)
+		instr = InstrQueryStopFinalize(instr);
+
 	/* Update timing data */
 	if (myState->es->timing)
-	{
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(myState->metrics.timeSpent, end, start);
-	}
+		INSTR_TIME_ADD(myState->metrics.timeSpent, instr->instr.total);
 
 	/* Update buffer metrics */
 	if (myState->es->buffers)
-		BufferUsageAccumDiff(&myState->metrics.bufferUsage,
-							 &pgBufferUsage,
-							 &instr_start);
+		BufferUsageAdd(&myState->metrics.bufferUsage, &instr->instr.bufusage);
 
 	return true;
 }
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 5b86a727587..d81f6b30e9c 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -578,13 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	ListCell   *p;
 	ParamListInfo paramLI = NULL;
 	EState	   *estate = NULL;
-	instr_time	planstart;
-	instr_time	planduration;
-	BufferUsage bufusage_start,
-				bufusage;
+	QueryInstrumentation *instr = NULL;
 	MemoryContextCounters mem_counters;
 	MemoryContext planner_ctx = NULL;
 	MemoryContext saved_ctx = NULL;
+	int			instrument_options = INSTRUMENT_TIMER;
+
+	if (es->buffers)
+		instrument_options |= INSTRUMENT_BUFFERS;
+
+	instr = InstrQueryAlloc(instrument_options);
 
 	if (es->memory)
 	{
@@ -596,9 +599,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		saved_ctx = MemoryContextSwitchTo(planner_ctx);
 	}
 
-	if (es->buffers)
-		bufusage_start = pgBufferUsage;
-	INSTR_TIME_SET_CURRENT(planstart);
+	InstrQueryStart(instr);
 
 	/* Look it up in the hash table */
 	entry = FetchPreparedStatement(execstmt->name, true);
@@ -633,8 +634,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 	cplan = GetCachedPlan(entry->plansource, paramLI,
 						  CurrentResourceOwner, pstate->p_queryEnv);
 
-	INSTR_TIME_SET_CURRENT(planduration);
-	INSTR_TIME_SUBTRACT(planduration, planstart);
+	instr = InstrQueryStopFinalize(instr);
 
 	if (es->memory)
 	{
@@ -642,13 +642,6 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 		MemoryContextMemConsumed(planner_ctx, &mem_counters);
 	}
 
-	/* calc differences of buffer counters. */
-	if (es->buffers)
-	{
-		memset(&bufusage, 0, sizeof(BufferUsage));
-		BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
-	}
-
 	plan_list = cplan->stmt_list;
 
 	/* Explain each query */
@@ -658,7 +651,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
 
 		if (pstmt->commandType != CMD_UTILITY)
 			ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
-						   &planduration, (es->buffers ? &bufusage : NULL),
+						   &instr->instr.total, (es->buffers ? &instr->instr.bufusage : NULL),
 						   es->memory ? &mem_counters : NULL);
 		else
 			ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index 279108ca89f..75074fe4efa 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -308,8 +308,8 @@ parallel_vacuum_init(Relation rel, Relation *indrels, int nindexes,
 	 * PARALLEL_VACUUM_KEY_BUFFER_USAGE and PARALLEL_VACUUM_KEY_WAL_USAGE.
 	 *
 	 * If there are no extensions loaded that care, we could skip this.  We
-	 * have no way of knowing whether anyone's looking at pgBufferUsage or
-	 * pgWalUsage, so do it unconditionally.
+	 * have no way of knowing whether anyone's looking at instrumentation, so
+	 * do it unconditionally.
 	 */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -995,6 +995,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	PVIndStats *indstats;
 	PVShared   *shared;
 	TidStore   *dead_items;
+	QueryInstrumentation *instr;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	int			nindexes;
@@ -1084,7 +1085,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	error_context_stack = &errcallback;
 
 	/* Prepare to track buffer usage during parallel execution */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/* Process indexes to perform vacuum/cleanup */
 	parallel_vacuum_process_safe_indexes(&pvs);
@@ -1092,7 +1093,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
 	/* Report any remaining cost-based vacuum delay time */
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c0b174cfbc0..82253317e96 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -76,6 +76,7 @@ ExecutorCheckPerms_hook_type ExecutorCheckPerms_hook = NULL;
 /* decls for local routines only used within this module */
 static void InitPlan(QueryDesc *queryDesc, int eflags);
 static void CheckValidRowMarkRel(Relation rel, RowMarkType markType);
+static void ExecFinalizeTriggerInstrumentation(EState *estate);
 static void ExecPostprocessPlan(EState *estate);
 static void ExecEndPlan(PlanState *planstate, EState *estate);
 static void ExecutePlan(QueryDesc *queryDesc,
@@ -329,9 +330,28 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 	 */
 	oldcontext = MemoryContextSwitchTo(estate->es_query_cxt);
 
-	/* Allow instrumentation of Executor overall runtime */
+	/*
+	 * Start up required top-level instrumentation stack for WAL/buffer
+	 * tracking
+	 */
+	if (!queryDesc->totaltime && (estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)))
+		queryDesc->totaltime = InstrQueryAlloc(estate->es_instrument);
+
 	if (queryDesc->totaltime)
-		InstrStart(queryDesc->totaltime);
+	{
+		/* Allow instrumentation of Executor overall runtime */
+		InstrQueryStart(queryDesc->totaltime);
+
+		/*
+		 * Remember all node entries for abort recovery. We do this once here
+		 * after the first call to InstrQueryStart has pushed the parent
+		 * entry.
+		 */
+		if ((estate->es_instrument & (INSTRUMENT_BUFFERS | INSTRUMENT_WAL)) &&
+			!queryDesc->already_executed)
+			ExecRememberNodeInstrumentation(queryDesc->planstate,
+											queryDesc->totaltime);
+	}
 
 	/*
 	 * extract information from the query descriptor and the query feature.
@@ -383,7 +403,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
 		dest->rShutdown(dest);
 
 	if (queryDesc->totaltime)
-		InstrStop(queryDesc->totaltime);
+		InstrQueryStop(queryDesc->totaltime);
 
 	MemoryContextSwitchTo(oldcontext);
 }
@@ -433,7 +453,7 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 
 	/* Allow instrumentation of Executor overall runtime */
 	if (queryDesc->totaltime)
-		InstrStart(queryDesc->totaltime);
+		InstrQueryStart(queryDesc->totaltime);
 
 	/* Run ModifyTable nodes to completion */
 	ExecPostprocessPlan(estate);
@@ -442,8 +462,26 @@ standard_ExecutorFinish(QueryDesc *queryDesc)
 	if (!(estate->es_top_eflags & EXEC_FLAG_SKIP_TRIGGERS))
 		AfterTriggerEndQuery(estate);
 
+	/*
+	 * Accumulate per-node and trigger statistics to their respective parent
+	 * instrumentation stacks.
+	 *
+	 * We skip this in parallel workers because their per-node stats are
+	 * reported individually via ExecParallelReportInstrumentation, and the
+	 * leader's own ExecFinalizeNodeInstrumentation handles propagation.  If
+	 * we accumulated here, the leader would double-count: worker parent nodes
+	 * would already include their children's stats, and then the leader's
+	 * accumulation would add the children again.
+	 */
+	if (queryDesc->totaltime && estate->es_instrument && !IsParallelWorker())
+	{
+		ExecFinalizeNodeInstrumentation(queryDesc->planstate);
+
+		ExecFinalizeTriggerInstrumentation(estate);
+	}
+
 	if (queryDesc->totaltime)
-		InstrStop(queryDesc->totaltime);
+		queryDesc->totaltime = InstrQueryStopFinalize(queryDesc->totaltime);
 
 	MemoryContextSwitchTo(oldcontext);
 
@@ -1484,6 +1522,24 @@ ExecGetAncestorResultRels(EState *estate, ResultRelInfo *resultRelInfo)
 	return resultRelInfo->ri_ancestorResultRels;
 }
 
+static void
+ExecFinalizeTriggerInstrumentation(EState *estate)
+{
+	List	   *rels = NIL;
+
+	rels = list_concat(rels, estate->es_tuple_routing_result_relations);
+	rels = list_concat(rels, estate->es_opened_result_relations);
+	rels = list_concat(rels, estate->es_trig_target_relations);
+
+	foreach_node(ResultRelInfo, rInfo, rels)
+	{
+		TriggerInstrumentation *ti = rInfo->ri_TrigInstrument;
+
+		if (ti && (ti->instr.need_bufusage || ti->instr.need_walusage))
+			InstrAccum(instr_stack.current, &ti->instr);
+	}
+}
+
 /* ----------------------------------------------------------------
  *		ExecPostprocessPlan
  *
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index c153d5c1c3b..e6ad86cb887 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -694,7 +694,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
 	 *
 	 * If EXPLAIN is not in use and there are no extensions loaded that care,
 	 * we could skip this.  But we have no way of knowing whether anyone's
-	 * looking at pgBufferUsage, so do it unconditionally.
+	 * looking at instrumentation, so do it unconditionally.
 	 */
 	shm_toc_estimate_chunk(&pcxt->estimator,
 						   mul_size(sizeof(BufferUsage), pcxt->nworkers));
@@ -1456,6 +1456,7 @@ void
 ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 {
 	FixedParallelExecutorState *fpes;
+	QueryInstrumentation *instr;
 	BufferUsage *buffer_usage;
 	WalUsage   *wal_usage;
 	DestReceiver *receiver;
@@ -1516,7 +1517,7 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	 * leader, which also doesn't count buffer accesses and WAL activity that
 	 * occur during executor startup.
 	 */
-	InstrStartParallelQuery();
+	instr = InstrStartParallelQuery();
 
 	/*
 	 * Run the plan.  If we specified a tuple bound, be careful not to demand
@@ -1532,7 +1533,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
 	/* Report buffer/WAL usage during parallel execution. */
 	buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
 	wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
-	InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
+	InstrEndParallelQuery(instr,
+						  &buffer_usage[ParallelWorkerNumber],
 						  &wal_usage[ParallelWorkerNumber]);
 
 	/* Report instrumentation data if any instrumentation options are set. */
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 1846661b503..c788b5b00f9 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -122,6 +122,8 @@
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
 static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
 static bool ExecShutdownNode_walker(PlanState *node, void *context);
+static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
+static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
 
 
 /* ------------------------------------------------------------------------
@@ -787,10 +789,10 @@ ExecShutdownNode_walker(PlanState *node, void *context)
 	 * at least once already.  We don't expect much CPU consumption during
 	 * node shutdown, but in the case of Gather or Gather Merge, we may shut
 	 * down workers at this stage.  If so, their buffer usage will get
-	 * propagated into pgBufferUsage at this point, and we want to make sure
-	 * that it gets associated with the Gather node.  We skip this if the node
-	 * has never been executed, so as to avoid incorrectly making it appear
-	 * that it has.
+	 * propagated into the current instrumentation stack entry at this point,
+	 * and we want to make sure that it gets associated with the Gather node.
+	 * We skip this if the node has never been executed, so as to avoid
+	 * incorrectly making it appear that it has.
 	 */
 	if (node->instrument && node->instrument->running)
 		InstrStartNode(node->instrument);
@@ -828,6 +830,80 @@ ExecShutdownNode_walker(PlanState *node, void *context)
 	return false;
 }
 
+/*
+ * ExecRememberNodeInstrumentation
+ *
+ * Register all per-node instrumentation entries as unfinalized children of
+ * the executor's instrumentation. This is needed for abort recovery: if the
+ * executor aborts, we need to walk each per-node entry to recover buffer/WAL
+ * data from nodes that never got finalized, that someone might be interested
+ * in as an aggregate.
+ */
+void
+ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent)
+{
+	(void) ExecRememberNodeInstrumentation_walker(node, parent);
+}
+
+static bool
+ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
+{
+	QueryInstrumentation *parent = (QueryInstrumentation *) context;
+
+	Assert(parent != NULL);
+
+	if (node == NULL)
+		return false;
+
+	if (node->instrument)
+		InstrQueryRememberNode(parent, node->instrument);
+
+	return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
+}
+
+/*
+ * ExecFinalizeNodeInstrumentation
+ *
+ * Accumulate instrumentation stats from all execution nodes to their respective
+ * parents (or the original parent instrumentation).
+ *
+ * This must run after the cleanup done by ExecShutdownNode, and not rely on any
+ * resources cleaned up by it. We also expect shutdown actions to have occurred,
+ * e.g. parallel worker instrumentation to have been added to the leader.
+ */
+void
+ExecFinalizeNodeInstrumentation(PlanState *node)
+{
+	(void) ExecFinalizeNodeInstrumentation_walker(node, instr_stack.current);
+}
+
+static bool
+ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
+{
+	Instrumentation *parent = (Instrumentation *) context;
+
+	Assert(parent != NULL);
+
+	if (node == NULL)
+		return false;
+
+	/*
+	 * Recurse into children first (bottom-up accumulation), passing our
+	 * instrumentation as the parent context.  This ensures children can
+	 * accumulate to us even if they were never executed by the leader (e.g.
+	 * nodes beneath Gather that only workers ran).
+	 */
+	planstate_tree_walker(node, ExecFinalizeNodeInstrumentation_walker,
+						  node->instrument ? &node->instrument->instr : parent);
+
+	if (!node->instrument)
+		return false;
+
+	node->instrument = InstrFinalizeNode(node->instrument, parent);
+
+	return false;
+}
+
 /*
  * ExecSetTupleBound
  *
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 6a4a08ebb0c..1afa5e94960 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -16,25 +16,31 @@
 #include <unistd.h>
 
 #include "executor/instrument.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
 
-BufferUsage pgBufferUsage;
-static BufferUsage save_pgBufferUsage;
 WalUsage	pgWalUsage;
-static WalUsage save_pgWalUsage;
+Instrumentation instr_top;
+InstrStackState instr_stack = {0, 0, NULL, &instr_top};
 
-static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
-static void WalUsageAdd(WalUsage *dst, WalUsage *add);
-
-
-/* General purpose instrumentation handling */
-Instrumentation *
-InstrAlloc(int instrument_options)
+void
+InstrStackGrow(void)
 {
-	Instrumentation *instr = palloc0(sizeof(Instrumentation));
-	InstrInitOptions(instr, instrument_options);
-	return instr;
+	if (instr_stack.entries == NULL)
+	{
+		instr_stack.stack_space = 10;	/* Allocate sufficient initial space
+										 * for typical activity */
+		instr_stack.entries = MemoryContextAlloc(TopMemoryContext,
+												 sizeof(Instrumentation *) * instr_stack.stack_space);
+	}
+	else
+	{
+		instr_stack.stack_space *= 2;
+		instr_stack.entries = repalloc_array(instr_stack.entries, Instrumentation *, instr_stack.stack_space);
+	}
 }
 
+/* General purpose instrumentation handling */
 void
 InstrInitOptions(Instrumentation *instr, int instrument_options)
 {
@@ -54,38 +60,249 @@ InstrStart(Instrumentation *instr)
 			INSTR_TIME_SET_CURRENT(instr->starttime);
 	}
 
-	if (instr->need_bufusage)
-		instr->bufusage_start = pgBufferUsage;
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPushStack(instr);
+}
+
+static void
+InstrStopTimer(Instrumentation *instr)
+{
+	instr_time	endtime;
+
+	/* let's update the time only if the timer was requested */
+	if (INSTR_TIME_IS_ZERO(instr->starttime))
+		elog(ERROR, "InstrStop called without start");
+
+	INSTR_TIME_SET_CURRENT(endtime);
+	INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
 
-	if (instr->need_walusage)
-		instr->walusage_start = pgWalUsage;
+	INSTR_TIME_SET_ZERO(instr->starttime);
 }
 
 void
 InstrStop(Instrumentation *instr)
 {
-	instr_time	endtime;
+	if (instr->need_timer)
+		InstrStopTimer(instr);
+
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPopStack(instr);
+}
+
+/*
+ * Stops instrumentation, finalizes the stack entry and accumulates to its parent.
+ *
+ * Note that this intentionally allows passing a stack that is not the current
+ * top, as can happen with PG_FINALLY, or resource owners, which don't have a
+ * guaranteed cleanup order.
+ *
+ * We are careful here to achieve two goals:
+ *
+ * 1) Reset the stack to the parent of whichever of the released stack entries
+ *    has the lowest index
+ * 2) Accumulate all instrumentation to the currently active instrumentation,
+ *    so that callers get a complete picture of activity, even after an abort
+ */
+void
+InstrStopFinalize(Instrumentation *instr)
+{
+	int			idx = -1;
+
+	for (int i = instr_stack.stack_size - 1; i >= 0; i--)
+	{
+		if (instr_stack.entries[i] == instr)
+		{
+			idx = i;
+			break;
+		}
+	}
+
+	if (idx >= 0)
+	{
+		while (instr_stack.stack_size > idx + 1)
+			instr_stack.stack_size--;
+
+		InstrPopStack(instr);
+	}
 
-	/* let's update the time only if the timer was requested */
 	if (instr->need_timer)
+		InstrStopTimer(instr);
+
+	InstrAccum(instr_stack.current, instr);
+}
+
+
+/* Query instrumentation handling */
+
+/*
+ * Use ResourceOwner mechanism to correctly reset instr_stack on abort.
+ */
+static void ResOwnerReleaseInstrumentation(Datum res);
+static const ResourceOwnerDesc instrumentation_resowner_desc =
+{
+	.name = "instrumentation",
+	.release_phase = RESOURCE_RELEASE_AFTER_LOCKS,
+	.release_priority = RELEASE_PRIO_INSTRUMENTATION,
+	.ReleaseResource = ResOwnerReleaseInstrumentation,
+	.DebugPrint = NULL,			/* default message is fine */
+};
+
+static inline void
+ResourceOwnerRememberInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+	ResourceOwnerRemember(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static inline void
+ResourceOwnerForgetInstrumentation(ResourceOwner owner, QueryInstrumentation *qinstr)
+{
+	ResourceOwnerForget(owner, PointerGetDatum(qinstr), &instrumentation_resowner_desc);
+}
+
+static void
+ResOwnerReleaseInstrumentation(Datum res)
+{
+	QueryInstrumentation *qinstr = (QueryInstrumentation *) DatumGetPointer(res);
+	slist_mutable_iter iter;
+
+	/* Accumulate data from all unfinalized child node entries. */
+	slist_foreach_modify(iter, &qinstr->unfinalized_children)
 	{
-		if (INSTR_TIME_IS_ZERO(instr->starttime))
-			elog(ERROR, "InstrStop called without start");
+		NodeInstrumentation *child = slist_container(NodeInstrumentation, unfinalized_node, iter.cur);
 
-		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
+		InstrAccum(&qinstr->instr, &child->instr);
 
-		INSTR_TIME_SET_ZERO(instr->starttime);
+		/*
+		 * Free NodeInstrumentation now, since InstrFinalizeNode won't be
+		 * called
+		 */
+		pfree(child);
 	}
 
-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->need_bufusage)
-		BufferUsageAccumDiff(&instr->bufusage,
-							 &pgBufferUsage, &instr->bufusage_start);
+	/* Ensure the stack is reset as expected, and we accumulate to the parent */
+	InstrStopFinalize(&qinstr->instr);
+
+	/* Free QueryInstrumentation now, since InstrStop won't be called */
+	pfree(qinstr);
+}
+
+/*
+ * Allocate in TopMemoryContext so that the Instrumentation survives
+ * transaction abort — ResourceOwner release needs to access it.
+ */
+QueryInstrumentation *
+InstrQueryAlloc(int instrument_options)
+{
+	QueryInstrumentation *instr = MemoryContextAllocZero(TopMemoryContext, sizeof(QueryInstrumentation));
+
+	InstrInitOptions(&instr->instr, instrument_options);
+	return instr;
+}
+
+void
+InstrQueryStart(QueryInstrumentation *qinstr)
+{
+	InstrStart(&qinstr->instr);
 
-	if (instr->need_walusage)
-		WalUsageAccumDiff(&instr->walusage,
-						  &pgWalUsage, &instr->walusage_start);
+	if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+	{
+		Assert(CurrentResourceOwner != NULL);
+		qinstr->owner = CurrentResourceOwner;
+
+		ResourceOwnerEnlarge(qinstr->owner);
+		ResourceOwnerRememberInstrumentation(qinstr->owner, qinstr);
+	}
+}
+
+void
+InstrQueryStop(QueryInstrumentation *qinstr)
+{
+	InstrStop(&qinstr->instr);
+
+	if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+	{
+		Assert(qinstr->owner != NULL);
+		ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+		qinstr->owner = NULL;
+	}
+}
+
+QueryInstrumentation *
+InstrQueryStopFinalize(QueryInstrumentation *qinstr)
+{
+	QueryInstrumentation *copy;
+
+	InstrStopFinalize(&qinstr->instr);
+
+	if (qinstr->instr.need_bufusage || qinstr->instr.need_walusage)
+	{
+		Assert(qinstr->owner != NULL);
+		ResourceOwnerForgetInstrumentation(qinstr->owner, qinstr);
+		qinstr->owner = NULL;
+	}
+
+	/*
+	 * Copy to the current memory context so the caller doesn't need to
+	 * explicitly free the TopMemoryContext allocation.
+	 */
+	copy = palloc(sizeof(QueryInstrumentation));
+	memcpy(copy, qinstr, sizeof(QueryInstrumentation));
+	pfree(qinstr);
+	return copy;
+}
+
+/*
+ * Register a child NodeInstrumentation entry for abort processing.
+ *
+ * On abort, ResOwnerReleaseInstrumentation will walk the parent's list to
+ * recover buffer/WAL data from entries that were never finalized, in order for
+ * aggregate totals to be accurate despite the query erroring out.
+ */
+void
+InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *child)
+{
+	if (child->instr.need_bufusage || child->instr.need_walusage)
+		slist_push_head(&parent->unfinalized_children, &child->unfinalized_node);
+}
+
+/* start instrumentation during parallel executor startup */
+QueryInstrumentation *
+InstrStartParallelQuery(void)
+{
+	QueryInstrumentation *qinstr = InstrQueryAlloc(INSTRUMENT_BUFFERS | INSTRUMENT_WAL);
+
+	InstrQueryStart(qinstr);
+	return qinstr;
+}
+
+/* report usage after parallel executor shutdown */
+void
+InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage)
+{
+	qinstr = InstrQueryStopFinalize(qinstr);
+	memcpy(bufusage, &qinstr->instr.bufusage, sizeof(BufferUsage));
+	memcpy(walusage, &qinstr->instr.walusage, sizeof(WalUsage));
+}
+
+/*
+ * Accumulate work done by parallel workers in the leader's stats.
+ *
+ * Note that what gets added here effectively depends on whether per-node
+ * instrumentation is active. If it's active the parallel worker intentionally
+ * skips ExecFinalizeNodeInstrumentation on executor shutdown, because it would
+ * cause double counting. Instead, this only accumulates any extra activity
+ * outside of nodes.
+ *
+ * Otherwise this is responsible for making sure that the complete query
+ * activity is accumulated.
+ */
+void
+InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+{
+	BufferUsageAdd(&instr_stack.current->bufusage, bufusage);
+	WalUsageAdd(&instr_stack.current->walusage, walusage);
+
+	WalUsageAdd(&pgWalUsage, walusage);
 }
 
 /* Node instrumentation handling */
@@ -94,7 +311,13 @@ InstrStop(Instrumentation *instr)
 NodeInstrumentation *
 InstrAllocNode(int instrument_options, bool async_mode)
 {
-	NodeInstrumentation *instr = palloc(sizeof(NodeInstrumentation));
+	/*
+	 * We can utilize TopTransactionContext instead of TopMemoryContext here
+	 * because nodes don't get used for utility commands that restart
+	 * transactions, which would require a context that survives longer
+	 * (EXPLAIN ANALYZE is fine).
+	 */
+	NodeInstrumentation *instr = MemoryContextAlloc(TopTransactionContext, sizeof(NodeInstrumentation));
 
 	InstrInitNode(instr, instrument_options);
 	instr->async_mode = async_mode;
@@ -117,6 +340,7 @@ InstrStartNode(NodeInstrumentation *instr)
 	InstrStart(&instr->instr);
 }
 
+
 /* Exit from a plan node */
 void
 InstrStopNode(NodeInstrumentation *instr, double nTuples)
@@ -146,14 +370,12 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
 		INSTR_TIME_SET_ZERO(instr->instr.starttime);
 	}
 
-	/* Add delta of buffer usage since entry to node's totals */
-	if (instr->instr.need_bufusage)
-		BufferUsageAccumDiff(&instr->instr.bufusage,
-							 &pgBufferUsage, &instr->instr.bufusage_start);
-
-	if (instr->instr.need_walusage)
-		WalUsageAccumDiff(&instr->instr.walusage,
-						  &pgWalUsage, &instr->instr.walusage_start);
+	/*
+	 * Only pop the stack, accumulation runs in
+	 * ExecFinalizeNodeInstrumentation
+	 */
+	if (instr->instr.need_bufusage || instr->instr.need_walusage)
+		InstrPopStack(&instr->instr);
 
 	/* Is this the first tuple of this cycle? */
 	if (!instr->running)
@@ -172,6 +394,22 @@ InstrStopNode(NodeInstrumentation *instr, double nTuples)
 	}
 }
 
+/* Add per-node instrumentation to the parent and move into per-query memory context */
+NodeInstrumentation *
+InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent)
+{
+	NodeInstrumentation *dst = palloc(sizeof(NodeInstrumentation));
+
+	memcpy(dst, instr, sizeof(NodeInstrumentation));
+	pfree(instr);
+
+	/* Accumulate node's buffer/WAL usage to the parent */
+	if (dst->instr.need_bufusage || dst->instr.need_walusage)
+		InstrAccum(parent, &dst->instr);
+
+	return dst;
+}
+
 /* Update tuple count */
 void
 InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples)
@@ -188,8 +426,8 @@ InstrEndLoop(NodeInstrumentation *instr)
 	if (!instr->running)
 		return;
 
-	if (!INSTR_TIME_IS_ZERO(instr->instr.starttime))
-		elog(ERROR, "InstrEndLoop called on running node");
+	/* Ensure InstrNodeStop was called */
+	Assert(INSTR_TIME_IS_ZERO(instr->instr.starttime));
 
 	/* Accumulate per-cycle statistics into totals */
 	INSTR_TIME_ADD(instr->startup, instr->firsttuple);
@@ -259,38 +497,27 @@ InstrStartTrigger(TriggerInstrumentation *tginstr)
 void
 InstrStopTrigger(TriggerInstrumentation *tginstr, int firings)
 {
+	/*
+	 * This trigger may be called again, so we don't finalize instrumentation
+	 * here. Accumulation to the parent happens at ExecutorFinish through
+	 * ExecFinalizeTriggerInstrumentation.
+	 */
 	InstrStop(&tginstr->instr);
 	tginstr->firings += firings;
 }
 
-/* note current values during parallel executor startup */
-void
-InstrStartParallelQuery(void)
-{
-	save_pgBufferUsage = pgBufferUsage;
-	save_pgWalUsage = pgWalUsage;
-}
-
-/* report usage after parallel executor shutdown */
 void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccum(Instrumentation *dst, Instrumentation *add)
 {
-	memset(bufusage, 0, sizeof(BufferUsage));
-	BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
-	memset(walusage, 0, sizeof(WalUsage));
-	WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
-}
+	Assert(dst != NULL);
+	Assert(add != NULL);
 
-/* accumulate work done by workers in leader's stats */
-void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
-{
-	BufferUsageAdd(&pgBufferUsage, bufusage);
-	WalUsageAdd(&pgWalUsage, walusage);
+	BufferUsageAdd(&dst->bufusage, &add->bufusage);
+	WalUsageAdd(&dst->walusage, &add->walusage);
 }
 
 /* dst += add */
-static void
+void
 BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 {
 	dst->shared_blks_hit += add->shared_blks_hit;
@@ -311,39 +538,9 @@ BufferUsageAdd(BufferUsage *dst, const BufferUsage *add)
 	INSTR_TIME_ADD(dst->temp_blk_write_time, add->temp_blk_write_time);
 }
 
-/* dst += add - sub */
+/* dst += add */
 void
-BufferUsageAccumDiff(BufferUsage *dst,
-					 const BufferUsage *add,
-					 const BufferUsage *sub)
-{
-	dst->shared_blks_hit += add->shared_blks_hit - sub->shared_blks_hit;
-	dst->shared_blks_read += add->shared_blks_read - sub->shared_blks_read;
-	dst->shared_blks_dirtied += add->shared_blks_dirtied - sub->shared_blks_dirtied;
-	dst->shared_blks_written += add->shared_blks_written - sub->shared_blks_written;
-	dst->local_blks_hit += add->local_blks_hit - sub->local_blks_hit;
-	dst->local_blks_read += add->local_blks_read - sub->local_blks_read;
-	dst->local_blks_dirtied += add->local_blks_dirtied - sub->local_blks_dirtied;
-	dst->local_blks_written += add->local_blks_written - sub->local_blks_written;
-	dst->temp_blks_read += add->temp_blks_read - sub->temp_blks_read;
-	dst->temp_blks_written += add->temp_blks_written - sub->temp_blks_written;
-	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_read_time,
-						  add->shared_blk_read_time, sub->shared_blk_read_time);
-	INSTR_TIME_ACCUM_DIFF(dst->shared_blk_write_time,
-						  add->shared_blk_write_time, sub->shared_blk_write_time);
-	INSTR_TIME_ACCUM_DIFF(dst->local_blk_read_time,
-						  add->local_blk_read_time, sub->local_blk_read_time);
-	INSTR_TIME_ACCUM_DIFF(dst->local_blk_write_time,
-						  add->local_blk_write_time, sub->local_blk_write_time);
-	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_read_time,
-						  add->temp_blk_read_time, sub->temp_blk_read_time);
-	INSTR_TIME_ACCUM_DIFF(dst->temp_blk_write_time,
-						  add->temp_blk_write_time, sub->temp_blk_write_time);
-}
-
-/* helper functions for WAL usage accumulation */
-static void
-WalUsageAdd(WalUsage *dst, WalUsage *add)
+WalUsageAdd(WalUsage *dst, const WalUsage *add)
 {
 	dst->wal_bytes += add->wal_bytes;
 	dst->wal_records += add->wal_records;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 59b098ba0a0..1ff50aecc10 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1261,9 +1261,9 @@ PinBufferForBlock(Relation rel,
 	if (rel)
 	{
 		/*
-		 * While pgBufferUsage's "read" counter isn't bumped unless we reach
-		 * WaitReadBuffers() (so, not for hits, and not for buffers that are
-		 * zeroed instead), the per-relation stats always count them.
+		 * While the current buffer usage "read" counter isn't bumped unless
+		 * we reach WaitReadBuffers() (so, not for hits, and not for buffers
+		 * that are zeroed instead), the per-relation stats always count them.
 		 */
 		pgstat_count_buffer_read(rel);
 		if (*foundPtr)
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 9e7a88ec0d0..60400f0c81f 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -114,9 +114,9 @@ pgstat_prepare_io_time(bool track_io_guc)
  * pg_stat_database only counts block read and write times, these are done for
  * IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
  *
- * pgBufferUsage is used for EXPLAIN.  pgBufferUsage has write and read stats
- * for shared, local and temporary blocks.  pg_stat_io does not track the
- * activity of temporary blocks, so these are ignored here.
+ * Executor instrumentation is used for EXPLAIN. Buffer usage tracked there has
+ * write and read stats for shared, local and temporary blocks. pg_stat_io
+ * does not track the activity of temporary blocks, so these are ignored here.
  */
 void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
diff --git a/src/include/executor/execdesc.h b/src/include/executor/execdesc.h
index d3a57242844..340029a2034 100644
--- a/src/include/executor/execdesc.h
+++ b/src/include/executor/execdesc.h
@@ -51,8 +51,8 @@ typedef struct QueryDesc
 	/* This field is set by ExecutePlan */
 	bool		already_executed;	/* true if previously executed */
 
-	/* This is always set NULL by the core system, but plugins can change it */
-	struct Instrumentation *totaltime;	/* total time spent in ExecutorRun */
+	/* This field is set by ExecutorRun, or plugins */
+	struct QueryInstrumentation *totaltime; /* total time spent in ExecutorRun */
 } QueryDesc;
 
 /* in pquery.c */
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index d46ba59895d..c22199c6869 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -300,6 +300,8 @@ extern void ExecSetExecProcNode(PlanState *node, ExecProcNodeMtd function);
 extern Node *MultiExecProcNode(PlanState *node);
 extern void ExecEndNode(PlanState *node);
 extern void ExecShutdownNode(PlanState *node);
+extern void ExecRememberNodeInstrumentation(PlanState *node, QueryInstrumentation *parent);
+extern void ExecFinalizeNodeInstrumentation(PlanState *node);
 extern void ExecSetTupleBound(int64 tuples_needed, PlanState *child_node);
 
 
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 1139be8333e..cc33b32af1e 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -13,6 +13,7 @@
 #ifndef INSTRUMENT_H
 #define INSTRUMENT_H
 
+#include "lib/ilist.h"
 #include "portability/instr_time.h"
 
 
@@ -68,10 +69,22 @@ typedef enum InstrumentOption
 } InstrumentOption;
 
 /*
- * General purpose instrumentation that can capture time and WAL/buffer usage
+ * Instrumentation base class for capturing time and WAL/buffer usage
  *
- * Initialized through InstrAlloc, followed by one or more calls to a pair of
- * InstrStart/InstrStop (activity is measured inbetween).
+ * If used directly:
+ * - Allocate on the stack and zero initialize the struct
+ * - Call InstrInitOptions to set instrumentation options
+ * - Call InstrStart before the activity you want to measure
+ * - Call InstrStop / InstrStopFinalize after the activity to capture totals
+ *
+ * InstrStart/InstrStop may be called multiple times. The last stop call must
+ * be to InstrStopFinalize to ensure parent stack entries get the accumulated
+ * totals. If there is risk of transaction aborts you must call
+ * InstrStopFinalize in a PG_TRY/PG_FINALLY block to avoid corrupting the
+ * instrumentation stack.
+ *
+ * In a query context use QueryInstrumentation instead, which handles aborts
+ * using the resource owner logic.
  */
 typedef struct Instrumentation
 {
@@ -81,16 +94,52 @@ typedef struct Instrumentation
 	bool		need_walusage;	/* true if we need WAL usage data */
 	/* Internal state keeping: */
 	instr_time	starttime;		/* start time of last InstrStart */
-	BufferUsage bufusage_start; /* buffer usage at start */
-	WalUsage	walusage_start; /* WAL usage at start */
 	/* Accumulated statistics: */
 	instr_time	total;			/* total runtime */
 	BufferUsage bufusage;		/* total buffer usage */
 	WalUsage	walusage;		/* total WAL usage */
 } Instrumentation;
 
+/*
+ * Query-related instrumentation tracking.
+ *
+ * Usage:
+ * - Allocate on the heap using InstrQueryAlloc (required for abort handling)
+ * - Call InstrQueryStart before the activity you want to measure
+ * - Call InstrQueryStop / InstrQueryStopFinalize afterwards to capture totals
+ *
+ * InstrQueryStart/InstrQueryStop may be called multiple times. The last stop
+ * call must be to InstrQueryStopFinalize to ensure parent stack entries get
+ * the accumulated totals.
+ *
+ * Uses resource owner mechanism for handling aborts, as such, the caller
+ * *must* not exit out of the top level transaction after having called
+ * InstrQueryStart, without first calling InstrQueryStop. In the case of a
+ * transaction abort, logic equivalent to InstrQueryStop will be called
+ * automatically.
+ */
+struct ResourceOwnerData;
+typedef struct QueryInstrumentation
+{
+	Instrumentation instr;
+
+	/* Resource owner used for cleanup for aborts between InstrStart/InstrStop */
+	struct ResourceOwnerData *owner;
+
+	/*
+	 * NodeInstrumentation child entries that need to be cleaned up on abort,
+	 * since they are not registered as a resource owner themselves.
+	 */
+	slist_head	unfinalized_children;	/* head of unfinalized children list */
+} QueryInstrumentation;
+
 /*
  * Specialized instrumentation for per-node execution statistics
+ *
+ * Relies on an outer QueryInstrumentation having been set up to handle the
+ * stack used for WAL/buffer usage statistics, and relies on it for managing
+ * aborts. Solely intended for the executor and anyone reporting about its
+ * activities (e.g. EXPLAIN ANALYZE).
  */
 typedef struct NodeInstrumentation
 {
@@ -109,8 +158,15 @@ typedef struct NodeInstrumentation
 	double		nloops;			/* # of run cycles for this node */
 	double		nfiltered1;		/* # of tuples removed by scanqual or joinqual */
 	double		nfiltered2;		/* # of tuples removed by "other" quals */
+
+	/* Abort handling */
+	slist_node	unfinalized_node;	/* node in parent's unfinalized list */
 } NodeInstrumentation;
 
+/*
+ * Care must be taken with any pointers contained within this struct, as this
+ * gets copied across processes during parallel query execution.
+ */
 typedef struct WorkerNodeInstrumentation
 {
 	int			num_workers;	/* # of structures that follow */
@@ -124,19 +180,110 @@ typedef struct TriggerInstrumentation
 								 * was fired */
 } TriggerInstrumentation;
 
-extern PGDLLIMPORT BufferUsage pgBufferUsage;
+/*
+ * Dynamic array-based stack for tracking current WAL/buffer usage context.
+ *
+ * When the stack is empty, 'current' points to instr_top which accumulates
+ * session-level totals.
+ */
+typedef struct InstrStackState
+{
+	int			stack_space;	/* allocated capacity of entries array */
+	int			stack_size;		/* current number of entries */
+
+	Instrumentation **entries;	/* dynamic array of pointers */
+	Instrumentation *current;	/* top of stack, or &instr_top when empty */
+} InstrStackState;
+
 extern PGDLLIMPORT WalUsage pgWalUsage;
 
-extern Instrumentation *InstrAlloc(int instrument_options);
+/*
+ * The top instrumentation represents a running total of the current backend
+ * WAL/buffer usage information. This will not be updated immediately, but
+ * rather when the current stack entry gets accumulated which typically happens
+ * at query end.
+ *
+ * Care must be taken when utilizing this in the parallel worker context:
+ * Parallel workers will report back their instrumentation to the caller,
+ * and this gets added to the caller's stack. If this were to be used in the
+ * shared memory stats infrastructure it would need to be skipped on parallel
+ * workers to avoid double counting.
+ */
+extern PGDLLIMPORT Instrumentation instr_top;
+
+/*
+ * The instrumentation stack state. The 'current' field points to the
+ * currently active stack entry that is getting updated as activity happens,
+ * and will be accumulated to parent stacks when it gets finalized by
+ * InstrStop (for non-executor use cases), ExecFinalizeNodeInstrumentation
+ * (executor finish) or ResOwnerReleaseInstrumentation on abort.
+ */
+extern PGDLLIMPORT InstrStackState instr_stack;
+
+extern void InstrStackGrow(void);
+
+/*
+ * Pushes the stack so that all WAL/buffer usage updates go to the passed in
+ * instrumentation entry.
+ *
+ * Any caller using this directly must manage the passed in entry and call
+ * InstrPopStack on its own again, typically by using a PG_FINALLY block to
+ * ensure the stack gets reset via InstrPopStack on abort. Use InstrStart
+ * instead when you want automatic handling of abort cases using the resource
+ * owner infrastructure.
+ */
+static inline void
+InstrPushStack(Instrumentation *instr)
+{
+	if (unlikely(instr_stack.stack_size == instr_stack.stack_space))
+		InstrStackGrow();
+
+	instr_stack.entries[instr_stack.stack_size++] = instr;
+	instr_stack.current = instr;
+}
+
+/*
+ * Pops the stack entry back to the previous one that was effective at
+ * InstrPushStack.
+ *
+ * Callers must ensure that no intermediate stack entries are skipped, to
+ * handle aborts correctly. If you're thinking of calling this in a PG_FINALLY
+ * block, instead call InstrPopAndFinalizeStack which can skip intermediate
+ * stack entries, or instead use InstrStart/InstrStop.
+ */
+static inline void
+InstrPopStack(Instrumentation *instr)
+{
+	Assert(instr_stack.stack_size > 0);
+	Assert(instr_stack.entries[instr_stack.stack_size - 1] == instr);
+	instr_stack.stack_size--;
+	instr_stack.current = instr_stack.stack_size > 0
+		? instr_stack.entries[instr_stack.stack_size - 1]
+		: &instr_top;
+}
+
 extern void InstrInitOptions(Instrumentation *instr, int instrument_options);
 extern void InstrStart(Instrumentation *instr);
 extern void InstrStop(Instrumentation *instr);
+extern void InstrStopFinalize(Instrumentation *instr);
+extern void InstrAccum(Instrumentation *dst, Instrumentation *add);
+
+extern QueryInstrumentation *InstrQueryAlloc(int instrument_options);
+extern void InstrQueryStart(QueryInstrumentation *instr);
+extern void InstrQueryStop(QueryInstrumentation *instr);
+extern QueryInstrumentation *InstrQueryStopFinalize(QueryInstrumentation *instr);
+extern void InstrQueryRememberNode(QueryInstrumentation *parent, NodeInstrumentation *instr);
+
+pg_nodiscard extern QueryInstrumentation *InstrStartParallelQuery(void);
+extern void InstrEndParallelQuery(QueryInstrumentation *qinstr, BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
 
 extern NodeInstrumentation *InstrAllocNode(int instrument_options,
 										   bool async_mode);
 extern void InstrInitNode(NodeInstrumentation *instr, int instrument_options);
 extern void InstrStartNode(NodeInstrumentation *instr);
 extern void InstrStopNode(NodeInstrumentation *instr, double nTuples);
+extern NodeInstrumentation *InstrFinalizeNode(NodeInstrumentation *instr, Instrumentation *parent);
 extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
 extern void InstrEndLoop(NodeInstrumentation *instr);
 extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
@@ -145,31 +292,31 @@ extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
 extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
 extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
 
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void BufferUsageAccumDiff(BufferUsage *dst,
-								 const BufferUsage *add, const BufferUsage *sub);
+extern void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+extern void WalUsageAdd(WalUsage *dst, const WalUsage *add);
 extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
 							  const WalUsage *sub);
 
 #define INSTR_BUFUSAGE_INCR(fld) do { \
-		pgBufferUsage.fld++; \
+		instr_stack.current->bufusage.fld++; \
 	} while(0)
 #define INSTR_BUFUSAGE_ADD(fld,val) do { \
-		pgBufferUsage.fld += val; \
+		instr_stack.current->bufusage.fld += val; \
 	} while(0)
 #define INSTR_BUFUSAGE_TIME_ADD(fld,val) do { \
-	INSTR_TIME_ADD(pgBufferUsage.fld, val); \
+	INSTR_TIME_ADD(instr_stack.current->bufusage.fld, val); \
 	} while (0)
 #define INSTR_BUFUSAGE_TIME_ACCUM_DIFF(fld,endval,startval) do { \
-	INSTR_TIME_ACCUM_DIFF(pgBufferUsage.fld, endval, startval); \
+	INSTR_TIME_ACCUM_DIFF(instr_stack.current->bufusage.fld, endval, startval); \
 	} while (0)
+
 #define INSTR_WALUSAGE_INCR(fld) do { \
 		pgWalUsage.fld++; \
+		instr_stack.current->walusage.fld++; \
 	} while(0)
 #define INSTR_WALUSAGE_ADD(fld,val) do { \
 		pgWalUsage.fld += val; \
+		instr_stack.current->walusage.fld += val; \
 	} while(0)
 
 #endif							/* INSTRUMENT_H */
diff --git a/src/include/utils/resowner.h b/src/include/utils/resowner.h
index eb6033b4fdb..5463bc921f0 100644
--- a/src/include/utils/resowner.h
+++ b/src/include/utils/resowner.h
@@ -75,6 +75,7 @@ typedef uint32 ResourceReleasePriority;
 #define RELEASE_PRIO_SNAPSHOT_REFS			500
 #define RELEASE_PRIO_FILES					600
 #define RELEASE_PRIO_WAITEVENTSETS			700
+#define RELEASE_PRIO_INSTRUMENTATION		800
 
 /* 0 is considered invalid */
 #define RELEASE_PRIO_FIRST					1
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 7dc0073ab68..5580a080210 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -1320,6 +1320,7 @@ InjectionPointSharedState
 InjectionPointsCtl
 InlineCodeBlock
 InsertStmt
+InstrStackState
 Instrumentation
 Int128AggState
 Int8TransTypeData
@@ -2428,6 +2429,7 @@ QueryCompletion
 QueryDesc
 QueryEnvironment
 QueryInfo
+QueryInstrumentation
 QueryItem
 QueryItemType
 QueryMode
-- 
2.47.1



  [application/octet-stream] v7-0006-instrumentation-Optimize-ExecProcNodeInstr-instru.patch (11.6K, 7-v7-0006-instrumentation-Optimize-ExecProcNodeInstr-instru.patch)
  download | inline diff:
From ddc08b652cd95bdd2bd10c4fb3fca3d292e2c625 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 17:52:24 -0800
Subject: [PATCH v7 6/8] instrumentation: Optimize ExecProcNodeInstr
 instructions by inlining

For most queries, the bulk of the overhead of EXPLAIN ANALYZE happens in
ExecProcNodeInstr when starting/stopping instrumentation for that node.

Previously each ExecProcNodeInstr would check which instrumentation
options are active in the InstrStartNode/InstrStopNode calls, and do the
corresponding work (timers, instrumentation stack, etc.). These
conditionals being checked for each tuple being emitted add up, and cause
non-optimal set of instructions to be generated by the compiler.

Because we already have an existing mechanism to specify a function
pointer when instrumentation is enabled, we can instead create specialized
functions that are tailored to the instrumentation options enabled, and
avoid conditionals on subsequent ExecProcNodeInstr calls. This results in
the overhead for EXPLAIN (ANALYZE, TIMING OFF, BUFFERS OFF) for a stress
test with a large COUNT(*) that does many ExecProcNode calls from ~ 20% on
top of actual runtime to ~ 3%. When using BUFFERS ON the same query goes
from ~ 20% to ~ 10% on top of actual runtime.

Author: Lukas Fittl <[email protected]>
Reviewed-by:
Discussion:
---
 src/backend/executor/execProcnode.c |  22 +--
 src/backend/executor/instrument.c   | 224 +++++++++++++++++++++-------
 src/include/executor/instrument.h   |   5 +
 3 files changed, 174 insertions(+), 77 deletions(-)

diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index c788b5b00f9..6a74ca516ae 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -120,7 +120,6 @@
 #include "nodes/nodeFuncs.h"
 
 static TupleTableSlot *ExecProcNodeFirst(PlanState *node);
-static TupleTableSlot *ExecProcNodeInstr(PlanState *node);
 static bool ExecShutdownNode_walker(PlanState *node, void *context);
 static bool ExecRememberNodeInstrumentation_walker(PlanState *node, void *context);
 static bool ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context);
@@ -464,7 +463,7 @@ ExecProcNodeFirst(PlanState *node)
 	 * have ExecProcNode() directly call the relevant function from now on.
 	 */
 	if (node->instrument)
-		node->ExecProcNode = ExecProcNodeInstr;
+		node->ExecProcNode = InstrNodeSetupExecProcNode(node->instrument);
 	else
 		node->ExecProcNode = node->ExecProcNodeReal;
 
@@ -472,25 +471,6 @@ ExecProcNodeFirst(PlanState *node)
 }
 
 
-/*
- * ExecProcNode wrapper that performs instrumentation calls.  By keeping
- * this a separate function, we avoid overhead in the normal case where
- * no instrumentation is wanted.
- */
-static TupleTableSlot *
-ExecProcNodeInstr(PlanState *node)
-{
-	TupleTableSlot *result;
-
-	InstrStartNode(node->instrument);
-
-	result = node->ExecProcNodeReal(node);
-
-	InstrStopNode(node->instrument, TupIsNull(result) ? 0.0 : 1.0);
-
-	return result;
-}
-
 
 /* ----------------------------------------------------------------
  *		MultiExecProcNode
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 1afa5e94960..602facec401 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -49,29 +49,20 @@ InstrInitOptions(Instrumentation *instr, int instrument_options)
 	instr->need_timer = (instrument_options & INSTRUMENT_TIMER) != 0;
 }
 
-void
-InstrStart(Instrumentation *instr)
+static inline void
+InstrStartTimer(Instrumentation *instr)
 {
-	if (instr->need_timer)
-	{
-		if (!INSTR_TIME_IS_ZERO(instr->starttime))
-			elog(ERROR, "InstrStart called twice in a row");
-		else
-			INSTR_TIME_SET_CURRENT(instr->starttime);
-	}
+	Assert(INSTR_TIME_IS_ZERO(instr->starttime));
 
-	if (instr->need_bufusage || instr->need_walusage)
-		InstrPushStack(instr);
+	INSTR_TIME_SET_CURRENT(instr->starttime);
 }
 
-static void
+static inline void
 InstrStopTimer(Instrumentation *instr)
 {
 	instr_time	endtime;
 
-	/* let's update the time only if the timer was requested */
-	if (INSTR_TIME_IS_ZERO(instr->starttime))
-		elog(ERROR, "InstrStop called without start");
+	Assert(!INSTR_TIME_IS_ZERO(instr->starttime));
 
 	INSTR_TIME_SET_CURRENT(endtime);
 	INSTR_TIME_ACCUM_DIFF(instr->total, endtime, instr->starttime);
@@ -79,6 +70,16 @@ InstrStopTimer(Instrumentation *instr)
 	INSTR_TIME_SET_ZERO(instr->starttime);
 }
 
+void
+InstrStart(Instrumentation *instr)
+{
+	if (instr->need_timer)
+		InstrStartTimer(instr);
+
+	if (instr->need_bufusage || instr->need_walusage)
+		InstrPushStack(instr);
+}
+
 void
 InstrStop(Instrumentation *instr)
 {
@@ -333,65 +334,57 @@ InstrInitNode(NodeInstrumentation *instr, int instrument_options)
 	InstrInitOptions(&instr->instr, instrument_options);
 }
 
-/* Entry to a plan node */
+/* Entry to a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
 void
 InstrStartNode(NodeInstrumentation *instr)
 {
 	InstrStart(&instr->instr);
 }
 
-
-/* Exit from a plan node */
-void
-InstrStopNode(NodeInstrumentation *instr, double nTuples)
+/*
+ * Updates the node instrumentation time counter.
+ *
+ * Note this is different from InstrStop because total is only updated in
+ * InstrEndLoop. We need the separate counter variable because we need to
+ * calculate start-up time for the first tuple in each cycle, and then
+ * accumulate it together.
+ */
+static inline void
+InstrStopNodeTimer(NodeInstrumentation *instr)
 {
-	double		save_tuplecount = instr->tuplecount;
 	instr_time	endtime;
 
-	/* count the returned tuples */
-	instr->tuplecount += nTuples;
+	Assert(!INSTR_TIME_IS_ZERO(instr->instr.starttime));
+
+	INSTR_TIME_SET_CURRENT(endtime);
+	INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+	INSTR_TIME_SET_ZERO(instr->instr.starttime);
 
 	/*
-	 * Update the time only if the timer was requested.
+	 * Is this the first tuple of this cycle?
 	 *
-	 * Note this is different from InstrStop because total is only updated in
-	 * InstrEndLoop. We need the separate counter variable because we need to
-	 * calculate start-up time for the first tuple in each cycle, and then
-	 * accumulate it together.
+	 * In async mode, if the plan node hadn't emitted any tuples before, this
+	 * might be the first tuple
 	 */
-	if (instr->instr.need_timer)
-	{
-		if (INSTR_TIME_IS_ZERO(instr->instr.starttime))
-			elog(ERROR, "InstrStopNode called without start");
-
-		INSTR_TIME_SET_CURRENT(endtime);
-		INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->instr.starttime);
+	if (!instr->running || (instr->async_mode && instr->tuplecount < 1.0))
+		instr->firsttuple = instr->counter;
+}
 
-		INSTR_TIME_SET_ZERO(instr->instr.starttime);
-	}
+/* Exit from a plan node. If you modify this, check InstrNodeSetupExecProcNode. */
+void
+InstrStopNode(NodeInstrumentation *instr, double nTuples)
+{
+	if (instr->instr.need_timer)
+		InstrStopNodeTimer(instr);
 
-	/*
-	 * Only pop the stack, accumulation runs in
-	 * ExecFinalizeNodeInstrumentation
-	 */
+	/* Only pop the stack, accumulation runs in InstrFinalizeNode */
 	if (instr->instr.need_bufusage || instr->instr.need_walusage)
 		InstrPopStack(&instr->instr);
 
-	/* Is this the first tuple of this cycle? */
-	if (!instr->running)
-	{
-		instr->running = true;
-		instr->firsttuple = instr->counter;
-	}
-	else
-	{
-		/*
-		 * In async mode, if the plan node hadn't emitted any tuples before,
-		 * this might be the first tuple
-		 */
-		if (instr->async_mode && save_tuplecount < 1.0)
-			instr->firsttuple = instr->counter;
-	}
+	instr->running = true;
+
+	/* count the returned tuples */
+	instr->tuplecount += nTuples;
 }
 
 /* Add per-node instrumentation to the parent and move into per-query memory context */
@@ -475,6 +468,125 @@ InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add)
 		WalUsageAdd(&dst->instr.walusage, &add->instr.walusage);
 }
 
+/*
+ * Specialized handling of instrumented ExecProcNode
+ *
+ * These functions are equivalent to running ExecProcNodeReal wrapped in
+ * InstrStartNode and InstrStopNode, but avoid the conditionals in the hot path
+ * by checking the instrumentation options when the ExecProcNode pointer gets
+ * first set, and then using a special-purpose function for each. This results
+ * in a more optimized set of compiled instructions.
+ */
+
+#include "executor/tuptable.h"
+#include "nodes/execnodes.h"
+
+/* Simplified pop: restore saved state instead of re-deriving from array */
+static inline void
+InstrPopStackTo(Instrumentation *prev)
+{
+	Assert(instr_stack.stack_size > 0);
+	instr_stack.stack_size--;
+	instr_stack.current = prev;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrFull(PlanState *node)
+{
+	NodeInstrumentation *instr = node->instrument;
+	Instrumentation *prev = instr_stack.current;
+	TupleTableSlot *result;
+
+	InstrPushStack(&instr->instr);
+	InstrStartTimer(&instr->instr);
+
+	result = node->ExecProcNodeReal(node);
+
+	InstrStopNodeTimer(instr);
+	InstrPopStackTo(prev);
+
+	instr->running = true;
+	if (!TupIsNull(result))
+		instr->tuplecount += 1.0;
+
+	return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsBuffersWalOnly(PlanState *node)
+{
+	NodeInstrumentation *instr = node->instrument;
+	Instrumentation *prev = instr_stack.current;
+	TupleTableSlot *result;
+
+	InstrPushStack(&instr->instr);
+
+	result = node->ExecProcNodeReal(node);
+
+	InstrPopStackTo(prev);
+
+	instr->running = true;
+	if (!TupIsNull(result))
+		instr->tuplecount += 1.0;
+
+	return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsTimerOnly(PlanState *node)
+{
+	NodeInstrumentation *instr = node->instrument;
+	TupleTableSlot *result;
+
+	InstrStartTimer(&instr->instr);
+
+	result = node->ExecProcNodeReal(node);
+
+	InstrStopNodeTimer(instr);
+
+	instr->running = true;
+	if (!TupIsNull(result))
+		instr->tuplecount += 1.0;
+
+	return result;
+}
+
+static TupleTableSlot *
+ExecProcNodeInstrRowsOnly(PlanState *node)
+{
+	NodeInstrumentation *instr = node->instrument;
+	TupleTableSlot *result;
+
+	result = node->ExecProcNodeReal(node);
+
+	instr->running = true;
+	if (!TupIsNull(result))
+		instr->tuplecount += 1.0;
+
+	return result;
+}
+
+/*
+ * Returns an ExecProcNode wrapper that performs instrumentation calls,
+ * tailored to the instrumentation options enabled for the node.
+ */
+ExecProcNodeMtd
+InstrNodeSetupExecProcNode(NodeInstrumentation *instr)
+{
+	bool		need_timer = instr->instr.need_timer;
+	bool		need_buf = (instr->instr.need_bufusage ||
+							instr->instr.need_walusage);
+
+	if (need_timer && need_buf)
+		return ExecProcNodeInstrFull;
+	else if (need_buf)
+		return ExecProcNodeInstrRowsBuffersWalOnly;
+	else if (need_timer)
+		return ExecProcNodeInstrRowsTimerOnly;
+	else
+		return ExecProcNodeInstrRowsOnly;
+}
+
 /* Trigger instrumentation handling */
 TriggerInstrumentation *
 InstrAllocTrigger(int n, int instrument_options)
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index cc33b32af1e..ac7f0d21c37 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -288,6 +288,11 @@ extern void InstrUpdateTupleCount(NodeInstrumentation *instr, double nTuples);
 extern void InstrEndLoop(NodeInstrumentation *instr);
 extern void InstrAggNode(NodeInstrumentation *dst, NodeInstrumentation *add);
 
+typedef struct TupleTableSlot TupleTableSlot;
+typedef struct PlanState PlanState;
+typedef TupleTableSlot *(*ExecProcNodeMtd) (PlanState *pstate);
+extern ExecProcNodeMtd InstrNodeSetupExecProcNode(NodeInstrumentation *instr);
+
 extern TriggerInstrumentation *InstrAllocTrigger(int n, int instrument_options);
 extern void InstrStartTrigger(TriggerInstrumentation *tginstr);
 extern void InstrStopTrigger(TriggerInstrumentation *tginstr, int firings);
-- 
2.47.1



  [application/octet-stream] v7-0007-Index-scans-Show-table-buffer-accesses-separately.patch (16.8K, 8-v7-0007-Index-scans-Show-table-buffer-accesses-separately.patch)
  download | inline diff:
From 3ed273fa607d47346b20eecce0e48f33500a297f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:19 -0800
Subject: [PATCH v7 7/8] Index scans: Show table buffer accesses separately in
 EXPLAIN ANALYZE

This sets up a separate instrumentation stack that is used whilst an
Index Scan does scanning on the table, for example due to additional
data being needed.

EXPLAIN ANALYZE will now show "Table Buffers" that represent such activity.
The activity is also included in regular "Buffers" together with index
activity and that of any child nodes.

Author: Lukas Fittl <[email protected]>
Suggested-by: Andres Freund <[email protected]>
Reviewed-by:
Discussion:
---
 doc/src/sgml/perform.sgml              | 13 ++++--
 doc/src/sgml/ref/explain.sgml          |  1 +
 src/backend/commands/explain.c         | 55 +++++++++++++++-----------
 src/backend/executor/execProcnode.c    | 35 ++++++++++++++++
 src/backend/executor/nodeIndexscan.c   | 33 +++++++++++++++-
 src/include/executor/instrument_node.h |  6 +++
 src/include/nodes/execnodes.h          |  7 ++++
 7 files changed, 122 insertions(+), 28 deletions(-)

diff --git a/doc/src/sgml/perform.sgml b/doc/src/sgml/perform.sgml
index 5f6f1db0467..9219625faf6 100644
--- a/doc/src/sgml/perform.sgml
+++ b/doc/src/sgml/perform.sgml
@@ -734,6 +734,7 @@ WHERE t1.unique1 &lt; 10 AND t1.unique2 = t2.unique2;
    -&gt;  Index Scan using tenk2_unique2 on tenk2 t2  (cost=0.29..7.90 rows=1 width=244) (actual time=0.003..0.003 rows=1.00 loops=10)
          Index Cond: (unique2 = t1.unique2)
          Index Searches: 10
+         Table Buffers: shared hit=10
          Buffers: shared hit=24 read=6
  Planning:
    Buffers: shared hit=15 dirtied=9
@@ -949,7 +950,8 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @&gt; polygon '(0.5,2.0)';
    Index Cond: (f1 @&gt; '((0.5,2))'::polygon)
    Rows Removed by Index Recheck: 1
    Index Searches: 1
-   Buffers: shared hit=1
+   Table Buffers: shared hit=1
+   Buffers: shared hit=2
  Planning Time: 0.039 ms
  Execution Time: 0.098 ms
 </screen>
@@ -958,7 +960,9 @@ EXPLAIN ANALYZE SELECT * FROM polygon_tbl WHERE f1 @&gt; polygon '(0.5,2.0)';
     then rejected by a recheck of the index condition.  This happens because a
     GiST index is <quote>lossy</quote> for polygon containment tests: it actually
     returns the rows with polygons that overlap the target, and then we have
-    to do the exact containment test on those rows.
+    to do the exact containment test on those rows. The <literal>Table Buffers</literal>
+    counts indicate how many operations were performed on the table instead of
+    the index. This number is included in the <literal>Buffers</literal> counts.
    </para>
 
    <para>
@@ -1147,13 +1151,14 @@ EXPLAIN ANALYZE SELECT * FROM tenk1 WHERE unique1 &lt; 100 AND unique2 &gt; 9000
                                                           QUERY PLAN
 -------------------------------------------------------------------&zwsp;------------------------------------------------------------
  Limit  (cost=0.29..14.33 rows=2 width=244) (actual time=0.051..0.071 rows=2.00 loops=1)
-   Buffers: shared hit=16
+   Buffers: shared hit=14
    -&gt;  Index Scan using tenk1_unique2 on tenk1  (cost=0.29..70.50 rows=10 width=244) (actual time=0.051..0.070 rows=2.00 loops=1)
          Index Cond: (unique2 &gt; 9000)
          Filter: (unique1 &lt; 100)
          Rows Removed by Filter: 287
          Index Searches: 1
-         Buffers: shared hit=16
+         Table Buffers: shared hit=11
+         Buffers: shared hit=14
  Planning Time: 0.077 ms
  Execution Time: 0.086 ms
 </screen>
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 7dee77fd366..912c96f2ff5 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -506,6 +506,7 @@ EXPLAIN ANALYZE EXECUTE query(100, 200);
    -&gt;  Index Scan using test_pkey on test  (cost=0.29..10.27 rows=99 width=8) (actual time=0.009..0.025 rows=99.00 loops=1)
          Index Cond: ((id &gt; 100) AND (id &lt; 200))
          Index Searches: 1
+         Table Buffers: shared hit=1
          Buffers: shared hit=4
  Planning Time: 0.244 ms
  Execution Time: 0.073 ms
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index aa76f68bd10..437246d1aa4 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -143,7 +143,7 @@ static void show_instrumentation_count(const char *qlabel, int which,
 static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
 static const char *explain_get_index_name(Oid indexId);
 static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
-static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static void show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title);
 static void show_wal_usage(ExplainState *es, const WalUsage *usage);
 static void show_memory_counters(ExplainState *es,
 								 const MemoryContextCounters *mem_counters);
@@ -603,7 +603,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
 		}
 
 		if (bufusage)
-			show_buffer_usage(es, bufusage);
+			show_buffer_usage(es, bufusage, NULL);
 
 		if (mem_counters)
 			show_memory_counters(es, mem_counters);
@@ -1020,7 +1020,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
 		if (es->buffers && peek_buffer_usage(es, &metrics->bufferUsage))
 		{
 			es->indent++;
-			show_buffer_usage(es, &metrics->bufferUsage);
+			show_buffer_usage(es, &metrics->bufferUsage, NULL);
 			es->indent--;
 		}
 	}
@@ -1034,7 +1034,7 @@ ExplainPrintSerialize(ExplainState *es, SerializeMetrics *metrics)
 								BYTES_TO_KILOBYTES(metrics->bytesSent), es);
 		ExplainPropertyText("Format", format, es);
 		if (es->buffers)
-			show_buffer_usage(es, &metrics->bufferUsage);
+			show_buffer_usage(es, &metrics->bufferUsage, NULL);
 	}
 
 	ExplainCloseGroup("Serialization", "Serialization", true, es);
@@ -1962,6 +1962,9 @@ ExplainNode(PlanState *planstate, List *ancestors,
 				show_instrumentation_count("Rows Removed by Filter", 1,
 										   planstate, es);
 			show_indexsearches_info(planstate, es);
+
+			if (es->buffers && planstate->instrument)
+				show_buffer_usage(es, &((IndexScanState *) planstate)->iss_InstrumentTable->instr.bufusage, "Table");
 			break;
 		case T_IndexOnlyScan:
 			show_scan_qual(((IndexOnlyScan *) plan)->indexqual,
@@ -2280,7 +2283,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 	/* Show buffer/WAL usage */
 	if (es->buffers && planstate->instrument)
-		show_buffer_usage(es, &planstate->instrument->instr.bufusage);
+		show_buffer_usage(es, &planstate->instrument->instr.bufusage, NULL);
 	if (es->wal && planstate->instrument)
 		show_wal_usage(es, &planstate->instrument->instr.walusage);
 
@@ -2299,7 +2302,7 @@ ExplainNode(PlanState *planstate, List *ancestors,
 
 			ExplainOpenWorker(n, es);
 			if (es->buffers)
-				show_buffer_usage(es, &instrument->instr.bufusage);
+				show_buffer_usage(es, &instrument->instr.bufusage, NULL);
 			if (es->wal)
 				show_wal_usage(es, &instrument->instr.walusage);
 			ExplainCloseWorker(n, es);
@@ -4099,7 +4102,7 @@ peek_buffer_usage(ExplainState *es, const BufferUsage *usage)
  * Show buffer usage details.  This better be sync with peek_buffer_usage.
  */
 static void
-show_buffer_usage(ExplainState *es, const BufferUsage *usage)
+show_buffer_usage(ExplainState *es, const BufferUsage *usage, const char *title)
 {
 	if (es->format == EXPLAIN_FORMAT_TEXT)
 	{
@@ -4124,6 +4127,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
 		if (has_shared || has_local || has_temp)
 		{
 			ExplainIndentText(es);
+			if (title)
+				appendStringInfo(es->str, "%s ", title);
 			appendStringInfoString(es->str, "Buffers:");
 
 			if (has_shared)
@@ -4179,6 +4184,8 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
 		if (has_shared_timing || has_local_timing || has_temp_timing)
 		{
 			ExplainIndentText(es);
+			if (title)
+				appendStringInfo(es->str, "%s ", title);
 			appendStringInfoString(es->str, "I/O Timings:");
 
 			if (has_shared_timing)
@@ -4220,44 +4227,46 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
 	}
 	else
 	{
-		ExplainPropertyInteger("Shared Hit Blocks", NULL,
+		char	   *prefix = title ? psprintf("%s ", title) : pstrdup("");
+
+		ExplainPropertyInteger(psprintf("%sShared Hit Blocks", prefix), NULL,
 							   usage->shared_blks_hit, es);
-		ExplainPropertyInteger("Shared Read Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sShared Read Blocks", prefix), NULL,
 							   usage->shared_blks_read, es);
-		ExplainPropertyInteger("Shared Dirtied Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sShared Dirtied Blocks", prefix), NULL,
 							   usage->shared_blks_dirtied, es);
-		ExplainPropertyInteger("Shared Written Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sShared Written Blocks", prefix), NULL,
 							   usage->shared_blks_written, es);
-		ExplainPropertyInteger("Local Hit Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sLocal Hit Blocks", prefix), NULL,
 							   usage->local_blks_hit, es);
-		ExplainPropertyInteger("Local Read Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sLocal Read Blocks", prefix), NULL,
 							   usage->local_blks_read, es);
-		ExplainPropertyInteger("Local Dirtied Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sLocal Dirtied Blocks", prefix), NULL,
 							   usage->local_blks_dirtied, es);
-		ExplainPropertyInteger("Local Written Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sLocal Written Blocks", prefix), NULL,
 							   usage->local_blks_written, es);
-		ExplainPropertyInteger("Temp Read Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sTemp Read Blocks", prefix), NULL,
 							   usage->temp_blks_read, es);
-		ExplainPropertyInteger("Temp Written Blocks", NULL,
+		ExplainPropertyInteger(psprintf("%sTemp Written Blocks", prefix), NULL,
 							   usage->temp_blks_written, es);
 		if (track_io_timing)
 		{
-			ExplainPropertyFloat("Shared I/O Read Time", "ms",
+			ExplainPropertyFloat(psprintf("%sShared I/O Read Time", prefix), "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time),
 								 3, es);
-			ExplainPropertyFloat("Shared I/O Write Time", "ms",
+			ExplainPropertyFloat(psprintf("%sShared I/O Write Time", prefix), "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time),
 								 3, es);
-			ExplainPropertyFloat("Local I/O Read Time", "ms",
+			ExplainPropertyFloat(psprintf("%sLocal I/O Read Time", prefix), "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time),
 								 3, es);
-			ExplainPropertyFloat("Local I/O Write Time", "ms",
+			ExplainPropertyFloat(psprintf("%sLocal I/O Write Time", prefix), "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time),
 								 3, es);
-			ExplainPropertyFloat("Temp I/O Read Time", "ms",
+			ExplainPropertyFloat(psprintf("%sTemp I/O Read Time", prefix), "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time),
 								 3, es);
-			ExplainPropertyFloat("Temp I/O Write Time", "ms",
+			ExplainPropertyFloat(psprintf("%sTemp I/O Write Time", prefix), "ms",
 								 INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time),
 								 3, es);
 		}
diff --git a/src/backend/executor/execProcnode.c b/src/backend/executor/execProcnode.c
index 6a74ca516ae..5e476939edf 100644
--- a/src/backend/executor/execProcnode.c
+++ b/src/backend/executor/execProcnode.c
@@ -414,9 +414,24 @@ ExecInitNode(Plan *node, EState *estate, int eflags)
 
 	/* Set up instrumentation for this node if requested */
 	if (estate->es_instrument)
+	{
 		result->instrument = InstrAllocNode(estate->es_instrument,
 											result->async_capable);
 
+		/* IndexScan tracks table access separately from index access. */
+		if (IsA(result, IndexScanState) && (estate->es_instrument & INSTRUMENT_BUFFERS) != 0)
+		{
+			IndexScanState *iss = castNode(IndexScanState, result);
+
+			/*
+			 * We intentionally don't collect timing here (even if enabled),
+			 * since we don't need it, and IndexNext calls InstrPushStack /
+			 * InstrPopStack (instead of InstrNode*) to reduce overhead.
+			 */
+			iss->iss_InstrumentTable = InstrAllocNode(INSTRUMENT_BUFFERS, false);
+		}
+	}
+
 	return result;
 }
 
@@ -836,8 +851,19 @@ ExecRememberNodeInstrumentation_walker(PlanState *node, void *context)
 		return false;
 
 	if (node->instrument)
+	{
 		InstrQueryRememberNode(parent, node->instrument);
 
+		/* IndexScan has a separate entry to track table access */
+		if (IsA(node, IndexScanState))
+		{
+			IndexScanState *iss = castNode(IndexScanState, node);
+
+			if (iss->iss_InstrumentTable)
+				InstrQueryRememberNode(parent, iss->iss_InstrumentTable);
+		}
+	}
+
 	return planstate_tree_walker(node, ExecRememberNodeInstrumentation_walker, context);
 }
 
@@ -879,6 +905,15 @@ ExecFinalizeNodeInstrumentation_walker(PlanState *node, void *context)
 	if (!node->instrument)
 		return false;
 
+	/* IndexScan has a separate entry to track table access */
+	if (IsA(node, IndexScanState))
+	{
+		IndexScanState *iss = castNode(IndexScanState, node);
+
+		if (iss->iss_InstrumentTable)
+			iss->iss_InstrumentTable = InstrFinalizeNode(iss->iss_InstrumentTable, &node->instrument->instr);
+	}
+
 	node->instrument = InstrFinalizeNode(node->instrument, parent);
 
 	return false;
diff --git a/src/backend/executor/nodeIndexscan.c b/src/backend/executor/nodeIndexscan.c
index a616abff04c..4794095092e 100644
--- a/src/backend/executor/nodeIndexscan.c
+++ b/src/backend/executor/nodeIndexscan.c
@@ -83,7 +83,9 @@ IndexNext(IndexScanState *node)
 	ExprContext *econtext;
 	ScanDirection direction;
 	IndexScanDesc scandesc;
+	ItemPointer tid;
 	TupleTableSlot *slot;
+	bool		found;
 
 	/*
 	 * extract necessary information from index scan node
@@ -128,8 +130,22 @@ IndexNext(IndexScanState *node)
 	/*
 	 * ok, now that we have what we need, fetch the next tuple.
 	 */
-	while (index_getnext_slot(scandesc, direction, slot))
+	while ((tid = index_getnext_tid(scandesc, direction)) != NULL)
 	{
+		if (node->iss_InstrumentTable)
+			InstrPushStack(&node->iss_InstrumentTable->instr);
+
+		found = index_fetch_heap(scandesc, slot);
+
+		if (node->iss_InstrumentTable)
+			InstrPopStack(&node->iss_InstrumentTable->instr);
+
+		if (unlikely(!found))
+			continue;
+
+		if (scandesc->xs_heap_continue)
+			elog(ERROR, "non-MVCC snapshots are not supported in index-only scans");
+
 		CHECK_FOR_INTERRUPTS();
 
 		/*
@@ -812,6 +828,11 @@ ExecEndIndexScan(IndexScanState *node)
 		 * which will have a new IndexOnlyScanState and zeroed stats.
 		 */
 		winstrument->nsearches += node->iss_Instrument.nsearches;
+		if (node->iss_InstrumentTable)
+		{
+			BufferUsageAdd(&winstrument->worker_table_bufusage, &node->iss_InstrumentTable->instr.bufusage);
+			WalUsageAdd(&winstrument->worker_table_walusage, &node->iss_InstrumentTable->instr.walusage);
+		}
 	}
 
 	/*
@@ -1819,4 +1840,14 @@ ExecIndexScanRetrieveInstrumentation(IndexScanState *node)
 		SharedInfo->num_workers * sizeof(IndexScanInstrumentation);
 	node->iss_SharedInfo = palloc(size);
 	memcpy(node->iss_SharedInfo, SharedInfo, size);
+
+	/* Aggregate workers' table buffer/WAL usage into leader's entry */
+	if (node->iss_InstrumentTable)
+		for (int i = 0; i < node->iss_SharedInfo->num_workers; i++)
+		{
+			BufferUsageAdd(&node->iss_InstrumentTable->instr.bufusage,
+						   &node->iss_SharedInfo->winstrument[i].worker_table_bufusage);
+			WalUsageAdd(&node->iss_InstrumentTable->instr.walusage,
+						&node->iss_SharedInfo->winstrument[i].worker_table_walusage);
+		}
 }
diff --git a/src/include/executor/instrument_node.h b/src/include/executor/instrument_node.h
index 8847d7f94fa..170b6143ef6 100644
--- a/src/include/executor/instrument_node.h
+++ b/src/include/executor/instrument_node.h
@@ -18,6 +18,8 @@
 #ifndef INSTRUMENT_NODE_H
 #define INSTRUMENT_NODE_H
 
+#include "executor/instrument.h"
+
 
 /* ---------------------
  *	Instrumentation information for aggregate function execution
@@ -48,6 +50,10 @@ typedef struct IndexScanInstrumentation
 {
 	/* Index search count (incremented with pgstat_count_index_scan call) */
 	uint64		nsearches;
+
+	/* Used for passing iss_InstrumentTableStack data from parallel workers */
+	BufferUsage worker_table_bufusage;
+	WalUsage	worker_table_walusage;
 } IndexScanInstrumentation;
 
 /*
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 605c7a6cc39..c778641c13d 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1728,6 +1728,13 @@ typedef struct IndexScanState
 	IndexScanInstrumentation iss_Instrument;
 	SharedIndexScanInstrumentation *iss_SharedInfo;
 
+	/*
+	 * Instrumentation utilized for tracking table access. This is separate
+	 * from iss_Instrument since it needs to be allocated in the right context
+	 * and IndexScanInstrumentation shouldn't contain pointers.
+	 */
+	NodeInstrumentation *iss_InstrumentTable;
+
 	/* These are needed for re-checking ORDER BY expr ordering */
 	pairingheap *iss_ReorderQueue;
 	bool		iss_ReachedEnd;
-- 
2.47.1



  [application/octet-stream] v7-0008-Add-pg_session_buffer_usage-contrib-module.patch (25.5K, 9-v7-0008-Add-pg_session_buffer_usage-contrib-module.patch)
  download | inline diff:
From 15900843fffa2c6405a673f50f98c37c09e617e8 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <[email protected]>
Date: Sat, 7 Mar 2026 11:46:41 -0800
Subject: [PATCH v7 8/8] Add pg_session_buffer_usage contrib module

This is intended for testing instrumentation related logic as it pertains
to the top level stack that is maintained as a running total. There is
currently no in-core user that utilizes the top-level values in this
manner, and especially during abort situations this helps ensure we don't
lose activity due to incorrect handling of unfinalized node stacks.
---
 contrib/meson.build                           |   1 +
 contrib/pg_session_buffer_usage/Makefile      |  23 ++
 .../expected/pg_session_buffer_usage.out      | 283 ++++++++++++++++++
 contrib/pg_session_buffer_usage/meson.build   |  34 +++
 .../pg_session_buffer_usage--1.0.sql          |  31 ++
 .../pg_session_buffer_usage.c                 |  95 ++++++
 .../pg_session_buffer_usage.control           |   5 +
 .../sql/pg_session_buffer_usage.sql           | 204 +++++++++++++
 8 files changed, 676 insertions(+)
 create mode 100644 contrib/pg_session_buffer_usage/Makefile
 create mode 100644 contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
 create mode 100644 contrib/pg_session_buffer_usage/meson.build
 create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
 create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
 create mode 100644 contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
 create mode 100644 contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql

diff --git a/contrib/meson.build b/contrib/meson.build
index def13257cbe..cab1b211678 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -50,6 +50,7 @@ subdir('pg_logicalinspect')
 subdir('pg_overexplain')
 subdir('pg_prewarm')
 subdir('pgrowlocks')
+subdir('pg_session_buffer_usage')
 subdir('pg_stat_statements')
 subdir('pgstattuple')
 subdir('pg_surgery')
diff --git a/contrib/pg_session_buffer_usage/Makefile b/contrib/pg_session_buffer_usage/Makefile
new file mode 100644
index 00000000000..75bd8e09b3d
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/Makefile
@@ -0,0 +1,23 @@
+# contrib/pg_session_buffer_usage/Makefile
+
+MODULE_big = pg_session_buffer_usage
+OBJS = \
+	$(WIN32RES) \
+	pg_session_buffer_usage.o
+
+EXTENSION = pg_session_buffer_usage
+DATA = pg_session_buffer_usage--1.0.sql
+PGFILEDESC = "pg_session_buffer_usage - show buffer usage statistics for the current session"
+
+REGRESS = pg_session_buffer_usage
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_session_buffer_usage
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
new file mode 100644
index 00000000000..242b4003950
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/expected/pg_session_buffer_usage.out
@@ -0,0 +1,283 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+  AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+  AND local_blks_hit >= 0 AND local_blks_read >= 0
+  AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+  AND temp_blks_read >= 0 AND temp_blks_written >= 0
+  AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+  AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+  AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+ ok 
+----
+ t
+(1 row)
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+ count 
+-------
+  1000
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+ blocks_increased 
+------------------
+ t
+(1 row)
+
+DROP TABLE test_buf_activity;
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+SELECT count(*) FROM par_dc_tab;
+ count 
+-------
+  5000
+(1 row)
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count 
+-------
+  5000
+(1 row)
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+SELECT count(*) FROM par_dc_tab;
+ count 
+-------
+  5000
+(1 row)
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+       AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+ leader_buffers_match 
+----------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_dc_tab, dc_serial_result;
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+        WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+        SELECT a / 0 FROM ins;
+ERROR:  division by zero
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+ exception_buffers_visible 
+---------------------------
+ t
+(1 row)
+
+DROP TABLE exc_tab;
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+ERROR:  duplicate key value violates unique constraint "unique_tab_a_key"
+DETAIL:  Key (a)=(1) already exists.
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+ unique_violation_buffers_visible 
+----------------------------------
+ t
+(1 row)
+
+DROP TABLE unique_tab;
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+    BEGIN
+        EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+            WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+            SELECT a / 0 FROM ins';
+    EXCEPTION WHEN division_by_zero THEN
+        RETURN 'caught';
+    END;
+    RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+SELECT subxact_exc_func();
+ subxact_exc_func 
+------------------
+ caught
+(1 row)
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ subxact_buffers_visible 
+-------------------------
+ t
+(1 row)
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+    rec record;
+    cnt int := 0;
+BEGIN
+    BEGIN
+        FOR rec IN SELECT * FROM cursor_tab LOOP
+            cnt := cnt + 1;
+            IF cnt = 250 THEN
+                PERFORM 1 / 0;
+            END IF;
+        END LOOP;
+    EXCEPTION WHEN division_by_zero THEN
+        RETURN 'caught after ' || cnt || ' rows';
+    END;
+    RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+SELECT cursor_exc_func();
+    cursor_exc_func    
+-----------------------
+ caught after 250 rows
+(1 row)
+
+SELECT local_blks_hit + local_blks_read > 0
+       AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+ cursor_subxact_buffers_visible 
+--------------------------------
+ t
+(1 row)
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+ count 
+-------
+  5000
+(1 row)
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR:  invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+SELECT pg_session_buffer_usage_reset();
+ pg_session_buffer_usage_reset 
+-------------------------------
+ 
+(1 row)
+
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+ERROR:  invalid input syntax for type smallint: "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
+CONTEXT:  parallel worker
+RESET debug_parallel_query;
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+       AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+ worker_abort_buffers_not_propagated 
+-------------------------------------
+ t
+(1 row)
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+DROP TABLE par_abort_tab, par_abort_serial_result;
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
diff --git a/contrib/pg_session_buffer_usage/meson.build b/contrib/pg_session_buffer_usage/meson.build
new file mode 100644
index 00000000000..34c7502beb4
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/meson.build
@@ -0,0 +1,34 @@
+# Copyright (c) 2026, PostgreSQL Global Development Group
+
+pg_session_buffer_usage_sources = files(
+  'pg_session_buffer_usage.c',
+)
+
+if host_system == 'windows'
+  pg_session_buffer_usage_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'pg_session_buffer_usage',
+    '--FILEDESC', 'pg_session_buffer_usage - show buffer usage statistics for the current session',])
+endif
+
+pg_session_buffer_usage = shared_module('pg_session_buffer_usage',
+  pg_session_buffer_usage_sources,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += pg_session_buffer_usage
+
+install_data(
+  'pg_session_buffer_usage--1.0.sql',
+  'pg_session_buffer_usage.control',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'pg_session_buffer_usage',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'regress': {
+    'sql': [
+      'pg_session_buffer_usage',
+    ],
+  },
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
new file mode 100644
index 00000000000..b300fdbc643
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql
@@ -0,0 +1,31 @@
+/* contrib/pg_session_buffer_usage/pg_session_buffer_usage--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION pg_session_buffer_usage" to load this file. \quit
+
+CREATE FUNCTION pg_session_buffer_usage(
+    OUT shared_blks_hit bigint,
+    OUT shared_blks_read bigint,
+    OUT shared_blks_dirtied bigint,
+    OUT shared_blks_written bigint,
+    OUT local_blks_hit bigint,
+    OUT local_blks_read bigint,
+    OUT local_blks_dirtied bigint,
+    OUT local_blks_written bigint,
+    OUT temp_blks_read bigint,
+    OUT temp_blks_written bigint,
+    OUT shared_blk_read_time double precision,
+    OUT shared_blk_write_time double precision,
+    OUT local_blk_read_time double precision,
+    OUT local_blk_write_time double precision,
+    OUT temp_blk_read_time double precision,
+    OUT temp_blk_write_time double precision
+)
+RETURNS record
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage'
+LANGUAGE C PARALLEL RESTRICTED;
+
+CREATE FUNCTION pg_session_buffer_usage_reset()
+RETURNS void
+AS 'MODULE_PATHNAME', 'pg_session_buffer_usage_reset'
+LANGUAGE C PARALLEL RESTRICTED;
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
new file mode 100644
index 00000000000..f869956b3a9
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * pg_session_buffer_usage.c
+ *	  show buffer usage statistics for the current session
+ *
+ * Copyright (c) 2026, PostgreSQL Global Development Group
+ *
+ *	  contrib/pg_session_buffer_usage/pg_session_buffer_usage.c
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/htup_details.h"
+#include "executor/instrument.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/memutils.h"
+
+PG_MODULE_MAGIC_EXT(
+					.name = "pg_session_buffer_usage",
+					.version = PG_VERSION
+);
+
+#define NUM_BUFFER_USAGE_COLUMNS 16
+
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage);
+PG_FUNCTION_INFO_V1(pg_session_buffer_usage_reset);
+
+#define HAVE_INSTR_STACK 1		/* Change to 0 when testing before stack
+								 * change */
+
+/*
+ * SQL function: pg_session_buffer_usage()
+ *
+ * Returns a single row with all BufferUsage counters accumulated since the
+ * start of the session. Excludes any usage not yet added to the top of the
+ * stack (e.g. if this gets called inside a statement that also had buffer
+ * activity).
+ */
+Datum
+pg_session_buffer_usage(PG_FUNCTION_ARGS)
+{
+	TupleDesc	tupdesc;
+	Datum		values[NUM_BUFFER_USAGE_COLUMNS];
+	bool		nulls[NUM_BUFFER_USAGE_COLUMNS];
+	BufferUsage *usage;
+
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	memset(nulls, 0, sizeof(nulls));
+
+#if HAVE_INSTR_STACK
+	usage = &instr_top.bufusage;
+#else
+	usage = &pgBufferUsage;
+#endif
+
+	values[0] = Int64GetDatum(usage->shared_blks_hit);
+	values[1] = Int64GetDatum(usage->shared_blks_read);
+	values[2] = Int64GetDatum(usage->shared_blks_dirtied);
+	values[3] = Int64GetDatum(usage->shared_blks_written);
+	values[4] = Int64GetDatum(usage->local_blks_hit);
+	values[5] = Int64GetDatum(usage->local_blks_read);
+	values[6] = Int64GetDatum(usage->local_blks_dirtied);
+	values[7] = Int64GetDatum(usage->local_blks_written);
+	values[8] = Int64GetDatum(usage->temp_blks_read);
+	values[9] = Int64GetDatum(usage->temp_blks_written);
+	values[10] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_read_time));
+	values[11] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->shared_blk_write_time));
+	values[12] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_read_time));
+	values[13] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->local_blk_write_time));
+	values[14] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_read_time));
+	values[15] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(usage->temp_blk_write_time));
+
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
+}
+
+/*
+ * SQL function: pg_session_buffer_usage_reset()
+ *
+ * Resets all BufferUsage counters on the top instrumentation stack to zero.
+ * Useful in tests to avoid the baseline/delta pattern.
+ */
+Datum
+pg_session_buffer_usage_reset(PG_FUNCTION_ARGS)
+{
+#if HAVE_INSTR_STACK
+	memset(&instr_top.bufusage, 0, sizeof(BufferUsage));
+#else
+	memset(&pgBufferUsage, 0, sizeof(BufferUsage));
+#endif
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
new file mode 100644
index 00000000000..fabd05ee024
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/pg_session_buffer_usage.control
@@ -0,0 +1,5 @@
+# pg_session_buffer_usage extension
+comment = 'show buffer usage statistics for the current session'
+default_version = '1.0'
+module_pathname = '$libdir/pg_session_buffer_usage'
+relocatable = true
diff --git a/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
new file mode 100644
index 00000000000..8f5810fadd3
--- /dev/null
+++ b/contrib/pg_session_buffer_usage/sql/pg_session_buffer_usage.sql
@@ -0,0 +1,204 @@
+LOAD 'pg_session_buffer_usage';
+CREATE EXTENSION pg_session_buffer_usage;
+
+-- Verify all columns are non-negative
+SELECT count(*) = 1 AS ok FROM pg_session_buffer_usage()
+WHERE shared_blks_hit >= 0 AND shared_blks_read >= 0
+  AND shared_blks_dirtied >= 0 AND shared_blks_written >= 0
+  AND local_blks_hit >= 0 AND local_blks_read >= 0
+  AND local_blks_dirtied >= 0 AND local_blks_written >= 0
+  AND temp_blks_read >= 0 AND temp_blks_written >= 0
+  AND shared_blk_read_time >= 0 AND shared_blk_write_time >= 0
+  AND local_blk_read_time >= 0 AND local_blk_write_time >= 0
+  AND temp_blk_read_time >= 0 AND temp_blk_write_time >= 0;
+
+-- Verify counters increase after buffer activity
+SELECT pg_session_buffer_usage_reset();
+
+CREATE TEMP TABLE test_buf_activity (id int, data text);
+INSERT INTO test_buf_activity SELECT i, repeat('x', 100) FROM generate_series(1, 1000) AS i;
+SELECT count(*) FROM test_buf_activity;
+
+SELECT local_blks_hit + local_blks_read > 0 AS blocks_increased
+FROM pg_session_buffer_usage();
+
+DROP TABLE test_buf_activity;
+
+-- Parallel query test
+CREATE TABLE par_dc_tab (a int, b char(200));
+INSERT INTO par_dc_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+SELECT count(*) FROM par_dc_tab;
+
+-- Measure serial scan delta (leader does all the work)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+CREATE TEMP TABLE dc_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Measure parallel scan delta with leader NOT participating in scanning.
+-- Workers do all table scanning; leader only runs the Gather node.
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT count(*) FROM par_dc_tab;
+
+-- Confirm we got a similar hit counter through parallel worker accumulation
+SELECT shared_blks_hit > s.serial_delta / 2 AND shared_blks_hit < s.serial_delta * 2
+       AS leader_buffers_match
+FROM pg_session_buffer_usage(), dc_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_dc_tab, dc_serial_result;
+
+--
+-- Abort/exception tests: verify buffer usage survives various error paths.
+--
+
+-- Rolled-back divide-by-zero under EXPLAIN ANALYZE
+CREATE TEMP TABLE exc_tab (a int, b char(20));
+
+SELECT pg_session_buffer_usage_reset();
+
+EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+        WITH ins AS (INSERT INTO exc_tab VALUES (1, 'aaa') RETURNING a)
+        SELECT a / 0 FROM ins;
+
+SELECT local_blks_dirtied > 0 AS exception_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE exc_tab;
+
+-- Unique constraint violation in regular query
+CREATE TEMP TABLE unique_tab (a int UNIQUE, b char(20));
+INSERT INTO unique_tab VALUES (1, 'first');
+
+SELECT pg_session_buffer_usage_reset();
+INSERT INTO unique_tab VALUES (1, 'duplicate');
+
+SELECT local_blks_hit > 0 AS unique_violation_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP TABLE unique_tab;
+
+-- Caught exception in PL/pgSQL subtransaction (BEGIN...EXCEPTION)
+CREATE TEMP TABLE subxact_tab (a int, b char(20));
+
+CREATE FUNCTION subxact_exc_func() RETURNS text AS $$
+BEGIN
+    BEGIN
+        EXECUTE 'EXPLAIN (ANALYZE, BUFFERS, COSTS OFF)
+            WITH ins AS (INSERT INTO subxact_tab VALUES (1, ''aaa'') RETURNING a)
+            SELECT a / 0 FROM ins';
+    EXCEPTION WHEN division_by_zero THEN
+        RETURN 'caught';
+    END;
+    RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT subxact_exc_func();
+
+SELECT local_blks_dirtied > 0 AS subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION subxact_exc_func;
+DROP TABLE subxact_tab;
+
+-- Cursor (FOR loop) in aborted subtransaction; verify post-exception tracking
+CREATE TEMP TABLE cursor_tab (a int, b char(200));
+INSERT INTO cursor_tab SELECT i, repeat('x', 200) FROM generate_series(1, 500) AS i;
+
+CREATE FUNCTION cursor_exc_func() RETURNS text AS $$
+DECLARE
+    rec record;
+    cnt int := 0;
+BEGIN
+    BEGIN
+        FOR rec IN SELECT * FROM cursor_tab LOOP
+            cnt := cnt + 1;
+            IF cnt = 250 THEN
+                PERFORM 1 / 0;
+            END IF;
+        END LOOP;
+    EXCEPTION WHEN division_by_zero THEN
+        RETURN 'caught after ' || cnt || ' rows';
+    END;
+    RETURN 'not reached';
+END;
+$$ LANGUAGE plpgsql;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT cursor_exc_func();
+
+SELECT local_blks_hit + local_blks_read > 0
+       AS cursor_subxact_buffers_visible
+FROM pg_session_buffer_usage();
+
+DROP FUNCTION cursor_exc_func;
+DROP TABLE cursor_tab;
+
+-- Parallel worker abort: worker buffer activity is currently NOT propagated on abort.
+--
+-- When a parallel worker aborts, InstrEndParallelQuery and
+-- ExecParallelReportInstrumentation never run, so the worker's buffer
+-- activity is never written to shared memory, despite the information having been
+-- captured by the res owner release instrumentation handling.
+CREATE TABLE par_abort_tab (a int, b char(200));
+INSERT INTO par_abort_tab SELECT i, repeat('x', 200) FROM generate_series(1, 5000) AS i;
+
+-- Warm shared buffers so all reads become hits
+SELECT count(*) FROM par_abort_tab;
+
+-- Measure serial scan delta as a reference (leader reads all blocks)
+SET max_parallel_workers_per_gather = 0;
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+CREATE TABLE par_abort_serial_result AS
+SELECT shared_blks_hit AS serial_delta FROM pg_session_buffer_usage();
+
+-- Now force parallel with leader NOT participating in scanning
+SET parallel_setup_cost = 0;
+SET parallel_tuple_cost = 0;
+SET min_parallel_table_scan_size = 0;
+SET max_parallel_workers_per_gather = 2;
+SET parallel_leader_participation = off;
+SET debug_parallel_query = on; -- Ensure we get CONTEXT line consistently
+
+SELECT pg_session_buffer_usage_reset();
+SELECT b::int2 FROM par_abort_tab WHERE a > 1000;
+
+RESET debug_parallel_query;
+
+-- Workers scanned the table but aborted before reporting stats back.
+-- The leader's delta should be much less than a serial scan, documenting
+-- that worker buffer activity is lost on abort.
+SELECT shared_blks_hit < s.serial_delta / 2
+       AS worker_abort_buffers_not_propagated
+FROM pg_session_buffer_usage(), par_abort_serial_result s;
+
+RESET parallel_setup_cost;
+RESET parallel_tuple_cost;
+RESET min_parallel_table_scan_size;
+RESET max_parallel_workers_per_gather;
+RESET parallel_leader_participation;
+
+DROP TABLE par_abort_tab, par_abort_serial_result;
+
+-- Cleanup
+DROP EXTENSION pg_session_buffer_usage;
-- 
2.47.1



view thread (44+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Stack-based tracking of per-node WAL/buffer usage
  In-Reply-To: <CAP53PkxrmpECzVFpeeEEHDGe6u625s+YkmVv5-gw3L_NDSfbiA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox