public inbox for [email protected]  
help / color / mirror / Atom feed
Re: pg_stat_io_histogram
4+ messages / 2 participants
[nested] [flat]

* Re: pg_stat_io_histogram
@ 2026-02-18 23:12 Andres Freund <[email protected]>
  2026-02-23 12:30 ` Re: pg_stat_io_histogram Jakub Wartak <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Andres Freund @ 2026-02-18 23:12 UTC (permalink / raw)
  To: Ants Aasma <[email protected]>; +Cc: Jakub Wartak <[email protected]>; PostgreSQL Hackers <[email protected]>

Hi,

On 2026-02-18 19:37:16 +0200, Ants Aasma wrote:
> I was also able to convince clang and gcc to vectorize these loops. I
> had to split the innermost loop so the time calculation with the /1000
> for microseconds conversion and the conditional histogram loop are
> done separately, mark all the loop with nounroll pragmas, and tag the
> innermost loop for vectorization by clang. But looking at the
> benchmark results below, probably not worth the effort.

I don't think the CPU efficiency of flushing stats should ever matter to the
degree that vectorizing should be a goal. If it does, we are either keeping
way too many stats or are flushing way way too often.

What matters is to reduce the overhead when doing process local accounting, as
that will typically happen many orders of magnitude more frequently than
flushing / merging stats.



> For the I/O collection, I tried using prewarm, but got really noisy
> results from it. So instead I created a table with 100k rows with one
> row per page, vacuumed it and benchmarked select count(*) over it.
> Interestingly, setting effective_io_concurrency = 1 made the results
> both more consistent and faster.

It's an aside, but anyway: There's really not a whole lot of benefit of doing
AIO when the data is in the page cache. It can only accellerate things if
either checksum computations or memory bandwidth is a limiting factor, as with
worker mode both can be parallelized.  I don't think the checksum computation
commonly is a bottleneck with proper compiler optimization.  While memory
bandwith can be a major bottleneck on Intel server architectures, I haven't
seen that on AMD.

I'd probably, just out of paranoia, also test without checksums enabled (to
avoid the memory bandwidth hit) and see if the overhead increases if you
change the query to not need to evaluate expressions (e.g. by using SELECT *
FROM tbl OFFSET large_number, or using pg_prewarm with
maintenance_io_concurrency=1).


One thing to be aware of is that with the rdtsc[p] patch (to substantially
reduce timing overhead), it'll become a tad more expensive to convert an
instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
It may be worth testing the combination.

On that note, why is this measuring things in nanoseconds, given that we
already conver instr_time to microseconds nearby and that its quite unlikely
that you'd ever have IO times below a microsecond and that
MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
display it as microseconds?


Just rediscovered that the per-backend tracking patch added an external
function call to pgstat_count_io_op_time(), pgstat_count_backend_io_op() and
that a fair number of more recently added branches are constants at the
callsite :(. Probably doesn't matter, but makes me sad nonetheless :)




> I still want to look at the memory overhead more closely. The 30kB per
> backend seems tolerable to me

One thing worth thinking about here is that we probably could stand to
increase the number of IO types further, we e.g. have been talking about
tracking IO that bypasses shared buffers separately.  And a few more context
types (e.g. index inner/leaf) could also make sense.

Without that change that'd be a somewhat moderate increase in memory usage,
but with this change it'd increase a lot more.


> but I think having it in PgStat_BktypeIO is not great. This makes
> PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
> be half a megabyte bigger for no reason seems too wasteful.

Yea, that's not awesome.

I guess we could count IO as 4 byte integers, and shift all bucket counts down
in the rare case of an on overflow. It's just a 2x improvement, but ...

I think we might need to reduce the number of buckets somewhat.


Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
16-32. I.e. the first bucket is the same width as the second. Is that
intentional?


Greetings,

Andres Freund






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: pg_stat_io_histogram
  2026-02-18 23:12 Re: pg_stat_io_histogram Andres Freund <[email protected]>
@ 2026-02-23 12:30 ` Jakub Wartak <[email protected]>
  2026-02-26 16:13   ` Re: pg_stat_io_histogram Andres Freund <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Jakub Wartak @ 2026-02-23 12:30 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Ants Aasma <[email protected]>; PostgreSQL Hackers <[email protected]>

On Thu, Feb 19, 2026 at 12:12 AM Andres Freund <[email protected]> wrote:

Hi Andres,

> One thing to be aware of is that with the rdtsc[p] patch (to substantially
> reduce timing overhead), it'll become a tad more expensive to convert an
> instr_time to nanoseconds (due to having to convert cycles to nanoseconds).
> It may be worth testing the combination.

I've took a quick look on latest v7-0002 from there [1] and to sum up it does:

-#define INSTR_TIME_GET_NANOSEC(t) \
-    ((int64) (t).ticks)

+static inline int64
+pg_ticks_to_ns(int64 ticks)
+{
+#if defined(__x86_64__) || defined(WIN32)
[..]
+    ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+    return ns;
+#else
+    return ticks;
+#endif
+}

[..but!]
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)

+#define INSTR_TIME_GET_NANOSEC(t) \
+    (pg_ticks_to_ns((t).ticks))
+

So at least to my eyes, it looks pretty cheap, doesn't it?


> On that note, why is this measuring things in nanoseconds, given that we
> already conver instr_time to microseconds nearby and that its quite unlikely
> that you'd ever have IO times below a microsecond and that
> MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we
> display it as microseconds?

Hmm, in earlier reply You have recommened to get away from conversion from
microseconds so I've did because the microseconds were really costly
integer divisions [2]
  "It's annoying to have to convert to microseconds here, that's not free :("

so because INSTR_TIME_GET_NANOSEC() is still cheap and fetching "ticks".


> > I still want to look at the memory overhead more closely. The 30kB per
> > backend seems tolerable to me
>
> One thing worth thinking about here is that we probably could stand to
> increase the number of IO types further, we e.g. have been talking about
> tracking IO that bypasses shared buffers separately.  And a few more context
> types (e.g. index inner/leaf) could also make sense.
>
> Without that change that'd be a somewhat moderate increase in memory usage,
> but with this change it'd increase a lot more.

OK, point taken, it can grow even further, but..:

> > but I think having it in PgStat_BktypeIO is not great. This makes
> > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
> > be half a megabyte bigger for no reason seems too wasteful.
>
> Yea, that's not awesome.

Guys, question, could You please explain me what are the drawbacks of having
this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to
understand two things:
a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;))
b) how does it affect anything, because testing show it's not?

My understandiung is that it only affects file size on startup/shutdown
in $PGDATADIR/pgstat/pgstat.stat, correct?  My worry is that we introduce
more code (and bugs) for no real gain (?)

> I guess we could count IO as 4 byte integers, and shift all bucket counts down
> in the rare case of an on overflow. It's just a 2x improvement, but ...

[..I'll reply to that in next follow-up]

> I think we might need to reduce the number of buckets somewhat.

I'm kind of skeptical on lowering bucket count, and even Ants wanted to
increase it, so that we would gain perfect visibility into sometimes
problematic hardware issues (I would also swear there is something magical
for I/Os stuck for 60secs), so we would both would want to cover it there,
but we cannot squeezee more due to performance concerns...

Now there's also this area where we want to understand was it from page
cache or some-fast-IO-dev and that's how I arrived at this first edge of
~8us. If we go one bucket further (that is make first bucket 16us), I was
afraid we may start loosing being able to differentiating page-cache vs
devices, won't we? (Optane seems to be gone, but it started @ ~20us? You said
in [3] that it could be even as low as 10? so I've thought 8 is good bet)

Right now, the final bucket is that we track >128ms (==> bad stuff),
and I would love to extend to that >512ms, but we cannot as it would be
more than 16 buckets (and 16*8bytes_due_to_uint64=128bytes already).

> Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third for
> 16-32. I.e. the first bucket is the same width as the second. Is that
> intentional?

Yes, it's intentional flat at the beggining to be able to differentiate
those fast accesses.

-J.

[1] - https://www.postgresql.org/message-id/CAP53PkxNJ2Y6G8PEpQn1zKa6ODE6k1-oP9DNqWjkTj%3DdC8_KiA%40mail.g...
[2] - https://www.postgresql.org/message-id/vhzkeogzrrfzjwo3xrnq4xsjh6i37ou6xsbz7yby3lbb3rnxzz%406fpysnkjy...






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: pg_stat_io_histogram
  2026-02-18 23:12 Re: pg_stat_io_histogram Andres Freund <[email protected]>
  2026-02-23 12:30 ` Re: pg_stat_io_histogram Jakub Wartak <[email protected]>
@ 2026-02-26 16:13   ` Andres Freund <[email protected]>
  2026-05-20 08:37     ` Re: pg_stat_io_histogram Jakub Wartak <[email protected]>
  0 siblings, 1 reply; 4+ messages in thread

From: Andres Freund @ 2026-02-26 16:13 UTC (permalink / raw)
  To: Jakub Wartak <[email protected]>; +Cc: Ants Aasma <[email protected]>; PostgreSQL Hackers <[email protected]>

Hi,

On 2026-02-23 13:30:44 +0100, Jakub Wartak wrote:
> > > but I think having it in PgStat_BktypeIO is not great. This makes
> > > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snapshot
> > > be half a megabyte bigger for no reason seems too wasteful.
> >
> > Yea, that's not awesome.
> 
> Guys, question, could You please explain me what are the drawbacks of having
> this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to
> understand two things:
> a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;))

I don't really agree with that, I guess. And even if I did, it's one thing to
use 0.5MB when you actually use it, it's quite another when most of that
memory is never used.


With the patch, *every* backend ends up with a substantially larger
pgStatLocal. Before:

nm -t d --size-sort -r -S src/backend/postgres|head -n20|less
(the second column is the decimal size, third the type of the symbol)

0000000004131808 0000000000297456 r yy_transition
...
0000000003916352 0000000000054744 r UnicodeDecompMain
0000000021004896 0000000000052824 B pgStatLocal
0000000003850592 0000000000040416 r unicode_categories
...

after:
0000000023220512 0000000000329304 B pgStatLocal
0000000018531648 0000000000297456 r yy_transition
...

And because pgStatLocal is zero initialized data, it'll be on-demand-allocated
in every single backend (whereas e.g. yy_transition is read-only shared).  So
you're not talking a single time increase, you're multiplying it by the numer
of active connections

Now, it's true that most backend won't ever touch pgStatLocal.  However, most
backends will touch Pending[Backend]IOStats, which also increased noticably:

before:
0000000021060960 0000000000002880 b PendingIOStats
0000000021057792 0000000000002880 b PendingBackendStats

after:
0000000023568416 0000000000018240 b PendingIOStats
0000000023549888 0000000000018240 b PendingBackendStats


Again, I think some increase here doesn't have to be fatal, but increasing
with mainly impossible-to-use memory seems just too much waste to mee.


This also increases the shared-memory usage of pgstats: Before it used ~300kB
on a small system. That nearly doubles with this patch. But that's perhaps
less concerning, given it's per-system, rather than per-backend memory usage.



> b) how does it affect anything, because testing show it's not?

Which of your testing would conceivably show the effect?  The concern here
isn't really performance, it's that it increases our memory usage, which you'd
only see having an effect if you are tight on memory or have a workload that
is cache sensitive.


> My understandiung is that it only affects file size on startup/shutdown
> in $PGDATADIR/pgstat/pgstat.stat, correct?  My worry is that we introduce
> more code (and bugs) for no real gain (?)

that part is kind of irrelevant compared to the actual increase in memory
usage IMO.

Greetings,

Andres Freund






^ permalink  raw  reply  [nested|flat] 4+ messages in thread

* Re: pg_stat_io_histogram
  2026-02-18 23:12 Re: pg_stat_io_histogram Andres Freund <[email protected]>
  2026-02-23 12:30 ` Re: pg_stat_io_histogram Jakub Wartak <[email protected]>
  2026-02-26 16:13   ` Re: pg_stat_io_histogram Andres Freund <[email protected]>
@ 2026-05-20 08:37     ` Jakub Wartak <[email protected]>
  0 siblings, 0 replies; 4+ messages in thread

From: Jakub Wartak @ 2026-05-20 08:37 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Ants Aasma <[email protected]>; PostgreSQL Hackers <[email protected]>

On Fri, May 8, 2026 at 11:57 AM Jakub Wartak
<[email protected]> wrote:
>
> On Thu, Mar 19, 2026 at 11:16 AM Jakub Wartak
> <[email protected]> wrote:
> >
> > On Wed, Mar 18, 2026 at 2:29 PM Jakub Wartak
> > <[email protected]> wrote:
> > >
> > > On Tue, Mar 17, 2026 at 3:17 PM Andres Freund <[email protected]> wrote:
> > > > On 2026-03-17 13:13:59 +0100, Jakub Wartak wrote:
> > > > > 1. Concerns about memory use. With v7 I had couple of ideas, and with those
> > > > > the memory use is really minimized as long as the code is still simple
> > > > > (so nothing fancy, just some ideas to trim stuff and dynamically allocate
> > > > > memory). I hope those reduce memory footprint to acceptable levels, see my
> > > > > earlier description for v7.
> > > >
> > > > Personally I unfortunately continue to think that storing lots of values that
> > > > are never anything but zero isn't a good idea once you have more than a
> > > > handful of kB. Storing pointless data is something different than increasing
> > > > memory usage with actual information.
> > > >
> > > > I still think you should just count the number of histograms needed, have an
> > > > array [object][context][op] with the associated histogram "offset" and then
> > > > increment the associated offset.  It'll add an indirection at count time, but
> > > > no additional branches.
> > >
> > > Great idea, thanks, I haven't thought about that! Attached v9 attempts to do
> > > that for pending backend I/O struct, which minimizes the (backend) memory
> > > footprint for client backends to just about ~5kB.
> > >
> > > I have been pulling my hair trying to achieve the same for shared-memory, but I
> > > have failed to do that w/o sinking into complexity [..]
> >
> > OK, I've made  it done too with indirect offset on shared memory, it wasn't easy
> > at least for me, but now we have two approaches/patchsets:
> >
> [..]
> > v9b: with more code and build complexity but that should address concern of not
> >      used memory
> >
> > 'Shared Memory Stats' allocated size:
> > master - uses ~308kB for shm
> > v9a-000[12]: 578kB shm
> > v9a-000[123]: 507kB shm
> > v9a-000[1234]: 471kB shm (+~163kB more)
> >
> > v9b-000[123]: 361kB shm
> >
> > v9a-000[12] are identical to v9b-00[12], but included just for
> > patchset completeness.
> >
> > In v9b meson/autoconf (for adding pgstat_io_genstats) build most of
> > the time what
> > they need, but probably that needs some cleanups and better dependency
> > tracking. I'm
> > not sure about correctnes of those changes as especially
> > autoconf/Makefile is a lot
> > like brainf**k to me and that area would need some help...
> >
> > I think now we could even increase max resolution of buckets to cover
> > max those maximum
> > of 32s+ (at the cost of one extra 64-byte cacheline for pending IO
> > stats, so go with
> > PGSTAT_IO_HIST_BUCKETS from 16 to 24)
>
> Good morning all,
>
> Ok here comes v10, which is bit like earlier v9b (so has reduced shared memory
> footprint using Yours idea about indirect offsets idea), but now with shm memory
> sized and allocated on startup by postmaster. There are 3 patches:
> - 0001, one to introduce view and bucketting, no changes since quite some time
> - 0002, saves some private (backend) memory
> - 0003, main meat, saving shared memory (main problem raised earlier),
> now switched
>   to simply dynamically size shared memory based on those pgstat_track_io*()
>   logic
>
> The problem with the 0003 earlier was that I wanted to absolutley avoid further
> complexiy/alterations in struct PgStat_IO related to dynamic shared memory
> allocation for hist_time_buckets_slots[PGSTAT_IO_HIST_BUCKET_SLOTS]
> [PGSTAT_IO_HIST_BUCKETS] (I was afraid to touch that shm code, it
> looks complex),
> so I had to come out with something that would tell us how many slots
> (PGSTAT_IO_HIST_BUCKET_SLOTS) we need, I wish we had C++'s `constexpr` that
> would do all of that. I've tried three aproaches (like in v9b but that hit
> some serious cross-compiling obstacles, also had perl doing that, but that
> had lots of code duplication), so in the end I had to alter the pgstat_io
> shm allocation which is now in 0003.
>
> Summary of changes in 0003 since v9b / earlier post:
> - Fixed potential race condition (touch via memset/memcpy() only histogram
>   slots under LWLock)
> - Fixed/removed the PGSTAT_IO_HIST_BUCKET_SLOTS macro
> - Removed pgstat_io_genslots.c (first idea, above) and abandonded attempt to
>   fixup some cross compilation woes on MSVC/mingw
> - Bumped PGSTAT_FILE_FORMAT_ID
> - Move/optimize pending_off in pgstat_io_flush_cb out of hot loop
> - Document that hist_time_buckets_offsets should be the last member of
> PgStat_BktypeIO
> - Be defensive - added some asserts()
> - Adjust _bucket_offsets from uint64 to just int to save memory (offsets are low
>   numbers)
> - and finally moved to dynamic shm allocation of PgStat_IO stuff during
>   startup
>
> At the end of the day, I'll squeze 000[123] into just one, but wanted
> to ease the
> review first a bit. Of course this is material for PG20.

Just noticed it needed a rebase (due to c7cb8e5b73c6; renumber_oids.pl), so v11
attached before I forget.

-J.


Attachments:

  [text/x-patch] v11-0001-Add-pg_stat_io_histogram-view-to-provide-more-de.patch (39.7K, 2-v11-0001-Add-pg_stat_io_histogram-view-to-provide-more-de.patch)
  download | inline diff:
From cb29b625be435f5fab3c8f2f19ab81ae170f3bfc Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 23 Jan 2026 08:10:09 +0100
Subject: [PATCH v11 1/3] Add pg_stat_io_histogram view to provide more
 detailed insight into IO profile

pg_stat_io_histogram displays a histogram of IO latencies for specific
backend_type, object, context and io_type. The histogram has buckets that allow
faster identification of I/O latency outliers due to faulty hardware and/or
misbehaving I/O stack. Such I/O outliers e.g. slow fsyncs could sometimes
cause intermittent issues e.g. for COMMIT or affect the synchronous standbys
performance.

Author: Jakub Wartak <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Ants Aasma <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAKZiRmwvE4uJLKTgPXeBA4m%2Bd4tTghayoefcaM9%3Dz3_S7i72GA%40mail.gmail.com
---
 configure                                   |  38 +++
 configure.ac                                |   1 +
 doc/src/sgml/config.sgml                    |  12 +-
 doc/src/sgml/monitoring.sgml                | 290 ++++++++++++++++++++
 doc/src/sgml/wal.sgml                       |   5 +-
 meson.build                                 |   1 +
 src/backend/catalog/system_views.sql        |  11 +
 src/backend/utils/activity/pgstat.c         |  19 +-
 src/backend/utils/activity/pgstat_backend.c |   4 +-
 src/backend/utils/activity/pgstat_io.c      |  92 ++++++-
 src/backend/utils/adt/pgstatfuncs.c         | 148 ++++++++++
 src/include/catalog/pg_proc.dat             |   9 +
 src/include/pgstat.h                        |  38 ++-
 src/include/port/pg_bitutils.h              |  38 ++-
 src/include/utils/pgstat_internal.h         |   2 +-
 src/test/recovery/t/029_stats_restart.pl    |  29 ++
 src/test/regress/expected/rules.out         |   8 +
 src/tools/pgindent/typedefs.list            |   1 +
 18 files changed, 727 insertions(+), 19 deletions(-)

diff --git a/configure b/configure
index f66c1054a7a..c09329240be 100755
--- a/configure
+++ b/configure
@@ -16054,6 +16054,44 @@ cat >>confdefs.h <<_ACEOF
 #define HAVE__BUILTIN_CLZ 1
 _ACEOF
 
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_clzl" >&5
+$as_echo_n "checking for __builtin_clzl... " >&6; }
+if ${pgac_cv__builtin_clzl+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+int
+call__builtin_clzl(unsigned long x)
+{
+    return __builtin_clzl(x);
+}
+int
+main ()
+{
+
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  pgac_cv__builtin_clzl=yes
+else
+  pgac_cv__builtin_clzl=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv__builtin_clzl" >&5
+$as_echo "$pgac_cv__builtin_clzl" >&6; }
+if test x"${pgac_cv__builtin_clzl}" = xyes ; then
+
+cat >>confdefs.h <<_ACEOF
+#define HAVE__BUILTIN_CLZL 1
+_ACEOF
+
 fi
 { $as_echo "$as_me:${as_lineno-$LINENO}: checking for __builtin_ctz" >&5
 $as_echo_n "checking for __builtin_ctz... " >&6; }
diff --git a/configure.ac b/configure.ac
index 8d176bd3468..8f804464bc5 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1881,6 +1881,7 @@ PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap32], [int x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_bswap64], [long int x])
 # We assume that we needn't test all widths of these explicitly:
 PGAC_CHECK_BUILTIN_FUNC([__builtin_clz], [unsigned int x])
+PGAC_CHECK_BUILTIN_FUNC([__builtin_clzl], [unsigned long x])
 PGAC_CHECK_BUILTIN_FUNC([__builtin_ctz], [unsigned int x])
 # __builtin_frame_address may draw a diagnostic for non-constant argument,
 # so it needs a different test function.
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 73cc0412330..91b1fd7e635 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -9067,9 +9067,11 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         displayed in <link linkend="monitoring-pg-stat-database-view">
         <structname>pg_stat_database</structname></link>,
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> (if <varname>object</varname>
-        is not <literal>wal</literal>), in the output of the
-        <link linkend="pg-stat-get-backend-io">
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link>
+        (if <varname>object</varname> is not <literal>wal</literal>),
+        in the output of the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function (if
         <varname>object</varname> is not <literal>wal</literal>), in the
         output of <xref linkend="sql-explain"/> when the <literal>BUFFERS</literal>
@@ -9099,7 +9101,9 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         measure the overhead of timing on your system.
         I/O timing information is displayed in
         <link linkend="monitoring-pg-stat-io-view">
-        <structname>pg_stat_io</structname></link> for the
+        <structname>pg_stat_io</structname></link> and
+        <link linkend="monitoring-pg-stat-io-histogram-view">
+        <structname>pg_stat_io_histogram</structname></link> for the
         <varname>object</varname> <literal>wal</literal> and in the output of
         the <link linkend="pg-stat-get-backend-io">
         <function>pg_stat_get_backend_io()</function></link> function for the
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 08d5b824552..e8c5f391841 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -509,6 +509,17 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
      </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_io_histogram</structname><indexterm><primary>pg_stat_io_histogram</primary></indexterm></entry>
+      <entry>
+       One row for each combination of backend type, context, target object,
+       IO operation type and latency bucket (in microseconds) containing
+       cluster-wide I/O statistics.
+       See <link linkend="monitoring-pg-stat-io-histogram-view">
+       <structname>pg_stat_io_histogram</structname></link> for details.
+     </entry>
+     </row>
+
      <row>
       <entry><structname>pg_stat_lock</structname><indexterm><primary>pg_stat_lock</primary></indexterm></entry>
       <entry>
@@ -734,6 +745,8 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    Users are advised to use the <productname>PostgreSQL</productname>
    statistics views in combination with operating system utilities for a more
    complete picture of their database's I/O performance.
+   Furthermore the <structname>pg_stat_io_histogram</structname> view can be helpful
+   identifying latency outliers for specific I/O operations.
   </para>
 
  </sect2>
@@ -3302,6 +3315,283 @@ description | Waiting for a newly initialized WAL file to reach durable storage
 
  </sect2>
 
+ <sect2 id="monitoring-pg-stat-io-histogram-view">
+  <title><structname>pg_stat_io_histogram</structname></title>
+
+  <indexterm>
+   <primary>pg_stat_io_histogram</primary>
+  </indexterm>
+
+  <para>
+   The <structname>pg_stat_io_histogram</structname> view will contain one row for each
+   combination of backend type, target I/O object, and I/O context, IO operation
+   type, bucket latency cluster-wide I/O statistics. Combinations which do not make sense
+   are omitted.
+  </para>
+
+  <para>
+   The view shows measured perceived I/O latency by the backend, not the kernel or device
+   one. This is important distinction when troubleshooting, as the I/O latency observed by
+   the backend might get affected by:
+   <itemizedlist>
+     <listitem>
+        <para>OS scheduler decisions and available CPU resources.</para>
+        <para>With AIO, it might include time to service other IOs from the queue. That will often inflate IO latency.</para>
+        <para>In case of writing, additional filesystem journaling operations.</para>
+     </listitem>
+  </itemizedlist>
+  </para>
+
+  <para>
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
+  </para>
+
+  <table id="pg-stat-io-histogram-view" xreflabel="pg_stat_io_histogram">
+   <title><structname>pg_stat_io_histogram</structname> View</title>
+   <tgroup cols="1">
+    <thead>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        Column Type
+       </para>
+       <para>
+        Description
+       </para>
+      </entry>
+     </row>
+    </thead>
+    <tbody>
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>backend_type</structfield> <type>text</type>
+       </para>
+       <para>
+        Type of backend (e.g. background worker, autovacuum worker). See <link
+        linkend="monitoring-pg-stat-activity-view">
+        <structname>pg_stat_activity</structname></link> for more information
+        on <varname>backend_type</varname>s. Some
+        <varname>backend_type</varname>s do not accumulate I/O operation
+        statistics and will not be included in the view.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>object</structfield> <type>text</type>
+       </para>
+       <para>
+        Target object of an I/O operation. Possible values are:
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>relation</literal>: Permanent relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>temp relation</literal>: Temporary relations.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
+       </itemizedlist>
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>context</structfield> <type>text</type>
+       </para>
+       <para>
+        The context of an I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>normal</literal>: The default or standard
+          <varname>context</varname> for a type of I/O operation. For
+          example, by default, relation data is read into and written out from
+          shared buffers. Thus, reads and writes of relation data to and from
+          shared buffers are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>vacuum</literal>: I/O operations performed outside of shared
+          buffers while vacuuming and analyzing permanent relations. Temporary
+          table vacuums use the same local buffer pool as other temporary table
+          I/O operations and are tracked in <varname>context</varname>
+          <literal>normal</literal>.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkread</literal>: Certain large read I/O operations
+          done outside of shared buffers, for example, a sequential scan of a
+          large table.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>bulkwrite</literal>: Certain large write I/O operations
+          done outside of shared buffers, such as <command>COPY</command>.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>io_type</structfield> <type>text</type>
+       </para>
+       <para>
+        The type of I/O operation. Possible values are:
+       </para>
+       <itemizedlist>
+        <listitem>
+         <para>
+          <literal>evict</literal>: eviction from shared buffers cache.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>fsync</literal>: synchronization of modified kernel's
+          filesystem page cache with storage device.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>hit</literal>: shared buffers cache lookup hit.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>reuse</literal>: reuse of existing buffer in case of
+          reusing limited-space ring buffer (applies to <literal>bulkread</literal>,
+          <literal>bulkwrite</literal>, or <literal>vacuum</literal> contexts).
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>writeback</literal>: advise kernel that the described dirty
+          data should be flushed to disk preferably asynchronously.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>extend</literal>: add new zeroed blocks to the end of file.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>read</literal>: self explanatory.
+         </para>
+        </listitem>
+        <listitem>
+         <para>
+          <literal>write</literal>: self explanatory.
+         </para>
+        </listitem>
+       </itemizedlist>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_latency_us</structfield> <type>int4range</type>
+       </para>
+       <para>
+        The latency bucket (in microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>bucket_count</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of times latency of the I/O operation hit this specific bucket (with
+        up to <varname>bucket_latency_us</varname> microseconds).
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>stats_reset</structfield> <type>timestamp with time zone</type>
+       </para>
+       <para>
+        Time at which these statistics were last reset.
+       </para>
+      </entry>
+     </row>
+    </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   Some backend types never perform I/O operations on some I/O objects and/or
+   in some I/O contexts. These rows might display zero bucket counts for such
+   specific operations.
+  </para>
+
+  <para>
+   <structname>pg_stat_io_histogram</structname> can be used to identify
+   I/O storage issues
+   For example:
+   <itemizedlist>
+    <listitem>
+     <para>
+      Presence of abnormally high latency for <varname>fsyncs</varname> might
+      indicate I/O saturation, oversubscription or hardware connectivity issues.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Unusually high latency for <varname>fsyncs</varname> on standby's startup
+      backend type, might be responsible for high duration of commits in
+      synchronous replication setups.
+     </para>
+    </listitem>
+   </itemizedlist>
+  </para>
+
+  <note>
+   <para>
+    Columns tracking I/O wait time will only be non-zero when
+    <xref linkend="guc-track-io-timing"/> is enabled. The user should be
+    careful when referencing these columns in combination with their
+    corresponding I/O operations in case <varname>track_io_timing</varname>
+    was not enabled for the entire time since the last stats reset.
+   </para>
+  </note>
+ </sect2>
 
  <sect2 id="monitoring-pg-stat-lock-view">
   <title><structname>pg_stat_lock</structname></title>
diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml
index c32931edde3..531245935da 100644
--- a/doc/src/sgml/wal.sgml
+++ b/doc/src/sgml/wal.sgml
@@ -950,8 +950,9 @@
    of times <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <varname>writes</varname> and <varname>fsyncs</varname>
-   in <structname>pg_stat_io</structname> for the <varname>object</varname>
-   <literal>wal</literal>, respectively.
+   in <structname>pg_stat_io</structname> and
+   <structname>pg_stat_io_histogram</structname> for the
+   <varname>object</varname> <literal>wal</literal>, respectively.
   </para>
 
   <para>
diff --git a/meson.build b/meson.build
index 20b887f1a1b..51058165742 100644
--- a/meson.build
+++ b/meson.build
@@ -2048,6 +2048,7 @@ builtins = [
   'bswap32',
   'bswap64',
   'clz',
+  'clzl',
   'ctz',
   'constant_p',
   'frame_address',
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 73a1c1c4670..a752ab157ba 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1282,6 +1282,17 @@ SELECT
        b.stats_reset
 FROM pg_stat_get_io() b;
 
+CREATE VIEW pg_stat_io_histogram AS
+SELECT
+       b.backend_type,
+       b.object,
+       b.context,
+       b.io_type,
+       b.bucket_latency_us,
+       b.bucket_count,
+       b.stats_reset
+FROM pg_stat_get_io_histogram() b;
+
 CREATE VIEW pg_stat_wal AS
     SELECT
         w.wal_records,
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b67da88c7dc..9feb2f1370b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -105,8 +105,10 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -689,6 +691,14 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
+	/* Allocate I/O latency buckets only if we are going to populate it */
+	if (track_io_timing || track_wal_io_timing)
+		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext,
+																		  IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES *
+																		  PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
+	else
+		PendingIOStats.pending_hist_time_buckets = NULL;
+
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif
@@ -1668,10 +1678,17 @@ pgstat_write_statsfile(void)
 
 		pgstat_build_snapshot_fixed(kind);
 		if (pgstat_is_kind_builtin(kind))
-			ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		{
+			if (kind == PGSTAT_KIND_IO)
+				ptr = (char *) pgStatLocal.snapshot.io;
+			else
+				ptr = ((char *) &pgStatLocal.snapshot) + info->snapshot_ctl_off;
+		}
 		else
 			ptr = pgStatLocal.snapshot.custom_data[kind - PGSTAT_KIND_CUSTOM_MIN];
 
+		Assert(ptr != NULL);
+
 		fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
 		pgstat_write_chunk_s(fpout, &kind);
 		pgstat_write_chunk(fpout, ptr, info->shared_data_len);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index 73461c9bca5..fc1bf824a31 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -168,7 +168,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 {
 	PgStatShared_Backend *shbackendent;
 	PgStat_BktypeIO *bktype_shstats;
-	PgStat_PendingIO pending_io;
+	PgStat_BackendPendingIO pending_io;
 
 	/*
 	 * This function can be called even if nothing at all has happened for IO
@@ -205,7 +205,7 @@ pgstat_flush_backend_entry_io(PgStat_EntryRef *entry_ref)
 	/*
 	 * Clear out the statistics buffer, so it can be re-used.
 	 */
-	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_PendingIO));
+	MemSet(&PendingBackendStats.pending_io, 0, sizeof(PgStat_BackendPendingIO));
 
 	backend_has_iostats = false;
 }
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 13a5d8e6440..c2faada6487 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -17,10 +17,12 @@
 #include "postgres.h"
 
 #include "executor/instrument.h"
+#include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
-static PgStat_PendingIO PendingIOStats;
+PgStat_PendingIO PendingIOStats;
 static bool have_iostats = false;
 
 /*
@@ -107,6 +109,35 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+#define MIN_PG_STAT_IO_HIST_LATENCY 8191
+static inline int
+get_bucket_index(uint64_t ns)
+{
+	const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1;
+
+	/*
+	 * hopefully pre-calculated by the compiler: clzl(8191) =
+	 * clz(01111111111111b on uint64)
+	 */
+	const uint32_t min_latency_leading_zeros =
+		pg_leading_zero_bits64(MIN_PG_STAT_IO_HIST_LATENCY);
+
+	/*
+	 * make sure the tmp value has at least 8191 (our minimum bucket size) as
+	 * __builtin_clzl might return undefined behavior when operating on 0
+	 */
+	uint64_t	tmp = ns | MIN_PG_STAT_IO_HIST_LATENCY;
+
+	/* count leading zeros */
+	int			leading_zeros = pg_leading_zero_bits64(tmp);
+
+	/* normalize the index */
+	uint32_t	index = min_latency_leading_zeros - leading_zeros;
+
+	/* clamp it to the maximum */
+	return (index > max_index) ? max_index : index;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  *
@@ -125,6 +156,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 	if (!INSTR_TIME_IS_ZERO(start_time))
 	{
 		instr_time	io_time;
+		int			bucket_index;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
@@ -152,6 +184,16 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op],
 					   io_time);
 
+		if (PendingIOStats.pending_hist_time_buckets != NULL)
+		{
+			/*
+			 * calculate the bucket_index based on latency in nanoseconds
+			 * (uint64)
+			 */
+			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
+			PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+		}
+
 		/* Add the per-backend count */
 		pgstat_count_backend_io_op_time(io_object, io_context, io_op,
 										io_time);
@@ -165,7 +207,7 @@ pgstat_fetch_stat_io(void)
 {
 	pgstat_snapshot_fixed(PGSTAT_KIND_IO);
 
-	return &pgStatLocal.snapshot.io;
+	return pgStatLocal.snapshot.io;
 }
 
 /*
@@ -221,6 +263,11 @@ pgstat_io_flush_cb(bool nowait)
 
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
+
+				if (PendingIOStats.pending_hist_time_buckets != NULL)
+					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+						bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+							PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
 			}
 		}
 	}
@@ -229,7 +276,8 @@ pgstat_io_flush_cb(bool nowait)
 
 	LWLockRelease(bktype_lock);
 
-	memset(&PendingIOStats, 0, sizeof(PendingIOStats));
+	/* Avoid overwriting latency buckets array pointer */
+	memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets));
 
 	have_iostats = false;
 
@@ -274,6 +322,33 @@ pgstat_get_io_object_name(IOObject io_object)
 	pg_unreachable();
 }
 
+const char *
+pgstat_get_io_op_name(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_EVICT:
+			return "evict";
+		case IOOP_FSYNC:
+			return "fsync";
+		case IOOP_HIT:
+			return "hit";
+		case IOOP_REUSE:
+			return "reuse";
+		case IOOP_WRITEBACK:
+			return "writeback";
+		case IOOP_EXTEND:
+			return "extend";
+		case IOOP_READ:
+			return "read";
+		case IOOP_WRITE:
+			return "write";
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
@@ -281,6 +356,9 @@ pgstat_io_init_shmem_cb(void *stats)
 
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 		LWLockInitialize(&stat_shmem->locks[i], LWTRANCHE_PGSTATS_DATA);
+
+	/* this might end up being lazily allocated in pgstat_io_snapshot_cb() */
+	pgStatLocal.snapshot.io = NULL;
 }
 
 void
@@ -308,11 +386,15 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 void
 pgstat_io_snapshot_cb(void)
 {
+	if (unlikely(pgStatLocal.snapshot.io == NULL))
+		pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext,
+														 sizeof(PgStat_IO));
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
 		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
-		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io.stats[i];
+		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
 
@@ -321,7 +403,7 @@ pgstat_io_snapshot_cb(void)
 		 * the reset timestamp as well.
 		 */
 		if (i == 0)
-			pgStatLocal.snapshot.io.stat_reset_timestamp =
+			pgStatLocal.snapshot.io->stat_reset_timestamp =
 				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
 
 		/* using struct assignment due to better type safety */
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 6f9c9c72de5..e16c65d45e9 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -18,6 +18,7 @@
 #include "access/xlog.h"
 #include "access/xlogprefetcher.h"
 #include "catalog/catalog.h"
+#include "catalog/namespace.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
 #include "common/ip.h"
@@ -30,6 +31,7 @@
 #include "storage/procarray.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
+#include "utils/rangetypes.h"
 #include "utils/timestamp.h"
 #include "utils/tuplestore.h"
 #include "utils/wait_event.h"
@@ -1638,6 +1640,152 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 	return (Datum) 0;
 }
 
+/*
+* When adding a new column to the pg_stat_io_histogram view and the
+* pg_stat_get_io_histogram() function, add a new enum value here above
+* HIST_IO_NUM_COLUMNS.
+*/
+typedef enum hist_io_stat_col
+{
+	HIST_IO_COL_INVALID = -1,
+	HIST_IO_COL_BACKEND_TYPE,
+	HIST_IO_COL_OBJECT,
+	HIST_IO_COL_CONTEXT,
+	HIST_IO_COL_IOTYPE,
+	HIST_IO_COL_BUCKET_US,
+	HIST_IO_COL_COUNT,
+	HIST_IO_COL_RESET_TIME,
+	HIST_IO_NUM_COLUMNS
+} histogram_io_stat_col;
+
+/*
+ * pg_stat_io_histogram_build_tuples
+ *
+ * Helper routine for pg_stat_get_io_histogram() and pg_stat_get_backend_io()
+ * filling a result tuplestore with one tuple for each object and each
+ * context supported by the caller, based on the contents of bktype_stats.
+ */
+static void
+pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+								  PgStat_BktypeIO *bktype_stats,
+								  BackendType bktype,
+								  TimestampTz stat_reset_timestamp)
+{
+	/* Get OID for int4range type */
+	Datum		bktype_desc = CStringGetTextDatum(GetBackendTypeDesc(bktype));
+	Oid			range_typid = TypenameGetTypid("int4range");
+	TypeCacheEntry *typcache = lookup_type_cache(range_typid, TYPECACHE_RANGE_INFO);
+
+	for (int io_obj = 0; io_obj < IOOBJECT_NUM_TYPES; io_obj++)
+	{
+		const char *obj_name = pgstat_get_io_object_name(io_obj);
+
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			const char *context_name = pgstat_get_io_context_name(io_context);
+
+			/*
+			 * Some combinations of BackendType, IOObject, and IOContext are
+			 * not valid for any type of IOOp. In such cases, omit the entire
+			 * row from the view.
+			 */
+			if (!pgstat_tracks_io_object(bktype, io_obj, io_context))
+				continue;
+
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				const char *op_name = pgstat_get_io_op_name(io_op);
+
+				for (int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++)
+				{
+					Datum		values[HIST_IO_NUM_COLUMNS] = {0};
+					bool		nulls[HIST_IO_NUM_COLUMNS] = {0};
+					RangeBound	lower,
+								upper;
+					RangeType  *range;
+
+					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
+					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
+					values[HIST_IO_COL_CONTEXT] = CStringGetTextDatum(context_name);
+					values[HIST_IO_COL_IOTYPE] = CStringGetTextDatum(op_name);
+
+					/* bucket's maximum latency as range in microseconds */
+					if (bucket == 0)
+						lower.val = Int32GetDatum(0);
+					else
+						lower.val = Int32GetDatum(1 << (2 + bucket));
+					lower.infinite = false;
+					lower.inclusive = true;
+					lower.lower = true;
+
+					if (bucket == PGSTAT_IO_HIST_BUCKETS - 1)
+						upper.infinite = true;
+					else
+					{
+						upper.val = Int32GetDatum(1 << (2 + bucket + 1));
+						upper.infinite = false;
+					}
+					upper.inclusive = false;
+					upper.lower = false;
+
+					range = make_range(typcache, &lower, &upper, false, NULL);
+					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
+
+					/* bucket count */
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(
+															  bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+
+					if (stat_reset_timestamp != 0)
+						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
+					else
+						nulls[HIST_IO_COL_RESET_TIME] = true;
+
+					tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+										 values, nulls);
+				}
+			}
+		}
+	}
+}
+
+Datum
+pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
+{
+	ReturnSetInfo *rsinfo;
+	PgStat_IO  *backends_io_stats;
+
+	InitMaterializedSRF(fcinfo, 0);
+	rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+
+	backends_io_stats = pgstat_fetch_stat_io();
+
+	for (int bktype = 0; bktype < BACKEND_NUM_TYPES; bktype++)
+	{
+		PgStat_BktypeIO *bktype_stats = &backends_io_stats->stats[bktype];
+
+		/*
+		 * In Assert builds, we can afford an extra loop through all of the
+		 * counters (in pg_stat_io_build_tuples()), checking that only
+		 * expected stats are non-zero, since it keeps the non-Assert code
+		 * cleaner.
+		 */
+		Assert(pgstat_bktype_io_stats_valid(bktype_stats, bktype));
+
+		/*
+		 * For those BackendTypes without IO Operation stats, skip
+		 * representing them in the view altogether.
+		 */
+		if (!pgstat_tracks_io_bktype(bktype))
+			continue;
+
+		/* save tuples with data from this PgStat_BktypeIO */
+		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+										  backends_io_stats->stat_reset_timestamp);
+	}
+
+	return (Datum) 0;
+}
+
 /*
  * pg_stat_wal_build_tuple
  *
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index be157a5fbe9..159d912515c 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6061,6 +6061,15 @@
   proargnames => '{backend_type,object,context,reads,read_bytes,read_time,writes,write_bytes,write_time,writebacks,writeback_time,extends,extend_bytes,extend_time,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
+{ oid => '6149', descr => 'statistics: per backend type IO latency histogram',
+  proname => 'pg_stat_get_io_histogram', prorows => '30', proretset => 't',
+  provolatile => 'v', proparallel => 'r', prorettype => 'record',
+  proargtypes => '',
+  proallargtypes => '{text,text,text,text,int4range,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,io_type,bucket_latency_us,bucket_count,stats_reset}',
+  prosrc => 'pg_stat_get_io_histogram' },
+
 { oid => '6509', descr => 'statistics: per lock type statistics',
   proname => 'pg_stat_get_lock', prorows => '10', proretset => 't',
   provolatile => 'v', proparallel => 'r', prorettype => 'record',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index dfa2e837638..34fd93f86dc 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -326,11 +326,23 @@ typedef enum IOOp
 	(((unsigned int) (io_op)) < IOOP_NUM_TYPES && \
 	 ((unsigned int) (io_op)) >= IOOP_EXTEND)
 
+/*
+ * This should represent balance between being fast and providing value
+ * to the users:
+ * 1. We want to cover various fast and slow device types (0.01ms - 15ms)
+ * 2. We want to also cover sporadic long tail latencies (hardware issues,
+ *    delayed fsyncs, stuck I/O)
+ * 3. We want to be as small as possible here in terms of size:
+ *    16 * sizeof(uint64) = which should be less than two cachelines.
+ */
+#define PGSTAT_IO_HIST_BUCKETS 16
+
 typedef struct PgStat_BktypeIO
 {
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -338,8 +350,18 @@ typedef struct PgStat_PendingIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+
+	/*
+	 * Dynamically allocated array of
+	 * [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES]
+	 * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings
+	 * true.
+	 */
+	uint64		(*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
 } PgStat_PendingIO;
 
+extern PgStat_PendingIO PendingIOStats;
+
 typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
@@ -526,15 +548,24 @@ typedef struct PgStat_Backend
 } PgStat_Backend;
 
 /* ---------
- * PgStat_BackendPending	Non-flushed backend stats.
+ * PgStat_BackendPending(IO)	Non-flushed backend stats.
  * ---------
  */
+typedef struct PgStat_BackendPendingIO
+{
+	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+}			PgStat_BackendPendingIO;
+
 typedef struct PgStat_BackendPending
 {
 	/*
-	 * Backend statistics store the same amount of IO data as PGSTAT_KIND_IO.
+	 * Backend statistics store almost the same amount of IO data as
+	 * PGSTAT_KIND_IO. The only difference between PgStat_BackendPendingIO and
+	 * PgStat_PendingIO is that the latter also track IO latency histograms.
 	 */
-	PgStat_PendingIO pending_io;
+	PgStat_BackendPendingIO pending_io;
 } PgStat_BackendPending;
 
 /*
@@ -624,6 +655,7 @@ extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
+extern const char *pgstat_get_io_op_name(IOOp io_op);
 
 extern bool pgstat_tracks_io_bktype(BackendType bktype);
 extern bool pgstat_tracks_io_object(BackendType bktype,
diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 7a00d197013..b27913a2ad8 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -32,6 +32,42 @@ extern PGDLLIMPORT const uint8 pg_leftmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_rightmost_one_pos[256];
 extern PGDLLIMPORT const uint8 pg_number_of_ones[256];
 
+
+/*
+ * pg_leading_zero_bits64
+ *		Returns the number of leading 0-bits in x, starting at the most significant bit position.
+ *		Word must not be 0 (as it is undefined behavior).
+ */
+static inline int
+pg_leading_zero_bits64(uint64 word)
+{
+#ifdef HAVE__BUILTIN_CLZL
+	Assert(word != 0);
+
+#if SIZEOF_LONG == 8
+	return __builtin_clzl(word);
+#elif SIZEOF_LONG_LONG == 8
+	return __builtin_clzll(word);
+#else
+#error "cannot find integer type of the same size as uint64_t"
+#endif
+
+#else
+	uint64 y;
+	int n = 64;
+	if (word == 0)
+		return 64;
+
+	y = word >> 32; if (y != 0) { n -= 32; word = y; }
+	y = word >> 16; if (y != 0) { n -= 16; word = y; }
+	y = word >> 8;  if (y != 0) { n -= 8;  word = y; }
+	y = word >> 4;  if (y != 0) { n -= 4;  word = y; }
+	y = word >> 2;  if (y != 0) { n -= 2;  word = y; }
+	y = word >> 1;  if (y != 0) { return n - 2; }
+	return n - 1;
+#endif
+}
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -71,7 +107,7 @@ pg_leftmost_one_pos32(uint32 word)
 static inline int
 pg_leftmost_one_pos64(uint64 word)
 {
-#ifdef HAVE__BUILTIN_CLZ
+#ifdef HAVE__BUILTIN_CLZL
 	Assert(word != 0);
 
 #if SIZEOF_LONG == 8
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index fe463faaf63..a3ce8b04723 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -608,7 +608,7 @@ typedef struct PgStat_Snapshot
 
 	PgStat_CheckpointerStats checkpointer;
 
-	PgStat_IO	io;
+	PgStat_IO  *io;
 
 	PgStat_Lock lock;
 
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index cdc427dbc78..33939c8701a 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -293,7 +293,36 @@ cmp_ok(
 	$wal_restart_immediate->{reset},
 	"$sect: reset timestamp is new");
 
+
+## Test pg_stat_io_histogram that is becoming active due to dynamic memory
+## allocation only for new backends with globally set track_[io|wal_io]_timing
+$sect = "pg_stat_io_histogram";
+$node->append_conf('postgresql.conf', "track_io_timing = 'on'");
+$node->append_conf('postgresql.conf', "track_wal_io_timing = 'on'");
+$node->restart;
+
+
+## Check that pg_stat_io_histograms sees some growing counts in buckets
+## We could also try with checkpointer, but it often runs with fsync=off
+## during test.
+my $countbefore = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+$node->safe_psql('postgres', "CREATE TABLE test_io_hist(id bigint);");
+$node->safe_psql('postgres', "INSERT INTO test_io_hist SELECT generate_series(1, 100) s;");
+$node->safe_psql('postgres', "SELECT pg_stat_force_next_flush();");
+
+my $countafter = $node->safe_psql('postgres',
+	"SELECT sum(bucket_count) AS hist_bucket_count_sum FROM pg_stat_get_io_histogram() " .
+	"WHERE backend_type='client backend' AND object='relation' AND context='normal'");
+
+cmp_ok(
+	$countafter, '>', $countbefore,
+	"pg_stat_io_histogram: latency buckets growing");
+
 $node->stop;
+
 done_testing();
 
 sub trigger_funcrel_stat
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index a65a5bf0c4f..c0067cb653b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1967,6 +1967,14 @@ pg_stat_io| SELECT backend_type,
     fsync_time,
     stats_reset
    FROM pg_stat_get_io() b(backend_type, object, context, reads, read_bytes, read_time, writes, write_bytes, write_time, writebacks, writeback_time, extends, extend_bytes, extend_time, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+pg_stat_io_histogram| SELECT backend_type,
+    object,
+    context,
+    io_type,
+    bucket_latency_us,
+    bucket_count,
+    stats_reset
+   FROM pg_stat_get_io_histogram() b(backend_type, object, context, io_type, bucket_latency_us, bucket_count, stats_reset);
 pg_stat_lock| SELECT locktype,
     waits,
     wait_time,
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 8cf40c87043..ce52e7619fd 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3858,6 +3858,7 @@ gzFile
 having_collation_ctx
 heap_page_items_state
 help_handler
+histogram_io_stat_col
 hlCheck
 host_cache_hash
 hstoreCheckKeyLen_t
-- 
2.43.0



  [text/x-patch] v11-0002-Lower-pg_stat_io_histogram-private-backend-memor.patch (8.6K, 3-v11-0002-Lower-pg_stat_io_histogram-private-backend-memor.patch)
  download | inline diff:
From e4ebec91ac9b9a9984afeacc62d6f216569c2a29 Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Wed, 18 Mar 2026 07:24:14 +0100
Subject: [PATCH v11 2/3] Lower pg_stat_io_histogram private (backend) memory
 in pending_hist_time_buckets by using array with indirect offsets.

---
 src/backend/utils/activity/pgstat.c    |  9 +--
 src/backend/utils/activity/pgstat_io.c | 90 ++++++++++++++++++++++++--
 src/include/pgstat.h                   | 19 ++++--
 src/include/utils/pgstat_internal.h    |  1 +
 4 files changed, 102 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 9feb2f1370b..7c597932671 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -445,6 +445,7 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
 		.shared_data_off = offsetof(PgStatShared_IO, stats),
 		.shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats),
 
+		.init_backend_cb = pgstat_io_init_backend_cb,
 		.flush_static_cb = pgstat_io_flush_cb,
 		.init_shmem_cb = pgstat_io_init_shmem_cb,
 		.reset_all_cb = pgstat_io_reset_all_cb,
@@ -691,14 +692,6 @@ pgstat_initialize(void)
 	/* Set up a process-exit hook to clean up */
 	before_shmem_exit(pgstat_shutdown_hook, 0);
 
-	/* Allocate I/O latency buckets only if we are going to populate it */
-	if (track_io_timing || track_wal_io_timing)
-		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext,
-																		  IOOBJECT_NUM_TYPES * IOCONTEXT_NUM_TYPES * IOOP_NUM_TYPES *
-																		  PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
-	else
-		PendingIOStats.pending_hist_time_buckets = NULL;
-
 #ifdef USE_ASSERT_CHECKING
 	pgstat_is_initialized = true;
 #endif
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index c2faada6487..4c655d38b97 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
@@ -66,6 +67,27 @@ pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 	return true;
 }
 
+int
+pgstat_bktype_count_potentially_used(BackendType bktype)
+{
+	int			cnt = 0;
+
+	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+	{
+		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+		{
+			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+			{
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+					cnt++;
+			}
+		}
+	}
+
+	return cnt;
+}
+
 void
 pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 				   uint32 cnt, uint64 bytes)
@@ -186,12 +208,16 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 
 		if (PendingIOStats.pending_hist_time_buckets != NULL)
 		{
+			int			offset;
+
 			/*
 			 * calculate the bucket_index based on latency in nanoseconds
 			 * (uint64)
 			 */
 			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
-			PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++;
+
+			offset = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+			PendingIOStats.pending_hist_time_buckets[offset][bucket_index]++;
 		}
 
 		/* Add the per-backend count */
@@ -264,10 +290,23 @@ pgstat_io_flush_cb(bool nowait)
 				bktype_shstats->times[io_object][io_context][io_op] +=
 					INSTR_TIME_GET_MICROSEC(time);
 
+				/*
+				 * If tracking I/O stats, save I/O histograms from backend
+				 * local's PendingIOStats by using indirect offsets from the
+				 * pending_hist_time_buckets dynamic array (accessed with
+				 * offsets to save memory) into shared memory.
+				 */
 				if (PendingIOStats.pending_hist_time_buckets != NULL)
 					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-						bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-							PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+					{
+						int			pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+
+						if (pending_off != -1)
+						{
+							bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
+								PendingIOStats.pending_hist_time_buckets[pending_off][b];
+						}
+					}
 			}
 		}
 	}
@@ -276,8 +315,14 @@ pgstat_io_flush_cb(bool nowait)
 
 	LWLockRelease(bktype_lock);
 
-	/* Avoid overwriting latency buckets array pointer */
+	/*
+	 * Avoid overwriting histogram latency array (with offsets) and pointer to
+	 * dynamically allocated memory
+	 */
 	memset(&PendingIOStats, 0, offsetof(PgStat_PendingIO, pending_hist_time_buckets));
+	if (PendingIOStats.pending_hist_time_buckets != NULL)
+		memset(PendingIOStats.pending_hist_time_buckets, 0,
+			   PendingIOStats.pending_hist_time_buckets_size * sizeof(*PendingIOStats.pending_hist_time_buckets));
 
 	have_iostats = false;
 
@@ -349,6 +394,43 @@ pgstat_get_io_op_name(IOOp io_op)
 	pg_unreachable();
 }
 
+void
+pgstat_io_init_backend_cb(void)
+{
+	/* Allocate I/O latency buckets only if we are going to populate it */
+	if (track_io_timing || track_wal_io_timing)
+	{
+		int			alloc_sz,
+					io_histograms_used = 0;
+
+		PendingIOStats.pending_hist_time_buckets_size = pgstat_bktype_count_potentially_used(MyBackendType);
+		alloc_sz = PendingIOStats.pending_hist_time_buckets_size * sizeof(*PendingIOStats.pending_hist_time_buckets);
+		PendingIOStats.pending_hist_time_buckets = MemoryContextAllocZero(TopMemoryContext, alloc_sz);
+
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op))
+					{
+						Assert(io_histograms_used <= PendingIOStats.pending_hist_time_buckets_size);
+
+						PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] =
+							io_histograms_used++;
+					}
+					else
+						PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] = -1;
+				}
+			}
+		}
+	}
+	else
+		PendingIOStats.pending_hist_time_buckets = NULL;
+
+}
+
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 34fd93f86dc..984914e69b8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -352,12 +352,20 @@ typedef struct PgStat_PendingIO
 	instr_time	pending_times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 
 	/*
-	 * Dynamically allocated array of
-	 * [IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES]
-	 * [IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS] only with track_io_timings
-	 * true.
+	 * Dynamically allocated array for pg_stat_io_histograms only when
+	 * track_io_timings is true. pending_hist_time_buckets_offsets is just an
+	 * offset within pending_hist_time_buckets to avoid using unnecessary
+	 * memory.
 	 */
-	uint64		(*pending_hist_time_buckets)[IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
+	uint64		(*pending_hist_time_buckets)[PGSTAT_IO_HIST_BUCKETS];
+	uint64		pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+
+	/*
+	 * Cache how much histograms we have allocated to avoid repetably calling
+	 * pgstat_bktype_count_potentially_used(MyBackendType) from
+	 * pgstat_io_flush_cb()
+	 */
+	int			pending_hist_time_buckets_size;
 } PgStat_PendingIO;
 
 extern PgStat_PendingIO PendingIOStats;
@@ -645,6 +653,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 										 BackendType bktype);
+extern int	pgstat_bktype_count_potentially_used(BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context,
 							   IOOp io_op, uint32 cnt, uint64 bytes);
 extern instr_time pgstat_prepare_io_time(bool track_io_guc);
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index a3ce8b04723..fcaf21db574 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -759,6 +759,7 @@ extern void pgstat_function_reset_timestamp_cb(PgStatShared_Common *header, Time
 extern void pgstat_flush_io(bool nowait);
 
 extern bool pgstat_io_flush_cb(bool nowait);
+extern void pgstat_io_init_backend_cb(void);
 extern void pgstat_io_init_shmem_cb(void *stats);
 extern void pgstat_io_reset_all_cb(TimestampTz ts);
 extern void pgstat_io_snapshot_cb(void);
-- 
2.43.0



  [text/x-patch] v11-0003-Lower-pg_stat_io_histogram-shared-memory-use-by-.patch (19.2K, 4-v11-0003-Lower-pg_stat_io_histogram-shared-memory-use-by-.patch)
  download | inline diff:
From 3dd96ae7164e28f802fa44c311b123fdf31e223b Mon Sep 17 00:00:00 2001
From: Jakub Wartak <[email protected]>
Date: Fri, 8 May 2026 09:19:49 +0200
Subject: [PATCH v11 3/3] Lower pg_stat_io_histogram shared memory use by using
 array with indirect offsets.

We use pgstat_track_io_*() family of functions to derive the length of static
array that is allocated in shared memory region during startup. As the number
of valid combinations of backend types vs I/O object/context/operations is
coming from semi-runtime pgstat_io_get_sum_tracked() function, it cannot be
preprocessed, so we would need to come up with #define PGSTAT_IO_HIST_BUCKET_SLOTS
somehow. In order to do that - and avoid that C limitations (lack of
constexpr) - we could precalculate (in the build system) the size of
static array and generate .h include that would be included by pgstat.h,
however it appears that would be it hardly cross-portable and hardly
cross-compilable. Instead of doing that, we dynamically allocate shared memory
for IO historgrams during startup.
---
 src/backend/utils/activity/pgstat.c       |  42 +++++-
 src/backend/utils/activity/pgstat_io.c    | 164 +++++++++++++++++++---
 src/backend/utils/activity/pgstat_shmem.c |  15 ++
 src/backend/utils/adt/pgstatfuncs.c       |  20 ++-
 src/include/pgstat.h                      |  26 +++-
 5 files changed, 241 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 7c597932671..0bd59992f4e 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -443,7 +443,13 @@ static const PgStat_KindInfo pgstat_kind_builtin_infos[PGSTAT_KIND_BUILTIN_SIZE]
 		.snapshot_ctl_off = offsetof(PgStat_Snapshot, io),
 		.shared_ctl_off = offsetof(PgStat_ShmemControl, io),
 		.shared_data_off = offsetof(PgStatShared_IO, stats),
-		.shared_data_len = sizeof(((PgStatShared_IO *) 0)->stats),
+
+		/*
+		 * Do not write everything using this .shared_data_len, as the IO
+		 * histogram backing store is handled by special-case (as it is
+		 * dynamic) in pgstat_write_statsfile() / pgstat_read_statsfile().
+		 */
+		.shared_data_len = offsetof(PgStat_IO, hist_time_buckets_slot_count),
 
 		.init_backend_cb = pgstat_io_init_backend_cb,
 		.flush_static_cb = pgstat_io_flush_cb,
@@ -1685,6 +1691,21 @@ pgstat_write_statsfile(void)
 		fputc(PGSTAT_FILE_ENTRY_FIXED, fpout);
 		pgstat_write_chunk_s(fpout, &kind);
 		pgstat_write_chunk(fpout, ptr, info->shared_data_len);
+
+		/*
+		 * PGSTAT_KIND_IO has a dynamically-sized histogram that lives outside
+		 * the shared_data_len region. This assumes that PGSTAT_FILE_FORMAT_ID
+		 * would be bumped each time that pgstat_track_io*() logic is altered.
+		 */
+		if (kind == PGSTAT_KIND_IO)
+		{
+			PgStat_IO  *io = pgStatLocal.snapshot.io;
+
+			pgstat_write_chunk(fpout, io->hist_time_buckets_slots,
+							   (size_t) io->hist_time_buckets_slot_count *
+							   PGSTAT_IO_HIST_BUCKETS *
+							   sizeof(uint64));
+		}
 	}
 
 	/*
@@ -1930,6 +1951,25 @@ pgstat_read_statsfile(void)
 						goto error;
 					}
 
+					/*
+					 * PGSTAT_KIND_IO has also semi-dynamic histogram array
+					 * appended after the main chunk. By now, the
+					 * StatsShmemInit() prepared the memory.
+					 */
+					if (kind == PGSTAT_KIND_IO)
+					{
+						PgStat_IO  *io = &shmem->io.stats;
+
+						if (!pgstat_read_chunk(fpin, io->hist_time_buckets_slots,
+											   (size_t) io->hist_time_buckets_slot_count *
+											   PGSTAT_IO_HIST_BUCKETS *
+											   sizeof(uint64)))
+						{
+							elog(WARNING, "could not read pgstat_io histogram backing store");
+							goto error;
+						}
+					}
+
 					break;
 				}
 			case PGSTAT_FILE_ENTRY_HASH:
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 4c655d38b97..ad8093420ed 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -20,6 +20,7 @@
 #include "executor/instrument.h"
 #include "port/pg_bitutils.h"
 #include "storage/bufmgr.h"
+#include "storage/shmem.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
 
@@ -210,6 +211,8 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 		{
 			int			offset;
 
+			Assert(track_io_timing || track_wal_io_timing);
+
 			/*
 			 * calculate the bucket_index based on latency in nanoseconds
 			 * (uint64)
@@ -217,6 +220,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			bucket_index = get_bucket_index(INSTR_TIME_GET_NANOSEC(io_time));
 
 			offset = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+
+			/* does offset points to valid slot? */
+			Assert(offset >= 0 && offset < PendingIOStats.pending_hist_time_buckets_size);
+
 			PendingIOStats.pending_hist_time_buckets[offset][bucket_index]++;
 		}
 
@@ -258,6 +265,7 @@ pgstat_io_flush_cb(bool nowait)
 {
 	LWLock	   *bktype_lock;
 	PgStat_BktypeIO *bktype_shstats;
+	PgStat_IO  *bk_io;
 
 	if (!have_iostats)
 		return false;
@@ -265,6 +273,7 @@ pgstat_io_flush_cb(bool nowait)
 	bktype_lock = &pgStatLocal.shmem->io.locks[MyBackendType];
 	bktype_shstats =
 		&pgStatLocal.shmem->io.stats.stats[MyBackendType];
+	bk_io = &pgStatLocal.shmem->io.stats;
 
 	if (!nowait)
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
@@ -297,16 +306,23 @@ pgstat_io_flush_cb(bool nowait)
 				 * offsets to save memory) into shared memory.
 				 */
 				if (PendingIOStats.pending_hist_time_buckets != NULL)
-					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
-					{
-						int			pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
+				{
+					int			bktype_shstats_off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op];
+					int			pending_off = PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op];
 
-						if (pending_off != -1)
-						{
-							bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=
-								PendingIOStats.pending_hist_time_buckets[pending_off][b];
-						}
-					}
+					Assert(track_io_timing || track_wal_io_timing);
+
+					/*
+					 * -1 means here that such mapping doesn't have a slot
+					 * (based on pgstat_track_io_*()).
+					 */
+					if (bktype_shstats_off == -1 || pending_off == -1)
+						continue;
+
+					for (int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
+						bk_io->hist_time_buckets_slots[bktype_shstats_off][b] +=
+							PendingIOStats.pending_hist_time_buckets[pending_off][b];
+				}
 			}
 		}
 	}
@@ -415,7 +431,7 @@ pgstat_io_init_backend_cb(void)
 				{
 					if (pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op))
 					{
-						Assert(io_histograms_used <= PendingIOStats.pending_hist_time_buckets_size);
+						Assert(io_histograms_used < PendingIOStats.pending_hist_time_buckets_size);
 
 						PendingIOStats.pending_hist_time_buckets_offsets[io_object][io_context][io_op] =
 							io_histograms_used++;
@@ -428,12 +444,12 @@ pgstat_io_init_backend_cb(void)
 	}
 	else
 		PendingIOStats.pending_hist_time_buckets = NULL;
-
 }
 
 void
 pgstat_io_init_shmem_cb(void *stats)
 {
+	int			histogram_slots = 0;
 	PgStatShared_IO *stat_shmem = (PgStatShared_IO *) stats;
 
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
@@ -441,26 +457,79 @@ pgstat_io_init_shmem_cb(void *stats)
 
 	/* this might end up being lazily allocated in pgstat_io_snapshot_cb() */
 	pgStatLocal.snapshot.io = NULL;
+
+	/*
+	 * Establish indirect mapping from
+	 * PgStat_BktypeIO.hist_time_buckets_offsets[][][] to
+	 * PgStat_IO.hist_time_buckets_slots[x]
+	 */
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+	{
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					if (pgstat_tracks_io_op(i, io_object, io_context, io_op))
+					{
+						stat_shmem->stats.stats[i].hist_time_buckets_offsets[io_object][io_context][io_op] =
+							histogram_slots++;
+					}
+					else
+						stat_shmem->stats.stats[i].hist_time_buckets_offsets[io_object][io_context][io_op] =
+							-1;
+				}
+			}
+		}
+	}
+
+	/*
+	 * Sanity check: the offset table we just produced must use exactly the
+	 * number of slots StatsShmemInit() reserved.  Both come from the same
+	 * pgstat_tracks_io_*() rules, so a mismatch would indicate a bug.
+	 */
+	Assert(histogram_slots == stat_shmem->stats.hist_time_buckets_slot_count);
 }
 
 void
 pgstat_io_reset_all_cb(TimestampTz ts)
 {
+	PgStat_IO  *io_stats = &pgStatLocal.shmem->io.stats;
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_shstats = &io_stats->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_EXCLUSIVE);
 
 		/*
 		 * Use the lock in the first BackendType's PgStat_BktypeIO to protect
-		 * the reset timestamp as well.
+		 * the reset timestamp.
 		 */
 		if (i == 0)
-			pgStatLocal.shmem->io.stats.stat_reset_timestamp = ts;
+			io_stats->stat_reset_timestamp = ts;
 
-		memset(bktype_shstats, 0, sizeof(*bktype_shstats));
+		/* Reset this BackendType's histogram slots */
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op];
+
+					if (off == -1)
+						continue;
+					memset(io_stats->hist_time_buckets_slots[off], 0,
+						   sizeof(io_stats->hist_time_buckets_slots[off]));
+				}
+			}
+		}
+
+		/* Avoid resetting our indirect mapping offsets */
+		memset(bktype_shstats, 0, offsetof(PgStat_BktypeIO, hist_time_buckets_offsets));
 		LWLockRelease(bktype_lock);
 	}
 }
@@ -468,14 +537,30 @@ pgstat_io_reset_all_cb(TimestampTz ts)
 void
 pgstat_io_snapshot_cb(void)
 {
+	PgStat_IO  *shmem_io = &pgStatLocal.shmem->io.stats;
+
 	if (unlikely(pgStatLocal.snapshot.io == NULL))
+	{
+		int			n = shmem_io->hist_time_buckets_slot_count;
+
 		pgStatLocal.snapshot.io = MemoryContextAllocZero(TopMemoryContext,
 														 sizeof(PgStat_IO));
 
+		/*
+		 * Allocated on demand in private (TopMemoryContext) memory and points
+		 * to the same indirect offsets.
+		 */
+		pgStatLocal.snapshot.io->hist_time_buckets_slot_count = n;
+		pgStatLocal.snapshot.io->hist_time_buckets_slots =
+			MemoryContextAllocZero(TopMemoryContext,
+								   (size_t) n * PGSTAT_IO_HIST_BUCKETS *
+								   sizeof(uint64));
+	}
+
 	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
 	{
 		LWLock	   *bktype_lock = &pgStatLocal.shmem->io.locks[i];
-		PgStat_BktypeIO *bktype_shstats = &pgStatLocal.shmem->io.stats.stats[i];
+		PgStat_BktypeIO *bktype_shstats = &shmem_io->stats[i];
 		PgStat_BktypeIO *bktype_snap = &pgStatLocal.snapshot.io->stats[i];
 
 		LWLockAcquire(bktype_lock, LW_SHARED);
@@ -486,10 +571,29 @@ pgstat_io_snapshot_cb(void)
 		 */
 		if (i == 0)
 			pgStatLocal.snapshot.io->stat_reset_timestamp =
-				pgStatLocal.shmem->io.stats.stat_reset_timestamp;
+				shmem_io->stat_reset_timestamp;
 
 		/* using struct assignment due to better type safety */
 		*bktype_snap = *bktype_shstats;
+
+		/* Copy this BackendType's histogram slots */
+		for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
+		{
+			for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
+			{
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					int			off = bktype_shstats->hist_time_buckets_offsets[io_object][io_context][io_op];
+
+					if (off == -1)
+						continue;
+					memcpy(pgStatLocal.snapshot.io->hist_time_buckets_slots[off],
+						   shmem_io->hist_time_buckets_slots[off],
+						   sizeof(shmem_io->hist_time_buckets_slots[off]));
+				}
+			}
+		}
+
 		LWLockRelease(bktype_lock);
 	}
 }
@@ -720,3 +824,29 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * Total number of tuple of really usable combinations (BackendType, IOObject,
+ * IOContext, IOOp) that we consider trackable.
+ */
+int
+pgstat_io_get_sum_tracked(void)
+{
+	int			sum = 0;
+
+	for (int i = 0; i < BACKEND_NUM_TYPES; i++)
+		sum += pgstat_bktype_count_potentially_used(i);
+
+	return sum;
+}
+
+/*
+ * Returns number of bytes for shared memory required by
+ * PgStat_IO.hist_time_buckets_slots,
+ */
+Size
+pgstat_io_histogram_shmem_size(void)
+{
+	return mul_size(pgstat_io_get_sum_tracked(),
+					PGSTAT_IO_HIST_BUCKETS * sizeof(uint64));
+}
diff --git a/src/backend/utils/activity/pgstat_shmem.c b/src/backend/utils/activity/pgstat_shmem.c
index b8f354c818a..bb25be106be 100644
--- a/src/backend/utils/activity/pgstat_shmem.c
+++ b/src/backend/utils/activity/pgstat_shmem.c
@@ -139,6 +139,12 @@ StatsShmemSize(void)
 	sz = MAXALIGN(sizeof(PgStat_ShmemControl));
 	sz = add_size(sz, pgstat_dsa_init_size());
 
+	/*
+	 * Dynamic allocation for PgStat_IO.hist_time_buckets_slots. Sized from
+	 * the rules in pgstat_tracks_io_*()
+	 */
+	sz = add_size(sz, MAXALIGN(pgstat_io_histogram_shmem_size()));
+
 	/* Add shared memory for all the custom fixed-numbered statistics */
 	for (PgStat_Kind kind = PGSTAT_KIND_CUSTOM_MIN; kind <= PGSTAT_KIND_CUSTOM_MAX; kind++)
 	{
@@ -194,6 +200,15 @@ StatsShmemInit(void *arg)
 							  LWTRANCHE_PGSTATS_DSA, NULL);
 	dsa_pin(dsa);
 
+	/*
+	 * Prepare PgStat_IO.hist_time_buckets_slot* stuff before calling
+	 * pgstat_io_init_shmem_cb(). The additional memory for this was requested
+	 * in the StatsShmemSize() above.
+	 */
+	ctl->io.stats.hist_time_buckets_slot_count = pgstat_io_get_sum_tracked();
+	ctl->io.stats.hist_time_buckets_slots = (uint64 (*)[PGSTAT_IO_HIST_BUCKETS]) p;
+	p += MAXALIGN(pgstat_io_histogram_shmem_size());
+
 	/*
 	 * To ensure dshash is created in "plain" shared memory, temporarily limit
 	 * size of dsa to the initial size of the dsa.
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e16c65d45e9..da0e309600a 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1667,6 +1667,7 @@ typedef enum hist_io_stat_col
  */
 static void
 pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
+								  PgStat_IO *backends_io_stats,
 								  PgStat_BktypeIO *bktype_stats,
 								  BackendType bktype,
 								  TimestampTz stat_reset_timestamp)
@@ -1695,6 +1696,16 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
 				const char *op_name = pgstat_get_io_op_name(io_op);
+				int			bktype_hist_time_bucket_off;
+
+				/*
+				 * The offset is the same for every histogram bucket of this
+				 * io_obj/io_context/io_op combination.
+				 */
+				bktype_hist_time_bucket_off = bktype_stats->hist_time_buckets_offsets[io_obj][io_context][io_op];
+				if (bktype_hist_time_bucket_off == -1)
+					continue;
+				Assert(bktype_hist_time_bucket_off < backends_io_stats->hist_time_buckets_slot_count);
 
 				for (int bucket = 0; bucket < PGSTAT_IO_HIST_BUCKETS; bucket++)
 				{
@@ -1703,6 +1714,7 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
 					RangeBound	lower,
 								upper;
 					RangeType  *range;
+					uint64		bktype_bucket;
 
 					values[HIST_IO_COL_BACKEND_TYPE] = bktype_desc;
 					values[HIST_IO_COL_OBJECT] = CStringGetTextDatum(obj_name);
@@ -1731,9 +1743,9 @@ pg_stat_io_histogram_build_tuples(ReturnSetInfo *rsinfo,
 					range = make_range(typcache, &lower, &upper, false, NULL);
 					values[HIST_IO_COL_BUCKET_US] = RangeTypePGetDatum(range);
 
-					/* bucket count */
-					values[HIST_IO_COL_COUNT] = Int64GetDatum(
-															  bktype_stats->hist_time_buckets[io_obj][io_context][io_op][bucket]);
+					/* get bucket count, access indirectly */
+					bktype_bucket = backends_io_stats->hist_time_buckets_slots[bktype_hist_time_bucket_off][bucket];
+					values[HIST_IO_COL_COUNT] = Int64GetDatum(bktype_bucket);
 
 					if (stat_reset_timestamp != 0)
 						values[HIST_IO_COL_RESET_TIME] = TimestampTzGetDatum(stat_reset_timestamp);
@@ -1779,7 +1791,7 @@ pg_stat_get_io_histogram(PG_FUNCTION_ARGS)
 			continue;
 
 		/* save tuples with data from this PgStat_BktypeIO */
-		pg_stat_io_histogram_build_tuples(rsinfo, bktype_stats, bktype,
+		pg_stat_io_histogram_build_tuples(rsinfo, backends_io_stats, bktype_stats, bktype,
 										  backends_io_stats->stat_reset_timestamp);
 	}
 
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 984914e69b8..de90f1fb5b0 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -20,7 +20,6 @@
 #include "utils/backend_status.h"	/* for backward compatibility */	/* IWYU pragma: export */
 #include "utils/pgstat_kind.h"
 
-
 /* avoid including access/transam.h */
 typedef struct FullTransactionId FullTransactionId;
 
@@ -218,7 +217,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCBC
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCBD
 
 typedef struct PgStat_ArchiverStats
 {
@@ -342,7 +341,14 @@ typedef struct PgStat_BktypeIO
 	uint64		bytes[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 	PgStat_Counter times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
-	uint64		hist_time_buckets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES][PGSTAT_IO_HIST_BUCKETS];
+
+	/*
+	 * Indirect offset to PgStat_IO (parent
+	 * structure).hist_time_buckets_slots. This needs to be the last field due
+	 * to the use of memset(.., offsetof(hist_time_buckets_offsets)) in
+	 * pgstat_io_reset_all_cb().
+	 */
+	int			hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_PendingIO
@@ -358,7 +364,7 @@ typedef struct PgStat_PendingIO
 	 * memory.
 	 */
 	uint64		(*pending_hist_time_buckets)[PGSTAT_IO_HIST_BUCKETS];
-	uint64		pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	int			pending_hist_time_buckets_offsets[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 
 	/*
 	 * Cache how much histograms we have allocated to avoid repetably calling
@@ -374,6 +380,16 @@ typedef struct PgStat_IO
 {
 	TimestampTz stat_reset_timestamp;
 	PgStat_BktypeIO stats[BACKEND_NUM_TYPES];
+
+	/*
+	 * The IO histogram memory is sized at postmaster start from the rules in
+	 * pgstat_tracks_io_*() and persisted by additinal code to handle this
+	 * dynamic (shared) memory pointer in pgstat_write_statsfile() /
+	 * pgstat_read_statsfile(), so they nes are not part of the serialization
+	 * to disk by common code.
+	 */
+	int			hist_time_buckets_slot_count;
+	uint64		(*hist_time_buckets_slots)[PGSTAT_IO_HIST_BUCKETS];
 } PgStat_IO;
 
 typedef struct PgStat_LockEntry
@@ -654,6 +670,8 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 										 BackendType bktype);
 extern int	pgstat_bktype_count_potentially_used(BackendType bktype);
+extern int	pgstat_io_get_sum_tracked(void);
+extern Size pgstat_io_histogram_shmem_size(void);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context,
 							   IOOp io_op, uint32 cnt, uint64 bytes);
 extern instr_time pgstat_prepare_io_time(bool track_io_guc);
-- 
2.43.0



^ permalink  raw  reply  [nested|flat] 4+ messages in thread


end of thread, other threads:[~2026-05-20 08:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-02-18 23:12 Re: pg_stat_io_histogram Andres Freund <[email protected]>
2026-02-23 12:30 ` Jakub Wartak <[email protected]>
2026-02-26 16:13   ` Andres Freund <[email protected]>
2026-05-20 08:37     ` Jakub Wartak <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox