MIME-Version: 1.0
References: <CAKZiRmwvE4uJLKTgPXeBA4m+d4tTghayoefcaM9=z3_S7i72GA@mail.gmail.com>
 <vhzkeogzrrfzjwo3xrnq4xsjh6i37ou6xsbz7yby3lbb3rnxzz@6fpysnkjyldi>
 <CAKZiRmxNVMOV7x5c_Amqk=2mmYJOqsfHgE8N8O9jGjgfBYa8kQ@mail.gmail.com> <CANwKhkM1FNS1Wmc56+aunXhaP_zjjO2YKzKTJRoVW0RsM4Of3w@mail.gmail.com>
In-Reply-To: <CANwKhkM1FNS1Wmc56+aunXhaP_zjjO2YKzKTJRoVW0RsM4Of3w@mail.gmail.com>
From: Jakub Wartak <jakub.wartak@enterprisedb.com>
Date: Thu, 12 Feb 2026 09:35:06 +0100
Message-ID: <CAKZiRmxmVUpLb9HiY1MUczCLNFpP30f0+ZRS9NaeTf7DZjt4Tg@mail.gmail.com>
Subject: Re: pg_stat_io_histogram
To: Ants Aasma <ants.aasma@cybertec.at>
Cc: Andres Freund <andres@anarazel.de>, 
	PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAKZiRmxmVUpLb9HiY1MUczCLNFpP30f0%2BZRS9NaeTf7DZjt4Tg%40mail.gmail.com>
Precedence: bulk

On Wed, Feb 11, 2026 at 1:42=E2=80=AFPM Ants Aasma <ants.aasma@cybertec.at>=
 wrote:

Hi Ants, thanks for taking time to respond!

> On Tue, 27 Jan 2026 at 14:06, Jakub Wartak
> <jakub.wartak@enterprisedb.com> wrote:
> > > Hm. Isn't 128us a pretty high floor for at least reads and writes? On=
 a good
> > > NVMe disk you'll get < 10us, after all.
> >
> > I was blind and concentrated way too much on the bad-behaving I/O rathe=
r than good
> >  I/O - let's call it I/O negativity bias 8)
> >
> > Now v2 contains the min bucket lowered to 8us (but max then is just ~13=
1ms, I
> > didn't want it to use more than 64b total, 16*4b (uint32)=3D64b and wel=
l
> > 16*8b(uint64)=3D128b already, so that's why it's capped max at 131072us=
 right now).
>
> I have toyed around with similar histogram implementations as I have
> dealt with multiple cases where having a latency histogram would have
> made diagnosis much faster. So thank you for working on this.

Awesome! (I mean sorry you had to deal with terrible I/O stack
implementations.. ;))

> I think it would be useful to have a max higher than 131ms. I've seen
> some cases with buggy multipathing driver and self-DDOS'ing networking
> hardware where the problem latencies have been in the 20s - 60s range.
> Being able to attribute the whole time to I/O allows quickly ruling
> out other problems. Seeing a count in 131ms+ bucket is a strong hint,
> seeing a count in 34s-68s bucket is a smoking gun.
>
> Is the main concern for limiting the range cache-misses/pollution when
> counting I/O or is it memory overhead and cost of collecting?

Yes, I fully agree, but the primary reason for developing is finding those
edge case outliers (p99.9) that cause issues, but as You say I'm completely
not sure of how much data we can gather there before it starts to be
noticeable OR just makes committers uncomfortable due to performance concer=
ns
(even if not demonstrated by benchmarks).

> It seems quite wasteful to replicate the histogram 240x for each
> object/context/op combination. I don't think it matters for I/O
> instrumentation overhead - each backend is only doing a limited amount
> of different I/O categories and the lower buckets are likely to be on
> the same cache line with the counter that gets touched anyway. For
> higher buckets the overhead should be negligible compared to the cost
> of the I/O itself.
> What I'm worried about is that this increases PgStat_PendingIO from
> 5.6KB to 30KB. This whole chunk of memory needs to be scanned and
> added to shared memory structures element by element. Compiler auto
> vectorization doesn't seem to kick in on pgstat_io_flush_cb(), but

Right, after putting "#pragma clang loop vectorize(enable)") clang reports:
  ../src/backend/utils/activity/pgstat_io.c:273:2: warning: loop not vector=
ized:
  the optimizer was unable to perform the requested transformation;
  the transformation might be disabled or specified as part of an unsupport=
ed
  transformation ordering [-Wpass-failed=3Dtransform-warning]
  273 |         for (int io_object =3D 0; io_object <
IOOBJECT_NUM_TYPES; io_object++)

BTW how have you arrived with the "240x" number? We have 16 buckets for eac=
h
of the object/context/type.

> even then scanning an extra 25KB of mostly zeroes on every commit
> doesn't seem great. Maybe making the histogram accumulation
> conditional on the counter field being non-zero is enough to avoid any
> issues? I haven't yet constructed a benchmark to see if it's actually
> a problem or not. Select only pgbench with small shared buffers and
> scale that fits into page cache should be an adversarial use case
> while still being reasonably realistic.

Earlier I've done some benchmarks (please see [1]) based on recommendations
by Andres to keep low io_combine_limit for that and just tiny shared_buffer=
s.
I'm getting too much noise to derive any results, and as this is related
to I/O even probably context switches start playing a role there... sadly w=
e
seem not to have a performance farm to answer this.

TBH, I'm not sure how to progress with this, I mean we could as you say:
- reduce PgStat_PendingIO.pending_hist_time_buckets by removing
IOCONTEXT_NUM_TYPES
  (not a big loss, just lack of showing BAS strategy)
- we could even further reduce PgStat_PendingIO.pending_hist_time_buckets
  by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be
  useful

... and are You saying to try to do this below thing too?

@@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait)
                                for(int b =3D 0; b < PGSTAT_IO_HIST_BUCKETS=
; b++)
-
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D
-
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];
+
if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b=
]
> 0)
+
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D
+
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];

.. but the main problem, even if I do all of that I won't be able to
reliably measure the impact, probably the best I could say is
"runs good as well as master, +/- 3%".

Could you somehow help me with this? I mean should we reduce the scope(remo=
ve
context) and add that "if"?

> I'm not familiar enough with the new stats infrastructure to tell
> whether it's a problem, but it seems odd that
> pgstat_flush_backend_entry_io() isn't modified to aggregate the
> histograms.

Well I'm first time doing this too, and my understanding is that
pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics
(per backend-type) while the per-individual backend
pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about
per-PID-backends stats (--> for: select * from pg_stat_get_backend_io(PID))=
.

In terms of this patch, the per-backend-PID-I/O histograms are not implemen=
ted
yet, and I've raised this question earlier, but I'm starting to believe
the answer is probably no, we should not implement those (more overhead
for no apparent benefit, as most of the cases could be tracked down just wi=
th
this overall per-backend-type stats ).

Please feel free to drop some code, I'm looking for Co-authors on this for =
sure.

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmyLKeh9thmHNbkD7KSy3fsoUe=
opNVEGH33na8dXS9kN2g%40mail.gmail.com