Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vqTz8-00BeGg-2O for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Feb 2026 10:31:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vqTz7-007ysO-34 for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Feb 2026 10:31:46 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vqTz7-007ysE-1p for pgsql-hackers@lists.postgresql.org; Thu, 12 Feb 2026 10:31:46 +0000 Received: from mail-wr1-x433.google.com ([2a00:1450:4864:20::433]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vqTz5-00000000IYB-3fVI for pgsql-hackers@lists.postgresql.org; Thu, 12 Feb 2026 10:31:45 +0000 Received: by mail-wr1-x433.google.com with SMTP id ffacd0b85a97d-43622089851so2790625f8f.3 for ; Thu, 12 Feb 2026 02:31:43 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770892302; cv=none; d=google.com; s=arc-20240605; b=TmYtNjinU/8+xDN/yYAYJtFjGXZHtJkWKaCqH1t/u3b6imMIfrksxkg3/SN0vBuCtr ANujUl+vxwIoGkqNjJZpxZjPPY3cgMe0ons+X1D0OxaAhksHIwMVWo2TQGi6oPSfZhE+ 7ghI13n8J8HLsjJvu3Mhu2WO499G7XEYDYastme0TWRC4LQ+DJCK5B9rtGGBH8OhH1Dm IZYxFq8IldAL6WpzmQvNhHKUN+snDijiOjtFKJQMUOvfgqE00LFXgHVDPe6gW6vcrHW4 3w8onW4KgxDqEOgyV9RTKaP9JC34jIBjUOSype85JKvPjgFrzxDkGb8u1oa8ps6Z96nH DkGg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=NswGhdzfrSTMmBaLGx6PH4+hZidcqGQYJ4Iy+3uIab8=; fh=9lTpRectFOnB2u76Pu1eUq+ZJHQiCXV1zmnyUL+QQm8=; b=LWr8iMDMFQ9bLCxaYZfB3itU9WehuppuhbLowqfn3+FlZOoDPKZUujH6RY3LHsWNM0 W/3aFTNZWg2ep4KTS+HskUjL7TDZqmVXnos8o1gJw4zUmZU3J8GKYgfiQ7w87DGswRcD qWrqHbbXKnQ0Rh8BuOp4i02N+NPBueaHyEr9p4vKIvmgnLdBPVJeK0tcMGJ9z7uO7G9Z /rkUkYXbxu8s3BSsTpv2p4qxk7sVGXErrjMJCXNA5SGI05rG44Pxp4N3GQKx4IR1DxbE OMzJE4OB+H1Z5Yb802dtduo1LUi0LQVNb/7tlNbFfZ7N6yGytvQOJsQuLhUSXlNC8/eg oXlg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cybertec.at; s=google; t=1770892302; x=1771497102; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=NswGhdzfrSTMmBaLGx6PH4+hZidcqGQYJ4Iy+3uIab8=; b=QbUP51P+e5+m2IzSIkK2z7D+JkOgA3V7IwTbdZCpmUbgASdXrejruZrpRQG9fz0T95 Ekp74lxEJhjT5agG73Ixzbe6dxA+VGzGxEDgdISgdIuNcMg44X7gN0C5zTf1EiaP0nsz TeqahPVoJLMNIZQb2d0SWYUzx9VLUPLghyN0AR5Lx6hC7Odh13vsnaGaWWsDWWpGcwJS gQFuPIA3CuswCwIWWdFqIypXbjJKNDTqxcZBppkqFtNBcVbun2NBiV5UXr+iGESUTmNc 00P7BcDdMtwgq+kwYm8n0XiI/lqiDUZPV/VX4rYueklY2D9BjqIft8VcIRZ88Me4P/ad fRwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770892302; x=1771497102; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=NswGhdzfrSTMmBaLGx6PH4+hZidcqGQYJ4Iy+3uIab8=; b=i8rWOV6pS1qrMLjbGBTvhWnmyPTbmw1nsmmakNl/8+SOfaFPX+JbcOZ5p6L3SkAGYD oZqDoJ3IiTvSO/66AOBUzboszDZpHoim+1jVfIP84rqLFJXRTggyA4Yj/C3y60EjyRM5 kUynj1b6+drIfClE786DwKOGxIyATlDAodPWVMhPbvMq3+7qgcVFBatEQsiAVjAgcgWd 7IzAMVrJKq+scPa7/oDKLGashNl9xbdsGxGcxgJfZaN3GZksrak+uEJs1rrlNZO1l2nl xWahwsOPX40fep5KS643cFiDKpYVegshn2VEjHURFexX0T8RBHxsKXNeiF8S/wBIunVS +f3Q== X-Forwarded-Encrypted: i=1; AJvYcCW94pr53BST37VA0c8URBNn97ASgRJTg5fSSn/WKuoT9aggPuaIFpl4Mrw+TLpWw2zsEoPglqXfRR7dL/IZ@lists.postgresql.org X-Gm-Message-State: AOJu0Ywf5adV2cfUR55ELt7sTz9OJ3V+XL0qX7j9PZcMOW6yEVag7jii J/6TXFXT7N8rWamWmYUlbPRmiyuWWSFu1LBxz93GgR/PnwdNWRwstuQL98zSn6qEEzC0E88j1Ox SP1OOBDG0765ylf45q6mxM0BHWBrXLwVyCGhZ01Z6Kw== X-Gm-Gg: AZuq6aKe98L1InEWmqoxcBXGyq4fXwCV0Zs+3rY1mXykw7W5FjzuuXhNU5Qlmy5tRTu m1YG22R6GdP+pFwzazMV4K0QAxqUpTz3kXZm7EMSeoro1Ct7cI0318xJxyUIP/ptVKKj4GqUP4u q/gg0Ss8/tpH4XEtQOu9Iv4WiFfuzc/ii62IaOYMmvmrsFQRgYMy8jh9TfxR9d+FpsYh3MqaBAc ErLKdcJnrzPh/jEz0p+P4EzH0wTXDsAYHUKM+o4+eoYLGwqD3lDsOJy96vGojGM7jpE9LkTX+oq Ml4Uyiuq X-Received: by 2002:a05:6000:184e:b0:437:677b:4a24 with SMTP id ffacd0b85a97d-4378f11390dmr2839044f8f.15.1770892302117; Thu, 12 Feb 2026 02:31:42 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Ants Aasma Date: Thu, 12 Feb 2026 12:31:31 +0200 X-Gm-Features: AZwV_QiPHAVeXxgw7azJTSuMI6O2Zh519DGwU9oyppednRS6teCuvoxCgtNxH9U Message-ID: Subject: Re: pg_stat_io_histogram To: Jakub Wartak Cc: Andres Freund , PostgreSQL Hackers Content-Type: multipart/alternative; boundary="0000000000009ab9f3064a9dfed9" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000009ab9f3064a9dfed9 Content-Type: text/plain; charset="UTF-8" On Thu, 12 Feb 2026 at 10:35, Jakub Wartak wrote: > BTW how have you arrived with the "240x" number? We have 16 buckets for > each > of the object/context/type. > Sorry, I worded that poorly. I meant that we store a histogram for each combination, 240 in total. > > > even then scanning an extra 25KB of mostly zeroes on every commit > > doesn't seem great. Maybe making the histogram accumulation > > conditional on the counter field being non-zero is enough to avoid any > > issues? I haven't yet constructed a benchmark to see if it's actually > > a problem or not. Select only pgbench with small shared buffers and > > scale that fits into page cache should be an adversarial use case > > while still being reasonably realistic. > > Earlier I've done some benchmarks (please see [1]) based on recommendations > by Andres to keep low io_combine_limit for that and just tiny > shared_buffers. > I'm getting too much noise to derive any results, and as this is related > to I/O even probably context switches start playing a role there... sadly > we > seem not to have a performance farm to answer this. > I glossed over the first benchmark you did. That's pretty close to what I was talking about - exercise the stats collection part by having ~1 I/O served from page cache per a trivial transaction. And the prewarm should exercise the per I/O overhead. If neither of them have any measurable overhead then I can't think of a workload where it could be worse. TBH, I'm not sure how to progress with this, I mean we could as you say: > - reduce PgStat_PendingIO.pending_hist_time_buckets by removing > IOCONTEXT_NUM_TYPES > (not a big loss, just lack of showing BAS strategy) > I'm on the fence on this. For the actual problems I've had to diagnose it wouldn't have mattered. But latency differences of bulk vs. normal access might be useful for understanding benchmark results better. A 5x reduction in size is pretty big. > - we could even further reduce PgStat_PendingIO.pending_hist_time_buckets > by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be > useful > WAL write vs. relation write is a very useful distinction for me. ... and are You saying to try to do this below thing too? > > @@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait) > for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; > b++) > - > bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += > - > PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; > + > > if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b] > > 0) > + > bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += > + > PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; > > I meant this: @@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait) bktype_shstats->times[io_object][io_context][io_op] += INSTR_TIME_GET_MICROSEC(time); + if (PendingIOStats.counts[io_object][io_context][io_op] > 0) for(int b = 0; b < PGSTAT_IO_HIST_BUCKETS; b++) bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] += PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; Most object/context/op combinations will have a 0 count, so no point in actually looking at the histogram. > .. but the main problem, even if I do all of that I won't be able to > reliably measure the impact, probably the best I could say is > "runs good as well as master, +/- 3%". > > Could you somehow help me with this? I mean should we reduce the > scope(remove > context) and add that "if"? > I think if we only aggregate histograms conditionally, then having a ton of different histograms is less of a problem. Only the histograms that have any data will get accessed. The overhead is limited to the memory usage which I think is acceptable. I'll run a few benchmarks on what I have available here to see if I can tease out anything more than the no effect with a 3% error margin we have today. > > I'm not familiar enough with the new stats infrastructure to tell > > whether it's a problem, but it seems odd that > > pgstat_flush_backend_entry_io() isn't modified to aggregate the > > histograms. > > Well I'm first time doing this too, and my understanding is that > pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics > (per backend-type) while the per-individual backend > pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about > per-PID-backends stats (--> for: select * from > pg_stat_get_backend_io(PID)). > > In terms of this patch, the per-backend-PID-I/O histograms are not > implemented > yet, and I've raised this question earlier, but I'm starting to believe > the answer is probably no, we should not implement those (more overhead > for no apparent benefit, as most of the cases could be tracked down just > with > this overall per-backend-type stats ). > I agree that per-PID histograms are probably not worth the extra work. But this left me wondering if we are allocating the whole set of histograms too many times. I don't think every place that uses PgStat_BktypeIO actually needs the histograms. I will need to dig around to understand this code a bit better. Regards, Ants Aasma --0000000000009ab9f3064a9dfed9 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu, 12 Feb 2026 at 10:35, Jakub Warta= k <jakub.wartak@enterpr= isedb.com> wrote:
BTW how have you arri= ved with the "240x" number? We have 16 buckets for each
of the object/context/type.

Sorry, I wo= rded that poorly. I meant that we store a histogram for each combination, 2= 40 in total.

> even then scanning an extra 25KB of mostly zeroes on every commit
> doesn't seem great. Maybe making the histogram accumulation
> conditional on the counter field being non-zero is enough to avoid any=
> issues? I haven't yet constructed a benchmark to see if it's a= ctually
> a problem or not. Select only pgbench with small shared buffers and > scale that fits into page cache should be an adversarial use case
> while still being reasonably realistic.

Earlier I've done some benchmarks (please see [1]) based on recommendat= ions
by Andres to keep low io_combine_limit for that and just tiny shared_buffer= s.
I'm getting too much noise to derive any results, and as this is relate= d
to I/O even probably context switches start playing a role there... sadly w= e
seem not to have a performance farm to answer this.
I glossed over the first benchmark you did. That's pretty = close to what I was talking about - exercise the stats collection part by h= aving ~1 I/O served from page cache per a trivial transaction. And the prew= arm should exercise the per I/O overhead. If neither of them have any measu= rable overhead then I can't think of a workload where it could be worse= .

TBH, I'm not sure how to progress with this, I mean we could as you say= :
- reduce PgStat_PendingIO.pending_hist_time_buckets by removing
IOCONTEXT_NUM_TYPES
=C2=A0 (not a big loss, just lack of showing BAS strategy)
=

I'm on the fence on this. For the actual problems I= 've had to diagnose it wouldn't have mattered. But latency differen= ces of bulk vs. normal access might be useful for understanding benchmark r= esults better. A 5x reduction in size is pretty big.
=C2=A0
=
- we could even further reduce PgStat_PendingIO.pending_hist_time_buckets =C2=A0 by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might = be
=C2=A0 useful

WAL write vs. relation wr= ite is a very useful distinction for me.

... and are You saying to try to do this below thing too?

@@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for(int b =3D 0; b < PGSTAT_IO_HI= ST_BUCKETS; b++)
-
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D<= br> -
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];<= br> +
if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b= ]
> 0)
+
bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D<= br> +
PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b];<= br>

I meant this:

@@ -287,6 +287,7 @@ pgstat_io_flush_cb(bool nowait)
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 bktype_shstats->times[io_object][io_context][io_op]= +=3D
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 I= NSTR_TIME_GET_MICROSEC(time);
=C2=A0
+ =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 if (PendingIOStats.counts[io_object][io_context][io_op] > 0)
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 for(int b = =3D 0; b < PGSTAT_IO_HIST_BUCKETS; b++)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 bktype_s= hstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 PendingIOStats.pending_hist_t= ime_buckets[io_object][io_context][io_op][b];

= Most object/context/op combinations will have a 0 count, so no point in act= ually looking at the histogram.
=C2=A0
.. but the main problem, even if I do all of that I won't be able to reliably measure the impact, probably the best I could say is
"runs good as well as master, +/- 3%".

Could you somehow help me with this? I mean should we reduce the scope(remo= ve
context) and add that "if"?

I= think if we only aggregate histograms conditionally, then having a ton of = different histograms is less of a problem. Only the histograms that have an= y data will get accessed. The overhead is limited to the memory usage which= I think is acceptable.

I'll run a few benchma= rks on what I have available here to see if I can tease out anything more t= han the no effect with a 3% error margin we have today.
=C2=A0
> I'm not familiar enough with the new stats infrastructure to tell<= br> > whether it's a problem, but it seems odd that
> pgstat_flush_backend_entry_io() isn't modified to aggregate the > histograms.

Well I'm first time doing this too, and my understanding is that
pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics
(per backend-type) while the per-individual backend
pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about
per-PID-backends stats (--> for: select * from pg_stat_get_backend_io(PI= D)).

In terms of this patch, the per-backend-PID-I/O histograms are not implemen= ted
yet, and I've raised this question earlier, but I'm starting to bel= ieve
the answer is probably no, we should not implement those (more overhead
for no apparent benefit, as most of the cases could be tracked down just wi= th
this overall per-backend-type stats ).

= I agree that per-PID histograms are probably not worth the extra work. But = this left me wondering if we are allocating the whole set of histograms too= many times. I don't think every place that uses=C2=A0PgStat_BktypeIO a= ctually needs the histograms. I will need to dig around to understand this = code a bit better.

Regards,
Ants Aasma
--0000000000009ab9f3064a9dfed9--