Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vqSAY-00B4BM-27 for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Feb 2026 08:35:27 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vqSAX-007W0t-2v for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Feb 2026 08:35:26 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vqSAX-007W0l-1S for pgsql-hackers@lists.postgresql.org; Thu, 12 Feb 2026 08:35:26 +0000 Received: from mail-lj1-x22f.google.com ([2a00:1450:4864:20::22f]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vqSAT-00000000Fs6-3A13 for pgsql-hackers@lists.postgresql.org; Thu, 12 Feb 2026 08:35:24 +0000 Received: by mail-lj1-x22f.google.com with SMTP id 38308e7fff4ca-38707d4c8b9so12836771fa.2 for ; Thu, 12 Feb 2026 00:35:21 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770885320; cv=none; d=google.com; s=arc-20240605; b=AUQ7tvFtyhu0F6DTDFeEnk0wultcT8czoJSS4tewD9I/siR7oheU29ENFR/UfwmR7r CD9YjroSHl5MJCt6S7LGv3Ny9IzQosMZBACn45rNObnwxhiXBm7Yt81txfAyxjTeJ0Bh YHnHXYMQizoMCxG/1X2IJxGI/msxjwc2PxMXD1sQAIpVkW6rlE+IfQUhsk4lovmUjscL bbCw11cmm0uGtrJanQdbvtavHG4DkDuSADzPIILT9D1iV8cLsfGBvkXl6CBp3+4aw4C8 wcy8YY+ECG6iPOu5XcgCyrbafzISZgIurA1ahspfXOZOQ7Ors+4JkL3vUv36gkhfH6li hy9g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=48FqQFcJsFYk+P2RVvfNwh8ajLGKIcR60MZDRLiaSrQ=; fh=rBSy1UXz2DN6BoQXWumSOh+G7iPmG69YIjrYzk2tfks=; b=H8inkItAjvV4e8KwvwpvcwGsVHweB/7Pb8+tR1wE16E5Wz0VxlYriiNPYGokzWEu0G y1kKZu7ZieRBRd+kjZw0BvnZdJ35mIDvLpNwmyIugKtIaro57IlwyA2+QKxY/Dcpir7p KlOcTAnWNp2TVIwpqxIue6rkbowRkbe5bXjM/N3uSgdmOFZ70p0dA/ng1OzhvqCBotq/ 5vWmP83OHrKzzr7cemWeMyWwVCXfG0zhGVzUrO4r3Fa/mh3g16VlpKKWgpQ7KAXY8Nan xRmLlCh8AOqx977iVrEzWV0RkGTueRw3sF8RQqk4n5XTn3q9HC6qcoJe+MysKMevayLs MqcA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1770885320; x=1771490120; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=48FqQFcJsFYk+P2RVvfNwh8ajLGKIcR60MZDRLiaSrQ=; b=AfiWs2dGld5wQeptZMjM8XGFjgf1S9kOfxRf3ZSaFafaG9wsCGTxdAGAb/P5iysdg6 bJrbvvtm7fg3N6wWxpuyF4vBtRN4kG42JEBKPkHLwfAZtEaJ59xd4xeKMpeP6JfVslm4 hIyT8OWr+aa7oIfIJS+4xEsZGJsVckCE/gdCTmwI7GtjGVFKuV+OIXoqvjqqT4ugiw3J EcS0+m0NyxmPdcb9KHoe3U0TkVHLddffnfO/GrPiQc+wuRFkLVlndgK0M+ZS9aDy6faq zQkoddZRL5shfmqiUc5WxTUXuUl1a+8VzfIOBg2usiJhQ5XEQIWbVSUjBUBhCz+6YUUL tdXQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770885320; x=1771490120; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=48FqQFcJsFYk+P2RVvfNwh8ajLGKIcR60MZDRLiaSrQ=; b=IdrW4BQtw6BOKMbxyzWc36dusHns/8FyJ8Cgx/BG3hmKMoSm5WVQT15BhRSBbSFlWD u+VW7ecM2e7bkFlQqbr9Yw7MVH7Dhs66X7GGOR1yjqLdNxMQ+tHb0f0DenUqEo6adovZ 73RIC3F6sxP95+nj4fofvAIGPdvX8epJK4JySh/Q5e+x5JWAL+k/yRUtEX/qCF0ulonm wII3Tlzl/IZln+iOLCU+ursn2+2UdNStg0T/RIycvyyj3c7HsaJT9r7bVf5CmL9ip5U6 hyeRFcgQR28cGiiOfVObFpzJp+0y3YEX6SZLvQGl2dcj9LKu/qOR4/VJ6WFB6i8QsbrE qBDQ== X-Forwarded-Encrypted: i=1; AJvYcCUXtLYDOw4q6njBCRlry8ZE0oHC3mJrmKTFKVjhjFo1prBpuvyv5mgAabqaNFpb1I0xuQjc6SKNUGGpr2PJ@lists.postgresql.org X-Gm-Message-State: AOJu0YziPuG43Io6nSqbW8xjribXDQ91wR60QHxAk+RgFBqC16fwGtad JO7+KF9HK5IBhMGIbLffVcM+IvTW+LuWcHTq9ocdq1UM95jdZ1+7o4QNEfuWC+lSJM2sOKSZl7f gq9uXpCGboCdQA5SVrqL5Hf0kRAb4IyrP7ehzPRrs X-Gm-Gg: AZuq6aIvuSLq2OVYx4HjuXarlzpxPomiCJ4sXBullPyHimWwMlYJwk0tpTDIrjV7v9j +KmU0KC6R77P8HcwvgoP0tVOWt5nJQu6g3urKJAzb3vNVaPUoPaz4FM1XDSv3reX7+j4SXl7Tv+ BwdpuQ42A4Wl7ewHsN/1NPSsjKWwWKLwI3auXk18ZenpTWdV4tamP9AS+/TdV3J7hD909mN/OmO yuh0Y7pSdBmB4aAkVlCuPiUZ9lIplp6cINGsYSAHQHF+d2ggNhEDDgeJ/2oMRH0qAtXJPDu5V8O QRBdljqOAcc/aMGMo+90NpSml7WJQZ4/aNqWMCW1vGqE9pVoqF4oBRjGKPHKYcIqhxI= X-Received: by 2002:a05:651c:1503:b0:385:f235:51c9 with SMTP id 38308e7fff4ca-3871113c7b4mr7906911fa.34.1770885319743; Thu, 12 Feb 2026 00:35:19 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jakub Wartak Date: Thu, 12 Feb 2026 09:35:06 +0100 X-Gm-Features: AZwV_QiunsK5xM52QpENdkg8GztcWG2TX39esR2bfdedD1s4Z4MNiG--ZytDFpE Message-ID: Subject: Re: pg_stat_io_histogram To: Ants Aasma Cc: Andres Freund , PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Feb 11, 2026 at 1:42=E2=80=AFPM Ants Aasma = wrote: Hi Ants, thanks for taking time to respond! > On Tue, 27 Jan 2026 at 14:06, Jakub Wartak > wrote: > > > Hm. Isn't 128us a pretty high floor for at least reads and writes? On= a good > > > NVMe disk you'll get < 10us, after all. > > > > I was blind and concentrated way too much on the bad-behaving I/O rathe= r than good > > I/O - let's call it I/O negativity bias 8) > > > > Now v2 contains the min bucket lowered to 8us (but max then is just ~13= 1ms, I > > didn't want it to use more than 64b total, 16*4b (uint32)=3D64b and wel= l > > 16*8b(uint64)=3D128b already, so that's why it's capped max at 131072us= right now). > > I have toyed around with similar histogram implementations as I have > dealt with multiple cases where having a latency histogram would have > made diagnosis much faster. So thank you for working on this. Awesome! (I mean sorry you had to deal with terrible I/O stack implementations.. ;)) > I think it would be useful to have a max higher than 131ms. I've seen > some cases with buggy multipathing driver and self-DDOS'ing networking > hardware where the problem latencies have been in the 20s - 60s range. > Being able to attribute the whole time to I/O allows quickly ruling > out other problems. Seeing a count in 131ms+ bucket is a strong hint, > seeing a count in 34s-68s bucket is a smoking gun. > > Is the main concern for limiting the range cache-misses/pollution when > counting I/O or is it memory overhead and cost of collecting? Yes, I fully agree, but the primary reason for developing is finding those edge case outliers (p99.9) that cause issues, but as You say I'm completely not sure of how much data we can gather there before it starts to be noticeable OR just makes committers uncomfortable due to performance concer= ns (even if not demonstrated by benchmarks). > It seems quite wasteful to replicate the histogram 240x for each > object/context/op combination. I don't think it matters for I/O > instrumentation overhead - each backend is only doing a limited amount > of different I/O categories and the lower buckets are likely to be on > the same cache line with the counter that gets touched anyway. For > higher buckets the overhead should be negligible compared to the cost > of the I/O itself. > What I'm worried about is that this increases PgStat_PendingIO from > 5.6KB to 30KB. This whole chunk of memory needs to be scanned and > added to shared memory structures element by element. Compiler auto > vectorization doesn't seem to kick in on pgstat_io_flush_cb(), but Right, after putting "#pragma clang loop vectorize(enable)") clang reports: ../src/backend/utils/activity/pgstat_io.c:273:2: warning: loop not vector= ized: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupport= ed transformation ordering [-Wpass-failed=3Dtransform-warning] 273 | for (int io_object =3D 0; io_object < IOOBJECT_NUM_TYPES; io_object++) BTW how have you arrived with the "240x" number? We have 16 buckets for eac= h of the object/context/type. > even then scanning an extra 25KB of mostly zeroes on every commit > doesn't seem great. Maybe making the histogram accumulation > conditional on the counter field being non-zero is enough to avoid any > issues? I haven't yet constructed a benchmark to see if it's actually > a problem or not. Select only pgbench with small shared buffers and > scale that fits into page cache should be an adversarial use case > while still being reasonably realistic. Earlier I've done some benchmarks (please see [1]) based on recommendations by Andres to keep low io_combine_limit for that and just tiny shared_buffer= s. I'm getting too much noise to derive any results, and as this is related to I/O even probably context switches start playing a role there... sadly w= e seem not to have a performance farm to answer this. TBH, I'm not sure how to progress with this, I mean we could as you say: - reduce PgStat_PendingIO.pending_hist_time_buckets by removing IOCONTEXT_NUM_TYPES (not a big loss, just lack of showing BAS strategy) - we could even further reduce PgStat_PendingIO.pending_hist_time_buckets by removing IOOBJECT_NUM_TYPES, but those are just 3 and they might be useful ... and are You saying to try to do this below thing too? @@ -288,8 +290,9 @@ pgstat_io_flush_cb(bool nowait) for(int b =3D 0; b < PGSTAT_IO_HIST_BUCKETS= ; b++) - bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D - PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; + if(PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b= ] > 0) + bktype_shstats->hist_time_buckets[io_object][io_context][io_op][b] +=3D + PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][b]; .. but the main problem, even if I do all of that I won't be able to reliably measure the impact, probably the best I could say is "runs good as well as master, +/- 3%". Could you somehow help me with this? I mean should we reduce the scope(remo= ve context) and add that "if"? > I'm not familiar enough with the new stats infrastructure to tell > whether it's a problem, but it seems odd that > pgstat_flush_backend_entry_io() isn't modified to aggregate the > histograms. Well I'm first time doing this too, and my understanding is that pgstat_io.c::pgstat_io_flush_cb() is flushing the global statistics (per backend-type) while the per-individual backend pgstat_flush_backend_entry_io() (from pgstat_backend.c) is more about per-PID-backends stats (--> for: select * from pg_stat_get_backend_io(PID))= . In terms of this patch, the per-backend-PID-I/O histograms are not implemen= ted yet, and I've raised this question earlier, but I'm starting to believe the answer is probably no, we should not implement those (more overhead for no apparent benefit, as most of the cases could be tracked down just wi= th this overall per-backend-type stats ). Please feel free to drop some code, I'm looking for Co-authors on this for = sure. -J. [1] - https://www.postgresql.org/message-id/CAKZiRmyLKeh9thmHNbkD7KSy3fsoUe= opNVEGH33na8dXS9kN2g%40mail.gmail.com