Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vuV5Z-009cDZ-2L for pgsql-hackers@arkaria.postgresql.org; Mon, 23 Feb 2026 12:31:01 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vuV5Y-00DBTE-1r for pgsql-hackers@arkaria.postgresql.org; Mon, 23 Feb 2026 12:31:00 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vuV5Y-00DBT6-0o for pgsql-hackers@lists.postgresql.org; Mon, 23 Feb 2026 12:31:00 +0000 Received: from mail-lj1-x22d.google.com ([2a00:1450:4864:20::22d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vuV5U-00000000rcU-1Yws for pgsql-hackers@lists.postgresql.org; Mon, 23 Feb 2026 12:30:59 +0000 Received: by mail-lj1-x22d.google.com with SMTP id 38308e7fff4ca-38706b10b3bso41707331fa.1 for ; Mon, 23 Feb 2026 04:30:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1771849855; cv=none; d=google.com; s=arc-20240605; b=WN1KEdHrhPDfvDPAmZNCrE57PxwYXYdo96kfWtedqu56sHzvOlQxe7woFmUIVQvjEr CPtyWxEH8tkxeQ0DjmfoVxvYZiq9Y17J3/TIZksjHo+zHIwZjSU2srI7Go1GJeTYpu69 7JBFmWrOq08+vKSgIBQI1cLqFPKoRcAYcleKYcOvQd76tujVlpAiW7zERrPVbggFghK/ 4J5Z/MWRxMGHC3kVkMnHm+0Awo1chXxLRdnHkDtNf40RNYL1CZeC/5sR4WGcPO3sdslI qfwY1RTnRooZZRL2zE2ofuzLoqd5Ponid0Dym3F3Pn0aeplON1iPyWyUtRVKzkDwfU6d X7+g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=UHVokpnl+5cXFRl7FOqh+YOpcLd8f0ruwav7vdLoe5w=; fh=hEexil63mikVjxFEUBpiJm2fDE8QoE9YPgy7YXHdhW0=; b=ddjcMeX67L47y2KXtuykt6I1m2R4pP8jYzgijxLN5meyeikAaVem3EzjeywdHNftCJ 8J+3HxjPjTzL3Whd4bWcG+vV0FqgBbJIyip5qv63PocA1ngA2jkBvO3fGuWmaXbN+PkA renXWYVJoZLwt8kVzs6zCnsXVOzHuiCMy3TyGkD+Ju/S5SKRH4ZRYcuMwlzMoNHQZFIQ r2YB2AuSiPmMk6jXn0MtZcoNKUT+PlSQSB/Zh3qVz1bv2vOyUD0OTaax8Lfw2JB6ZuZ/ zMuRLv3GFuMzXYfdoUX8HNMIjJfoj9hZ7DsL4j5Mpvxe7mRaKLiWQdmeAvc+884bX9yp TETw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1771849855; x=1772454655; darn=lists.postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UHVokpnl+5cXFRl7FOqh+YOpcLd8f0ruwav7vdLoe5w=; b=LHt1NF3YSfyrLweidkYCNH0q3O2+l0UWtSNA4fd6jf860kcc3Pfu9l8konrcosXEwD lPbPNOAIif/xex714EXLr6v5Wz6xHCm71e1FsP5F13MJjjrT6E/NhQUUcE7EhT85trOu cyXxfcNjEr7voVY4QBUyc6sLZbyuBmKOiwLTmTlo7x91NvktVwO0KbXd0KkxEO27jW7M LNZqpRqMohMLMI+ExSivnWIJfop6Nec94p/8eloPfs5n3GU42XJxi53IIOdAg0x30aZj nmF5JltYnO/+0F3U0NEO6022SzNKVKGOEUj3W4+BxxQuzAoghO5Am4WjkwkpTD1JckUE LT9w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1771849855; x=1772454655; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=UHVokpnl+5cXFRl7FOqh+YOpcLd8f0ruwav7vdLoe5w=; b=fNE8RTkOXf0BVZF9LpEjsYij1grGpCYJpMGgSzfcmIpiRD6liEOMigNV7Nkhx22VP7 9iJYfvNppFVcAUrssjL6KUmcbB5hHL57h2fm7/+doaQI4Y5RjTtIdlmicp4K0TNWhDKB p9GC+a5Ote9PZoLtBZvQWp5agUGIgsEx2uFkThImKcYvgs53zq6MmrGtoBq2akxsDjel BZ9JN/B+I8VmGrN8Qd7LQYk2Ho8MSvde5OZhRw3Glw/tTyr+ppU0rzEcJy+xzwGcFDs6 M7JnT5OrUjHdOqQbHmjvVoM7Rs1fg7xjqmPCpjp6EyUKrwD1Tcu2BNFb4KfCYeGdRwdG g+3A== X-Forwarded-Encrypted: i=1; AJvYcCVjRrZweP6dSwAvFzRboTaTVRzw1rZMkkPZrIr3dOClXnWroZGg11DsQPVoBe9zCpFZoMX7nL9dXgSr4O5y@lists.postgresql.org X-Gm-Message-State: AOJu0YzpXufZ9kiIwwEtfp4QQPuKakLn2WvAZdJXEZUsBTMZVdBRz7MT o4F6YYdnTtmZfA63yRyFSAn0j21pbqWd/NfNNjCpr4i3KI+9oCqmhaQVMqA5HPX/PLXC6MgmYt9 ihwjc2AW+p0Fhi1F3rMLJO5fAUfRUorZLR22F8EHq7Z7EpSUyMXiFjw== X-Gm-Gg: AZuq6aIEY84+8nHDl23XVBMlP9iGhBxfuQyMJxLAKH16x9teGbXavUkIu5LSngUIwbn EVxIpaLORUh9tIksGQDBk0sGrWtDOXGj+NxfoaNPRwiK/oQ+9YBtPN3Rne326M9//d23Fpp7CbX I+khMo0t1cJAtQpgONsB2mov4IJha9D7fi65rXDik01GwZ0EDjnSt2RhCnsr8Cv4aIPmE8SmcbI cFiaPlJ2w4Ge/aPGlDQHiwwHxFbyoLkurRpD7+LWy6TLzLW75HVHGZFRYrnCRRqdMLGU6h31Y34 4QDl+RYTFyNIt1BaUhYtfzxazysAqU4NIxzVQuCLkyQXx5aK4eb0xpYHyEx4JszjZEk= X-Received: by 2002:a05:651c:2211:b0:385:c13b:5584 with SMTP id 38308e7fff4ca-389a5e130bemr26743421fa.36.1771849855404; Mon, 23 Feb 2026 04:30:55 -0800 (PST) MIME-Version: 1.0 References: In-Reply-To: From: Jakub Wartak Date: Mon, 23 Feb 2026 13:30:44 +0100 X-Gm-Features: AaiRm50VrbdTAJZhlVGLw-L_DyNRaVS-N7IDwyTTlV6SizgUMYRbFF6yR9qIyGw Message-ID: Subject: Re: pg_stat_io_histogram To: Andres Freund Cc: Ants Aasma , PostgreSQL Hackers Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, Feb 19, 2026 at 12:12=E2=80=AFAM Andres Freund = wrote: Hi Andres, > One thing to be aware of is that with the rdtsc[p] patch (to substantiall= y > reduce timing overhead), it'll become a tad more expensive to convert an > instr_time to nanoseconds (due to having to convert cycles to nanoseconds= ). > It may be worth testing the combination. I've took a quick look on latest v7-0002 from there [1] and to sum up it do= es: -#define INSTR_TIME_GET_NANOSEC(t) \ - ((int64) (t).ticks) +static inline int64 +pg_ticks_to_ns(int64 ticks) +{ +#if defined(__x86_64__) || defined(WIN32) [..] + ns +=3D ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION; + return ns; +#else + return ticks; +#endif +} [..but!] +/* + * Make sure this is a power-of-two, so that the compiler can turn the + * multiplications and divisions into shifts. + */ +#define TICKS_TO_NS_PRECISION (1<<14) +#define INSTR_TIME_GET_NANOSEC(t) \ + (pg_ticks_to_ns((t).ticks)) + So at least to my eyes, it looks pretty cheap, doesn't it? > On that note, why is this measuring things in nanoseconds, given that we > already conver instr_time to microseconds nearby and that its quite unlik= ely > that you'd ever have IO times below a microsecond and that > MIN_PG_STAT_IO_HIST_LATENCY already is in the microsecond domain and we > display it as microseconds? Hmm, in earlier reply You have recommened to get away from conversion from microseconds so I've did because the microseconds were really costly integer divisions [2] "It's annoying to have to convert to microseconds here, that's not free := (" so because INSTR_TIME_GET_NANOSEC() is still cheap and fetching "ticks". > > I still want to look at the memory overhead more closely. The 30kB per > > backend seems tolerable to me > > One thing worth thinking about here is that we probably could stand to > increase the number of IO types further, we e.g. have been talking about > tracking IO that bypasses shared buffers separately. And a few more cont= ext > types (e.g. index inner/leaf) could also make sense. > > Without that change that'd be a somewhat moderate increase in memory usag= e, > but with this change it'd increase a lot more. OK, point taken, it can grow even further, but..: > > but I think having it in PgStat_BktypeIO is not great. This makes > > PgStat_IO 30k*BACKEND_NUM_TYPES bigger, or ~ 0.5MB. Having a stats snap= shot > > be half a megabyte bigger for no reason seems too wasteful. > > Yea, that's not awesome. Guys, question, could You please explain me what are the drawbacks of havin= g this semi-big (internal-only) stat snapshot of 0.5MB? I'm struggling to understand two things: a) 0.5MB is not a lot those days (ok my 286 had 1MB in the day ;)) b) how does it affect anything, because testing show it's not? My understandiung is that it only affects file size on startup/shutdown in $PGDATADIR/pgstat/pgstat.stat, correct? My worry is that we introduce more code (and bugs) for no real gain (?) > I guess we could count IO as 4 byte integers, and shift all bucket counts= down > in the rare case of an on overflow. It's just a 2x improvement, but ... [..I'll reply to that in next follow-up] > I think we might need to reduce the number of buckets somewhat. I'm kind of skeptical on lowering bucket count, and even Ants wanted to increase it, so that we would gain perfect visibility into sometimes problematic hardware issues (I would also swear there is something magical for I/Os stuck for 60secs), so we would both would want to cover it there, but we cannot squeezee more due to performance concerns... Now there's also this area where we want to understand was it from page cache or some-fast-IO-dev and that's how I arrived at this first edge of ~8us. If we go one bucket further (that is make first bucket 16us), I was afraid we may start loosing being able to differentiating page-cache vs devices, won't we? (Optane seems to be gone, but it started @ ~20us? You sa= id in [3] that it could be even as low as 10? so I've thought 8 is good bet) Right now, the final bucket is that we track >128ms (=3D=3D> bad stuff), and I would love to extend to that >512ms, but we cannot as it would be more than 16 buckets (and 16*8bytes_due_to_uint64=3D128bytes already). > Right now the lowest bucket is for 0-8 ms, the second for 8-16, the third= for > 16-32. I.e. the first bucket is the same width as the second. Is that > intentional? Yes, it's intentional flat at the beggining to be able to differentiate those fast accesses. -J. [1] - https://www.postgresql.org/message-id/CAP53PkxNJ2Y6G8PEpQn1zKa6ODE6k1= -oP9DNqWjkTj%3DdC8_KiA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/vhzkeogzrrfzjwo3xrnq4xsjh6i37ou= 6xsbz7yby3lbb3rnxzz%406fpysnkjyldi