Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vkOCz-00EJGQ-2e for pgsql-hackers@arkaria.postgresql.org; Mon, 26 Jan 2026 15:08:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vkOCy-008UMD-34 for pgsql-hackers@arkaria.postgresql.org; Mon, 26 Jan 2026 15:08:53 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vkOCy-008UM4-0e for pgsql-hackers@lists.postgresql.org; Mon, 26 Jan 2026 15:08:52 +0000 Received: from fhigh-a1-smtp.messagingengine.com ([103.168.172.152]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vkOCv-00000000Yhs-39qe for pgsql-hackers@lists.postgresql.org; Mon, 26 Jan 2026 15:08:52 +0000 Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfhigh.phl.internal (Postfix) with ESMTP id 1F73F1400031; Mon, 26 Jan 2026 10:08:48 -0500 (EST) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-05.internal (MEProxy); Mon, 26 Jan 2026 10:08:48 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm2; t=1769440128; x=1769526528; bh=xOrZSND3sH YuxW/ZDXILsTyfND8+XthgoANfrRNiu6c=; b=MXFf8IRAkW0TPmVzgoZwL/P3EU bq89on6dF3HpouzfmafmdsbHdXZp08qd/w+71M24rWizYKJxmDuh9TLGXkC797U1 WRyfdZrVnCTmi8ZAA5UA//i9uc+GI193UW+trJ9e8xef4d9s50Z0vYEz2PjrMEwj y1O+AEKwsOiJhMEMj1QpXs4/mWH2YuJaVravDJnm2snPfSnBsnVa3wPje8p8PS1j UR2KsdbeE8l7Sx5441smiuIfiSqxWLfICekTVhD5n/kCNhwvsm/z0D5L2OozacMA GNVf62y+oJJq5Ook6XtV3oS+RClQe9Nb7zsc+mWRrSZqFsVnyQuHPo6mUCSA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1769440128; x=1769526528; bh=xOrZSND3sHYuxW/ZDXILsTyfND8+XthgoAN frRNiu6c=; b=UfPHgCcKulX44lOseO/TbR46DDRM+JlErBWWdUyB9wGOHo51Hrw vqIJnVvoA7KWRUbClROyfLGUxMR2qTHoQsT1hr5aL0OTayhTNPn90jsiIWgtw+Y1 EQeh+5Q2WCaVMldSONvsplo7L16hCqQzcQAGKvBPRo2/EPKE99sNciiefG1qFl0P QSN8mDFU+c3BOdaZXwio59+tma3bwcKrfuXSCD+bDQ4Hmx7rdoU3ZKDNIn1I3h7N w08PVpQY3MjLvYJdcdujKbWQN8wc+9/fFyMoHbiY4q/3BEvSg/Ej71km4+It5Ieb CGV39O/ogz3LVpZthkGDGcv698jR4w5bM5A== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdduheejleelucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopedvpdhmohguvgep shhmthhpohhuthdprhgtphhtthhopehjrghkuhgsrdifrghrthgrkhesvghnthgvrhhprh hishgvuggsrdgtohhmpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhishht shdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 26 Jan 2026 10:08:47 -0500 (EST) Date: Mon, 26 Jan 2026 10:08:47 -0500 From: Andres Freund To: Jakub Wartak Cc: PostgreSQL Hackers Subject: Re: pg_stat_io_histogram Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-01-26 10:40:52 +0100, Jakub Wartak wrote: > I'm proposing that we add pg_stat_io_histogram that would track/show I/O > latencies profile, so we could quickly identify I/O outliers. I think that's an interesting feature! > From time to time users complain that 'PostgreSQL is slow or stuck' (usually > COMMIT is slow), when it is quite apparent that it is down to somewhere in > the I/O stack. It is quite easy to prove once one has proper measurement > tools in place and is able to correlate, but it takes IMHO way too much time > and energy to cross-correlate all of that information (iostat -x 1s, wait > events 1s, and so on), especially if one would like to provide rapid > response. For measuring particularly stuck things, I've been wondering about having a regular timer that starts to collect more information if stuck in a place for a while. That would probably end up being lower overhead than constantly measuring... But would also be a lot more work. > Right now the patch does not include per-backend/PID tracking, hopefully if > there will be interest in this, I'll add it, but I would like to first hear > if that's a good idea. The current implementation uses fast bucket > calculation to avoid overheads and tries to cover most useful range of > devices via buckets (128us..256ms, so that covers both NVMe/SSD/HDD and > abnormally high latency too as from time to time I'm try to help with I/O > stuck for *seconds*, usually a sign of some I/O multipath issues, device > resetting, or hypervisor woes). Hm. Isn't 128us a pretty high floor for at least reads and writes? On a good NVMe disk you'll get < 10us, after all. I see a few problems with the source of the latency measurements though: - The latency gathered is that it's quite heavily affected by scheduler noise. If your process isn't scheduled because other processes are busy doing stuff on the CPU, it's quite possible to get results many orders of magnitude wrong. - With AIO, you're measuring wait time, and that time can be affected by other IOs in the queue. That will often *drastically* overestimate IO latency measured this way. I don't see how we can do better absent cooperation from the kernel (by putting lower-level measurements into io_uring completions, for example) though. So maybe this is just how it has to be and we ought to just document it. > postgres=# select > substring(backend_type,1,8) as backend,object,context,io_type, > bucket_latency_us as lat_us, > round(bucket_latency_us/1000.0, 3) as lat_ms, > bucket_count as count > from pg_stat_get_io_histogram() > where > bucket_count > 0 > order by 1,2,3,4,5; > backend | object | context | io_type | lat_us | lat_ms | count > ----------+----------+-----------+-----------+--------+--------+------- > autovacu | relation | normal | read | 128 | 0.128 | 54 Perhaps the latency should be represented as a range? > Of course most of the I/O calls today are hitting page cache, so one would > expect they'll be < 128us most of the time Have you measured whether overhead is measurable when hitting the page cache? I'd hope that it doesn't, due to io combing amortizing the cost somewhat. But it seems worth measuring. I assume you made pgstat_get_io_op_name() return "hit?" because you don't expect that to ever be hit? > +static inline int get_bucket_index(uint32_t val) { > +#define MIN_PG_STAT_IO_HIST_LATENCY 127 > + const uint32_t max_index = PGSTAT_IO_HIST_BUCKETS - 1; > + /* > + * hopefully calculated to be 25 by the compiler: > + * clz(127) = clz(01111111b on uint32) = 25 > + */ > + const uint32_t min_latency_leading_zeros = > + pg_leading_zero_bits32(MIN_PG_STAT_IO_HIST_LATENCY); > + > + /* > + * make sure the tmp value at least 127 (our minimum bucket size) > + * as __builtin_clz might return undefined behavior when operating on 0 > + */ > + uint32_t tmp = val | MIN_PG_STAT_IO_HIST_LATENCY; > + /* count leading zeros */ > + int leading_zeros = pg_leading_zero_bits32(tmp); > + > + /* normalize the index */ > + uint32_t index = min_latency_leading_zeros - leading_zeros; > + > + /* clamp it to the maximum */ > + return (index > max_index) ? max_index : index; > +} Wouldn't it be easier to handle the minimum latency by shifting right? I think we may also need to handle inputs that don't fit a uint32. For things like a stopped VM or such we could see IOs that that don't fit into a uint32 when measured in microseconds. So perhaps I'd make the input to the bucket calc 64 bits, then shift to the minimum precision and mask to implement clamping. > @@ -152,6 +189,10 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op, > INSTR_TIME_ADD(PendingIOStats.pending_times[io_object][io_context][io_op], > io_time); > > + /* calculate the bucket_index based on latency in us */ > + bucket_index = get_bucket_index(INSTR_TIME_GET_MICROSEC(io_time)); > + PendingIOStats.pending_hist_time_buckets[io_object][io_context][io_op][bucket_index]++; > + > /* Add the per-backend count */ > pgstat_count_backend_io_op_time(io_object, io_context, io_op, It's annoying to have to convert to microseconds here, that's not free :(. > @@ -1356,6 +1356,24 @@ typedef enum io_stat_col > IO_NUM_COLUMNS, > } io_stat_col; > > +/* > +* When adding a new column to the pg_stat_io view and the > +* pg_stat_get_backend_io() function, add a new enum value here above > +* IO_NUM_COLUMNS. > +*/ > +typedef enum hist_io_stat_col > +{ > + HIST_IO_COL_INVALID = -1, > + HIST_IO_COL_BACKEND_TYPE, > + HIST_IO_COL_OBJECT, > + HIST_IO_COL_CONTEXT, > + HIST_IO_COL_IOTYPE, > + HIST_IO_COL_BUCKET_US, > + HIST_IO_COL_COUNT, > + HIST_IO_COL_RESET_TIME, > + HIST_IO_NUM_COLUMNS > +} history_get_history_state; Think the IO_NUM_COLUMNS reference in the comment is a copy-pasto. I don't think this should be introduced in the middle of the pg_stat_io implementation. > +/* > + * pg_leading_zero_bits32 > + * Returns the number of leading 0-bits in x, starting at the most significant bit position. > + * Word must not be 0 (as it is undefined behavior). > + */ > +static inline int > +pg_leading_zero_bits32(uint32 word) Do we really need this in addition to the already existing pg_leftmost_one_pos32()? Particularly because that already has an msvc implementation... Greetings, Andres Freund