Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs1m8-001QMa-2p for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 16:48:45 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vs1m7-0044mN-2E for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 16:48:43 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs1m7-0044mB-0v for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 16:48:43 +0000 Received: from fhigh-b8-smtp.messagingengine.com ([202.12.124.159]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vs1m4-00000000v9X-1mLT for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 16:48:42 +0000 Received: from phl-compute-04.internal (phl-compute-04.internal [10.202.2.44]) by mailfhigh.stl.internal (Postfix) with ESMTP id DA4707A00BC; Mon, 16 Feb 2026 11:48:39 -0500 (EST) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-04.internal (MEProxy); Mon, 16 Feb 2026 11:48:40 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1771260519; x=1771346919; bh=zVd08tprNk 0qhdoJnxznVBtfOjPneYPS/ki8ju+ac8Q=; b=UhctxtwTaDzIEPqSs9ApOmcZsj vTA/Ep4eMefFjgyCMw31V6FRq4YHSpz7m2JXGldvJZlcqwNPZ7ansb/oHLS6lq9i 4gVtImyrm5xn7tfuFbLM6goTuNoXUG0Asl6QzM67hco3KyTwe3ws+OAlhdLd08G6 qMjCu823vCuWuXIgzdbhG2ftwIqCTYfhfE1P3XnVLZcZioGXDXF3lfp0iqGhmFKr ye1BDhqF/eXjE/Z/VHGmp+8wWSqGPa/VFr+nM2JWCzkJ4gstSdca5W3hnLzDzy3+ mNXanX5OEn9UYAI2bRSlmPJ1wgkWLVAW5htjkBoMHY3yejNw8XEEjfSpMWDQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1771260519; x=1771346919; bh=zVd08tprNk0qhdoJnxznVBtfOjPneYPS/ki 8ju+ac8Q=; b=M7S1vixwljYWAKwVo8X/ZnEzIzefSY9y/4t0ute5W2XEfyyGuDV +0niV+XjkbpwV0dSW98Q5MdnAZtJkspxqB/Nw4VFJC+blKxdSN5/ovkz/0peaEgk TcVFl0+KCN+4lnCElI2fRR77PDfVtLGGv7yq9CXBeR6QXqKPOAVTYthSHHsmrCLV O1PEjzlNn3fYIGPiCL9LMc1T8O4Diz33jA7cB6/qhaY8hVMxG5OeLR1wdhfaDFOs rRdWyS3vK+0fJXY9YvoBgBVUnvMiixLgDDrZeVzqOfDAysa0lZfR87lcrbF7D4ry BriaQI9vSMSLSIA2Ui4R5SUjvPl7c6OrZyQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvudejfeelucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeduuddpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepphhgsegsohifthdrihgvpdhrtghpthhtohepkh hnihiihhhnihhksehgrghrrhgvthdrrhhupdhrtghpthhtohepsgihrghvuhiikedusehg mhgrihhlrdgtohhmpdhrtghpthhtohepughilhhiphgsrghlrghuthesghhmrghilhdrtg homhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgrihhlrdgtohhm pdhrtghpthhtohepohdrrghlvgigrghnughrvgdrfhgvlhhiphgvsehgmhgrihhlrdgtoh hmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphht thhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehpgh hsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 16 Feb 2026 11:48:38 -0500 (EST) Date: Mon, 16 Feb 2026 11:48:38 -0500 From: Andres Freund To: Tomas Vondra Cc: Peter Geoghegan , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: References: <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl> <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-02-16 16:07:24 +0100, Tomas Vondra wrote: > > stream->distance is just a cap of how far we *may* look ahead, not how far we > > are currently looking ahead. > > > > E.g. if you have a stream that full tilt blazes ahead with 1 block random IOs, > > none of then in s_b, you'll soon have a distance that's large, as it gets > > doubled for every miss until hitting the cap (of io_combine_limit * > > effective_io_concurrency, capped by the buffer pin limit). But because you're > > doing random IO, you're just effective_io_concurrency IOs, not > > effective_io_concurrency * io_combine_limit. > > > > This gets even more extreme if you yield often, because that will lead to the > > distance staying relatively high, while preventing actually issuing much > > concurrent IO. > > > > I wonder if this might be hurting us. Peter was working on adding what > he calls "adaptive yielding", so maybe it could be preventing issuing > much enough concurrent IOs. Yes, It's hurting quite substantially. For well correlated index scans it prevents readahead from becoming aggressive enough even on a local low latency SSD. Which means it'll not even be even remotely aggressive enough on a networked block store. On a pgbench scale 300, with io_method=io_uring and debug_io_direct=data (that avoids needing to drop the entire kernel buffers and makes the results more predictable) and local changes to measure the stats over ios_in_progress, instead of distance: psql -Xq \ -c "SELECT pg_buffercache_evict_relation('pgbench_accounts');" \ -c "SELECT pg_prewarm('pgbench_accounts_pkey')" && psql -Xq \ -c "SET enable_seqscan = 0; SET enable_indexonlyscan = 0;" \ -c "EXPLAIN ANALYZE SELECT count(*) FROM (SELECT * FROM pgbench_accounts ORDER BY aid);" Aggregate (cost=1645852.77..1645852.78 rows=1 width=8) (actual time=10195.159..10195.160 rows=1.00 loops=1) Buffers: shared hit=81971 read=491804 I/O Timings: shared read=3609.144 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.56..1270852.22 rows=30000044 width=352) (actual time=0.314..8758.229 rows=30000000.> Index Searches: 1 Prefetch: distance=0.000 count=519121 stalls=81967 skipped=29507836 resets=0 pauses=27322 ungets=0 forwarded=0 histogram [1,2) => 519121 Buffers: shared hit=81971 read=491804 I/O Timings: shared read=3609.144 Planning: Buffers: shared hit=70 Planning Time: 0.225 ms Execution Time: 10195.214 ms iostat looks like this: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme6n1 5089.00 178.93 0.00 0.00 0.13 36.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.65 65.90 Note that there is pretty much *no* readhead, because the yields happen more frequently than a io_combine_limit sized IO can be formed. With the yielding logic disabled: Aggregate (cost=1645852.77..1645852.78 rows=1 width=8) (actual time=5631.206..5631.207 rows=1.00 loops=1) Buffers: shared hit=81971 read=491804 I/O Timings: shared read=140.005 -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.56..1270852.22 rows=30000044 width=352) (actual time=0.524..4491.521 rows=30000000.> Index Searches: 1 Prefetch: distance=30.981 count=533770 stalls=6 skipped=29507836 resets=0 pauses=41966 ungets=0 forwarded=0 histogram [1,2) => 44, [2,4) => 31, [4,8) => 49, [8,16) => 81, [16,32) => 145, [32,64) => 533420 Buffers: shared hit=81971 read=491804 I/O Timings: shared read=140.005 Planning: Buffers: shared hit=70 Planning Time: 0.244 ms Execution Time: 5631.259 ms iostat looks like this: Device r/s rMB/s rrqm/s %rrqm r_await rareq-sz w/s wMB/s wrqm/s %wrqm w_await wareq-sz d/s dMB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util nvme6n1 7508.00 682.55 0.00 0.00 0.18 93.09 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.32 100.00 And the only reason the difference isn't bigger is that the second query is completely CPU bound. > > > >> IIRC in that particular case we needed to know how far ahead is the > >> "prefetch position" (I mean, how many index entries are in between). > > > > Right - but that's not what looking at ->distance tells you :). I think you > > could use ->pinned_buffers for it, if you want to look at the number of > > blocks, not the number of IOs. > > > > Sure, but it's the one thing that's easily accessible, even if it's > imperfect. It can definitely tell us when the distance "collapses" close > to 1.0 (in which case we can't be issuing any concurrent IOs). I don't follow - ->pinned_buffers is also readily accessible? As is ->ios_in_progress? I don't think the distance really tells you that there's no concurrent IO. E.g. the query from above, with yielding enabled, and the stats using ->distance -> Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.56..1270852.22 rows=30000044 width=352) (actual time=0.166..12169.987 rows=30000000> Index Searches: 1 Prefetch: distance=9.421 count=519121 stalls=81967 skipped=29507836 resets=0 pauses=27322 ungets=0 forwarded=0 histogram [1,2) => 27322, [2,4) => 81967, [4,8) => 191253, [8,16) => 218579 Buffers: shared hit=81971 read=491804 I/O Timings: shared read=5469.932 Sure looks like there might be some concurrent IO... > >>> FWIW, if I change the batchdistance <= 2 check to <= 8, I get good perf even > >>> with io_combine_limit=16: > >>> > >>> stats using stream->ios_in_progress: > >>> Prefetch: distance=2.605 count=315526 stalls=3 skipped=9687128 resets=0 pauses=3035 ungets=0 forwarded=50 > >>> histogram [1,2) => 72679, [2,4) => 170115, [4,8) => 72682 > >>> Buffers: shared hit=27325 read=312500 > >>> I/O Timings: shared read=125.902 > >>> > >>> but that was just an experiment. > >>> > >> > > I missed this comment about batchdistance before. I'm sure there are > cases where a higher threshold would work better, but IIRC it's meant as > a safety for short queries that happen to run-away trying to look for > the next block to prefetch. And the block may not even be needed in some > cases, e.g. with LIMIT. > > So increasing the value to 8 would help this particular query, but could > easily hurt various other queries - like index-only scans with LIMIT. > > Maybe we should gradually ramp-up the threshold, instead of keeping it > at 2 forever. The comment seems to say it's about avoiding to look very into the future when using index only scans that just need a few heap lookups. Certainly an important goal. I did not actually want to suggest that changing this to ->batchdistance <= 8 makes sense, I was just experimenting to figure out why the query was so IO bound / not doing any actual readahead. I briefly experimented with a heuristic that skips yielding if there are few IOs in flight within the lookahead distance. The idea being that that would detect cases where io combining limits how many IOs we have in flight, despite looking reasonably far ahead. While that works, it's not obvious how many in-flight IOs we'd want before yielding? So I guess yes, you'd need some ramping? One thing that does confuse me about the yielding logic is that it seems to actually put a cap on ever looking more than two batches ahead (with a bit of fuzziness)? Why support more batches then (INDEX_SCAN_MAX_BATCHES == 128)? Isn't two batches too low for anything with some correlation on even remotely higher latency storage? > >> I'll take a close look tomorrow, but AFAICS we really aim to measure two > >> rather different things. I've been interested in "how far ahead are we > >> looking" and you're more interested in the number of I/Os initiated by > >> the stream. Which both seem interesting and necessary to understand > >> what's going on. > > > > When do you care about the distance purely in blocks, rather than IOs? If you > > can't actually have IO concurrency, due to io combining and yielding / low pin > > limits never actually allowing multiple IOs, you'll have no gain from AIO. > > > > If we're doing a 1000 IOs, but each IO can is either 8K or 128K chunk, > those seem like rather different situations, no? The bandwidth will be > vastly different, and it also saturates the other operators differently. > But I haven't thought about that too deeply. Just knowing the number of > IOs seems incomplete. Hm, maybe tracking the average size of IOs would be quite interesting? Given how big the perf effects of being able to combine or not are? Then the number of IOs would tell you a lot more... Greetings, Andres Freund