Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vw8LJ-004jpe-24 for pgsql-hackers@arkaria.postgresql.org; Sat, 28 Feb 2026 00:38:01 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vw8LI-008IXD-0S for pgsql-hackers@arkaria.postgresql.org; Sat, 28 Feb 2026 00:38:00 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vw8LH-008IWm-1G for pgsql-hackers@lists.postgresql.org; Sat, 28 Feb 2026 00:37:59 +0000 Received: from fout-a6-smtp.messagingengine.com ([103.168.172.149]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vw8LD-00000001ahH-3B6z for pgsql-hackers@lists.postgresql.org; Sat, 28 Feb 2026 00:37:58 +0000 Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfout.phl.internal (Postfix) with ESMTP id A7250EC00D0; Fri, 27 Feb 2026 19:37:55 -0500 (EST) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-01.internal (MEProxy); Fri, 27 Feb 2026 19:37:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1772239075; x=1772325475; bh=uitKDKncM3FSntAVebtpdeg29Ys5HSS0o5oVj+YZL6k=; b= Rovh4OM0ZXaBBdBirk47iUOlQQW/+PBvkVY09JWhVk0s733aPuGpIaxYDMHNGDIC Aqd06gsvgX1yBHm+QAl9LUyEW01tCZt6/wAp6hszUwZas9Gg4UIeFJhcF82LzinP UpL67+VVlzyZ9BB0K9T1WchPWm8gyEDFJJuwqQchxMcHsmttnGNyDqz+0X6AU4pF fCHZdemcYkQln8Dn3eaC3g/8VHbmsswWvvhnnUZ4H+VyaUmDJMX/8U4tu/CEN6Q7 v5pORMHcxeZ4CXwoaGaPdH3tgg6xij1HOtYd3mU+zvdc4QuEY94y2zpZLg6ChhpF HaZ5JY6ioYO+uquFsR0wWg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1772239075; x= 1772325475; bh=uitKDKncM3FSntAVebtpdeg29Ys5HSS0o5oVj+YZL6k=; b=G S1eQnGh9hEYB6nVLu3M/ABKDpZ3cn7TYAikUhzPGvSDhUO4SGgcpgooZaYbR7GCw v2WWp6mpxretwNzyIr4gMQ70P3+PqIb+FGV/Y4z3pvZDTs+qlhEWEeZGLK5M3jei PK5eVRDWAsJ/n5gDnghNuvxSmlYlecbPl+nJ8k06iV7NRdIea31vs5UZpz1+XxbW PlP5hCWP0wbXXLGDTmDiQFhkcgJLdf1EhszeBaX1rtIiMS8b9tipdmhLqVwNFL3i HpaUwixqjdaZ38cIRHjuYmZKLgj1vuYTlJe0XKWZ8vYID2i4L3fUBuKSN698U/CW vleL7sqlHCtUflRslAN+A== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvhedtgeeiucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggugfgjsehtkefstddttdejnecuhfhrohhmpeetnhgurhgv shcuhfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrg htthgvrhhnpedtleelvdfgjedvffeiueekfeeuleffhfegfffhgfffkeevueehieehhfei gffhvdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe grnhgurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepuddupdhmohgu vgepshhmthhpohhuthdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhope hknhhiiihhnhhikhesghgrrhhrvghtrdhruhdprhgtphhtthhopegshigrvhhuiiekudes ghhmrghilhdrtghomhdprhgtphhtthhopeguihhlihhpsggrlhgruhhtsehgmhgrihhlrd gtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghilhdrtgho mhdprhgtphhtthhopehordgrlhgvgigrnhgurhgvrdhfvghlihhpvgesghhmrghilhdrtg homhdprhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhmpdhrtghp thhtohepthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohepph hgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Fri, 27 Feb 2026 19:37:54 -0500 (EST) Date: Fri, 27 Feb 2026 19:37:53 -0500 From: Andres Freund To: Peter Geoghegan Cc: Tomas Vondra , Alexandre Felipe , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar Subject: Re: index prefetching Message-ID: References: <52512325-b1f2-4fff-819e-f68122b2e427@vondra.me> <64mfcfv7iihc4pmqlxarii4esnmqry52ckz5m7lmwylnfnuxuz@oxh4ioxkjtep> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-02-27 15:11:24 -0500, Peter Geoghegan wrote: > On Thu, Feb 26, 2026 at 11:18 PM Andres Freund wrote: > > Note how the increase in scanned heap pages actually *decreases* the overall > > time rather substantially. > > > > It's quite visible, both in iostat, and a query like > > SELECT pid, target_desc, off, length FROM pg_aios \watch 0.5 > > > > that for the first query has basically no IO concurrency, the second has very > > intermittent IO concurrency and the third one has nice IO concurrency. > > > > > > If I disable the yield logic, the fillfactor=90 case is good: > > I can recreate your results. Including the part where you found that > the problem would go away once yields were completely disabled. Ooops, just sent an review that I started writing a few hours ago that would have benefited from having read this first... > I can certainly understand why you're suspicious of the yielding > mechanism. I wonder if I gave undue weight to the merge join query I > showed you [1] (and one or two others like it). Declaring that the > underlying merge join/yielding issue is not worth the complexity > required to yield would certainly be convenient. Yielding *isn't* > helpful for the vast majority of individual queries, so I'm certainly > tempted. But I can't help but feel nervous about the large disparity > in the number of *index* pages read by that particular query, once the > yielding mechanism is disabled. I do continue to wonder if we ought to pass down some hints from the planner about how much data an indexscan is likely to read to influence readahead aggressiveness. I do agree it's right beign concerned about the increase in index fetches with such mark/restore cases. > With that being said, it seems as if yielding isn't the only factor in > play here. I also notice that even master exhibits roughly the same > performance disparity (also while using direct I/O, though with > shared_buffers set to 16GB rather than your 2GB): > ================================= > EXPLAIN OUTPUT (best run, master) > ================================= > > --- Fillfactor 90 --- > Index Scan using pgbench_accounts_ff90_pkey on pgbench_accounts_ff90 > Index Searches: 1 > Buffers: shared hit=27325 read=181819 > I/O Timings: shared read=16822.256 > Planning Time: 0.035 ms > Execution Time: 18048.198 ms > > --- Fillfactor 50 --- > Index Scan using pgbench_accounts_ff50_pkey on pgbench_accounts_ff50 > Index Searches: 1 > Buffers: shared hit=27325 read=333334 > I/O Timings: shared read=30685.965 > Planning Time: 0.028 ms > Execution Time: 32005.962 ms > > --- Fillfactor 25 --- > Index Scan using pgbench_accounts_ff25_pkey on pgbench_accounts_ff25 > Index Searches: 1 > Buffers: shared hit=27325 read=666667 > I/O Timings: shared read=10278.124 > Planning Time: 0.034 ms > Execution Time: 11796.573 ms > > While fillfactor 90 is fastest, fillfactor 25 is almost 3x faster than > fillfactor 50, despite performing about twice as many reads. I have to > imagine this relates to my Samsung 980 Pro SSD performing its own > read-ahead, in a way that works inconsistently across workloads. > Note again that this effect with master only appears when > shared_buffers is set to 16GB. With your 2GB shared_buffers setting, > master takes 17930.381 ms for FF 90, 31822.473 ms for FF 50, and > 61094.676 ms for FF 25 (which is at least consistent-ish in the way > that one would expect). Is this, by any chance, with starting the server and running these queries in that order? Are you repeating these runs within one server start, evicting the buffers inbetween? If you don't, you'll often get very inconsistent performance, because the first time a buffer pool page is accessed, you'll get a page fault, during which the kernel has to find free memory to back the page, which then also has to be zeroed out. With a small buffer pool you reach the point where individual buffers are reused much more quickly, which would explain why it only happens with the larger s_b. Are you using huge pages? I see rather differing performance results with/without when not prefetching. In fact, when not repeating the benchmarks and running them in order within the same "server start cycle", I get similar timings as you were when not using huge pages. But just running the queries in a different order gives very different results. > For context, here is how the patch compares to master with > shared_buffers=16GB (here master uses the same query execution/query > plans as those shown above) once the patch/Pfetch's yielding is > disabled: > > FF Heap Pages Master Pfetch ON ON/Master > ---------------------------------------------------- > 90 181819 18048.2 1465.0 0.081x > 50 333334 32006.0 1682.2 0.053x > 25 666667 11796.6 1928.4 0.163x > > I also noticed that the patch isn't at all sensitive to whether > shared_buffers is set to 2GB or 16GB -- not once yielding is disabled > like this. It really shouldn't be sensitive - that query will never be able to reuse heapam pages within a query, and evicting a clean buffer isn't that expensive. > I'm not sure how relevant this later point about "shared_buffers > sensitivity with yielding" really is. Nor am I sure if the effect with > master (and the possible role of device-level readahead) is all that > significant. I'm pointing all of this out in the hope that you can > offer an explanation that'll help me to improve my own intutions about > this stuff. I suspect it's really related to running multiple different queries in a specific order, without restarting in between. Exacerbated perhaps by not using huge pages. Greetings, Andres Freund