Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vrjUQ-0038qE-0B for pgsql-hackers@arkaria.postgresql.org; Sun, 15 Feb 2026 21:17:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vrjUP-002KCe-0E for pgsql-hackers@arkaria.postgresql.org; Sun, 15 Feb 2026 21:17:13 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vrjUO-002KCW-22 for pgsql-hackers@lists.postgresql.org; Sun, 15 Feb 2026 21:17:12 +0000 Received: from relay8-d.mail.gandi.net ([217.70.183.201]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vrjUM-00000000qZg-0zb5 for pgsql-hackers@lists.postgresql.org; Sun, 15 Feb 2026 21:17:12 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id A54E344507; Sun, 15 Feb 2026 21:17:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1771190228; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=MPzNQnYkY4/5njmofggpFYxIG0Th83aEt5MOufBWtEw=; b=KQthnSvRKV8dhO+SuZaiXFewQ3OV5t/9B48BcyRwZoSzXMAhODnx0T5XCLZXnWrQMkbq53 sLzvRSMs9foeVptf5ipUm0mfOEAYKVW7XDcXOaHZjRNKZClTVe7BdayJPPnoJXXtrWwQgf PH/R1LdA8z2/vQZhblyIR+EQQ86JYoaqAsfVqTTpM5O3/WE34c+QwsPQzpTxxhulD/D2dM wLzQZtmnhptqaipmRDdQqMa1bSygeYv7mWPYcTCQ8yEB6GzcKI2XWLnonQx41D1I6HOdDC 1auBF4pYPrYd6knRe44Jc6kPZNQTsdWSAtNGDnxDCqvSGzDGTS/AcCb64/7eLA== Message-ID: <984dcf9e-ada0-4dff-ae58-1f97bc904ccb@vondra.me> Date: Sun, 15 Feb 2026 22:17:05 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Alexandre Felipe Cc: Peter Geoghegan , Andres Freund , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me> <9411f220-007d-4f1e-9c8f-ca8eb09e6788@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-Sasl: tomas@vondra.me X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvudehtdehucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthekredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepuedvvdeifefffeekudeggfdtieeglefggeduheffveeihefggfehgfdvudetffevnecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhqihgupeetheeggfefgeeghedtjedpmhhouggvpehsmhhtphhouhhtpdhnsggprhgtphhtthhopeduuddprhgtphhtthhopehorghlvgigrghnughrvghfvghlihhpvgesghhmrghilhdrtghomhdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehthhhomhgrshhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhrtghpthhto heprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomh X-GND-State: clean X-GND-Score: -100 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 2/15/26 09:08, Alexandre Felipe wrote: > Another quick response, I am impressed > > I'm quite confused by the scripts you shared, it seems incomplete. The > run_regression.py is meant to call purge_cache.sh (which is missing), > and the run_benchmark tries to call all sorts of missing .sql scripts. > > > That was a confusion I did, I uploaded run_benchmark that was just a > nonsensical AI generated script in my work area and I didn't upload > purge_cache.sh. Notice that you can still run the python script with > evict mode off,pg. The missing script is attached in my previous message.  >   The purge_cache.sh script does not use sudo anywhere, so on Linux it has no effect at all. No idea if "purge" works on MacOS like this. > > A table that is just 24MB and fits into buffers is a bit useless. It > means that even with random pattern (which is generally about the best > for prefetching), only about ~1/30 of pages will require I/O. Each page > has ~32 items, but only the first item from each page will incur an > I/O.  > > > OK, I will start it with different parameters, for development I needed > something that was not too slow, to be able to catch the bugs. > > On what kind of hardware? How much variance is in the results? > > Is a Mac mini M1. Please check the .PNG it shows the confidence > intervals of the difference of execution time with the parameter ON and > OFF, with different settings. > I don't have access to a M1 machine (and it also does not say what type of storage is it using, which seems pretty important for a patch aiming to improve I/O behavior). But I tried running this on my ryzen machine with local SSDs (in RAID0), and with the 100k rows (and fixed handling of page cache) I get this: column_name io_method evict n master_ms off_ms on_ms effect_pct periodic worker off 10 35.8 35.1 36.5 2.0 periodic worker os 10 49.4 49.9 58.8 8.1 periodic worker pg 10 39.5 39.9 47.1 8.3 random worker off 10 35.9 35.6 35.7 0.2 random worker os 10 49.0 49.0 42.6 -7.0 random worker pg 10 39.6 39.9 40.9 1.2 sequential worker off 10 28.2 27.9 27.7 -0.4 sequential worker os 10 39.3 39.2 34.8 -6.0 sequential worker pg 10 30.1 30.1 29.4 -1.3 column_name io_method evict n master_ms off_ms on_ms effect_pct periodic io_uring off 10 35.9 35.8 35.8 -0.1 periodic io_uring os 10 49.3 49.9 50.0 0.1 periodic io_uring pg 10 40.1 39.8 41.7 2.4 random io_uring off 10 35.6 35.2 35.7 0.8 random io_uring os 10 49.1 48.9 46.1 -3.0 random io_uring pg 10 39.8 40.1 42.6 3.1 sequential io_uring off 10 28.0 27.8 28.0 0.4 sequential io_uring os 10 39.8 39.1 40.7 1.9 sequential io_uring pg 10 30.2 30.0 29.6 -0.8 This is on default config with io_workers=12 and data_checksums=off. I'm not showing results for parallel query, because it's irrelevant. This also has timings for master, for worker and io_uring (which you could not get on M1, at least no in MacOS). For "worker" the differences are much smaller (within 10% in the worst case), and almost non-existent for io_uring. Which suggests this is likely due to the "signal" overhead associated with worker, which can be annoying for certain data patterns (where we end up issuing an I/O for individual blocks at distance 1). That is not something the index prefetching can "fix", this affects all places using the read stream facility (it may be true index scans are more likely to generate such "poor" data patterns). The above is about the best we can do right now. In the future we might tweak the read stream heuristics, or invent a special mode for index scans, not sure. As already suggested, the whole point of index prefetching is improving cases where we have to do a fair amount of I/O. And the 100k rows is far too small for that - 24MB can get easily cached in various places (shared buffers, kernel, the drive itself, ...). The best we can hope for in this case is not causing regressions. Which I'd say we're mostly achieving, per the results I just shared. But let's look at a larger data set, with 1M rows and fillfactor=10. That inflates the table to 2.6GB, which is a bit more sensible. column_name io_method evict n master_ms off_ms on_ms pct periodic worker off 10 737.5 747.3 766.3 1.3 periodic worker os 10 2949.1 3000.1 1148.1 -44.6 periodic worker pg 10 745.3 740.1 757.0 1.1 random worker off 10 1491.4 1503.2 1103.8 -15.3 random worker os 10 11236.7 12100.0 2363.9 -67.3 random worker pg 10 1523.0 1489.6 1122.2 -14.1 sequential worker off 10 627.6 629.9 377.8 -25.0 sequential worker os 10 963.4 975.0 588.6 -24.7 sequential worker pg 10 630.8 631.4 372.2 -25.8 column_name io_method evict n master_ms off_ms on_ms pct periodic io_uring off 10 752.3 753.1 858.3 6.5 periodic io_uring os 10 2956.4 2800.6 1210.0 -39.7 periodic io_uring pg 10 738.5 750.0 858.3 6.7 random io_uring off 10 1502.2 1506.3 1691.6 5.8 random io_uring os 10 1905.9 11159.4 2756.7 -60.4 random io_uring pg 10 1535.6 1512.0 1680.4 5.3 sequential io_uring off 10 628.5 630.6 584.4 -3.8 sequential io_uring os 10 993.1 962.2 937.6 -1.3 sequential io_uring pg 10 625.7 634.6 587.1 -3.9 Clearly, significant benefits, but also some interesting differences in behavior between worker/io_uring. I haven't looked into the details, but my guess would be it's due to the different "parallelism" of the two methods (io_uring is confined to the single process, worker can spread the work over multiple processes). It does not surprise me, and there is no single "best" I/O method. And never will be. Another thing to consider is that 2.6GB table is much larger than shared buffers, which means it'll never be "fully cached" in PG. regards -- Tomas Vondra