Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs85r-007AS7-13 for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 23:33:31 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vs85q-006Tty-0J for pgsql-hackers@arkaria.postgresql.org; Mon, 16 Feb 2026 23:33:30 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vs85p-006Ttn-2I for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 23:33:29 +0000 Received: from relay8-d.mail.gandi.net ([217.70.183.201]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vs85m-00000000y4t-2T4D for pgsql-hackers@lists.postgresql.org; Mon, 16 Feb 2026 23:33:28 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 6D87B43371; Mon, 16 Feb 2026 23:33:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1771284804; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=7ZyB/74/HnY8cNspEJkj2h8eScWKUujANQvOKjvu2Bs=; b=NWLiUQkqFawPIRlLDfx7bbT49Y1n1d4jxdDIPmHyd2NJp2IndTKry3/xrwwvTSsbgZ1kX3 fJi3R4HMDLY5C0xey4Iq31gMZ4MiwWWEO2PTxHh3XsajvXunYBl1ysbopf/WVKrbEV4y9S xojpKGelaFMSCARigpFgCDIh6TF1Ib/bLxWbsHyEQZXfXcK2XKfCcmGHPGyIMqf1nAkTd4 Al11+rw84DswARscHY8qnnJhfwL3HbTlzFE/F6T1mpPXB4ct2MqnUgnqW7E010Sr4Rrys/ GLTnGX/nxWKwEpCnW/+byY0GotIKCPQEdlS7yV42FMYJ0WL3ssaAx/snejwEkQ== Message-ID: <833fb173-e59b-47d2-929d-5712987d3781@vondra.me> Date: Tue, 17 Feb 2026 00:33:21 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Alexandre Felipe , Andres Freund Cc: Peter Geoghegan , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Konstantin Knizhnik , Dilip Kumar References: <64a2re223ajj4popowsyu4xekbuvvyfwkrihn5yzyrkwsmsuvp@2lls3tpww5dl> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-Sasl: tomas@vondra.me X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgddvudekvddtucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenucfjughrpefkffggfgfuvfevfhfhjggtgfesthekredttddvjeenucfhrhhomhepvfhomhgrshcugghonhgurhgruceothhomhgrshesvhhonhgurhgrrdhmvgeqnecuggftrfgrthhtvghrnhepuedvvdeifefffeekudeggfdtieeglefggeduheffveeihefggfehgfdvudetffevnecukfhppeekiedrgeelrddvfedtrddvtdeinecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepkeeirdegledrvdeftddrvddtiedphhgvlhhopegluddtrddufeejrddtrddvngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhqihgupeeiffekjeeugeeffeejuddpmhhouggvpehsmhhtphhouhhtpdhnsggprhgtphhtthhopeduuddprhgtphhtthhopehorghlvgigrghnughrvghfvghlihhpvgesghhmrghilhdrtghomhdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopehthhhomhgrshhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhrtghpthhto heprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomh X-GND-State: clean X-GND-Score: -100 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 2/17/26 00:05, Alexandre Felipe wrote: > Hi guys, > > There seems to be some very interesting stuff here, I have to try to > catch up with your analysis Andres. > > In the meantime. > > I am sharing the results I have got on a well behaved Linux system. > Can you share how is the system / Postgres configured? It's a good practice to provide all the information others might need to reproduce your results. In particular, what is shared_buffers set to? Are you still using io_method=worker? With how many io workers? > No sophisticated algorithm here but evicting OS cache helps to verify > the benefit of prefetching at a much smaller scale, and I think this is > useful > % gcc drop_cache.c -o drop_cache; > % sudo chown root:root drop_cache; > % sudo chmod 4755 drop_cache; > > I was executing like this > python3 .../run_regression_test.py --port 5433 --iterations 10 \ >             --columns sequential,random --workers 0 --evict os,off \ >             --payload-size 50 \ >             --rows 10000 \ >             --reset \ >             --ntables 5 > > 1 table: significant benefit with HDD cold, SSD random cold access. > 5 tables: significant benefit for random cold access. Somewhat > detrimental for sequential cold access, and random hot access. > 10 tables: significant benefit for random cold access. Slightly better > than 5 tables for cold sequential access, and somewhat detrimental for > random hot access. > > These results are hard to explain, but maybe Andres has the answer: >> I think this specific issue is a bit different, because today you get >> drastically different behaviour if you have >>  >> a) [miss, (miss, hit)+] >> vs >> b) [(miss, hit)+] > What's the distance in those cases? You may need to add some logging to read_stream to show that. If the distance is not ~1.0 then it's not the issue described by Andres, I think. There are other ways to look at issued IOs, either using iostat, or tools like perf-trace. > > Tomas said >> I think a "proper" solution would require some sort of cost model for >> the I/O part, so that we can schedule the I/Os just so that the I/O >> completes right before we actually need the page. > > I dare to ask > Why not use this on a feedback loop? > > while (!current_buffer.ready && reasonable to prefetch) { >   fetch next index tuple. >   if necessary prefetch one more buffer > } > What does "reasonable to prefetch" mean in practice, and how you determine it at runtime, before initiating the buffer prefetch? > I also dare to ask > Is it possible to skip an unavailable buffer and gain time processing > the rows that will be needed afterwards? > This could also help by releasing buffers more quickly if they need to > be recycled. > Not at the moment, AFAIK. And for most index-only scans that would not really work anyway, because those need to produce sorted output. regards -- Tomas Vondra