Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1va0Zn-002MIw-1J for pgsql-hackers@arkaria.postgresql.org; Sun, 28 Dec 2025 23:53:32 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1va0Zm-00EWHS-1O for pgsql-hackers@arkaria.postgresql.org; Sun, 28 Dec 2025 23:53:31 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1va0Zl-00EWHK-2f for pgsql-hackers@lists.postgresql.org; Sun, 28 Dec 2025 23:53:30 +0000 Received: from relay6-d.mail.gandi.net ([217.70.183.198]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1va0Zf-0035hb-30 for pgsql-hackers@lists.postgresql.org; Sun, 28 Dec 2025 23:53:24 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id F0E6B41BC9; Sun, 28 Dec 2025 23:53:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1766965999; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/m8+NIrhAZhgsFo8gdG2aCIkPc/QKfch0XVNdyuyl14=; b=In5tX+BKb1ekVNyckhazTNo1AOaTKR2f0QnK5m89+QzZsqxK0B9Jf3Edoz4pOs69iXZPn6 X1XkgmKe7R7BXkBUfKVc2xPssNqmxXn+79CiZw7GfgTyX3M28hf9PfDMvIUyC10xLZlR8Q XN0iZrwHZ7kPwzLSyXYwfOtrXmN2EuR3MPp5DA9ZU70pCMklFKApXfyy9LruRQ4kxA8h53 bzYi+iBBvgMCIDMmWnuvz7TLJ85FB1cgtwZELBDo8XIuXonXGW8lZptpTOX6co2lQHSLJs d8a265I8+r3l23w1pRKOcRjamwkd6RoStBPREUztz54US66pnIQxDjksTPsNMQ== Message-ID: <95b6c6ef-4ba1-44f6-b1ac-eef0bbb5327a@vondra.me> Date: Mon, 29 Dec 2025 00:53:16 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Konstantin Knizhnik , Peter Geoghegan Cc: Andres Freund , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Dilip Kumar References: <6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me> <4zeu5yb73byiquvf3eefsunnrydyqfxy3eup66jrliutrtd4xl@5iifjey4n5m5> <8f123089-49ed-4c35-ab28-e3df70d28043@garret.ru> <69908539-3cc5-4572-b69f-055fa40d3bea@garret.ru> Content-Language: en-US From: Tomas Vondra In-Reply-To: <69908539-3cc5-4572-b69f-055fa40d3bea@garret.ru> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-Sasl: tomas@vondra.me X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdejheeivdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhepkfffgggfuffvvehfhfgjtgfgsehtkeertddtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpeeuvddvieefffefkedugefgtdeigeelgfegudehffevieehgffghefgvdduteffveenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrudekngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhqihgupefhtdfgieeugeduueevledpmhhouggvpehsmhhtphhouhhtpdhnsggprhgtphhtthhopedutddprhgtphhtthhopehknhhiiihhnhhikhesghgrrhhrvghtrdhruhdprhgtphhtthhopehpghessghofihtrdhivgdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehthhhomhgrshhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohepsgihrghvuhiikedusehgmhgrihhlrdgtohhmpdhrtghpthhtoheprhhosggvrhhtm hhhrggrshesghhmrghilhdrtghomh List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 12/28/25 21:30, Konstantin Knizhnik wrote: > > On 28/12/2025 8:08 PM, Tomas Vondra wrote: >> On 12/25/25 16:39, Konstantin Knizhnik wrote: >>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote: >>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan wrote: >>>>> Attached is v4. >>>> Attached is v5. Changes from v4: >>>> >>>> * Simplified and optimized index-only scans, with a particular >>>> emphasis on avoiding regressions with nested loop joins with an inner >>>> index-only scan. >>>> >>>> There were quite a number of small problems/dead code related to >>>> index-only scans fixed by this new v5. Overall, I'm quite a bit >>>> happier with the state of index-only scans, which I'd not paid too >>>> much attention to before now. >>>> >>>> * Added Valgrind instrumentation to the hash index patch, which was >>>> required to fix some false positives. >>>> >>>> The generic indexam_util_batch_unlock routine had Valgrind >>>> instrumentation in earlier versions, just to keep nbtree's buffer >>>> locking checks from generating similar false positives. Some time >>>> later, when I added the hashgetbatch patch, there were new Valgrind >>>> false positives during hash index scans -- which I missed at first. >>>> This new v5 revisions adds similar Valgrind checks to hash itself >>>> (changes that add code that is more or less a direct port of the stuff >>>> added to nbtree by commit 4a70f829), which fixes the false positives, >>>> and is independently useful. >>>> >>>> The rule for amgetbatch-based index AMs is that they must have similar >>>> buffer locking instrumentation. That seems like a good thing. >>>> >>>> --  >>>> Peter Geoghegan >>> I the previous mail I shared results of my experiments with different >>> prefetch distance. >>> I think that we should start prefetching of heap tuples not from the >>> second batch, but after some number of proceeded tids. >>> >>> Attached please find a patch which implements this approach. >>> And below are updated results: >>> >>> limit\prefetch    on      off   always  inc    threshold >>> 1                 12074   12765  3146    3282     12394 >>> 2                 5912    6198   2463    2438      6124 >>> 4                 2919    3047   1334    1964      2910 >>> 8                 1554    1496   1166    1409      1588 >>> 16                815     775    947     940        600 >>> 32                424     403    687     695        478 >>> 64                223     208    446     453        358 >>> 128               115     106    258     270        232 >>> 256               68      53     138     149        131 >>> 512               43      27     72      78          71 >>> 1024              28      13     38      40          38 >>> >>> Last column is result of prefetch with read_stream_threshold=10. >>> >> That's great, but it only works for cases that can (and do) benefit from >> the prefetching. Try running the benchmark with a data set that fits >> into shared buffers (or RAM), which makes prefetching useless. >> >> I tried that with your test, comparing master, v5 and v5 + your >> read_stream_threshold patch. See the attached run.sh script, and the PDF >> summarizing the results. The last two column groups are comparisons to >> master, with green=improvement, red=regression. There are no actual >> improvements (1% delta is just noise). But the read_stream_threshold >> results have a clear pattern of pretty massive (20-30%) regressions. >> >> The difference between v5 and v5-threshold is pretty clear. >> >> IIRC cases like this are *exactly* why we ended up with the current >> heuristics, enabling prefetching only from the second batch. This >> removes the risk of expensive read_stream init for very fast queries >> that don't benefit anything. Of course, prefetching may be useless for >> later batches too (e.g. if all the data is cached), but the query will >> be expensive enough for the read_stream init cost to be negligible. >> >> To put this differently, the more aggressive the heuristics is (enabling >> prefetching in more case), the more likely it's to cause regressions. >> We've chosen to be more defensive, i.e. to sacrifice some possible gains >> in order to not regress plausible workloads. I hope we agree queries on >> fully cached "hot" data are pretty common / important. >> >> We can probably do better in the future. But we'll never know for sure >> if a given scan benefits from prefetching. It's not just about the >> number of items in the batch, but also about how many heap pages that >> translates to, what I/O pattern (random vs. sequential?), how many are >> already cached. For some queries we don't even know how many items we'll >> actually need. We can't check all that at the very beginning, because >> it's simply prohibitively expensive. > > > I tried to reproduce your results, but at Mac I do not see some > noticeable difference  for 250k records, fillfactor=10 and 4GB shared > buffers > between `enable_indexscan_prefetch=false` and > `enable_indexscan_prefetch=true`. > I can't believe that just adding this checks in `heap_batch_advance_pos` > can cause 75% degrade of performance (because for limit < 10, no read > stream is initialized, but still we somewhere loose 25%). > > I just commented this fragment of code in heapam_handler.c: > > > #if 0 >     proceed_items = ScanDirectionIsForward(direction) >         ? pos->item - batch->firstItem >         : batch->lastItem - pos->item; >     /* Delay initializing stream until proceeding */ >     if (proceed_items >= read_stream_threshold >         && !scan->xs_heapfetch->rs >         && !scan->batchqueue->disabled >         && !scan->xs_want_itup    /* XXX prefetching disabled for IoS, > for now */ >         && enable_indexscan_prefetch) >     { >         scan->xs_heapfetch->rs = >             read_stream_begin_relation(READ_STREAM_DEFAULT, NULL, >                                        scan->heapRelation, MAIN_FORKNUM, >  scan->heapRelation->rd_tableam->index_getnext_stream, >                                        scan, 0); >     } > #endif > > and ... see no difference. > > I can understand why initializing read stream earlier (not at the second > batch, but after 10 proceeded items) may have negative impact on > performance when all data is present i shared buffers for LIMIT>=10. > But how it can happen with LIMIT 1 and commented fragment above. There > is nothing else in my patch except adding GUC. > So I think that it is some "external" factor and wonder if you can > reproduce this results (just first line). > It seems this is due to sending an extra SET (for the new GUC) in the pgbench script, which is recognized only on the v5+threshold build. That's a thinko on my side, I should have realized the extra command might affect this. It doesn't really affect the behavior, because 10 is the default value for read_stream_threshold. I've fixed the script, will check fresh results tomorrow. Still, I think most of what I said about heuristics when to initialize the read stream, and the risk/benefit tradeoff, still applies. regards -- Tomas Vondra