Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vZxPp-001gPN-0q for pgsql-hackers@arkaria.postgresql.org; Sun, 28 Dec 2025 20:31:02 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vZxPm-00EHds-2b for pgsql-hackers@arkaria.postgresql.org; Sun, 28 Dec 2025 20:30:59 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vZxPm-00EHde-1G for pgsql-hackers@lists.postgresql.org; Sun, 28 Dec 2025 20:30:59 +0000 Received: from mail-ed1-x536.google.com ([2a00:1450:4864:20::536]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vZxPj-003I0B-2z for pgsql-hackers@lists.postgresql.org; Sun, 28 Dec 2025 20:30:58 +0000 Received: by mail-ed1-x536.google.com with SMTP id 4fb4d7f45d1cf-64d80a47491so10998651a12.1 for ; Sun, 28 Dec 2025 12:30:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=garret-ru.20230601.gappssmtp.com; s=20230601; t=1766953853; x=1767558653; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=6tWDFCBmHJlvdu5Xd8bXl6zISNypilFB5SH2NIj2Mc4=; b=anWOWHiPi4Wvf0S+18lecV2oKuAYFfR1rolHEn4WC2e7zq+IPxcZwaqgJ0zeCZ0Coo Jes0tDowSuh8oqI9bf3UQSpD1lLCEOCRakjiQi8lqLrr2yLLMwNeFNv4f14L+xjRfqSr ZSsiGZgNJ0kWlzbQT0qfD0mVSMLgIBdLVkqJmvAweKj+VDwHdnRSg4PaUGKNZgtjEyDA bVWxER5FX+yCNsxzeOIUyUKfiIRUtWS8+9K4StjwtbSC1472l9vahKEzUnlAmLySZqn4 lMHSRnsdZ6ZL7BderIsGcsfAgmsS9DEyAlfNfzWlLL+GDLryHv/Lw8BamL8GcePpxTkm kBNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1766953853; x=1767558653; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=6tWDFCBmHJlvdu5Xd8bXl6zISNypilFB5SH2NIj2Mc4=; b=MkruVjN2OztTLzrTecAvcjximBaF/wmM2MVEEeo7MqJZ7bech7k5dAwCM3R25U1CqZ mHXM84VENd1ytLGqaC+dj+d+Z99IWNgyOacTmWW2Wfe9LF3zijkbSGmj4v3WkT8AzaHY i34wkOIdlrGCHwXmXA8HvUYoOnTwsxZbV8CJLXJONFNMh6MUwc7CO+bhz51y73iVHNeS jOhEpz9qrTAFKz8Vcv9bUtaJYsX4/9UlJ7yA88MKZ/aerT934BsODb8U5WLGSNKpMf9z a4aquHTOmMf7qaWiSsKJicbPC9hFfS15DUiLEVZAXnNb+jsOTHkDPWF6DNaJy7HGkKYS +nVA== X-Forwarded-Encrypted: i=1; AJvYcCVEJAWMEqU0LgsvXrgyUjA1mQnBDjGSGVontbNzRyYHiyXQ5bO+qhIfZF9Mh678qQtD2BBQqoVNo2TnHWJI@lists.postgresql.org X-Gm-Message-State: AOJu0YzQM+unw+D8Tw8yCsxUjpHbSwKujzq1UEPc5ptmpsH0Rq3LcWU1 +BmxDPpS1mDa4LQldyEmwW/gP9kn9WWGuTFT66glKcE8XHaFQ211F9Kk1ML4MgJ5Lw== X-Gm-Gg: AY/fxX7ap+MqHpYcFi223Xns18ADG7OpFvMwB1K0gw3AhqMyJD2eXo/TSN/P2xZmSCn 7mKTQo1r15i+v8ALO6wbOiyALis5x/0b5dQ14DxO3Z1GLHyhpnYbCeFIbeS96dQdcjT2tlpjoZr yh0PmNG6ItQs0AadONVBNRu+r+izHLG6xDy8QIX3O2hExOIkyHoPgLCOzH481nCpglyOdlnfZ0g rwC5epiVJ8myVQOUK37FTx9icoBmcO3OILBUCUMqIJiLbsr7yNVE21RjUO+wBHUG62shyZ/O0zy SnsaC3OUU+dXDND3Bpp48+uP867wDdJGj1hvp14OgZZVC4BR0PkoHV2xtnIxQAb8JYZl0LZkUGH 3dEKH7bRZ7uvvA++fdvDSle2AhpjogpfdKvrltCbpYG/L5EnALda5FQaaGZiMsYJwMSFpsCOSTG gRV2B/XXYss89vpPHtdAM4 X-Google-Smtp-Source: AGHT+IFfZMRo3pWvk2OviFwLDSDxK8aRdqTGc3ugfmhCgXz6/UhAqk+9XUHBR48hnj9hSC+iCRxolw== X-Received: by 2002:a05:6402:5188:b0:647:94e1:800f with SMTP id 4fb4d7f45d1cf-64b8ce39d40mr31793383a12.8.1766953852606; Sun, 28 Dec 2025 12:30:52 -0800 (PST) Received: from [10.248.37.148] ([130.41.208.2]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-64b90f53c16sm29558890a12.1.2025.12.28.12.30.50 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 28 Dec 2025 12:30:51 -0800 (PST) Message-ID: <69908539-3cc5-4572-b69f-055fa40d3bea@garret.ru> Date: Sun, 28 Dec 2025 22:30:49 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: index prefetching To: Tomas Vondra , Peter Geoghegan Cc: Andres Freund , Thomas Munro , Nazir Bilal Yavuz , Robert Haas , Melanie Plageman , PostgreSQL Hackers , Georgios , Dilip Kumar References: <6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me> <4zeu5yb73byiquvf3eefsunnrydyqfxy3eup66jrliutrtd4xl@5iifjey4n5m5> <8f123089-49ed-4c35-ab28-e3df70d28043@garret.ru> Content-Language: en-US From: Konstantin Knizhnik In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 28/12/2025 8:08 PM, Tomas Vondra wrote: > On 12/25/25 16:39, Konstantin Knizhnik wrote: >> On 21/12/2025 7:55 PM, Peter Geoghegan wrote: >>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan wrote: >>>> Attached is v4. >>> Attached is v5. Changes from v4: >>> >>> * Simplified and optimized index-only scans, with a particular >>> emphasis on avoiding regressions with nested loop joins with an inner >>> index-only scan. >>> >>> There were quite a number of small problems/dead code related to >>> index-only scans fixed by this new v5. Overall, I'm quite a bit >>> happier with the state of index-only scans, which I'd not paid too >>> much attention to before now. >>> >>> * Added Valgrind instrumentation to the hash index patch, which was >>> required to fix some false positives. >>> >>> The generic indexam_util_batch_unlock routine had Valgrind >>> instrumentation in earlier versions, just to keep nbtree's buffer >>> locking checks from generating similar false positives. Some time >>> later, when I added the hashgetbatch patch, there were new Valgrind >>> false positives during hash index scans -- which I missed at first. >>> This new v5 revisions adds similar Valgrind checks to hash itself >>> (changes that add code that is more or less a direct port of the stuff >>> added to nbtree by commit 4a70f829), which fixes the false positives, >>> and is independently useful. >>> >>> The rule for amgetbatch-based index AMs is that they must have similar >>> buffer locking instrumentation. That seems like a good thing. >>> >>> -- >>> Peter Geoghegan >> I the previous mail I shared results of my experiments with different >> prefetch distance. >> I think that we should start prefetching of heap tuples not from the >> second batch, but after some number of proceeded tids. >> >> Attached please find a patch which implements this approach. >> And below are updated results: >> >> limit\prefetch    on      off   always  inc    threshold >> 1                 12074   12765  3146    3282     12394 >> 2                 5912    6198   2463    2438      6124 >> 4                 2919    3047   1334    1964      2910 >> 8                 1554    1496   1166    1409      1588 >> 16                815     775    947     940        600 >> 32                424     403    687     695        478 >> 64                223     208    446     453        358 >> 128               115     106    258     270        232 >> 256               68      53     138     149        131 >> 512               43      27     72      78          71 >> 1024              28      13     38      40          38 >> >> Last column is result of prefetch with read_stream_threshold=10. >> > That's great, but it only works for cases that can (and do) benefit from > the prefetching. Try running the benchmark with a data set that fits > into shared buffers (or RAM), which makes prefetching useless. > > I tried that with your test, comparing master, v5 and v5 + your > read_stream_threshold patch. See the attached run.sh script, and the PDF > summarizing the results. The last two column groups are comparisons to > master, with green=improvement, red=regression. There are no actual > improvements (1% delta is just noise). But the read_stream_threshold > results have a clear pattern of pretty massive (20-30%) regressions. > > The difference between v5 and v5-threshold is pretty clear. > > IIRC cases like this are *exactly* why we ended up with the current > heuristics, enabling prefetching only from the second batch. This > removes the risk of expensive read_stream init for very fast queries > that don't benefit anything. Of course, prefetching may be useless for > later batches too (e.g. if all the data is cached), but the query will > be expensive enough for the read_stream init cost to be negligible. > > To put this differently, the more aggressive the heuristics is (enabling > prefetching in more case), the more likely it's to cause regressions. > We've chosen to be more defensive, i.e. to sacrifice some possible gains > in order to not regress plausible workloads. I hope we agree queries on > fully cached "hot" data are pretty common / important. > > We can probably do better in the future. But we'll never know for sure > if a given scan benefits from prefetching. It's not just about the > number of items in the batch, but also about how many heap pages that > translates to, what I/O pattern (random vs. sequential?), how many are > already cached. For some queries we don't even know how many items we'll > actually need. We can't check all that at the very beginning, because > it's simply prohibitively expensive. I tried to reproduce your results, but at Mac I do not see some noticeable difference  for 250k records, fillfactor=10 and 4GB shared buffers between `enable_indexscan_prefetch=false` and `enable_indexscan_prefetch=true`. I can't believe that just adding this checks in `heap_batch_advance_pos` can cause 75% degrade of performance (because for limit < 10, no read stream is initialized, but still we somewhere loose 25%). I just commented this fragment of code in heapam_handler.c: #if 0     proceed_items = ScanDirectionIsForward(direction)         ? pos->item - batch->firstItem         : batch->lastItem - pos->item;     /* Delay initializing stream until proceeding */     if (proceed_items >= read_stream_threshold         && !scan->xs_heapfetch->rs         && !scan->batchqueue->disabled         && !scan->xs_want_itup    /* XXX prefetching disabled for IoS, for now */         && enable_indexscan_prefetch)     {         scan->xs_heapfetch->rs =             read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,                                        scan->heapRelation, MAIN_FORKNUM,  scan->heapRelation->rd_tableam->index_getnext_stream,                                        scan, 0);     } #endif and ... see no difference. I can understand why initializing read stream earlier (not at the second batch, but after 10 proceeded items) may have negative impact on performance when all data is present i shared buffers for LIMIT>=10. But how it can happen with LIMIT 1 and commented fragment above. There is nothing else in my patch except adding GUC. So I think that it is some "external" factor and wonder if you can reproduce this results (just first line).