Message-ID: <69908539-3cc5-4572-b69f-055fa40d3bea@garret.ru>
Date: Sun, 28 Dec 2025 22:30:49 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: index prefetching
To: Tomas Vondra <tomas@vondra.me>, Peter Geoghegan <pg@bowt.ie>
Cc: Andres Freund <andres@anarazel.de>, Thomas Munro
 <thomas.munro@gmail.com>, Nazir Bilal Yavuz <byavuz81@gmail.com>,
 Robert Haas <robertmhaas@gmail.com>,
 Melanie Plageman <melanieplageman@gmail.com>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>, Dilip Kumar <dilipbalaut@gmail.com>
References: <6d59c277-c440-4d1f-a46e-157958c06a5f@vondra.me>
 <CAH2-WznFdjY_OB2S7_BY4iAyeffK+XrE2qsX6aghgP63VocRfQ@mail.gmail.com>
 <4zeu5yb73byiquvf3eefsunnrydyqfxy3eup66jrliutrtd4xl@5iifjey4n5m5>
 <c515d037-9c9c-40d3-9f1f-f0e68bf5842b@vondra.me>
 <CAH2-WzkAC_gWQgc3MV3MWBx2EUaE4TfPk8XQPkJgs1R_qNvo=A@mail.gmail.com>
 <CAH2-WznLRnUZHC4kmR+J3MgGpN0iTUaZ-4xeSHPiaYR=y3C8Og@mail.gmail.com>
 <CAH2-WzkHK0++yn3_nmWL2rWzXiu8Qc=2=NPQDG4nfVjvYkRDrQ@mail.gmail.com>
 <CAH2-Wzk9=x=a2TbcqYcX+XXmDHQr5=1v9m4Z_v8a-KwF1Zoz0A@mail.gmail.com>
 <CAH2-WzmYqhacBH161peAWb5eF=Ja7CFAQ+0jSEMq=qnfLVTOOg@mail.gmail.com>
 <CAH2-Wzmm+mXuv_r+eYgcwOKZzNSmbYoEmxi5mdKMUtkmYa_qXw@mail.gmail.com>
 <CAH2-WzkC_EdGB_tdEQr63BiOuJbcvZiV=xO+OJYux4wOaxPD8w@mail.gmail.com>
 <CAH2-Wzn8whv-RE4E5L2WQB_Ha_jy0UReJq1yypu83gDYHeXq5g@mail.gmail.com>
 <CAH2-Wzm7-QuDOs6TcqfhhDsGEZCuHtn=D-SriOTnTZ_fiXNBvA@mail.gmail.com>
 <8f123089-49ed-4c35-ab28-e3df70d28043@garret.ru>
 <c2357c01-bf15-44d6-a4cb-591675457032@vondra.me>
Content-Language: en-US
From: Konstantin Knizhnik <knizhnik@garret.ru>
In-Reply-To: <c2357c01-bf15-44d6-a4cb-591675457032@vondra.me>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Archived-At: 
 <https://www.postgresql.org/message-id/69908539-3cc5-4572-b69f-055fa40d3bea%40garret.ru>
Precedence: bulk


On 28/12/2025 8:08 PM, Tomas Vondra wrote:
> On 12/25/25 16:39, Konstantin Knizhnik wrote:
>> On 21/12/2025 7:55 PM, Peter Geoghegan wrote:
>>> On Wed, Dec 10, 2025 at 9:21 PM Peter Geoghegan <pg@bowt.ie> wrote:
>>>> Attached is v4.
>>> Attached is v5. Changes from v4:
>>>
>>> * Simplified and optimized index-only scans, with a particular
>>> emphasis on avoiding regressions with nested loop joins with an inner
>>> index-only scan.
>>>
>>> There were quite a number of small problems/dead code related to
>>> index-only scans fixed by this new v5. Overall, I'm quite a bit
>>> happier with the state of index-only scans, which I'd not paid too
>>> much attention to before now.
>>>
>>> * Added Valgrind instrumentation to the hash index patch, which was
>>> required to fix some false positives.
>>>
>>> The generic indexam_util_batch_unlock routine had Valgrind
>>> instrumentation in earlier versions, just to keep nbtree's buffer
>>> locking checks from generating similar false positives. Some time
>>> later, when I added the hashgetbatch patch, there were new Valgrind
>>> false positives during hash index scans -- which I missed at first.
>>> This new v5 revisions adds similar Valgrind checks to hash itself
>>> (changes that add code that is more or less a direct port of the stuff
>>> added to nbtree by commit 4a70f829), which fixes the false positives,
>>> and is independently useful.
>>>
>>> The rule for amgetbatch-based index AMs is that they must have similar
>>> buffer locking instrumentation. That seems like a good thing.
>>>
>>> -- 
>>> Peter Geoghegan
>> I the previous mail I shared results of my experiments with different
>> prefetch distance.
>> I think that we should start prefetching of heap tuples not from the
>> second batch, but after some number of proceeded tids.
>>
>> Attached please find a patch which implements this approach.
>> And below are updated results:
>>
>> limit\prefetch    on      off   always  inc    threshold
>> 1                 12074   12765  3146    3282     12394
>> 2                 5912    6198   2463    2438      6124
>> 4                 2919    3047   1334    1964      2910
>> 8                 1554    1496   1166    1409      1588
>> 16                815     775    947     940        600
>> 32                424     403    687     695        478
>> 64                223     208    446     453        358
>> 128               115     106    258     270        232
>> 256               68      53     138     149        131
>> 512               43      27     72      78          71
>> 1024              28      13     38      40          38
>>
>> Last column is result of prefetch with read_stream_threshold=10.
>>
> That's great, but it only works for cases that can (and do) benefit from
> the prefetching. Try running the benchmark with a data set that fits
> into shared buffers (or RAM), which makes prefetching useless.
>
> I tried that with your test, comparing master, v5 and v5 + your
> read_stream_threshold patch. See the attached run.sh script, and the PDF
> summarizing the results. The last two column groups are comparisons to
> master, with green=improvement, red=regression. There are no actual
> improvements (1% delta is just noise). But the read_stream_threshold
> results have a clear pattern of pretty massive (20-30%) regressions.
>
> The difference between v5 and v5-threshold is pretty clear.
>
> IIRC cases like this are *exactly* why we ended up with the current
> heuristics, enabling prefetching only from the second batch. This
> removes the risk of expensive read_stream init for very fast queries
> that don't benefit anything. Of course, prefetching may be useless for
> later batches too (e.g. if all the data is cached), but the query will
> be expensive enough for the read_stream init cost to be negligible.
>
> To put this differently, the more aggressive the heuristics is (enabling
> prefetching in more case), the more likely it's to cause regressions.
> We've chosen to be more defensive, i.e. to sacrifice some possible gains
> in order to not regress plausible workloads. I hope we agree queries on
> fully cached "hot" data are pretty common / important.
>
> We can probably do better in the future. But we'll never know for sure
> if a given scan benefits from prefetching. It's not just about the
> number of items in the batch, but also about how many heap pages that
> translates to, what I/O pattern (random vs. sequential?), how many are
> already cached. For some queries we don't even know how many items we'll
> actually need. We can't check all that at the very beginning, because
> it's simply prohibitively expensive.


I tried to reproduce your results, but at Mac I do not see some 
noticeable difference  for 250k records, fillfactor=10 and 4GB shared 
buffers
between `enable_indexscan_prefetch=false` and 
`enable_indexscan_prefetch=true`.
I can't believe that just adding this checks in `heap_batch_advance_pos` 
can cause 75% degrade of performance (because for limit < 10, no read 
stream is initialized, but still we somewhere loose 25%).

I just commented this fragment of code in heapam_handler.c:


#if 0
     proceed_items = ScanDirectionIsForward(direction)
         ? pos->item - batch->firstItem
         : batch->lastItem - pos->item;
     /* Delay initializing stream until proceeding */
     if (proceed_items >= read_stream_threshold
         && !scan->xs_heapfetch->rs
         && !scan->batchqueue->disabled
         && !scan->xs_want_itup    /* XXX prefetching disabled for IoS, 
for now */
         && enable_indexscan_prefetch)
     {
         scan->xs_heapfetch->rs =
             read_stream_begin_relation(READ_STREAM_DEFAULT, NULL,
                                        scan->heapRelation, MAIN_FORKNUM,
  scan->heapRelation->rd_tableam->index_getnext_stream,
                                        scan, 0);
     }
#endif

and ... see no difference.

I can understand why initializing read stream earlier (not at the second 
batch, but after 10 proceeded items) may have negative impact on 
performance when all data is present i shared buffers for LIMIT>=10.
But how it can happen with LIMIT 1 and commented fragment above. There 
is nothing else in my patch except adding GUC.
So I think that it is some "external" factor and wonder if you can 
reproduce this results (just first line).