Message-ID: <a062ac09-c574-1d6a-22b6-f3cc0d28ed42@enterprisedb.com>
Date: Fri, 9 Jun 2023 12:18:11 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.10.0
Subject: Re: index prefetching
To: Andres Freund <andres@anarazel.de>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>
References: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
 <20230609000600.syqy447e6metnvyj@awork3.anarazel.de>
Content-Language: en-US
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
In-Reply-To: <20230609000600.syqy447e6metnvyj@awork3.anarazel.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: <https://www.postgresql.org/message-id/a062ac09-c574-1d6a-22b6-f3cc0d28ed42%40enterprisedb.com>
Precedence: bulk

On 6/9/23 02:06, Andres Freund wrote:
> Hi,
> 
> On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
>> At pgcon unconference I presented a PoC patch adding prefetching for
>> indexes, along with some benchmark results demonstrating the (pretty
>> significant) benefits etc. The feedback was quite positive, so let me
>> share the current patch more widely.
> 
> I'm really excited about this work.
> 
> 
>> 1) pairing-heap in GiST / SP-GiST
>>
>> For most AMs, the index state is pretty trivial - matching items from a
>> single leaf page. Prefetching that is pretty trivial, even if the
>> current API is a bit cumbersome.
>>
>> Distance queries on GiST and SP-GiST are a problem, though, because
>> those do not just read the pointers into a simple array, as the distance
>> ordering requires passing stuff through a pairing-heap :-(
>>
>> I don't know how to best deal with that, especially not in the simple
>> API. I don't think we can "scan forward" stuff from the pairing heap, so
>> the only idea I have is actually having two pairing-heaps. Or maybe
>> using the pairing heap for prefetching, but stashing the prefetched
>> pointers into an array and then returning stuff from it.
>>
>> In the patch I simply prefetch items before we add them to the pairing
>> heap, which is good enough for demonstrating the benefits.
> 
> I think it'd be perfectly fair to just not tackle distance queries for now.
> 

My concern is that if we cut this from v0 entirely, we'll end up with an
API that'll not be suitable for adding distance queries later.

> 
>> 2) prefetching from executor
>>
>> Another question is whether the prefetching shouldn't actually happen
>> even higher - in the executor. That's what Andres suggested during the
>> unconference, and it kinda makes sense. That's where we do prefetching
>> for bitmap heap scans, so why should this happen lower, right?
> 
> Yea. I think it also provides potential for further optimizations in the
> future to do it at that layer.
> 
> One thing I have been wondering around this is whether we should not have
> split the code for IOS and plain indexscans...
> 

Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
did you mean something else?

> 
>> 4) per-leaf prefetching
>>
>> The code is restricted only prefetches items from one leaf page. If the
>> index scan needs to scan multiple (many) leaf pages, we have to process
>> the first leaf page first before reading / prefetching the next one.
>>
>> I think this is acceptable limitation, certainly for v0. Prefetching
>> across multiple leaf pages seems way more complex (particularly for the
>> cases using pairing heap), so let's leave this for the future.
> 
> Hm. I think that really depends on the shape of the API we end up with. If we
> move the responsibility more twoards to the executor, I think it very well
> could end up being just as simple to prefetch across index pages.
> 

Maybe. I'm open to that idea if you have idea how to shape the API to
make this possible (although perhaps not in v0).

> 
>> 5) index-only scans
>>
>> I'm not sure what to do about index-only scans. On the one hand, the
>> point of IOS is not to read stuff from the heap at all, so why prefetch
>> it. OTOH if there are many allvisible=false pages, we still have to
>> access that. And if that happens, this leads to the bizarre situation
>> that IOS is slower than regular index scan. But to address this, we'd
>> have to consider the visibility during prefetching.
> 
> That should be easy to do, right?
> 

It doesn't seem particularly complicated (famous last words), and we
need to do the VM checks anyway so it seems like it wouldn't add a lot
of overhead either

> 
> 
>> Benchmark / TPC-H
>> -----------------
>>
>> I ran the 22 queries on 100GB data set, with parallel query either
>> disabled or enabled. And I measured timing (and speedup) for each query.
>> The speedup results look like this (see the attached PDF for details):
>>
>>     query    serial    parallel
>>     1          101%         99%
>>     2          119%        100%
>>     3          100%         99%
>>     4          101%        100%
>>     5          101%        100%
>>     6           12%         99%
>>     7          100%        100%
>>     8           52%         67%
>>     10         102%        101%
>>     11         100%         72%
>>     12         101%        100%
>>     13         100%        101%
>>     14          13%        100%
>>     15         101%        100%
>>     16          99%         99%
>>     17          95%        101%
>>     18         101%        106%
>>     19          30%         40%
>>     20          99%        100%
>>     21         101%        100%
>>     22         101%        107%
>>
>> The percentage is (timing patched / master, so <100% means faster, >100%
>> means slower).
>>
>> The different queries are affected depending on the query plan - many
>> queries are close to 100%, which means "no difference". For the serial
>> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
>> while for the parallel case the benefits are somewhat less significant.
>>
>> My explanation is that either (a) parallel case used a different plan
>> with fewer index scans or (b) the parallel query does more concurrent
>> I/O simply by using parallel workers. Or maybe both.
>>
>> There are a couple regressions too, I believe those are due to doing too
>> much prefetching in some cases, and some of the heuristics mentioned
>> earlier should eliminate most of this, I think.
> 
> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> Unless I missed what "xeon / cached (speedup)" indicates?
> 

I forgot to explain what "cached" means in the TPC-H case. It means
second execution of the query, so you can imagine it like this:

for q in `seq 1 22`; do

   1. drop caches and restart postgres

   2. run query $q -> uncached

   3. run query $q -> cached

done

So the second execution has a chance of having data in memory - but
maybe not all, because this is a 100GB data set (so ~200GB after
loading), but the machine only has 64GB of RAM.

I think a likely explanation is some of the data wasn't actually in
memory, so prefetching still did something.

> I think it'd be good to run a performance comparison of the unpatched vs
> patched cases, with prefetching disabled for both. It's possible that
> something in the patch caused unintended changes (say spilling during a
> hashagg, due to larger struct sizes).
> 

That's certainly a good idea. I'll do that in the next round of tests. I
also plan to do a test on data set that fits into RAM, to test "properly
cached" case.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company