Message-ID: <6030d836-e8b7-e7b9-2cbb-144309679d03@enterprisedb.com>
Date: Sat, 10 Jun 2023 23:10:59 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Subject: Re: index prefetching
Content-Language: en-US
To: Andres Freund <andres@anarazel.de>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
 Georgios <gkokolatos@protonmail.com>
References: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
 <20230609000600.syqy447e6metnvyj@awork3.anarazel.de>
 <a062ac09-c574-1d6a-22b6-f3cc0d28ed42@enterprisedb.com>
 <20230610203456.5gancfekm4pj4pbs@awork3.anarazel.de>
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
In-Reply-To: <20230610203456.5gancfekm4pj4pbs@awork3.anarazel.de>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: <https://www.postgresql.org/message-id/6030d836-e8b7-e7b9-2cbb-144309679d03%40enterprisedb.com>
Precedence: bulk


On 6/10/23 22:34, Andres Freund wrote:
> Hi,
> 
> On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
>>>
>>>> 2) prefetching from executor
>>>>
>>>> Another question is whether the prefetching shouldn't actually happen
>>>> even higher - in the executor. That's what Andres suggested during the
>>>> unconference, and it kinda makes sense. That's where we do prefetching
>>>> for bitmap heap scans, so why should this happen lower, right?
>>>
>>> Yea. I think it also provides potential for further optimizations in the
>>> future to do it at that layer.
>>>
>>> One thing I have been wondering around this is whether we should not have
>>> split the code for IOS and plain indexscans...
>>>
>>
>> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
>> did you mean something else?
> 
> Yes, I meant that.
> 

Ah, you meant that maybe we shouldn't have done that. Sorry, I
misunderstood.

>>>> 4) per-leaf prefetching
>>>>
>>>> The code is restricted only prefetches items from one leaf page. If the
>>>> index scan needs to scan multiple (many) leaf pages, we have to process
>>>> the first leaf page first before reading / prefetching the next one.
>>>>
>>>> I think this is acceptable limitation, certainly for v0. Prefetching
>>>> across multiple leaf pages seems way more complex (particularly for the
>>>> cases using pairing heap), so let's leave this for the future.
>>>
>>> Hm. I think that really depends on the shape of the API we end up with. If we
>>> move the responsibility more twoards to the executor, I think it very well
>>> could end up being just as simple to prefetch across index pages.
>>>
>>
>> Maybe. I'm open to that idea if you have idea how to shape the API to
>> make this possible (although perhaps not in v0).
> 
> I'll try to have a look.
> 
> 
>>> I'm a bit confused by some of these numbers. How can OS-level prefetching lead
>>> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
>>> Unless I missed what "xeon / cached (speedup)" indicates?
>>>
>>
>> I forgot to explain what "cached" means in the TPC-H case. It means
>> second execution of the query, so you can imagine it like this:
>>
>> for q in `seq 1 22`; do
>>
>>    1. drop caches and restart postgres
> 
> Are you doing it in that order? If so, the pagecache can end up being seeded
> by postgres writing out dirty buffers.
> 

Actually no, I do it the other way around - first restart, then drop. It
shouldn't matter much, though, because after building the data set (and
vacuum + checkpoint), the data is not modified - all the queries run on
the same data set. So there shouldn't be any dirty buffers.

> 
>>    2. run query $q -> uncached
>>
>>    3. run query $q -> cached
>>
>> done
>>
>> So the second execution has a chance of having data in memory - but
>> maybe not all, because this is a 100GB data set (so ~200GB after
>> loading), but the machine only has 64GB of RAM.
>>
>> I think a likely explanation is some of the data wasn't actually in
>> memory, so prefetching still did something.
> 
> Ah, ok.
> 
> 
>>> I think it'd be good to run a performance comparison of the unpatched vs
>>> patched cases, with prefetching disabled for both. It's possible that
>>> something in the patch caused unintended changes (say spilling during a
>>> hashagg, due to larger struct sizes).
>>>
>>
>> That's certainly a good idea. I'll do that in the next round of tests. I
>> also plan to do a test on data set that fits into RAM, to test "properly
>> cached" case.
> 
> Cool. It'd be good to measure both the case of all data already being in s_b
> (to see the overhead of the buffer mapping lookups) and the case where the
> data is in the kernel pagecache (to see the overhead of pointless
> posix_fadvise calls).
> 

OK, I'll make sure the next round of tests includes a sufficiently small
data set too. I should have some numbers sometime early next week.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company