Feedback-ID: id4a34324:Fastmail
Date: Sat, 10 Jun 2023 13:34:56 -0700
From: Andres Freund <andres@anarazel.de>
To: Tomas Vondra <tomas.vondra@enterprisedb.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
	Georgios <gkokolatos@protonmail.com>
Subject: Re: index prefetching
Message-ID: <20230610203456.5gancfekm4pj4pbs@awork3.anarazel.de>
References: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
 <20230609000600.syqy447e6metnvyj@awork3.anarazel.de>
 <a062ac09-c574-1d6a-22b6-f3cc0d28ed42@enterprisedb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <a062ac09-c574-1d6a-22b6-f3cc0d28ed42@enterprisedb.com>
Archived-At: <https://www.postgresql.org/message-id/20230610203456.5gancfekm4pj4pbs%40awork3.anarazel.de>
Precedence: bulk

Hi,

On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote:
> > 
> >> 2) prefetching from executor
> >>
> >> Another question is whether the prefetching shouldn't actually happen
> >> even higher - in the executor. That's what Andres suggested during the
> >> unconference, and it kinda makes sense. That's where we do prefetching
> >> for bitmap heap scans, so why should this happen lower, right?
> > 
> > Yea. I think it also provides potential for further optimizations in the
> > future to do it at that layer.
> > 
> > One thing I have been wondering around this is whether we should not have
> > split the code for IOS and plain indexscans...
> > 
> 
> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or
> did you mean something else?

Yes, I meant that.

> >> 4) per-leaf prefetching
> >>
> >> The code is restricted only prefetches items from one leaf page. If the
> >> index scan needs to scan multiple (many) leaf pages, we have to process
> >> the first leaf page first before reading / prefetching the next one.
> >>
> >> I think this is acceptable limitation, certainly for v0. Prefetching
> >> across multiple leaf pages seems way more complex (particularly for the
> >> cases using pairing heap), so let's leave this for the future.
> > 
> > Hm. I think that really depends on the shape of the API we end up with. If we
> > move the responsibility more twoards to the executor, I think it very well
> > could end up being just as simple to prefetch across index pages.
> > 
> 
> Maybe. I'm open to that idea if you have idea how to shape the API to
> make this possible (although perhaps not in v0).

I'll try to have a look.


> > I'm a bit confused by some of these numbers. How can OS-level prefetching lead
> > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
> > Unless I missed what "xeon / cached (speedup)" indicates?
> > 
> 
> I forgot to explain what "cached" means in the TPC-H case. It means
> second execution of the query, so you can imagine it like this:
> 
> for q in `seq 1 22`; do
> 
>    1. drop caches and restart postgres

Are you doing it in that order? If so, the pagecache can end up being seeded
by postgres writing out dirty buffers.


>    2. run query $q -> uncached
> 
>    3. run query $q -> cached
> 
> done
> 
> So the second execution has a chance of having data in memory - but
> maybe not all, because this is a 100GB data set (so ~200GB after
> loading), but the machine only has 64GB of RAM.
> 
> I think a likely explanation is some of the data wasn't actually in
> memory, so prefetching still did something.

Ah, ok.


> > I think it'd be good to run a performance comparison of the unpatched vs
> > patched cases, with prefetching disabled for both. It's possible that
> > something in the patch caused unintended changes (say spilling during a
> > hashagg, due to larger struct sizes).
> > 
> 
> That's certainly a good idea. I'll do that in the next round of tests. I
> also plan to do a test on data set that fits into RAM, to test "properly
> cached" case.

Cool. It'd be good to measure both the case of all data already being in s_b
(to see the overhead of the buffer mapping lookups) and the case where the
data is in the kernel pagecache (to see the overhead of pointless
posix_fadvise calls).

Greetings,

Andres Freund