Feedback-ID: id4a34324:Fastmail
Date: Thu, 8 Jun 2023 17:06:00 -0700
From: Andres Freund <andres@anarazel.de>
To: Tomas Vondra <tomas.vondra@enterprisedb.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>,
	Georgios <gkokolatos@protonmail.com>
Subject: Re: index prefetching
Message-ID: <20230609000600.syqy447e6metnvyj@awork3.anarazel.de>
References: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <cf85f46f-b02f-05b2-5248-5000b894ebab@enterprisedb.com>
Archived-At: <https://www.postgresql.org/message-id/20230609000600.syqy447e6metnvyj%40awork3.anarazel.de>
Precedence: bulk

Hi,

On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote:
> At pgcon unconference I presented a PoC patch adding prefetching for
> indexes, along with some benchmark results demonstrating the (pretty
> significant) benefits etc. The feedback was quite positive, so let me
> share the current patch more widely.

I'm really excited about this work.


> 1) pairing-heap in GiST / SP-GiST
> 
> For most AMs, the index state is pretty trivial - matching items from a
> single leaf page. Prefetching that is pretty trivial, even if the
> current API is a bit cumbersome.
> 
> Distance queries on GiST and SP-GiST are a problem, though, because
> those do not just read the pointers into a simple array, as the distance
> ordering requires passing stuff through a pairing-heap :-(
> 
> I don't know how to best deal with that, especially not in the simple
> API. I don't think we can "scan forward" stuff from the pairing heap, so
> the only idea I have is actually having two pairing-heaps. Or maybe
> using the pairing heap for prefetching, but stashing the prefetched
> pointers into an array and then returning stuff from it.
> 
> In the patch I simply prefetch items before we add them to the pairing
> heap, which is good enough for demonstrating the benefits.

I think it'd be perfectly fair to just not tackle distance queries for now.


> 2) prefetching from executor
> 
> Another question is whether the prefetching shouldn't actually happen
> even higher - in the executor. That's what Andres suggested during the
> unconference, and it kinda makes sense. That's where we do prefetching
> for bitmap heap scans, so why should this happen lower, right?

Yea. I think it also provides potential for further optimizations in the
future to do it at that layer.

One thing I have been wondering around this is whether we should not have
split the code for IOS and plain indexscans...


> 4) per-leaf prefetching
> 
> The code is restricted only prefetches items from one leaf page. If the
> index scan needs to scan multiple (many) leaf pages, we have to process
> the first leaf page first before reading / prefetching the next one.
> 
> I think this is acceptable limitation, certainly for v0. Prefetching
> across multiple leaf pages seems way more complex (particularly for the
> cases using pairing heap), so let's leave this for the future.

Hm. I think that really depends on the shape of the API we end up with. If we
move the responsibility more twoards to the executor, I think it very well
could end up being just as simple to prefetch across index pages.


> 5) index-only scans
> 
> I'm not sure what to do about index-only scans. On the one hand, the
> point of IOS is not to read stuff from the heap at all, so why prefetch
> it. OTOH if there are many allvisible=false pages, we still have to
> access that. And if that happens, this leads to the bizarre situation
> that IOS is slower than regular index scan. But to address this, we'd
> have to consider the visibility during prefetching.

That should be easy to do, right?


> Benchmark / TPC-H
> -----------------
> 
> I ran the 22 queries on 100GB data set, with parallel query either
> disabled or enabled. And I measured timing (and speedup) for each query.
> The speedup results look like this (see the attached PDF for details):
> 
>     query    serial    parallel
>     1          101%         99%
>     2          119%        100%
>     3          100%         99%
>     4          101%        100%
>     5          101%        100%
>     6           12%         99%
>     7          100%        100%
>     8           52%         67%
>     10         102%        101%
>     11         100%         72%
>     12         101%        100%
>     13         100%        101%
>     14          13%        100%
>     15         101%        100%
>     16          99%         99%
>     17          95%        101%
>     18         101%        106%
>     19          30%         40%
>     20          99%        100%
>     21         101%        100%
>     22         101%        107%
> 
> The percentage is (timing patched / master, so <100% means faster, >100%
> means slower).
> 
> The different queries are affected depending on the query plan - many
> queries are close to 100%, which means "no difference". For the serial
> case, there are about 4 queries that improved a lot (6, 8, 14, 19),
> while for the parallel case the benefits are somewhat less significant.
> 
> My explanation is that either (a) parallel case used a different plan
> with fewer index scans or (b) the parallel query does more concurrent
> I/O simply by using parallel workers. Or maybe both.
> 
> There are a couple regressions too, I believe those are due to doing too
> much prefetching in some cases, and some of the heuristics mentioned
> earlier should eliminate most of this, I think.

I'm a bit confused by some of these numbers. How can OS-level prefetching lead
to massive prefetching in the alread cached case, e.g. in tpch q06 and q08?
Unless I missed what "xeon / cached (speedup)" indicates?

I think it'd be good to run a performance comparison of the unpatched vs
patched cases, with prefetching disabled for both. It's possible that
something in the patch caused unintended changes (say spilling during a
hashagg, due to larger struct sizes).

Greetings,

Andres Freund