Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1q7ZCL-0004BY-S2 for pgsql-hackers@arkaria.postgresql.org; Fri, 09 Jun 2023 10:18:26 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1q7ZCK-0006YA-Mu for pgsql-hackers@arkaria.postgresql.org; Fri, 09 Jun 2023 10:18:24 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1q7ZCK-0006Y1-BY for pgsql-hackers@lists.postgresql.org; Fri, 09 Jun 2023 10:18:24 +0000 Received: from mail-lf1-x132.google.com ([2a00:1450:4864:20::132]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1q7ZCG-001C6u-HB for pgsql-hackers@lists.postgresql.org; Fri, 09 Jun 2023 10:18:23 +0000 Received: by mail-lf1-x132.google.com with SMTP id 2adb3069b0e04-4f624daccd1so1927443e87.0 for ; Fri, 09 Jun 2023 03:18:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1686305899; x=1688897899; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=iHcwqy7k7irZmyhkAVmP4QzPJMtgSHM2NMTCFZK8AuA=; b=jnmnM98mxriaRzVmg1zeCzTCAEosow8qsAG2Awh7JK4soWSfOEchpi3IeoDwHs5kH9 bnhZz+3RTPbzWcUSFvnu9NmDCCOtSBLR4YJXn2NQvb7WBbDVeDbdD6LaPTp0rPS0Xt1D o8WTQmDmyOZKG1RUxoO1ygb01LQZGv45ZapMeP2YvCZzloytLwyTisnumrDB+bGAQYEX toUHOGW6+S0egS2ahYOVQTjrcv3ehx0mTDR2S7PIHpNoZqcVAVM/2QZUqZQQblpuc6I0 12N5VNMmy1uodvygsySYRH4AtsGqKtsh0FhY4ncIkq2MuCGrqOxhaWQ8BSA/J1mB7zl1 m9EA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686305899; x=1688897899; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=iHcwqy7k7irZmyhkAVmP4QzPJMtgSHM2NMTCFZK8AuA=; b=JhtYAJuG0YjvELRZFvSUckkem1AAIvR++QQumwXDrH3mkmdvkGhjdHQqk1ThozU5AZ tDBUqpMP6QuQO/suJeRRvAyuoWaKjswBjc/DpMLxaxng3Dj3Y22OS8iAQajpnHBKirlC b0mDAG+HgmpKqaYRlwJ1hUWImhmLjnpN6WPyvYKQUTXhZCuJ0pKarBUkG4E1MI8U0KTk +Fl1aiI/INXlqjErs5/oA+ke/CEWXleti0DLyL/60xKMeZLgiKJXxbWDS7cH1q7NPyKR 5wL+V0rni/fnhZI7kFL4mwdkXmtIi5kp6kfqS2NlCgwqMHEm51ZDv/SdvQy0YTpLaMS4 /07g== X-Gm-Message-State: AC+VfDwDA6WxNomwdldzVu+/uj7t2qMGiJcE6jYNSY6f5FIYfTgDc8Wv II+wP9o+tEdZoV+hgsmt7jmJCQ== X-Google-Smtp-Source: ACHHUZ4hQzMbzTr96ocMQoLCYRSRFI/AHPGzDhYO+FcE8gLIdpPiSkh6cnj1fcYI/E0s1FcSHs4Bag== X-Received: by 2002:a19:9155:0:b0:4d8:8ad1:a05f with SMTP id y21-20020a199155000000b004d88ad1a05fmr654378lfj.48.1686305899140; Fri, 09 Jun 2023 03:18:19 -0700 (PDT) Received: from [10.137.0.17] (static-84-42-175-93.bb.vodafone.cz. [84.42.175.93]) by smtp.gmail.com with ESMTPSA id dy28-20020a05640231fc00b0050bc6c04a66sm1609505edb.40.2023.06.09.03.18.18 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 09 Jun 2023 03:18:18 -0700 (PDT) Message-ID: Date: Fri, 9 Jun 2023 12:18:11 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.10.0 Subject: Re: index prefetching To: Andres Freund Cc: PostgreSQL Hackers , Georgios References: <20230609000600.syqy447e6metnvyj@awork3.anarazel.de> Content-Language: en-US From: Tomas Vondra In-Reply-To: <20230609000600.syqy447e6metnvyj@awork3.anarazel.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 6/9/23 02:06, Andres Freund wrote: > Hi, > > On 2023-06-08 17:40:12 +0200, Tomas Vondra wrote: >> At pgcon unconference I presented a PoC patch adding prefetching for >> indexes, along with some benchmark results demonstrating the (pretty >> significant) benefits etc. The feedback was quite positive, so let me >> share the current patch more widely. > > I'm really excited about this work. > > >> 1) pairing-heap in GiST / SP-GiST >> >> For most AMs, the index state is pretty trivial - matching items from a >> single leaf page. Prefetching that is pretty trivial, even if the >> current API is a bit cumbersome. >> >> Distance queries on GiST and SP-GiST are a problem, though, because >> those do not just read the pointers into a simple array, as the distance >> ordering requires passing stuff through a pairing-heap :-( >> >> I don't know how to best deal with that, especially not in the simple >> API. I don't think we can "scan forward" stuff from the pairing heap, so >> the only idea I have is actually having two pairing-heaps. Or maybe >> using the pairing heap for prefetching, but stashing the prefetched >> pointers into an array and then returning stuff from it. >> >> In the patch I simply prefetch items before we add them to the pairing >> heap, which is good enough for demonstrating the benefits. > > I think it'd be perfectly fair to just not tackle distance queries for now. > My concern is that if we cut this from v0 entirely, we'll end up with an API that'll not be suitable for adding distance queries later. > >> 2) prefetching from executor >> >> Another question is whether the prefetching shouldn't actually happen >> even higher - in the executor. That's what Andres suggested during the >> unconference, and it kinda makes sense. That's where we do prefetching >> for bitmap heap scans, so why should this happen lower, right? > > Yea. I think it also provides potential for further optimizations in the > future to do it at that layer. > > One thing I have been wondering around this is whether we should not have > split the code for IOS and plain indexscans... > Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or did you mean something else? > >> 4) per-leaf prefetching >> >> The code is restricted only prefetches items from one leaf page. If the >> index scan needs to scan multiple (many) leaf pages, we have to process >> the first leaf page first before reading / prefetching the next one. >> >> I think this is acceptable limitation, certainly for v0. Prefetching >> across multiple leaf pages seems way more complex (particularly for the >> cases using pairing heap), so let's leave this for the future. > > Hm. I think that really depends on the shape of the API we end up with. If we > move the responsibility more twoards to the executor, I think it very well > could end up being just as simple to prefetch across index pages. > Maybe. I'm open to that idea if you have idea how to shape the API to make this possible (although perhaps not in v0). > >> 5) index-only scans >> >> I'm not sure what to do about index-only scans. On the one hand, the >> point of IOS is not to read stuff from the heap at all, so why prefetch >> it. OTOH if there are many allvisible=false pages, we still have to >> access that. And if that happens, this leads to the bizarre situation >> that IOS is slower than regular index scan. But to address this, we'd >> have to consider the visibility during prefetching. > > That should be easy to do, right? > It doesn't seem particularly complicated (famous last words), and we need to do the VM checks anyway so it seems like it wouldn't add a lot of overhead either > > >> Benchmark / TPC-H >> ----------------- >> >> I ran the 22 queries on 100GB data set, with parallel query either >> disabled or enabled. And I measured timing (and speedup) for each query. >> The speedup results look like this (see the attached PDF for details): >> >> query serial parallel >> 1 101% 99% >> 2 119% 100% >> 3 100% 99% >> 4 101% 100% >> 5 101% 100% >> 6 12% 99% >> 7 100% 100% >> 8 52% 67% >> 10 102% 101% >> 11 100% 72% >> 12 101% 100% >> 13 100% 101% >> 14 13% 100% >> 15 101% 100% >> 16 99% 99% >> 17 95% 101% >> 18 101% 106% >> 19 30% 40% >> 20 99% 100% >> 21 101% 100% >> 22 101% 107% >> >> The percentage is (timing patched / master, so <100% means faster, >100% >> means slower). >> >> The different queries are affected depending on the query plan - many >> queries are close to 100%, which means "no difference". For the serial >> case, there are about 4 queries that improved a lot (6, 8, 14, 19), >> while for the parallel case the benefits are somewhat less significant. >> >> My explanation is that either (a) parallel case used a different plan >> with fewer index scans or (b) the parallel query does more concurrent >> I/O simply by using parallel workers. Or maybe both. >> >> There are a couple regressions too, I believe those are due to doing too >> much prefetching in some cases, and some of the heuristics mentioned >> earlier should eliminate most of this, I think. > > I'm a bit confused by some of these numbers. How can OS-level prefetching lead > to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? > Unless I missed what "xeon / cached (speedup)" indicates? > I forgot to explain what "cached" means in the TPC-H case. It means second execution of the query, so you can imagine it like this: for q in `seq 1 22`; do 1. drop caches and restart postgres 2. run query $q -> uncached 3. run query $q -> cached done So the second execution has a chance of having data in memory - but maybe not all, because this is a 100GB data set (so ~200GB after loading), but the machine only has 64GB of RAM. I think a likely explanation is some of the data wasn't actually in memory, so prefetching still did something. > I think it'd be good to run a performance comparison of the unpatched vs > patched cases, with prefetching disabled for both. It's possible that > something in the patch caused unintended changes (say spilling during a > hashagg, due to larger struct sizes). > That's certainly a good idea. I'll do that in the next round of tests. I also plan to do a test on data set that fits into RAM, to test "properly cached" case. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company