Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1q85rb-0000h1-4l for pgsql-hackers@arkaria.postgresql.org; Sat, 10 Jun 2023 21:11:11 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.92) (envelope-from ) id 1q85ra-0000U1-3A for pgsql-hackers@arkaria.postgresql.org; Sat, 10 Jun 2023 21:11:10 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.92) (envelope-from ) id 1q85rZ-0000To-PS for pgsql-hackers@lists.postgresql.org; Sat, 10 Jun 2023 21:11:09 +0000 Received: from mail-ed1-x52d.google.com ([2a00:1450:4864:20::52d]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1q85rT-001TYj-EP for pgsql-hackers@lists.postgresql.org; Sat, 10 Jun 2023 21:11:09 +0000 Received: by mail-ed1-x52d.google.com with SMTP id 4fb4d7f45d1cf-5148e4a2f17so5317387a12.1 for ; Sat, 10 Jun 2023 14:11:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=enterprisedb.com; s=google; t=1686431461; x=1689023461; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=Hi2aCtXd9AzybyoKR8XrCRuTgHVItflxUWKQ3WPSCGM=; b=HgNUyERaKeShLb5jrxdQKjbrGTq32IXiLXKn2Ox3jTNZl993RRbBxTXHNyvjgqRl9L to3XUh+Zg5TMogTwSndp7dOGYEwiK/t/FmtQjUYZz1AEgc43Hom4QKsohVKzNflRKodX D+nrM4ls4//R0+mK89700100gVUjYimqtFHEZMP8ubr24iXn3iCR5VqNAUi3VZBxYfu0 +yI0x7XzwKD2NPv4MLl7NUhsaqTPv2Km6mmU+hcKdGaxZPlomGckswihd1PHBzy+mc9g EV21J2wnElbAjPoSZlioVcwlzRQUjYTP+JZqItyqXaHsPVgsogOCTKRBV8u6AIdxxw4c 82Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1686431461; x=1689023461; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Hi2aCtXd9AzybyoKR8XrCRuTgHVItflxUWKQ3WPSCGM=; b=RO2Q5K3a57/IraHTre7f4e8HtQ0YA7FZpYlETJLo39nJa1w5+WM2j5wK4L9qTCdLMI lN1NlEnhHLYLZ4aIBnZ6nzwMyLkgvj+6S55pzT3adCHAYjKjskIQtpDI5qtoQyZ+7n12 9NGhExR89VgsWoMGQDw0UFgWwUO0rIXqjjvqYUUfO5ZrsmCWuhDC3yacyRtjLlMGQ6LL FDOh2y4asyxppq+Jfm74rppJmKKA8VOwKwvhA5jc7EQqLmHcyKTqlKG8VKcPeZp8XXWD begEQ70hqoxjT/FeI4tT+TY5sE5yVKHJG8K46f0nWCXG2ggrqs5EYptd27CaoY9zvlul 5EdQ== X-Gm-Message-State: AC+VfDyFn0Mxml9BR7zmMTjqGYH0PIdVHx1lgb6+tddCZjCoZ1OkU/Df Xe+rT3d2zBpdPYWaci1Ja0KirQ== X-Google-Smtp-Source: ACHHUZ5F9jyzVxiYcyecgGdqRU9RUEw3CJ/gdbJmK2jXkzNpy9Xs8v+HpfOfKfYCpdfMtoIAZkGVYA== X-Received: by 2002:a17:907:a0a:b0:974:5e14:29c0 with SMTP id bb10-20020a1709070a0a00b009745e1429c0mr6021235ejc.21.1686431461289; Sat, 10 Jun 2023 14:11:01 -0700 (PDT) Received: from [10.137.0.17] (ip-86-49-228-220.bb.vodafone.cz. [86.49.228.220]) by smtp.gmail.com with ESMTPSA id gj15-20020a170906e10f00b0096f4c9ce526sm2990041ejb.225.2023.06.10.14.11.00 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 10 Jun 2023 14:11:00 -0700 (PDT) Message-ID: <6030d836-e8b7-e7b9-2cbb-144309679d03@enterprisedb.com> Date: Sat, 10 Jun 2023 23:10:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Subject: Re: index prefetching Content-Language: en-US To: Andres Freund Cc: PostgreSQL Hackers , Georgios References: <20230609000600.syqy447e6metnvyj@awork3.anarazel.de> <20230610203456.5gancfekm4pj4pbs@awork3.anarazel.de> From: Tomas Vondra In-Reply-To: <20230610203456.5gancfekm4pj4pbs@awork3.anarazel.de> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 6/10/23 22:34, Andres Freund wrote: > Hi, > > On 2023-06-09 12:18:11 +0200, Tomas Vondra wrote: >>> >>>> 2) prefetching from executor >>>> >>>> Another question is whether the prefetching shouldn't actually happen >>>> even higher - in the executor. That's what Andres suggested during the >>>> unconference, and it kinda makes sense. That's where we do prefetching >>>> for bitmap heap scans, so why should this happen lower, right? >>> >>> Yea. I think it also provides potential for further optimizations in the >>> future to do it at that layer. >>> >>> One thing I have been wondering around this is whether we should not have >>> split the code for IOS and plain indexscans... >>> >> >> Which code? We already have nodeIndexscan.c and nodeIndexonlyscan.c? Or >> did you mean something else? > > Yes, I meant that. > Ah, you meant that maybe we shouldn't have done that. Sorry, I misunderstood. >>>> 4) per-leaf prefetching >>>> >>>> The code is restricted only prefetches items from one leaf page. If the >>>> index scan needs to scan multiple (many) leaf pages, we have to process >>>> the first leaf page first before reading / prefetching the next one. >>>> >>>> I think this is acceptable limitation, certainly for v0. Prefetching >>>> across multiple leaf pages seems way more complex (particularly for the >>>> cases using pairing heap), so let's leave this for the future. >>> >>> Hm. I think that really depends on the shape of the API we end up with. If we >>> move the responsibility more twoards to the executor, I think it very well >>> could end up being just as simple to prefetch across index pages. >>> >> >> Maybe. I'm open to that idea if you have idea how to shape the API to >> make this possible (although perhaps not in v0). > > I'll try to have a look. > > >>> I'm a bit confused by some of these numbers. How can OS-level prefetching lead >>> to massive prefetching in the alread cached case, e.g. in tpch q06 and q08? >>> Unless I missed what "xeon / cached (speedup)" indicates? >>> >> >> I forgot to explain what "cached" means in the TPC-H case. It means >> second execution of the query, so you can imagine it like this: >> >> for q in `seq 1 22`; do >> >> 1. drop caches and restart postgres > > Are you doing it in that order? If so, the pagecache can end up being seeded > by postgres writing out dirty buffers. > Actually no, I do it the other way around - first restart, then drop. It shouldn't matter much, though, because after building the data set (and vacuum + checkpoint), the data is not modified - all the queries run on the same data set. So there shouldn't be any dirty buffers. > >> 2. run query $q -> uncached >> >> 3. run query $q -> cached >> >> done >> >> So the second execution has a chance of having data in memory - but >> maybe not all, because this is a 100GB data set (so ~200GB after >> loading), but the machine only has 64GB of RAM. >> >> I think a likely explanation is some of the data wasn't actually in >> memory, so prefetching still did something. > > Ah, ok. > > >>> I think it'd be good to run a performance comparison of the unpatched vs >>> patched cases, with prefetching disabled for both. It's possible that >>> something in the patch caused unintended changes (say spilling during a >>> hashagg, due to larger struct sizes). >>> >> >> That's certainly a good idea. I'll do that in the next round of tests. I >> also plan to do a test on data set that fits into RAM, to test "properly >> cached" case. > > Cool. It'd be good to measure both the case of all data already being in s_b > (to see the overhead of the buffer mapping lookups) and the case where the > data is in the kernel pagecache (to see the overhead of pointless > posix_fadvise calls). > OK, I'll make sure the next round of tests includes a sufficiently small data set too. I should have some numbers sometime early next week. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company