Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tYL5H-009e2Y-W3 for pgsql-hackers@arkaria.postgresql.org; Thu, 16 Jan 2025 08:18:36 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tYL5F-008Mqe-H4 for pgsql-hackers@arkaria.postgresql.org; Thu, 16 Jan 2025 08:18:33 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tYL5F-008MqT-3O for pgsql-hackers@lists.postgresql.org; Thu, 16 Jan 2025 08:18:33 +0000 Received: from mail-yb1-xb2c.google.com ([2607:f8b0:4864:20::b2c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tYL5D-000fvb-2j for pgsql-hackers@postgresql.org; Thu, 16 Jan 2025 08:18:32 +0000 Received: by mail-yb1-xb2c.google.com with SMTP id 3f1490d57ef6-e53a91756e5so1215999276.1 for ; Thu, 16 Jan 2025 00:18:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737015510; x=1737620310; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=FobLwml8jH3QmYrrHKGZYkUtJzc4j5Gmjcme8P8f9wo=; b=bezl85g6eYowS2aRWhCtpYTP37CN9u3HV3SstPTG3uhQ3/MM45O2UCT/HrRDjgh1yn RWSOasJ4lJZoVxK/qFpV+PKxhBvY2W3aEwC7tsYklekYWE3BPLsarnfLJnqAmdfS5cuN k36Sy9mbDp8/wmXLh7+eUlDlnMBERqlebnmxkhEnnzqgmqJi1WrrCzARj8XYY0H26av2 AUalJy+rAK3ygR+e4/29hEQslHrH6Lx9NNma5Q8EWVUbgXGGxmBu4n2GV20IdwKdH1Ve 2/1QvyGYII5j4IoY5A+GH+S99OMrmrzUUu6vmF4nZySS3F/NzeuiNm1O3nYAmEdp7kaQ WwpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737015510; x=1737620310; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FobLwml8jH3QmYrrHKGZYkUtJzc4j5Gmjcme8P8f9wo=; b=p3uGUo7iUOtjAYRlUeASTioOo+vTIDprztfSr9wXZi4+NiCHqnPTb+IucMfhQkKb1+ a5CwtQFhha+/shAD72v/MCNw97l1jLZRF9FDgewEojPja1Z2Rb8E0cpc97Dnai1n2KUv 53qcTQJiAORmzkjy0vATzMSpX+CiJ+VckiWNOTvFWPYnGpC0x08qQLUyCYs/d9CDTX0g XF4j6BkbND/BxzSOOhivr4L1RT7Wy7vkSeYRCYly0APRsT5snrRBDiW+FUiIPkuQPSdb IiQKM/HhuVrb3wRS/L4b9Bm1ReVpJxBithuHi0A46V93OZOIGbrAwp46CCSogsMF1i6c UixQ== X-Forwarded-Encrypted: i=1; AJvYcCUf44BPBYnJJ2TZYWQ8CO1+2l5wHgO2ku9O1gwDVOvLICAAMZY58isxC3E7b2wSHfNDY8P9AWw0yKVyNn93@postgresql.org X-Gm-Message-State: AOJu0YyGVGbnbjgLnJjNL7lB6Smg5Ae4atvz3g9SLdAvoKKq5c96STAf Dosj+h4ntkZfZGZafUnWTOo9s738hSn95aflCgcbDBSiQEcXXgJkdvfG3GchmGuRxe4d6UGX91X 6YWI0ggGCrIyZGPLnB6e/4BPu/RE= X-Gm-Gg: ASbGncuz0aukWleLoi9SZbNtNHKI18z58/wkbrOBUuNFvPsvOHeyJ8hp82ZQdI68iOn 1HWHkal8HLIPvFJv9TXMK3cBvCjarwhkQ8hf0cKmfbNN4gf3xi5ljDNwVuQRX3TIXu5PX05JF X-Google-Smtp-Source: AGHT+IH/Fb549clG8NgBykTFnM4wkiF4N3sadYsfNjCuZEYwBpch87dcnzCa+0UcrEb0bP0GqyPLFDya3/Ut1Td6sUY= X-Received: by 2002:a05:690c:9687:b0:6f6:ccf8:614a with SMTP id 00721157ae682-6f6ccf86218mr47192627b3.17.1737015510442; Thu, 16 Jan 2025 00:18:30 -0800 (PST) MIME-Version: 1.0 References: <87il22cj51.fsf@163.com> In-Reply-To: From: Richard Guo Date: Thu, 16 Jan 2025 17:18:18 +0900 X-Gm-Features: AbW1kvamA-DgMR2GeSbZ9zDtuFmodE5fYNPN7Gsw9aLtOGj4I0l-Zk_-o2wU6VI Message-ID: Subject: Re: Eager aggregation, take 3 To: Robert Haas Cc: Tom Lane , Tender Wang , Paul George , Andy Fan , PostgreSQL-development , pgsql-hackers@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Jan 15, 2025 at 11:40=E2=80=AFPM Robert Haas wrote: > On Wed, Jan 15, 2025 at 1:58=E2=80=AFAM Richard Guo wrote: > > I understand that we're currently quite bad at estimating the number > > of groups after aggregation. In fact, it's not just aggregation > > estimates =E2=80=94 we're also bad at join estimates in some cases. Th= is is a > > reality we have to face. Here's what I think: we should be trying our > > best to cost each node type as accurately as possible, and then build > > the upper nodes based on those costs. We should not conclude that, > > because we are unable to accurately cost one node type, we should > > avoid any cost-based optimizations above that node. > > Well, I agree with that last sentence, for sure. But I don't think > it's true that the situations with joins and aggregates are > comparable. We are much better able to estimate the number of rows > that will come out of a join than we are to estimate the number of > rows that come out of an aggregate. It's certainly true that in some > cases we get join estimates badly wrong, and I'd like to see us do > better there, but our estimates of the number of distinct values that > exist in a column are the least reliable part of our statistics system > by far. I totally understand that the situation with joins is better than with aggregates, which is why I said that we're also bad at join estimates "in some cases" - especially in the cases where we fall back to use default selectivity estimates. A simple example: create table t1 (a int, b int); create table t2 (a int, b int); insert into t1 select i, i from generate_series(1,1000)i; insert into t2 select i, i from generate_series(1000, 1999)i; analyze t1, t2; explain analyze select * from t1 join t2 on t1.a > t2.a; And here is what I got: Nested Loop (cost=3D0.00..15032.50 rows=3D333333 width=3D16) (actual time=3D392.841..392.854 rows=3D0 loops=3D1) If this t1/t2 join is part of a larger SELECT query, I think the cost estimates for the upper join nodes would likely be quite inaccurate. > > I believe the HashAggregate node in this plan faces the same problem > > with inaccurate estimates. However, I don't think it's reasonable to > > say that, because we cannot accurately cost the Aggregate node, we > > should disregard considering JOIN_UNIQUE_OUTER/INNER. > > Fair point. > > > Back in August, I responded to this issue by "Maybe we can run some > > benchmarks first and investigate the regressions discovered on a > > case-by-case basis". In October, I ran the TPC-DS benchmark at scale > > 10 and observed that eager aggregation was applied in 7 queries, with > > no notable regressions. In contrast, Q4 and Q11 showed performance > > improvements of 3=E2=80=934 times. Please see [1]. > > I had forgotten about that, and again, fair point, but I'm concerned > that it might not be a broad enough base of queries to test against. > (7 isn't a very large number.) Yeah, I know 7 is not a large number, but this is the result I got from running the TPC-DS benchmark. For the remaining 92 queries in the benchmark, either the logic in this patch determines that eager aggregation is not applicable, or the path with eager aggregation is not the optimal one. I'd be more than happy if a benchmark query showed significant performance regression, so it would provide an opportunity to investigate how the cost estimates are negatively impacting the final plan and explore ways to avoid or improve that. If anyone can provide such a benchmark query, I'd be very grateful. Perhaps this is another reason why we should enable this feature by default, so we can identify such regression issues sooner rather than later. > > Actually, I introduced the EAGER_AGGREGATE_RATIO mechanism in October > > to limit the planning effort for eager aggregation. For more details, > > please see [2]. > > OK, I missed this, but... > > > And I don't think it's correct to say that we create a partially > > grouped rel for every baserel and every joinrel. This patch includes > > a bunch of logic to determine whether it's appropriate to create a > > grouped rel for a base or join rel. Furthermore, with the > > EAGER_AGGREGATE_RATIO mechanism, even if creating a grouped rel is > > possible, we will skip it if the grouped paths are considered not > > useful. All of these measures help reduce the number of grouped > > paths as well as the grouped relations in many cases where eager > > aggregation would not help a lot. > > ...it looks to me like EAGER_AGGREGATE_RATIO is used to set the > RelAggInfo's agg_useful field, which seems like it happens after the > RelOptInfo has already been created. I had been looking for something > that would avoid creating the RelOptInfo in the first place and I > didn't see it. Well, from the perspective of planning effort, what really matters is whether the RelOptInfo for the grouped relation is added to the PlannerInfo, as it is only then available for further joining in the join search routine, not whether the RelOptInfo is built or not. Building the RelOptInfo for a grouped relation is simply a makeNode call followed by a flat copy; it doesn't require going through the full process of determining its target list, or constructing its restrict and join clauses, or calculating size estimates, etc. Now, let's take a look at how the EAGER_AGGREGATE_RATIO mechanism is used. As you mentioned, EAGER_AGGREGATE_RATIO is used to set the agg_useful field of the RelAggInfo. For a base rel where we've decided that aggregation can be pushed down, if agg_useful is false, we skip building the grouped relation for it in the first place, not to mention adding the grouped relation to the PlannerInfo. For a join rel where aggregation can be pushed down, if agg_useful is false, we will create a temporary RelOptInfo for its grouped relation, but we only add this RelOptInfo to the PlannerInfo if we can generate any grouped paths by joining its input relations. We could easily modify make_grouped_join_rel() to create this temporary RelOptInfo only when needed, but I'm not sure if that's necessary, since I don't have data to suggest that the creation of this temporary RelOptInfo is a factor in causing planning regressions. > > Based on the TPC-DS benchmark results, I don't see "a lot of overhead" > > in the planning cost, at least for the 7 queries where eager > > aggregation is applied. As I said in [1], "For the planning time, I > > do not see notable regressions for any of the seven queries". In > > fact, I initially thought that we might consider enabling this by > > default, given the positive benchmark results, but I just couldn't > > summon the courage to do it. Perhaps we should reconsider enabling it > > by default, so users can benefit from the new feature and help > > identify any potential bugs. > > If you're going to commit this, I think it would be a good idea to > enable it by default at least for now. If there are problems, it's > better to find out about them sooner rather than later. If they are > minor they can be fixed; if they are major, we can consider whether it > is better to fix them, disable the feature by default, or revert. We > can add an open item to reconsider the default setting during beta. Agreed. And I like the suggestion of adding an open item about the default setting during beta. > > > As I said back in October, this seems like mixing together in one > > > RelOptInfo paths that really belong to two different RelOptInfos. > > > > I understand that you said about the design in October where > > "PartialAgg(t1 JOIN t2) and t1 JOIN PartialAgg(t2) get separate > > RelOptInfos", because "it's less clear whether it's fair to compare > > across the two categories". I've shared my thoughts on this in [5]. > > > > Furthermore, even if we separate these grouped paths into two > > different RelOptInfos, we still face the issue that "different paths > > of a grouped relation can have very different row counts", and we need > > a way to handle this. One could argue that we can separate the > > grouped paths where partial aggregation is placed at different > > locations into different RelOptInfos, but this would lead to an > > explosion in the number of RelOptInfos for grouped relations as we > > climb up the join tree. I think this is neither realistic nor > > necessary. > > It's possible you're right, but it does make me nervous. I do agree > that making the number of RelOptInfos explode would be really bad. Based on my explanation in [1], I think it's acceptable to compare grouped paths for the same grouped rel, regardless of where the partial aggregation is placed. I fully understand that I could be wrong about this, but I don't think it would break anything in regular planning (i.e., planning without eager aggregation). We would never compare a grouped path with a non-grouped path during scan/join planning. As far as I can see, the only consequence in that case would be that we might fail to select the optimal grouped path and miss out on fully leveraging the benefits of eager aggregation. Back in November, I considered the possibility of introducing a GroupPathInfo into the Path structure to store the location of the partial aggregation as well as the estimated rowcount for this grouped path, similar to how ParamPathInfo functions for parameterized paths. However, after some exploration, I determined that this was unnecessary. But in any case, I don't think it's an option to separate the grouped paths of the same grouped relation into different RelOptInfos based on the location of the partial aggregation within the path tree. [1] https://postgr.es/m/CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ= @mail.gmail.com Thanks Richard