Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tZUnq-00AEqQ-6r for pgsql-hackers@arkaria.postgresql.org; Sun, 19 Jan 2025 12:53:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1tZUno-007C3R-Sz for pgsql-hackers@arkaria.postgresql.org; Sun, 19 Jan 2025 12:53:21 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1tZUno-007C3B-G6 for pgsql-hackers@lists.postgresql.org; Sun, 19 Jan 2025 12:53:21 +0000 Received: from mail-yb1-xb32.google.com ([2607:f8b0:4864:20::b32]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1tZUnm-000MkF-0O for pgsql-hackers@postgresql.org; Sun, 19 Jan 2025 12:53:20 +0000 Received: by mail-yb1-xb32.google.com with SMTP id 3f1490d57ef6-e573136107bso5948918276.3 for ; Sun, 19 Jan 2025 04:53:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737291197; x=1737895997; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3tRGT+YpwxEiqeVVEjuq2JjYm7w9/cSbqAo0G7cihNo=; b=ETfbc62wQXsFMXEz6UG5+iXXN61z8cDHARCxaPdN0HGUWmcXXbrEwuLxAHZ3M1lJIt Od4K4bEVnaibpLFr0evlXHbrJZ8z5j4Y64hl196MaNbE/NEioDiIHJnZA0v6W6D9/IZQ Lb1evycfTSkxUg5mtKmwhul8Ut1BWHWx+VR1Gs5d/gaov8Hxm4/0i+H9FcEsPEo4rgmc ETBLOHwinq1EVU+o5SwfLBPvwqzG5i21lqC+D0SJWpNiHzStmHNEmU2jcqyWZfjrgflw zkrF2Ot5zMNVIIgCK6DtRiHdeokrpLQ/bKhPmnBHzuOsCo85qcWp+qeBKNKiS4TFqY5J 79mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737291197; x=1737895997; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3tRGT+YpwxEiqeVVEjuq2JjYm7w9/cSbqAo0G7cihNo=; b=vB1uY5qljWW5hfzYwAoEHr3wnmOYe4Jgibj/dZFR7X68xBsQUHOKyV85EfcJT3CLIs XN05Xi3t2M2tcs7AuNSnQ2y42DIh+Bg7YOYQutji0eTtYoRTQPilD/mx5lNN9neq2+X6 YGszFDcQi8jm6Qmj+j6JBWVzoQnLA2e7hDLlKwzX//JQiHkDt1RAHL2vPBFawV16wCcx 77W8vKMFmAc1QawapGIMet/dG5KQ4Vjojf3rpb7DO0RVYKlvCtM1uEZV04oYhVl8aTVV uFdaOkXIiD2MC7GbLdti5NLfLbn5zOm+Q82en0gCkzFGbPN3QfyBV4TO4lBR2dmwptZA xTaw== X-Forwarded-Encrypted: i=1; AJvYcCUVtIlXVSIx+F1m/zXYnANc0yLa/r9jc1X9NiQ/AwOmpVmYeglg7FjRSbl5YbRpt1JO8gH3fdQwAOZRRaaC@postgresql.org X-Gm-Message-State: AOJu0Yzj2CHp0Mcn5WaNHs9gUjY8eO7kh9x0hbnN2b67WHpwvVJqY3Js zZglecNtyrhucA2uDTg+M8vgd1Ltbo7PhtE+41d6AvTTuMffpHhiRN/GYpfgmNFLjfifnlC70bO FHCUVnHTe5U+4+d5tWrhWuGBnZqQ7OirX X-Gm-Gg: ASbGncug0l8hTtPtmPR/ZuPM+IFLWUjo9TiUhXnnOktlSMAbASekz0uLMBhO4QO4nNh tmgm1BQ6o4S1OQByGDZ/NLtWe2siKG6T0Df2V6cgKzq+p6sRP5CSHaxEMp47bXLElk4xn8TiPM/ j3ng3O+3miqQ== X-Google-Smtp-Source: AGHT+IENaLrtz99iF3E0WGVHX8ZV0yW+icRnyOrHJgQcNVSv2EbwwkZtE1Mp3q9sU1APkV5dll0pER6zDoCT/RQK/5s= X-Received: by 2002:a25:c788:0:b0:e57:cb4c:8a57 with SMTP id 3f1490d57ef6-e57cb4c8b15mr1920030276.37.1737291196511; Sun, 19 Jan 2025 04:53:16 -0800 (PST) MIME-Version: 1.0 References: <87il22cj51.fsf@163.com> In-Reply-To: From: Richard Guo Date: Sun, 19 Jan 2025 21:53:05 +0900 X-Gm-Features: AbW1kvZD9AwDRB6WeemQD48TirnFDerwiLcJ2WpUSaVUnLLofgEJKQcQdICPZ6c Message-ID: Subject: Re: Eager aggregation, take 3 To: Robert Haas Cc: Tom Lane , Tender Wang , Paul George , Andy Fan , PostgreSQL-development , pgsql-hackers@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Sat, Jan 18, 2025 at 6:16=E2=80=AFAM Robert Haas = wrote: > On Thu, Jan 16, 2025 at 3:18=E2=80=AFAM Richard Guo wrote: > > If this t1/t2 join is part of a larger SELECT query, I think the cost > > estimates for the upper join nodes would likely be quite inaccurate. > > That's definitely true. However, the question is not whether the > planner has problems today (it definitely does) but whether it's OK to > make this change without improving our ability to estimate the effects > of aggregation operations. I understand that you (quite rightly) don't > want to get sucked into fixing unrelated planner problems, and I'm > also not sure to what extent these problems are actually fixable. > However, major projects sometimes require such work. For instance, > commit 5edc63bda68a77c4d38f0cbeae1c4271f9ef4100 was motivated by the > discovery that it was too easy to get a Parallel Bitmap Heap Scan plan > even when it wasn't best. The fact that the costing wasn't right > wasn't the fault of parallel query, but parallel query still needed to > do something about it to get good results. Yeah, it's true that we have problems in aggregate estimates today. And it has been the case for a long time. In the past, we made some improvements in this area, such as in 84f9a35e3, where we adapted a new formula that is based on the random selection probability, inspired by two papers from Yao and Dell'Era. But we still have problems with aggregate estimates. I'm not sure when we could fix these problems, but I doubt that it will happen in the near future. (Sorry to be pessimistic.) If, at last, the conclusion of this discussion is that we should not apply this change until we fix those problems in aggregate estimates, I'd be very sad. This conclusion is absolutely correct, for sure, in an ideal world, but in the real world, it feels like a death sentence for this patch, and for any future patches that attempt to apply some optimizations above aggregate nodes - unless, of course, the day arrives when we finally fix those aggregate estimate problems, which doesn't seem likely in the near future. And if that's the case, can I then argue that the same principle should apply to joins? Specifically, should we refrain from applying any optimizations above join nodes until we've fixed the join estimate problems, particularly in cases where we fall back on default selectivity estimates? Please do not get me wrong. I'm not saying that we should not fix the problems in our current aggregate estimates. I think, as I said previously, that the realistic approach is to first identify some real-world queries where this patch causes significant performance regressions. This would give us the opportunity to investigate these regressions and understand how the bad cost estimates contributed to them. From there, we could figure out where to start fixing the cost estimates. And if we find that the problem is not entirely fixable, we could then explore the possibility of introducing new heuristics to avoid the performance regressions as much as possible. In my opinion, it's not very possible to make cost estimation perfect in all cases. In a sense, cost estimation is an art of compromise. I believe this is also the approach that commit 5edc63bda followed. First, it was found that Bitmap Heap Scans caused performance regressions in many TPCH queries in cases where work_mem was low. Then, this issue was thoroughly discussed, and eventually it was figured out that the impact of lossy pages needed to be accounted for when estimating the cost of bitmap scans, which became 5edc63bda. > > Well, from the perspective of planning effort, what really matters is > > whether the RelOptInfo for the grouped relation is added to the > > PlannerInfo, as it is only then available for further joining in the > > join search routine, not whether the RelOptInfo is built or not. > > Building the RelOptInfo for a grouped relation is simply a makeNode > > call followed by a flat copy; it doesn't require going through the > > full process of determining its target list, or constructing its > > restrict and join clauses, or calculating size estimates, etc. > > That's probably mostly true, but the overhead of memory allocations in > planner routines is not trivial. There are previous cases of changing > things or declining to change this purely on the number of palloc > cycles involved. Hmm, I think you are right. I will modify make_grouped_join_rel() to create the RelOptInfo for a grouped join relation only if we can generate any grouped paths by joining its input relations. > > > It's possible you're right, but it does make me nervous. I do agree > > > that making the number of RelOptInfos explode would be really bad. > > > > Based on my explanation in [1], I think it's acceptable to compare > > grouped paths for the same grouped rel, regardless of where the > > partial aggregation is placed. > > > > I fully understand that I could be wrong about this, but I don't think > > it would break anything in regular planning (i.e., planning without > > eager aggregation). > > I think you might be taking too narrow a view of the problem. As Tom > says, the issue is that this breaks a bunch of assumptions that hold > elsewhere. One place that shows up in the patch is in the special-case > logic you've added to set_cheapest(), but I fear that won't be the end > of it. It seems a bit surprising to me that you didn't also need to > adjust add_path(), for example. Even if you don't, there's lots of > places that rely on the assumption that all paths for a RelOptInfo are > returning the same set of rows. If it turns out that a bunch of those > places need to be adjusted to work with this, then the code could > potentially end up quite messy, and that might also have performance > consequences, even when this feature is disabled. Many of the code > paths that deal with paths in the planner are quite hot. Yeah, one of the basic assumptions in the planner is that all paths for a given RelOptInfo return the same set of rows. One exception to this is parameterized paths. As an example, please consider: create table t (a int, b int); create table t3 (a int, b int); insert into t select i, i from generate_series(1,1000)i; insert into t3 select i, i from generate_series(1,1000)i; create index on t3(a, b); analyze t, t3; explain (costs off) select * from t t1 join t t2 on true join t3 on t3.a > t1.a and t3.b > t2.b= ; With gdb, I found the following 4 paths in the pathlist of RelOptInfo of {t3}: {INDEXPATH :path.pathtype 341 :parent_relids (b 4) :required_outer (b 1 2) :path.parallel_aware false :path.parallel_safe true :path.parallel_workers 0 :path.rows 111 :path.disabled_nodes 0 :path.startup_cost 0.275 :path.total_cost 4.755000000000001 {INDEXPATH :path.pathtype 341 :parent_relids (b 4) :required_outer (b 1) :path.parallel_aware false :path.parallel_safe true :path.parallel_workers 0 :path.rows 333 :path.disabled_nodes 0 :path.startup_cost 0.275 :path.total_cost 6.1425 {INDEXPATH :path.pathtype 341 :parent_relids (b 4) :required_outer (b 2) :path.parallel_aware false :path.parallel_safe true :path.parallel_workers 0 :path.rows 333 :path.disabled_nodes 0 :path.startup_cost 0.275 :path.total_cost 11.145 {PATH :pathtype 338 :parent_relids (b 4) :required_outer (b) :parallel_aware false :parallel_safe true :parallel_workers 0 :rows 1000 :disabled_nodes 0 :startup_cost 0 :total_cost 15 None of them are returning the same set of rows. This is fine because we have revised the assumption to that all paths for a RelOptInfo of the same parameterization return the same set of rows. That is to say, it's OK that paths for the same RelOptInfo return different sets of rows if they have different parameterizations. Now we have the grouped paths. I had previously considered further revising this assumption to that all paths for a RelOptInfo of the same parameterization and the same location of partial aggregation return the same set of rows. That's why, back in November, I proposed the idea of introducing a GroupPathInfo into the Path structure to store the location of the partial aggregation and the estimated rowcount for each grouped path, similar to how ParamPathInfo functions for parameterized paths. However, I gave up on this idea in December after realizing an important difference from the parameterized path case. For a parameterized path, the fewer the required outer rels, the better, as more outer rels imply more join restrictions. In other words, the number of required outer rels is an important factor when comparing two paths in add_path(). For a grouped path, however, the location of partial aggregation does not impose such restrictions for further planning. As long as one grouped path is cheaper than another based on the current merits of add_path(), we don't really care where the partial aggregation is placed within the path tree. I can take up the idea of GroupPathInfo again. Before I start implementing it (which is not trivial), I'd like to hear others' thoughts on this approach - whether it's necessary and whether this is the right direction to pursue. > To say that another way, I'm not so much worried about the possibility > that the patch contains a bug. Patches contain bugs all the time and > we can just fix them. It's not wonderful, but that's how software > development goes. What I am worried about is whether the architecture > is right. If we go with one RelOptInfo when the "right answer" is > many, or for that matter if we go with many when the right answer is > one, those are things that cannot be easily and reasonably patched > post-commit, and especially not post-release. Fair point. We should make sure the architecture of this patch is solid before committing it. Regarding whether we should use a single RelOptInfo or separate RelOptInfos for the same grouped relation: If we choose to separate the grouped paths of the same grouped relation into different RelOptInfos based on the location of the partial aggregation within the path tree, then, based on my calculation from the previous email, for a relation containing n base rels, we would need to create 2^(n-1) different RelOptInfos, not to mention that this can also lead to more paths. I still struggle to see how this is feasible. Could you please elaborate on why you believe this is a viable option? Thanks Richard