Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ta9hj-00Gni4-Hi for pgsql-hackers@arkaria.postgresql.org; Tue, 21 Jan 2025 08:33:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1ta9hh-00DA3U-N9 for pgsql-hackers@arkaria.postgresql.org; Tue, 21 Jan 2025 08:33:45 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1ta9hh-00DA3D-5F for pgsql-hackers@lists.postgresql.org; Tue, 21 Jan 2025 08:33:45 +0000 Received: from mail-yb1-xb31.google.com ([2607:f8b0:4864:20::b31]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1ta9hd-000hN7-0j for pgsql-hackers@postgresql.org; Tue, 21 Jan 2025 08:33:43 +0000 Received: by mail-yb1-xb31.google.com with SMTP id 3f1490d57ef6-e3a26de697fso8326631276.3 for ; Tue, 21 Jan 2025 00:33:41 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1737448421; x=1738053221; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=rTXQOwYUsYpUD7KU/WRzBPQq2JAjHOVIOIr23KL2xaM=; b=VqDEzveAnZsFPdWZnrXKIxGqakhv3C/NU5+dBdi6geAz9YFUp/zEfn4va4LZkIO8bc cxCN8n5JsFXkJj95zQkXwzilaggL1LRpa0hR3xS4dXN5bniCRsjDBJEl4qs25Tf51CLG sIMfqQP2lgpLhlCpAF8d7ufcvg9optZ1XtomrMPR42a7esTlRMWcx1Vhn47C5cOGX3Ns jImmQPsv++qbg0HYOkbU3amAalpIwK89gh4t0E14odUULrQba5QntCkJAxbd/Em+f5RS AXEyHxQcIWsEw+JpKfKgMXrlV50jb9yckSShUzFEEnRUqBZWDrkqiPvc03XGilKuvWmT Gd4g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737448421; x=1738053221; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=rTXQOwYUsYpUD7KU/WRzBPQq2JAjHOVIOIr23KL2xaM=; b=I/mDi6XYOeuSTqWvejXzq5Bwz8ent3pVsdA/ykd13PR+J2jy1Gjnw6JcnTCtmAt/3x 9MKFAiQjwsb8nYcZznsY10k7OnpYvlgeQBRBStMpsJr1jyv49Mcl+rESHBbI4NLO4P/i SscGxZ0fTlSIkUjCVjkGR6slNdsRp1958rRiMGpgWIsqFgbcQbyKFxMNoGtRlzObhHn7 mtxW6SPVzi2d+LFQEnIEMfgK15/iQXcuD6S7ulM2CKXmpepbilKFAnzbtD6c1LG3j1rK faZT+BrprMEzoWjRy9z/UE0U9TsGF0Xj+OhZ37n2bD9hKiwfDbcwLsTQqEWfNBs0onWE xoPg== X-Forwarded-Encrypted: i=1; AJvYcCW1YmcdkOztQHYmuv38bF8pZY2zBgkrae8dPpSEGmBo/k+/JVzOAY7VHWbxVQyBcPesyuchlVlYCSpNFNK7@postgresql.org X-Gm-Message-State: AOJu0Yx6SLQrfOJc8gM0vaZ66ZG8m5raDR6Uua6XsdDgYmVdAUis45gS 7WPMJ6AWQa/9aGd9oJ+Od7y41ASgp7/gt8esTn5MwAIj5IduDzqLakc4L1cIEfk2N0IFP7mX0EL ONth+cN74nGuTlOB6imUd3M+GAg4= X-Gm-Gg: ASbGncvLUtvDetkgA3EvizgRHcfc16TPe3zfvXWgHGxR22l8Jw+BpITKVA00z7kQfGF X8RJwnmiybrRKN6IxAJX1utrlh74UWH/YSxPf5h/LSoAyl/ef4kdUYA+AeclmhknPpra8SAd8yY 6o0nQtOU+eDQ== X-Google-Smtp-Source: AGHT+IFEiEx1opcv6jNVO8jl73AVlPvu0w4Bd0F9lFeM6JfR7xewupxyWWJ7B3m4klLxamk+84DCcsfk3sPFC8cqi7s= X-Received: by 2002:a05:690c:6812:b0:6ef:e39f:bb1 with SMTP id 00721157ae682-6f6eb905af5mr130744507b3.27.1737448420716; Tue, 21 Jan 2025 00:33:40 -0800 (PST) MIME-Version: 1.0 References: <87il22cj51.fsf@163.com> In-Reply-To: From: Richard Guo Date: Tue, 21 Jan 2025 17:33:29 +0900 X-Gm-Features: AbW1kvYBnAyhv2km4iaJ-BLY41zkTNHv4ebSnLTCt8Lgc5C7LOA8--5LGNSn3Rw Message-ID: Subject: Re: Eager aggregation, take 3 To: Robert Haas Cc: Tom Lane , Tender Wang , Paul George , Andy Fan , PostgreSQL-development , pgsql-hackers@lists.postgresql.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Jan 21, 2025 at 1:28=E2=80=AFAM Robert Haas = wrote: > On Sun, Jan 19, 2025 at 7:53=E2=80=AFAM Richard Guo wrote: > > If, at last, the conclusion of this discussion is that we should not > > apply this change until we fix those problems in aggregate estimates, > > I'd be very sad. This conclusion is absolutely correct, for sure, in > > an ideal world, but in the real world, it feels like a death sentence > > for this patch, and for any future patches that attempt to apply some > > optimizations above aggregate nodes - unless, of course, the day > > arrives when we finally fix those aggregate estimate problems, which > > doesn't seem likely in the near future. > > Well, such conclusions should be based on evidence. So far, the > evidence you've presented suggests that the optimization works, so > there's no reason to leap to the conclusion that we shouldn't move > forward. On the other hand, the amount of evidence you've presented > does not seem to me to be all that large. And I'm not sure that you've > gone looking for adversarial cases. > > > And if that's the case, can I then argue that the same principle > > should apply to joins? Specifically, should we refrain from applying > > any optimizations above join nodes until we've fixed the join estimate > > problems, particularly in cases where we fall back on default > > selectivity estimates? > > I am having a hard time figuring out how to write back to this. I > mean, I don't think that what you write here is a serious proposal, > and I think you already know that I was not proposing any such thing. > But it upsets me that you think that this hypothetical argument is > equivalent to the ones I've actually been making. Apparently, you > consider my concerns quite groundless and foolish. I'm really sorry if my previous response upset you or gave the wrong impression. That was never my intention, and I certainly do not consider your concerns to be groundless or foolish. I can see how my message may have come across differently than I intended. To clarify, I wasn't suggesting that your concerns about the estimates weren't valid. Rather, I was trying to express that it might be more effective to fix the cost estimates based on specific regressions. > > Regarding whether we should use a single RelOptInfo or separate > > RelOptInfos for the same grouped relation: If we choose to separate > > the grouped paths of the same grouped relation into different > > RelOptInfos based on the location of the partial aggregation within > > the path tree, then, based on my calculation from the previous email, > > for a relation containing n base rels, we would need to create 2^(n-1) > > different RelOptInfos, not to mention that this can also lead to more > > paths. I still struggle to see how this is feasible. Could you > > please elaborate on why you believe this is a viable option? > > I agree that creating an exponential number of RelOptInfos is not > going to work out well. I haven't been quite as certain as you seem to > be that it's an unavoidable reality, but maybe it is. For instance, my > intuition is that if PartialAgg(t1) JOIN t2 and PartialAgg(t1 JOIN t2) > produce very different numbers of rows, we could probably just take > the one with the smaller row count regardless of cost, because the > whole selling point of this optimization is that we reduce the number > of rows that are being fed to higher level plan nodes. I don't quite > see how it can make sense to keep a less costly path that produces > more rows on the theory that maybe it's going to work out better later > on. Why is the path cheaper, after all? It feels like the savings must > come from not reducing the row count so much, but that is a cost we're > going to have to repay at a higher plan level. Moreover, we'll be > repaying it with interest, because more rows will have filtered > through every level of plan over which we postponed partial > aggregation. I've been thinking about this proposal, and it's quite appealing. It would significantly reduce both the planning effort and implementation complexity, while still yielding reasonable planning results. One concern I have with this proposal is that, as we climb up higher and higher in the join tree, the assumption that a path with smaller row count and higher cost is better than one with larger row count and lower cost may gradually no longer hold. It's true that a path with a smaller row count is generally better for upper join nodes, as it feeds fewer rows to upper join nodes. However, as there are fewer and fewer upper join nodes left, the efficiency gained from the smaller row count could likely no longer justify the high cost of that path itself. Here's an example I found that can help illustrate what I mean. create table t (a int, b int, c int); insert into t select i%3, i%3, i from generate_series(1,500)i; analyze t; set enable_eager_aggregate to on; And here are two plans for the same query: -- Plan 1 explain (costs on) select sum(t4.c) from t t1 join (t t2 join t t3 on t2.b !=3D t3.b join t t4 on t3.b =3D t4.b) on t1.b =3D t2.b group by t1.a; QUERY PLAN ---------------------------------------------------------------------------= --------------- Finalize HashAggregate (cost=3D4135.19..4135.22 rows=3D3 width=3D12) Group Key: t1.a -> Hash Join (cost=3D1392.13..3301.85 rows=3D166668 width=3D12) Hash Cond: (t2.b =3D t1.b) -> Nested Loop (cost=3D1377.88..1409.66 rows=3D1000 width=3D12) Join Filter: (t2.b <> t3.b) -> Partial HashAggregate (cost=3D1377.88..1377.91 rows=3D3 width=3D12) Group Key: t3.b -> Hash Join (cost=3D14.25..961.22 rows=3D83334 widt= h=3D8) Hash Cond: (t3.b =3D t4.b) -> Seq Scan on t t3 (cost=3D0.00..8.00 rows=3D500 width=3D4) -> Hash (cost=3D8.00..8.00 rows=3D500 width=3D= 8) -> Seq Scan on t t4 (cost=3D0.00..8.00 rows=3D500 width=3D8) -> Materialize (cost=3D0.00..10.50 rows=3D500 width=3D4) -> Seq Scan on t t2 (cost=3D0.00..8.00 rows=3D500 wi= dth=3D4) -> Hash (cost=3D8.00..8.00 rows=3D500 width=3D8) -> Seq Scan on t t1 (cost=3D0.00..8.00 rows=3D500 width=3D= 8) (17 rows) -- Plan 2 explain (costs on) select sum(t4.c) from t t1 join (t t2 join t t3 on t2.b !=3D t3.b join t t4 on t3.b =3D t4.b) on t1.b =3D t2.b group by t1.a; QUERY PLAN ---------------------------------------------------------------------------= --------------------- Finalize HashAggregate (cost=3D455675.44..455675.47 rows=3D3 width=3D12) Group Key: t1.a -> Hash Join (cost=3D455658.07..455672.94 rows=3D500 width=3D12) Hash Cond: (t1.b =3D t2.b) -> Seq Scan on t t1 (cost=3D0.00..8.00 rows=3D500 width=3D8) -> Hash (cost=3D455658.03..455658.03 rows=3D3 width=3D12) -> Partial HashAggregate (cost=3D455658.00..455658.03 rows=3D3 width=3D12) Group Key: t2.b -> Hash Join (cost=3D14.25..316768.56 rows=3D27777887 width=3D8) Hash Cond: (t3.b =3D t4.b) -> Nested Loop (cost=3D0.00..3767.25 rows=3D166666 width=3D8) Join Filter: (t2.b <> t3.b) -> Seq Scan on t t2 (cost=3D0.00..8.00 rows=3D500 width=3D4) -> Materialize (cost=3D0.00..10.50 rows=3D500 width=3D4) -> Seq Scan on t t3 (cost=3D0.00..8.00 rows=3D500 width=3D4) -> Hash (cost=3D8.00..8.00 rows=3D500 width=3D= 8) -> Seq Scan on t t4 (cost=3D0.00..8.00 rows=3D500 width=3D8) (17 rows) For the grouped relation {t2 t3 t4}, Plan 1 chose the path "PartialAgg(t3/t4) JOIN t2", while Plan 2 chose the path "PartialAgg(t2/t3/t4)". The first path has larger row count (1000) and lower cost (1409.66). The second path has smaller row count (3) and higher cost (455658.03). Executing these two plans shows that Plan 2 is slower than Plan 1. -- Plan 1 Execution Time: 286.860 ms -- Plan 2 Execution Time: 27109.744 ms I think we may need to take the position in the join tree into account when applying this heuristic. At lower levels, we should prefer paths with smaller row counts, while at higher levels, we should prefer paths with lower costs. However, it's unclear to me how we should define "lower" and "higher" - how low is 'low' and how high is 'high'. > I admit it's not so clear-cut when the row counts are close. If > PartialAgg(t1 JOIN t2) JOIN t3 has a very similar to PartialAgg(t1 > JOIN t3) JOIN t2, can we categorically pick whichever one has the > lower row count and forget about the other? I'm not sure. But I have > an uncomfortable feeling that if we can't, we're going to have an > explosion in the number of paths we have to generate even if we avoid > an explosion in the number of RelOptInfos we generate. > > For example, consider: > > SELECT ... FROM fact f, dim1, dim2, dim3, dim4 > WHERE f.dim1_id =3D dim1.id AND f.dim2_id =3D dim2.id > AND f.dim3_id =3D dim3.id AND f.dim4_id =3D dim4.id > GROUP BY f.something; > > Let's assume that each dimN table has PRIMARY KEY (id). Because of the > primary keys, it's only sensible to consider partial aggregation for > subsets of rels that include f; and it doesn't make sense to consider > partially aggregating after joining all 5 tables because at that point > we should just do a single-step aggregation. So, the partially > grouped-rel for {f,dim1,dim2,dim3,dim4} can contain paths generated in > 15 different ways, because we can join f to any proper subset of > {dim1,dim2,dim3,dim4} before partially aggregating and then to the > remainder after partially aggregating. But that feels like we're > re-performing essentially the same join search 16 times which seems > super-expensive. I can't quite say that the work is useless or that I > have a better idea, but I guess there will be a lot of cases where all > 16 join searches produce the same results, or most of them do. It > doesn't feel to me like checking through all of those possibilities is > a good expenditure of planner effort. Yeah, you're right that the join search process for grouped paths basically mirrors what we do for non-grouped paths, which indeed involves a lot of planner effort. I've been exploring potential heuristics to limit the search space for grouped paths, but so far, I haven't found any effective solutions. Currently, the heuristic used in the patch is to only consider grouped paths that dramatically reduce the number of rows. All others are just discarded. The rationale is that if a grouped path does not reduce the number of rows enough, it is highly unlikely to result in a competitive final plan during the upper planning stages, so it doesn't make much sense to consider it. The current threshold is set to 50%, meaning that if the number of rows returned by PartialAgg(t1 JOIN t2) is not less than 50% of the rows returned by (t1 JOIN t2), no Aggregate paths will be generated on top of the t1/t2 join. If we notice significant regressions in planning time, we might consider further increasing this threshold, say, to 80%, so that only grouped paths that reduce the rows by more than 80% will be considered. This heuristic also ensures that, once a plan with eager aggregation is chosen, it is highly likely to result in performance improvements, due to the significant data reduction before joins. > I took a look at the paper you linked in the original post, but > unfortunately it doesn't seem to say much about how to search the plan > space efficiently. I wonder if other systems perform a search that as > exhaustive as the one that you are proposing to perform here or > whether they apply some heuristics to limit the search space, and if > so, what those heuristics are. Unfortunately, I don't have much knowledge about other systems. It would be really helpful if anyone could share some insights on how other systems handle this. Thanks Richard