MIME-Version: 1.0
References: <CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com>
 <87il22cj51.fsf@163.com> <CAMbWs49=eAd2W9jCtGhaZPPp+SOC_2rg16RTG74xAht=hkr5JQ@mail.gmail.com>
 <CAMbWs49Nc4M3H+eCf1+8w8piDyEECjRb-gK_JMF4VvcyWwGEVQ@mail.gmail.com>
 <CAMbWs49E_dR0nobsExsyetpnBpHObLTsQLsEbWKQLkh0omPxNg@mail.gmail.com>
 <CAMbWs49B_qUiHvu2EqLHZRpLr3p_+QPBs50n2=L5ibYzniwTzA@mail.gmail.com>
 <CAMbWs48KCQtDymnYi4M=Vz+WMzo3fkBxffJsyk6VX6hOXXv+VA@mail.gmail.com>
 <CAMbWs49sv_MuOYqqrtmBN_oYf8VSQ2BXDwXaTpJTn_YfwyYdWQ@mail.gmail.com>
 <CAMbWs49U8Sddx_fGszPdvA3jp_nheynxaqm5Y4NqMV21VBYAuQ@mail.gmail.com>
 <CAMbWs4-LwyOg9ga+NVF7yQbMi0ZsZdN1G_sO2v=YJHV18=19+A@mail.gmail.com>
 <CALA8mJquG_zCJXfVwash5LKqHGtZXQmq7RfTSaRDUzGYeW=7Rw@mail.gmail.com>
 <CAMbWs4_EjgcBib5+y1LYcGB3EK3Y6R+OOxGKfJo42fDovadk1g@mail.gmail.com>
 <CALA8mJqe0anNM8_V6cOeOQnCHUTQggn7iOQNyQr1VaN_xMjz+w@mail.gmail.com>
 <CAMbWs48eE-s-jCicC8pSVfXk8Ws-ZvUKnsw8qH-DkVBdYv0eJQ@mail.gmail.com>
 <CAMbWs483a7-8M0pDttG44r-+8Gevn9VG0xNceE3WpkEQxJXPZw@mail.gmail.com>
 <CAHewXNmYM6DvR_kaxDL0w0fz9BwKbac+TSU3QS10aA3cXHyMmA@mail.gmail.com>
 <CA+TgmoaxH=P63hLYgyJJcEbMRnw3xi16d=HxFi1j-m7MhH6W_w@mail.gmail.com>
 <CAMbWs4_cOnpGsywj9Jt1WAgzJLW9Rxt5X13cfGz4iN2qvZQ68g@mail.gmail.com>
 <CA+Tgmob0q7bRbsFTVDMjxHE6zA4uDQLQa-s0CtwUw49V53UL_A@mail.gmail.com>
 <CAMbWs4-Xru_eKBeRHFduigSGihdixFWVTR8A+dtMw7Mao+RkJA@mail.gmail.com>
 <CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ@mail.gmail.com>
 <CAMbWs48LXGC-Y63YtzEeM-3f0NUXWCUEMs7XwGzywXTjUNMcxQ@mail.gmail.com>
 <CAMbWs48XdzvnwfTHWxQ7qK-yjvdrbwsPpqhJBuKDnO+hcbsVwA@mail.gmail.com>
 <CA+TgmoaO-7RHdyJuizWChXZm7EJGvDcfoePDDEyUA-y8vTB1tg@mail.gmail.com>
 <CAMbWs4-+jXRpKuFMZa08bS34-TBka3qqjVMAUjF=-1RA9BKvgg@mail.gmail.com>
 <CA+TgmoZapU1y59-s3o8oPt7Hv+cxRh_34FMu6MXumomLe+U1Cw@mail.gmail.com>
 <CAMbWs4_sEeeBmucBzbamBMfA9uLxVmOc_MV=ZpSyDbTcrUO_XQ@mail.gmail.com>
 <CA+Tgmob4fnv57PQB0Oox86mHSJQ0vVL249eT=gqPvrMkG7h1zw@mail.gmail.com>
 <CAMbWs489NYyTcCTbrUi7hPXKtNY5vHrrFcHyMRAv=CA5WsszVw@mail.gmail.com> <CA+TgmoazmDdcc7NeTo3WM5HW3DASNP4rfZw6X+2nnQKHampOng@mail.gmail.com>
In-Reply-To: <CA+TgmoazmDdcc7NeTo3WM5HW3DASNP4rfZw6X+2nnQKHampOng@mail.gmail.com>
From: Richard Guo <guofenglinux@gmail.com>
Date: Tue, 21 Jan 2025 17:33:29 +0900
Message-ID: <CAMbWs49bYr-ULhA+-At0iQ+NaFKy72AWB6jzughk8MPTiY+gMQ@mail.gmail.com>
Subject: Re: Eager aggregation, take 3
To: Robert Haas <robertmhaas@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Tender Wang <tndrwang@gmail.com>, 
	Paul George <p.a.george19@gmail.com>, Andy Fan <zhihuifan1213@163.com>, 
	PostgreSQL-development <pgsql-hackers@postgresql.org>, pgsql-hackers@lists.postgresql.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAMbWs49bYr-ULhA%2B-At0iQ%2BNaFKy72AWB6jzughk8MPTiY%2BgMQ%40mail.gmail.com>
Precedence: bulk

On Tue, Jan 21, 2025 at 1:28=E2=80=AFAM Robert Haas <robertmhaas@gmail.com>=
 wrote:
> On Sun, Jan 19, 2025 at 7:53=E2=80=AFAM Richard Guo <guofenglinux@gmail.c=
om> wrote:
> > If, at last, the conclusion of this discussion is that we should not
> > apply this change until we fix those problems in aggregate estimates,
> > I'd be very sad.  This conclusion is absolutely correct, for sure, in
> > an ideal world, but in the real world, it feels like a death sentence
> > for this patch, and for any future patches that attempt to apply some
> > optimizations above aggregate nodes - unless, of course, the day
> > arrives when we finally fix those aggregate estimate problems, which
> > doesn't seem likely in the near future.
>
> Well, such conclusions should be based on evidence. So far, the
> evidence you've presented suggests that the optimization works, so
> there's no reason to leap to the conclusion that we shouldn't move
> forward. On the other hand, the amount of evidence you've presented
> does not seem to me to be all that large. And I'm not sure that you've
> gone looking for adversarial cases.
>
> > And if that's the case, can I then argue that the same principle
> > should apply to joins?  Specifically, should we refrain from applying
> > any optimizations above join nodes until we've fixed the join estimate
> > problems, particularly in cases where we fall back on default
> > selectivity estimates?
>
> I am having a hard time figuring out how to write back to this. I
> mean, I don't think that what you write here is a serious proposal,
> and I think you already know that I was not proposing any such thing.
> But it upsets me that you think that this hypothetical argument is
> equivalent to the ones I've actually been making. Apparently, you
> consider my concerns quite groundless and foolish.

I'm really sorry if my previous response upset you or gave the wrong
impression.  That was never my intention, and I certainly do not
consider your concerns to be groundless or foolish.  I can see how my
message may have come across differently than I intended.  To clarify,
I wasn't suggesting that your concerns about the estimates weren't
valid.  Rather, I was trying to express that it might be more
effective to fix the cost estimates based on specific regressions.

> > Regarding whether we should use a single RelOptInfo or separate
> > RelOptInfos for the same grouped relation: If we choose to separate
> > the grouped paths of the same grouped relation into different
> > RelOptInfos based on the location of the partial aggregation within
> > the path tree, then, based on my calculation from the previous email,
> > for a relation containing n base rels, we would need to create 2^(n-1)
> > different RelOptInfos, not to mention that this can also lead to more
> > paths.  I still struggle to see how this is feasible.  Could you
> > please elaborate on why you believe this is a viable option?
>
> I agree that creating an exponential number of RelOptInfos is not
> going to work out well. I haven't been quite as certain as you seem to
> be that it's an unavoidable reality, but maybe it is. For instance, my
> intuition is that if PartialAgg(t1) JOIN t2 and PartialAgg(t1 JOIN t2)
> produce very different numbers of rows, we could probably just take
> the one with the smaller row count regardless of cost, because the
> whole selling point of this optimization is that we reduce the number
> of rows that are being fed to higher level plan nodes. I don't quite
> see how it can make sense to keep a less costly path that produces
> more rows on the theory that maybe it's going to work out better later
> on. Why is the path cheaper, after all? It feels like the savings must
> come from not reducing the row count so much, but that is a cost we're
> going to have to repay at a higher plan level. Moreover, we'll be
> repaying it with interest, because more rows will have filtered
> through every level of plan over which we postponed partial
> aggregation.

I've been thinking about this proposal, and it's quite appealing.  It
would significantly reduce both the planning effort and implementation
complexity, while still yielding reasonable planning results.

One concern I have with this proposal is that, as we climb up higher
and higher in the join tree, the assumption that a path with smaller
row count and higher cost is better than one with larger row count and
lower cost may gradually no longer hold.  It's true that a path with a
smaller row count is generally better for upper join nodes, as it
feeds fewer rows to upper join nodes.  However, as there are fewer and
fewer upper join nodes left, the efficiency gained from the smaller
row count could likely no longer justify the high cost of that path
itself.

Here's an example I found that can help illustrate what I mean.

create table t (a int, b int, c int);
insert into t select i%3, i%3, i from generate_series(1,500)i;
analyze t;
set enable_eager_aggregate to on;

And here are two plans for the same query:

-- Plan 1
explain (costs on)
select sum(t4.c) from t t1 join
  (t t2 join t t3 on t2.b !=3D t3.b join t t4 on t3.b =3D t4.b)
  on t1.b =3D t2.b
group by t1.a;
                                        QUERY PLAN
---------------------------------------------------------------------------=
---------------
 Finalize HashAggregate  (cost=3D4135.19..4135.22 rows=3D3 width=3D12)
   Group Key: t1.a
   ->  Hash Join  (cost=3D1392.13..3301.85 rows=3D166668 width=3D12)
         Hash Cond: (t2.b =3D t1.b)
         ->  Nested Loop  (cost=3D1377.88..1409.66 rows=3D1000 width=3D12)
               Join Filter: (t2.b <> t3.b)
               ->  Partial HashAggregate  (cost=3D1377.88..1377.91
rows=3D3 width=3D12)
                     Group Key: t3.b
                     ->  Hash Join  (cost=3D14.25..961.22 rows=3D83334 widt=
h=3D8)
                           Hash Cond: (t3.b =3D t4.b)
                           ->  Seq Scan on t t3  (cost=3D0.00..8.00
rows=3D500 width=3D4)
                           ->  Hash  (cost=3D8.00..8.00 rows=3D500 width=3D=
8)
                                 ->  Seq Scan on t t4
(cost=3D0.00..8.00 rows=3D500 width=3D8)
               ->  Materialize  (cost=3D0.00..10.50 rows=3D500 width=3D4)
                     ->  Seq Scan on t t2  (cost=3D0.00..8.00 rows=3D500 wi=
dth=3D4)
         ->  Hash  (cost=3D8.00..8.00 rows=3D500 width=3D8)
               ->  Seq Scan on t t1  (cost=3D0.00..8.00 rows=3D500 width=3D=
8)
(17 rows)

-- Plan 2
explain (costs on)
select sum(t4.c) from t t1 join
  (t t2 join t t3 on t2.b !=3D t3.b join t t4 on t3.b =3D t4.b)
  on t1.b =3D t2.b
group by t1.a;
                                           QUERY PLAN
---------------------------------------------------------------------------=
---------------------
 Finalize HashAggregate  (cost=3D455675.44..455675.47 rows=3D3 width=3D12)
   Group Key: t1.a
   ->  Hash Join  (cost=3D455658.07..455672.94 rows=3D500 width=3D12)
         Hash Cond: (t1.b =3D t2.b)
         ->  Seq Scan on t t1  (cost=3D0.00..8.00 rows=3D500 width=3D8)
         ->  Hash  (cost=3D455658.03..455658.03 rows=3D3 width=3D12)
               ->  Partial HashAggregate  (cost=3D455658.00..455658.03
rows=3D3 width=3D12)
                     Group Key: t2.b
                     ->  Hash Join  (cost=3D14.25..316768.56
rows=3D27777887 width=3D8)
                           Hash Cond: (t3.b =3D t4.b)
                           ->  Nested Loop  (cost=3D0.00..3767.25
rows=3D166666 width=3D8)
                                 Join Filter: (t2.b <> t3.b)
                                 ->  Seq Scan on t t2
(cost=3D0.00..8.00 rows=3D500 width=3D4)
                                 ->  Materialize  (cost=3D0.00..10.50
rows=3D500 width=3D4)
                                       ->  Seq Scan on t t3
(cost=3D0.00..8.00 rows=3D500 width=3D4)
                           ->  Hash  (cost=3D8.00..8.00 rows=3D500 width=3D=
8)
                                 ->  Seq Scan on t t4
(cost=3D0.00..8.00 rows=3D500 width=3D8)
(17 rows)

For the grouped relation {t2 t3 t4}, Plan 1 chose the path
"PartialAgg(t3/t4) JOIN t2", while Plan 2 chose the path
"PartialAgg(t2/t3/t4)".

The first path has larger row count (1000) and lower cost (1409.66).
The second path has smaller row count (3) and higher cost (455658.03).

Executing these two plans shows that Plan 2 is slower than Plan 1.

-- Plan 1
 Execution Time: 286.860 ms

-- Plan 2
 Execution Time: 27109.744 ms

I think we may need to take the position in the join tree into account
when applying this heuristic.  At lower levels, we should prefer paths
with smaller row counts, while at higher levels, we should prefer
paths with lower costs.  However, it's unclear to me how we should
define "lower" and "higher" - how low is 'low' and how high is 'high'.

> I admit it's not so clear-cut when the row counts are close. If
> PartialAgg(t1 JOIN t2) JOIN t3 has a very similar to PartialAgg(t1
> JOIN t3) JOIN t2, can we categorically pick whichever one has the
> lower row count and forget about the other? I'm not sure. But I have
> an uncomfortable feeling that if we can't, we're going to have an
> explosion in the number of paths we have to generate even if we avoid
> an explosion in the number of RelOptInfos we generate.
>
> For example, consider:
>
> SELECT ... FROM fact f, dim1, dim2, dim3, dim4
> WHERE f.dim1_id =3D dim1.id AND f.dim2_id =3D dim2.id
> AND f.dim3_id =3D dim3.id AND f.dim4_id =3D dim4.id
> GROUP BY f.something;
>
> Let's assume that each dimN table has PRIMARY KEY (id). Because of the
> primary keys, it's only sensible to consider partial aggregation for
> subsets of rels that include f; and it doesn't make sense to consider
> partially aggregating after joining all 5 tables because at that point
> we should just do a single-step aggregation. So, the partially
> grouped-rel for {f,dim1,dim2,dim3,dim4} can contain paths generated in
> 15 different ways, because we can join f to any proper subset of
> {dim1,dim2,dim3,dim4} before partially aggregating and then to the
> remainder after partially aggregating. But that feels like we're
> re-performing essentially the same join search 16 times which seems
> super-expensive. I can't quite say that the work is useless or that I
> have a better idea, but I guess there will be a lot of cases where all
> 16 join searches produce the same results, or most of them do. It
> doesn't feel to me like checking through all of those possibilities is
> a good expenditure of planner effort.

Yeah, you're right that the join search process for grouped paths
basically mirrors what we do for non-grouped paths, which indeed
involves a lot of planner effort.  I've been exploring potential
heuristics to limit the search space for grouped paths, but so far, I
haven't found any effective solutions.  Currently, the heuristic used
in the patch is to only consider grouped paths that dramatically
reduce the number of rows.  All others are just discarded.  The
rationale is that if a grouped path does not reduce the number of rows
enough, it is highly unlikely to result in a competitive final plan
during the upper planning stages, so it doesn't make much sense to
consider it.  The current threshold is set to 50%, meaning that if the
number of rows returned by PartialAgg(t1 JOIN t2) is not less than 50%
of the rows returned by (t1 JOIN t2), no Aggregate paths will be
generated on top of the t1/t2 join.  If we notice significant
regressions in planning time, we might consider further increasing
this threshold, say, to 80%, so that only grouped paths that reduce
the rows by more than 80% will be considered.  This heuristic also
ensures that, once a plan with eager aggregation is chosen, it is
highly likely to result in performance improvements, due to the
significant data reduction before joins.

> I took a look at the paper you linked in the original post, but
> unfortunately it doesn't seem to say much about how to search the plan
> space efficiently. I wonder if other systems perform a search that as
> exhaustive as the one that you are proposing to perform here or
> whether they apply some heuristics to limit the search space, and if
> so, what those heuristics are.

Unfortunately, I don't have much knowledge about other systems.  It
would be really helpful if anyone could share some insights on how
other systems handle this.

Thanks
Richard