MIME-Version: 1.0
References: <CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com>
 <87il22cj51.fsf@163.com> <CAMbWs49=eAd2W9jCtGhaZPPp+SOC_2rg16RTG74xAht=hkr5JQ@mail.gmail.com>
 <CAMbWs49Nc4M3H+eCf1+8w8piDyEECjRb-gK_JMF4VvcyWwGEVQ@mail.gmail.com>
 <CAMbWs49E_dR0nobsExsyetpnBpHObLTsQLsEbWKQLkh0omPxNg@mail.gmail.com>
 <CAMbWs49B_qUiHvu2EqLHZRpLr3p_+QPBs50n2=L5ibYzniwTzA@mail.gmail.com>
 <CAMbWs48KCQtDymnYi4M=Vz+WMzo3fkBxffJsyk6VX6hOXXv+VA@mail.gmail.com>
 <CAMbWs49sv_MuOYqqrtmBN_oYf8VSQ2BXDwXaTpJTn_YfwyYdWQ@mail.gmail.com>
 <CAMbWs49U8Sddx_fGszPdvA3jp_nheynxaqm5Y4NqMV21VBYAuQ@mail.gmail.com>
 <CAMbWs4-LwyOg9ga+NVF7yQbMi0ZsZdN1G_sO2v=YJHV18=19+A@mail.gmail.com>
 <CALA8mJquG_zCJXfVwash5LKqHGtZXQmq7RfTSaRDUzGYeW=7Rw@mail.gmail.com>
 <CAMbWs4_EjgcBib5+y1LYcGB3EK3Y6R+OOxGKfJo42fDovadk1g@mail.gmail.com>
 <CALA8mJqe0anNM8_V6cOeOQnCHUTQggn7iOQNyQr1VaN_xMjz+w@mail.gmail.com>
 <CAMbWs48eE-s-jCicC8pSVfXk8Ws-ZvUKnsw8qH-DkVBdYv0eJQ@mail.gmail.com>
 <CAMbWs483a7-8M0pDttG44r-+8Gevn9VG0xNceE3WpkEQxJXPZw@mail.gmail.com>
 <CAHewXNmYM6DvR_kaxDL0w0fz9BwKbac+TSU3QS10aA3cXHyMmA@mail.gmail.com>
 <CA+TgmoaxH=P63hLYgyJJcEbMRnw3xi16d=HxFi1j-m7MhH6W_w@mail.gmail.com>
 <CAMbWs4_cOnpGsywj9Jt1WAgzJLW9Rxt5X13cfGz4iN2qvZQ68g@mail.gmail.com>
 <CA+Tgmob0q7bRbsFTVDMjxHE6zA4uDQLQa-s0CtwUw49V53UL_A@mail.gmail.com>
 <CAMbWs4-Xru_eKBeRHFduigSGihdixFWVTR8A+dtMw7Mao+RkJA@mail.gmail.com>
 <CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ@mail.gmail.com>
 <CAMbWs48LXGC-Y63YtzEeM-3f0NUXWCUEMs7XwGzywXTjUNMcxQ@mail.gmail.com>
 <CAMbWs48XdzvnwfTHWxQ7qK-yjvdrbwsPpqhJBuKDnO+hcbsVwA@mail.gmail.com>
 <CA+TgmoaO-7RHdyJuizWChXZm7EJGvDcfoePDDEyUA-y8vTB1tg@mail.gmail.com>
 <CAMbWs4-+jXRpKuFMZa08bS34-TBka3qqjVMAUjF=-1RA9BKvgg@mail.gmail.com>
 <CA+TgmoZapU1y59-s3o8oPt7Hv+cxRh_34FMu6MXumomLe+U1Cw@mail.gmail.com>
 <CAMbWs4_sEeeBmucBzbamBMfA9uLxVmOc_MV=ZpSyDbTcrUO_XQ@mail.gmail.com> <CA+Tgmob4fnv57PQB0Oox86mHSJQ0vVL249eT=gqPvrMkG7h1zw@mail.gmail.com>
In-Reply-To: <CA+Tgmob4fnv57PQB0Oox86mHSJQ0vVL249eT=gqPvrMkG7h1zw@mail.gmail.com>
From: Richard Guo <guofenglinux@gmail.com>
Date: Sun, 19 Jan 2025 21:53:05 +0900
Message-ID: <CAMbWs489NYyTcCTbrUi7hPXKtNY5vHrrFcHyMRAv=CA5WsszVw@mail.gmail.com>
Subject: Re: Eager aggregation, take 3
To: Robert Haas <robertmhaas@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Tender Wang <tndrwang@gmail.com>, 
	Paul George <p.a.george19@gmail.com>, Andy Fan <zhihuifan1213@163.com>, 
	PostgreSQL-development <pgsql-hackers@postgresql.org>, pgsql-hackers@lists.postgresql.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CAMbWs489NYyTcCTbrUi7hPXKtNY5vHrrFcHyMRAv%3DCA5WsszVw%40mail.gmail.com>
Precedence: bulk

On Sat, Jan 18, 2025 at 6:16=E2=80=AFAM Robert Haas <robertmhaas@gmail.com>=
 wrote:
> On Thu, Jan 16, 2025 at 3:18=E2=80=AFAM Richard Guo <guofenglinux@gmail.c=
om> wrote:
> > If this t1/t2 join is part of a larger SELECT query, I think the cost
> > estimates for the upper join nodes would likely be quite inaccurate.
>
> That's definitely true. However, the question is not whether the
> planner has problems today (it definitely does) but whether it's OK to
> make this change without improving our ability to estimate the effects
> of aggregation operations. I understand that you (quite rightly) don't
> want to get sucked into fixing unrelated planner problems, and I'm
> also not sure to what extent these problems are actually fixable.
> However, major projects sometimes require such work. For instance,
> commit 5edc63bda68a77c4d38f0cbeae1c4271f9ef4100 was motivated by the
> discovery that it was too easy to get a Parallel Bitmap Heap Scan plan
> even when it wasn't best. The fact that the costing wasn't right
> wasn't the fault of parallel query, but parallel query still needed to
> do something about it to get good results.

Yeah, it's true that we have problems in aggregate estimates today.
And it has been the case for a long time.  In the past, we made some
improvements in this area, such as in 84f9a35e3, where we adapted a
new formula that is based on the random selection probability,
inspired by two papers from Yao and Dell'Era.  But we still have
problems with aggregate estimates.  I'm not sure when we could fix
these problems, but I doubt that it will happen in the near future.
(Sorry to be pessimistic.)

If, at last, the conclusion of this discussion is that we should not
apply this change until we fix those problems in aggregate estimates,
I'd be very sad.  This conclusion is absolutely correct, for sure, in
an ideal world, but in the real world, it feels like a death sentence
for this patch, and for any future patches that attempt to apply some
optimizations above aggregate nodes - unless, of course, the day
arrives when we finally fix those aggregate estimate problems, which
doesn't seem likely in the near future.

And if that's the case, can I then argue that the same principle
should apply to joins?  Specifically, should we refrain from applying
any optimizations above join nodes until we've fixed the join estimate
problems, particularly in cases where we fall back on default
selectivity estimates?

Please do not get me wrong.  I'm not saying that we should not fix the
problems in our current aggregate estimates.  I think, as I said
previously, that the realistic approach is to first identify some
real-world queries where this patch causes significant performance
regressions.  This would give us the opportunity to investigate these
regressions and understand how the bad cost estimates contributed to
them.  From there, we could figure out where to start fixing the cost
estimates.  And if we find that the problem is not entirely fixable,
we could then explore the possibility of introducing new heuristics to
avoid the performance regressions as much as possible.  In my opinion,
it's not very possible to make cost estimation perfect in all cases.
In a sense, cost estimation is an art of compromise.

I believe this is also the approach that commit 5edc63bda followed.
First, it was found that Bitmap Heap Scans caused performance
regressions in many TPCH queries in cases where work_mem was low.
Then, this issue was thoroughly discussed, and eventually it was
figured out that the impact of lossy pages needed to be accounted for
when estimating the cost of bitmap scans, which became 5edc63bda.

> > Well, from the perspective of planning effort, what really matters is
> > whether the RelOptInfo for the grouped relation is added to the
> > PlannerInfo, as it is only then available for further joining in the
> > join search routine, not whether the RelOptInfo is built or not.
> > Building the RelOptInfo for a grouped relation is simply a makeNode
> > call followed by a flat copy; it doesn't require going through the
> > full process of determining its target list, or constructing its
> > restrict and join clauses, or calculating size estimates, etc.
>
> That's probably mostly true, but the overhead of memory allocations in
> planner routines is not trivial. There are previous cases of changing
> things or declining to change this purely on the number of palloc
> cycles involved.

Hmm, I think you are right.  I will modify make_grouped_join_rel() to
create the RelOptInfo for a grouped join relation only if we can
generate any grouped paths by joining its input relations.

> > > It's possible you're right, but it does make me nervous. I do agree
> > > that making the number of RelOptInfos explode would be really bad.
> >
> > Based on my explanation in [1], I think it's acceptable to compare
> > grouped paths for the same grouped rel, regardless of where the
> > partial aggregation is placed.
> >
> > I fully understand that I could be wrong about this, but I don't think
> > it would break anything in regular planning (i.e., planning without
> > eager aggregation).
>
> I think you might be taking too narrow a view of the problem. As Tom
> says, the issue is that this breaks a bunch of assumptions that hold
> elsewhere. One place that shows up in the patch is in the special-case
> logic you've added to set_cheapest(), but I fear that won't be the end
> of it. It seems a bit surprising to me that you didn't also need to
> adjust add_path(), for example. Even if you don't, there's lots of
> places that rely on the assumption that all paths for a RelOptInfo are
> returning the same set of rows. If it turns out that a bunch of those
> places need to be adjusted to work with this, then the code could
> potentially end up quite messy, and that might also have performance
> consequences, even when this feature is disabled. Many of the code
> paths that deal with paths in the planner are quite hot.

Yeah, one of the basic assumptions in the planner is that all paths
for a given RelOptInfo return the same set of rows.  One exception
to this is parameterized paths.  As an example, please consider:

create table t (a int, b int);
create table t3 (a int, b int);

insert into t select i, i from generate_series(1,1000)i;
insert into t3 select i, i from generate_series(1,1000)i;

create index on t3(a, b);
analyze t, t3;

explain (costs off)
select * from t t1 join t t2 on true join t3 on t3.a > t1.a and t3.b > t2.b=
;

With gdb, I found the following 4 paths in the pathlist of RelOptInfo
of {t3}:

   {INDEXPATH
   :path.pathtype 341
   :parent_relids (b 4)
   :required_outer (b 1 2)
   :path.parallel_aware false
   :path.parallel_safe true
   :path.parallel_workers 0
   :path.rows 111
   :path.disabled_nodes 0
   :path.startup_cost 0.275
   :path.total_cost 4.755000000000001

   {INDEXPATH
   :path.pathtype 341
   :parent_relids (b 4)
   :required_outer (b 1)
   :path.parallel_aware false
   :path.parallel_safe true
   :path.parallel_workers 0
   :path.rows 333
   :path.disabled_nodes 0
   :path.startup_cost 0.275
   :path.total_cost 6.1425

   {INDEXPATH
   :path.pathtype 341
   :parent_relids (b 4)
   :required_outer (b 2)
   :path.parallel_aware false
   :path.parallel_safe true
   :path.parallel_workers 0
   :path.rows 333
   :path.disabled_nodes 0
   :path.startup_cost 0.275
   :path.total_cost 11.145

   {PATH
   :pathtype 338
   :parent_relids (b 4)
   :required_outer (b)
   :parallel_aware false
   :parallel_safe true
   :parallel_workers 0
   :rows 1000
   :disabled_nodes 0
   :startup_cost 0
   :total_cost 15

None of them are returning the same set of rows.  This is fine because
we have revised the assumption to that all paths for a RelOptInfo of
the same parameterization return the same set of rows.  That is to
say, it's OK that paths for the same RelOptInfo return different sets
of rows if they have different parameterizations.

Now we have the grouped paths.  I had previously considered further
revising this assumption to that all paths for a RelOptInfo of the
same parameterization and the same location of partial aggregation
return the same set of rows.  That's why, back in November, I proposed
the idea of introducing a GroupPathInfo into the Path structure to
store the location of the partial aggregation and the estimated
rowcount for each grouped path, similar to how ParamPathInfo functions
for parameterized paths.

However, I gave up on this idea in December after realizing an
important difference from the parameterized path case.  For a
parameterized path, the fewer the required outer rels, the better, as
more outer rels imply more join restrictions.  In other words, the
number of required outer rels is an important factor when comparing
two paths in add_path().  For a grouped path, however, the location of
partial aggregation does not impose such restrictions for further
planning.  As long as one grouped path is cheaper than another based
on the current merits of add_path(), we don't really care where the
partial aggregation is placed within the path tree.

I can take up the idea of GroupPathInfo again.  Before I start
implementing it (which is not trivial), I'd like to hear others'
thoughts on this approach - whether it's necessary and whether this is
the right direction to pursue.

> To say that another way, I'm not so much worried about the possibility
> that the patch contains a bug. Patches contain bugs all the time and
> we can just fix them. It's not wonderful, but that's how software
> development goes. What I am worried about is whether the architecture
> is right. If we go with one RelOptInfo when the "right answer" is
> many, or for that matter if we go with many when the right answer is
> one, those are things that cannot be easily and reasonably patched
> post-commit, and especially not post-release.

Fair point.  We should make sure the architecture of this patch is
solid before committing it.

Regarding whether we should use a single RelOptInfo or separate
RelOptInfos for the same grouped relation: If we choose to separate
the grouped paths of the same grouped relation into different
RelOptInfos based on the location of the partial aggregation within
the path tree, then, based on my calculation from the previous email,
for a relation containing n base rels, we would need to create 2^(n-1)
different RelOptInfos, not to mention that this can also lead to more
paths.  I still struggle to see how this is feasible.  Could you
please elaborate on why you believe this is a viable option?

Thanks
Richard