MIME-Version: 1.0
References: <CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com>
 <87il22cj51.fsf@163.com> <CAMbWs49=eAd2W9jCtGhaZPPp+SOC_2rg16RTG74xAht=hkr5JQ@mail.gmail.com>
 <CAMbWs49Nc4M3H+eCf1+8w8piDyEECjRb-gK_JMF4VvcyWwGEVQ@mail.gmail.com>
 <CAMbWs49E_dR0nobsExsyetpnBpHObLTsQLsEbWKQLkh0omPxNg@mail.gmail.com>
 <CAMbWs49B_qUiHvu2EqLHZRpLr3p_+QPBs50n2=L5ibYzniwTzA@mail.gmail.com>
 <CAMbWs48KCQtDymnYi4M=Vz+WMzo3fkBxffJsyk6VX6hOXXv+VA@mail.gmail.com>
 <CAMbWs49sv_MuOYqqrtmBN_oYf8VSQ2BXDwXaTpJTn_YfwyYdWQ@mail.gmail.com>
 <CAMbWs49U8Sddx_fGszPdvA3jp_nheynxaqm5Y4NqMV21VBYAuQ@mail.gmail.com>
 <CAMbWs4-LwyOg9ga+NVF7yQbMi0ZsZdN1G_sO2v=YJHV18=19+A@mail.gmail.com>
 <CALA8mJquG_zCJXfVwash5LKqHGtZXQmq7RfTSaRDUzGYeW=7Rw@mail.gmail.com>
 <CAMbWs4_EjgcBib5+y1LYcGB3EK3Y6R+OOxGKfJo42fDovadk1g@mail.gmail.com>
 <CALA8mJqe0anNM8_V6cOeOQnCHUTQggn7iOQNyQr1VaN_xMjz+w@mail.gmail.com>
 <CAMbWs48eE-s-jCicC8pSVfXk8Ws-ZvUKnsw8qH-DkVBdYv0eJQ@mail.gmail.com>
 <CAMbWs483a7-8M0pDttG44r-+8Gevn9VG0xNceE3WpkEQxJXPZw@mail.gmail.com>
 <CAHewXNmYM6DvR_kaxDL0w0fz9BwKbac+TSU3QS10aA3cXHyMmA@mail.gmail.com>
 <CA+TgmoaxH=P63hLYgyJJcEbMRnw3xi16d=HxFi1j-m7MhH6W_w@mail.gmail.com>
 <CAMbWs4_cOnpGsywj9Jt1WAgzJLW9Rxt5X13cfGz4iN2qvZQ68g@mail.gmail.com>
 <CA+Tgmob0q7bRbsFTVDMjxHE6zA4uDQLQa-s0CtwUw49V53UL_A@mail.gmail.com>
 <CAMbWs4-Xru_eKBeRHFduigSGihdixFWVTR8A+dtMw7Mao+RkJA@mail.gmail.com>
 <CAMbWs49dLjSSQRWeud+KSN0G531ciZdYoLBd5qktXA+3JQm_UQ@mail.gmail.com>
 <CAMbWs48LXGC-Y63YtzEeM-3f0NUXWCUEMs7XwGzywXTjUNMcxQ@mail.gmail.com>
 <CAMbWs48XdzvnwfTHWxQ7qK-yjvdrbwsPpqhJBuKDnO+hcbsVwA@mail.gmail.com>
 <CA+TgmoaO-7RHdyJuizWChXZm7EJGvDcfoePDDEyUA-y8vTB1tg@mail.gmail.com>
 <CAMbWs4-+jXRpKuFMZa08bS34-TBka3qqjVMAUjF=-1RA9BKvgg@mail.gmail.com>
 <CA+TgmoZapU1y59-s3o8oPt7Hv+cxRh_34FMu6MXumomLe+U1Cw@mail.gmail.com>
 <CAMbWs4_sEeeBmucBzbamBMfA9uLxVmOc_MV=ZpSyDbTcrUO_XQ@mail.gmail.com>
 <CA+Tgmob4fnv57PQB0Oox86mHSJQ0vVL249eT=gqPvrMkG7h1zw@mail.gmail.com>
 <CAMbWs489NYyTcCTbrUi7hPXKtNY5vHrrFcHyMRAv=CA5WsszVw@mail.gmail.com>
 <CA+TgmoazmDdcc7NeTo3WM5HW3DASNP4rfZw6X+2nnQKHampOng@mail.gmail.com>
 <CAMbWs49bYr-ULhA+-At0iQ+NaFKy72AWB6jzughk8MPTiY+gMQ@mail.gmail.com>
 <CA+TgmoYa-zexdbc5nO_D6oxPMZYs06hkYwZK5Dufq+4Hhe6uNQ@mail.gmail.com> <CAMbWs4_aji0kME490phz6nTXnPToddUn19OF3rLm1g4TbNkuzQ@mail.gmail.com>
In-Reply-To: <CAMbWs4_aji0kME490phz6nTXnPToddUn19OF3rLm1g4TbNkuzQ@mail.gmail.com>
From: Robert Haas <robertmhaas@gmail.com>
Date: Fri, 24 Jan 2025 15:53:42 -0500
Message-ID: <CA+Tgmoa3+G_=8XuQWN+0ugv6r-WV6ruFESpOxpXAAKrne3oVDQ@mail.gmail.com>
Subject: Re: Eager aggregation, take 3
To: Richard Guo <guofenglinux@gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>, Tender Wang <tndrwang@gmail.com>, 
	Paul George <p.a.george19@gmail.com>, Andy Fan <zhihuifan1213@163.com>, 
	PostgreSQL-development <pgsql-hackers@postgresql.org>, pgsql-hackers@lists.postgresql.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CA%2BTgmoa3%2BG_%3D8XuQWN%2B0ugv6r-WV6ruFESpOxpXAAKrne3oVDQ%40mail.gmail.com>
Precedence: bulk

On Wed, Jan 22, 2025 at 1:48=E2=80=AFAM Richard Guo <guofenglinux@gmail.com=
> wrote:
> This approach would require injecting multiple intermediate
> aggregation nodes into the path tree, for which we currently lack the
> necessary architecture.  As a result, I didn't pursue this idea
> further.  However, I'm really glad you mentioned this approach, though
> it's still unclear whether it's a feasible or reasonable idea.

I think the biggest question in my mind is really whether we can
accurately judge when such a strategy is likely to be a win. In this
instance it looks like we could have figured it out, but as we've
discussed, I fear a lot of estimates will be inaccurate. If we knew
they were going to be good, then I see no reason not to apply the
technique when it's sensible.

> I don't have much experience with end-user scenarios, so I'm not sure
> if it's common to have queries where the row count increases with more
> and more tables joined.

I don't think it's very common to see it increase as dramatically as
in your test case.

> > To be honest, I was quite surprised this was a percentage like 50% or
> > 80% and not a multiple like 2 or 5. And I had thought the multiplier
> > might even be larger, like 10 or more. The thing is, 50% means we only
> > have to form 2-item groups in order to justify aggregating twice.
> > Maybe SUM() is cheap enough to justify that treatment, but a more
> > expensive aggregate might not be, especially things like string_agg()
> > or array_agg() where aggregation creates bigger objects.
>
> Hmm, if I understand correctly, the "percentage" and the "multiple"
> work in the same way.  Percentage 50% and multiple 2 both mean that
> the average group size is 2, and percentage 90% and multiple 10 both
> mean that the average group size is 10.  In general, this relationship
> should hold: percentage =3D 1 - 1/multiple.  However, I might not have
> grasped your point correctly.

Yes, they're equivalent. However, a percentage to me suggests that we
think that the meaningful values might be something like 20%, 50%,
80%; whereas with a multiplier someone might be more inclined to think
of values like 10, 100, 1000. You can definitely write those values as
90%, 99%, 99.9%; however, it seems less natural to me to express it
that way when we think the value will be quite close to 1. The fact
that you chose a percentage suggested to me that you were aiming for a
less-strict threshold than I had supposed we would want.

> Yeah, as you summarized, this heuristic is primarily used to discard
> unpromising paths, ensuring they aren't considered further.  For the
> paths that pass this heuristic, the cost model will then determine the
> appropriate aggregation and join methods.  If we take this into
> consideration when applying the heuristic, it seems to me that we
> would essentially be duplicating the work that the cost model
> performs, which doesn't seem necessary.

Well, I think we do ideally want heuristics that can reject
unpromising paths earlier. The planning cost of this is really quite
high. But I'm not sure how far we can get with this particular
heuristic. True, we could raise it to a larger value, and that might
help to rule out unpromising paths earlier. But I fear you'll quickly
find examples where it also rules out promising paths early. A good
heuristic is easy to compute and highly accurate. This heuristic is
easy to compute, but the accuracy is questionable.

--=20
Robert Haas
EDB: http://www.enterprisedb.com