MIME-Version: 1.0
References: <CAMbWs48jzLrPt1J_00ZcPZXWUQKawQOFE8ROc-ADiYqsqrpBNw@mail.gmail.com>
 <87il22cj51.fsf@163.com> <CAMbWs49=eAd2W9jCtGhaZPPp+SOC_2rg16RTG74xAht=hkr5JQ@mail.gmail.com>
 <CAMbWs49Nc4M3H+eCf1+8w8piDyEECjRb-gK_JMF4VvcyWwGEVQ@mail.gmail.com>
 <CAMbWs49E_dR0nobsExsyetpnBpHObLTsQLsEbWKQLkh0omPxNg@mail.gmail.com>
 <CAMbWs49B_qUiHvu2EqLHZRpLr3p_+QPBs50n2=L5ibYzniwTzA@mail.gmail.com>
 <CAMbWs48KCQtDymnYi4M=Vz+WMzo3fkBxffJsyk6VX6hOXXv+VA@mail.gmail.com>
 <CAMbWs49sv_MuOYqqrtmBN_oYf8VSQ2BXDwXaTpJTn_YfwyYdWQ@mail.gmail.com>
 <CAMbWs49U8Sddx_fGszPdvA3jp_nheynxaqm5Y4NqMV21VBYAuQ@mail.gmail.com>
 <CAMbWs4-LwyOg9ga+NVF7yQbMi0ZsZdN1G_sO2v=YJHV18=19+A@mail.gmail.com>
 <CALA8mJquG_zCJXfVwash5LKqHGtZXQmq7RfTSaRDUzGYeW=7Rw@mail.gmail.com>
 <CAMbWs4_EjgcBib5+y1LYcGB3EK3Y6R+OOxGKfJo42fDovadk1g@mail.gmail.com>
 <CALA8mJqe0anNM8_V6cOeOQnCHUTQggn7iOQNyQr1VaN_xMjz+w@mail.gmail.com>
 <CAMbWs48eE-s-jCicC8pSVfXk8Ws-ZvUKnsw8qH-DkVBdYv0eJQ@mail.gmail.com>
 <CAMbWs483a7-8M0pDttG44r-+8Gevn9VG0xNceE3WpkEQxJXPZw@mail.gmail.com>
 <CA+TgmobjS23xTPs7U7oLgTgaa0ayCAcrU6tm6ent4WWm8+THUA@mail.gmail.com> <CAMbWs4-ov_aOhXTnVCH6SCW5KGpMbxUZtGKj=tqrqYyp8LXGuA@mail.gmail.com>
In-Reply-To: <CAMbWs4-ov_aOhXTnVCH6SCW5KGpMbxUZtGKj=tqrqYyp8LXGuA@mail.gmail.com>
From: jian he <jian.universality@gmail.com>
Date: Wed, 6 Nov 2024 13:04:38 +0800
Message-ID: <CACJufxEwZqZcjay_2wEJfCy8fCxUJK_NOknxEpd4M+udkyp_0A@mail.gmail.com>
Subject: Re: Eager aggregation, take 3
To: Richard Guo <guofenglinux@gmail.com>
Cc: Robert Haas <robertmhaas@gmail.com>, Paul George <p.a.george19@gmail.com>, 
	Andy Fan <zhihuifan1213@163.com>, 
	PostgreSQL-development <pgsql-hackers@postgresql.org>, pgsql-hackers@lists.postgresql.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/CACJufxEwZqZcjay_2wEJfCy8fCxUJK_NOknxEpd4M%2Budkyp_0A%40mail.gmail.com>
Precedence: bulk

On Thu, Aug 29, 2024 at 10:26=E2=80=AFAM Richard Guo <guofenglinux@gmail.co=
m> wrote:
>
>
> > 2. I think there might be techniques we could use to limit planning
> > effort at an earlier stage when the approach doesn't appear promising.
> > For example, if the proposed grouping column is already unique, the
> > exercise is pointless (I think). Ideally we'd like to detect that
> > without even creating the grouped_rel. But the proposed grouping
> > column might also be *mostly* unique. For example, consider a table
> > with a million rows and a column 500,000 distinct values. I suspect it
> > will be difficult for partial aggregation to work out to a win in a
> > case like this, because I think that the cost of performing the
> > partial aggregation will not reduce the cost either of the final
> > aggregation or of the intervening join steps by enough to compensate.
> > It would be best to find a way to avoid generating a lot of rels and
> > paths in cases where there's really not much hope of a win.
> >
> > One could, perhaps, imagine going further with this by postponing
> > eager aggregation planning until after regular paths have been built,
> > so that we have good cardinality estimates. Suppose the query joins a
> > single fact table to a series of dimension tables. The final plan thus
> > uses the fact table as the driving table and joins to the dimension
> > tables one by one. Do we really need to consider partial aggregation
> > at every level? Perhaps just where there's been a significant row
> > count reduction since the last time we tried it, but at the next level
> > the row count will increase again?
> >
> > Maybe there are other heuristics we could use in addition or instead.
>
> Yeah, one of my concerns with this work is that it can use
> significantly more CPU time and memory during planning once enabled.
> It would be great if we have some efficient heuristics to limit the
> effort.  I'll work on that next and see what happens.
>

in v13, latest version. we can

    /* ... and initialize these targets */
    if (!init_grouping_targets(root, rel, target, agg_input,
                               &group_clauses, &group_exprs))
        return NULL;
    if (rel->reloptkind =3D=3D RELOPT_BASEREL && group_exprs !=3D NIL)
    {
        foreach_node(Var, var, group_exprs)
        {
            if(var->varno =3D=3D rel->relid &&
                has_unique_index(rel, var->varattno))
                return NULL;
        }
    }

since in init_grouping_targets we already Asserted that group_exprs is
a list of Var.


---------------------------------------------------------------------------=
-----
also in create_rel_agg_info, estimate_num_groups

    result->group_exprs =3D group_exprs;
    result->grouped_rows =3D estimate_num_groups(root, result->group_exprs,
                                               rel->rows, NULL, NULL);
        /*
         * The grouped paths for the given relation are considered useful i=
ff
         * the row reduction ratio is greater than EAGER_AGGREGATE_RATIO.
         */
        agg_info->agg_useful =3D
            (agg_info->grouped_rows <=3D rel->rows * (1 - EAGER_AGGREGATE_R=
ATIO));

If the associated Var in group_exprs is too many, then result->grouped_rows
will be less accurate, therefore agg_info->agg_useful will be less accurate=
.
should we limit the number of Var associated with Var group_exprs.


for example:
SET enable_eager_aggregate TO on;
drop table if exists eager_agg_t1,eager_agg_t2, eager_agg_t3;
CREATE TABLE eager_agg_t1 (a int, b int, c double precision);
CREATE TABLE eager_agg_t2 (a int, b int, c double precision);
INSERT INTO eager_agg_t1 SELECT i % 100, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(1, 5)i;
INSERT INTO eager_agg_t2 SELECT i % 10, i, i FROM generate_series(-4, -2)i;
explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) =3D abs(t2.b % 10 + t2.a) group by 1;


explain(costs off, verbose, settings)
SELECT t1.a, avg(t2.c) FROM eager_agg_t1 t1 JOIN eager_agg_t2 t2 ON
abs(t1.b) =3D abs(t2.b % 10 + t2.a) group by 1;
                                      QUERY PLAN
---------------------------------------------------------------------------=
-----------
 Finalize HashAggregate
   Output: t1.a, avg(t2.c)
   Group Key: t1.a
   ->  Merge Join
         Output: t1.a, (PARTIAL avg(t2.c))
         Merge Cond: ((abs(((t2.b % 10) + t2.a))) =3D (abs(t1.b)))
         ->  Sort
               Output: t2.b, t2.a, (PARTIAL avg(t2.c)), (abs(((t2.b %
10) + t2.a)))
               Sort Key: (abs(((t2.b % 10) + t2.a)))
               ->  Partial HashAggregate
                     Output: t2.b, t2.a, PARTIAL avg(t2.c), abs(((t2.b
% 10) + t2.a))
                     Group Key: t2.b, t2.a
                     ->  Seq Scan on public.eager_agg_t2 t2
                           Output: t2.a, t2.b, t2.c
         ->  Sort
               Output: t1.a, t1.b, (abs(t1.b))
               Sort Key: (abs(t1.b))
               ->  Seq Scan on public.eager_agg_t1 t1
                     Output: t1.a, t1.b, abs(t1.b)
 Settings: enable_eager_aggregate =3D 'on'
 Query Identifier: -734044107933323262