From: Tom Lane <tgl@sss.pgh.pa.us>
To: Dimitrios Apostolou <jimis@gmx.net>
cc: pgsql-general@lists.postgresql.org
Subject: Re: array_agg() does not stop aggregating according to HAVING clause
In-reply-to: <215d6efa-bb4b-a76e-6066-7a83ccdb55e3@gmx.net>
References: <215d6efa-bb4b-a76e-6066-7a83ccdb55e3@gmx.net>
Comments: In-reply-to Dimitrios Apostolou <jimis@gmx.net>
	message dated "Sat, 17 Aug 2024 16:37:25 +0200"
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-ID: <954893.1723907711.1@sss.pgh.pa.us>
Content-Transfer-Encoding: quoted-printable
Date: Sat, 17 Aug 2024 11:15:11 -0400
Message-ID: <954894.1723907711@sss.pgh.pa.us>
Archived-At: <https://www.postgresql.org/message-id/954894.1723907711%40sss.pgh.pa.us>
Precedence: bulk

Dimitrios Apostolou <jimis@gmx.net> writes:
> I have a query that goes through *billions* of rows and for the columns
> that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
> selects all the IDs of the entries (array_agg(run_n)). Here is the full
> query:

> INSERT INTO infrequent_datatags_in_this_chunk
>    SELECT datatag, datatags.datatag_n, array_agg(run_n)
>      FROM runs_raw
>      JOIN datatags USING(datatag_n)
>      WHERE workitem_n >=3D 295
>        AND workitem_n <  714218
>        AND datatag IS NOT NULL
>      GROUP BY     datatags.datatag_n
>      HAVING  count(datatag_n) < 10
>        AND   count(datatag_n) > 0  -- Not really needed because of the J=
OIN above
> ;

> The problem is that this is extremely slow (5 hours), most likely becaus=
e
> it creates tens of gigabytes of temporary files as I see in the logs. I
> suspect that it is writing to disk the array_agg(run_n) of all entries a=
nd
> not only those HAVING count(datatag_n)<10.

Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN.  There's no way to apply the HAVING filter until
after the aggregation is finished.

I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.

> How do I tell postgres to stop aggregating when count>=3D10?

The only way to do this would be to do two separate passes of
aggregation in separate sub-queries.  Perhaps like

WITH rare AS (
     SELECT datatag_n
     FROM runs_raw
     WHERE workitem_n >=3D 295
       AND workitem_n <  714218
       AND datatag IS NOT NULL
     GROUP BY datatag_n
     HAVING  count(datatag_n) < 10
       AND   count(datatag_n) > 0 =

)
INSERT INTO infrequent_datatags_in_this_chunk
   SELECT datatag, datatags.datatag_n, array_agg(run_n)
     FROM runs_raw
     JOIN datatags USING(datatag_n)
     JOIN rare USING(datatag_n)
     GROUP BY     datatags.datatag_n
;

I can't tell from what you said which level the workitem_n and
datatag conditions go at, so this is just a draft-quality
query.  But I think the structure is basically okay, given
that you said datatag_n is unique in datatags (so there's no
need to join it in the WITH step).

			regards, tom lane