Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sfL9I-0053lb-L4 for pgsql-general@arkaria.postgresql.org; Sat, 17 Aug 2024 15:15:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sfL9G-004Db3-Ks for pgsql-general@arkaria.postgresql.org; Sat, 17 Aug 2024 15:15:23 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sfL9G-004Dav-AE for pgsql-general@lists.postgresql.org; Sat, 17 Aug 2024 15:15:23 +0000 Received: from sss.pgh.pa.us ([68.162.161.243]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sfL9A-0009GB-8t for pgsql-general@lists.postgresql.org; Sat, 17 Aug 2024 15:15:22 +0000 Received: from sss1.sss.pgh.pa.us (localhost [127.0.0.1]) by sss.pgh.pa.us (8.15.2/8.15.2) with ESMTP id 47HFFB3Z954895; Sat, 17 Aug 2024 11:15:12 -0400 From: Tom Lane To: Dimitrios Apostolou cc: pgsql-general@lists.postgresql.org Subject: Re: array_agg() does not stop aggregating according to HAVING clause In-reply-to: <215d6efa-bb4b-a76e-6066-7a83ccdb55e3@gmx.net> References: <215d6efa-bb4b-a76e-6066-7a83ccdb55e3@gmx.net> Comments: In-reply-to Dimitrios Apostolou message dated "Sat, 17 Aug 2024 16:37:25 +0200" MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <954893.1723907711.1@sss.pgh.pa.us> Content-Transfer-Encoding: quoted-printable Date: Sat, 17 Aug 2024 11:15:11 -0400 Message-ID: <954894.1723907711@sss.pgh.pa.us> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Dimitrios Apostolou writes: > I have a query that goes through *billions* of rows and for the columns > that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it > selects all the IDs of the entries (array_agg(run_n)). Here is the full > query: > INSERT INTO infrequent_datatags_in_this_chunk > SELECT datatag, datatags.datatag_n, array_agg(run_n) > FROM runs_raw > JOIN datatags USING(datatag_n) > WHERE workitem_n >=3D 295 > AND workitem_n < 714218 > AND datatag IS NOT NULL > GROUP BY datatags.datatag_n > HAVING count(datatag_n) < 10 > AND count(datatag_n) > 0 -- Not really needed because of the J= OIN above > ; > The problem is that this is extremely slow (5 hours), most likely becaus= e > it creates tens of gigabytes of temporary files as I see in the logs. I > suspect that it is writing to disk the array_agg(run_n) of all entries a= nd > not only those HAVING count(datatag_n)<10. Well, yes: the two aggregates (array_agg and count) are computed concurrently in a single Aggregate plan node scanning the output of the JOIN. There's no way to apply the HAVING filter until after the aggregation is finished. I think this approach is basically forced by the SQL standard's semantics for grouping/aggregation. > How do I tell postgres to stop aggregating when count>=3D10? The only way to do this would be to do two separate passes of aggregation in separate sub-queries. Perhaps like WITH rare AS ( SELECT datatag_n FROM runs_raw WHERE workitem_n >=3D 295 AND workitem_n < 714218 AND datatag IS NOT NULL GROUP BY datatag_n HAVING count(datatag_n) < 10 AND count(datatag_n) > 0 = ) INSERT INTO infrequent_datatags_in_this_chunk SELECT datatag, datatags.datatag_n, array_agg(run_n) FROM runs_raw JOIN datatags USING(datatag_n) JOIN rare USING(datatag_n) GROUP BY datatags.datatag_n ; I can't tell from what you said which level the workitem_n and datatag conditions go at, so this is just a draft-quality query. But I think the structure is basically okay, given that you said datatag_n is unique in datatags (so there's no need to join it in the WITH step). regards, tom lane