public inbox for [email protected]  
help / color / mirror / Atom feed
From: Dimitrios Apostolou <[email protected]>
To: [email protected]
Subject: array_agg() does not stop aggregating according to HAVING clause
Date: Sat, 17 Aug 2024 16:37:25 +0200 (CEST)
Message-ID: <[email protected]> (raw)

Hello list,

I have a query that goes through *billions* of rows and for the columns
that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
selects all the IDs of the entries (array_agg(run_n)). Here is the full
query:


INSERT INTO infrequent_datatags_in_this_chunk
   SELECT datatag, datatags.datatag_n, array_agg(run_n)
     FROM runs_raw
     JOIN datatags USING(datatag_n)
     WHERE workitem_n >= 295
       AND workitem_n <  714218
       AND datatag IS NOT NULL
     GROUP BY     datatags.datatag_n
     HAVING  count(datatag_n) < 10
       AND   count(datatag_n) > 0  -- Not really needed because of the JOIN above
;

The runs_raw table has run_n as the primary key id, and an index on
workitem_n. The datatags table is a key value store with datatag_n as
primary key.

The problem is that this is extremely slow (5 hours), most likely because
it creates tens of gigabytes of temporary files as I see in the logs. I
suspect that it is writing to disk the array_agg(run_n) of all entries and
not only those HAVING count(datatag_n)<10. (I might be wrong though, as
this is only an assumption based on the amount of data written; I don't
know of any way to examine the temporary files written). While this query
is going through billions of rows, the ones with infrequent datatags are
maybe 10M.

How do I tell postgres to stop aggregating when count>=10?

Thank you in advance,
Dimitris






view thread (3+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: array_agg() does not stop aggregating according to HAVING clause
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox