Date: Fri, 19 Jul 2024 15:31:34 +0200 (CEST)
From: Dimitrios Apostolou <jimis@gmx.net>
To: David Rowley <dgrowleyml@gmail.com>
cc: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>, 
    pgsql-general@lists.postgresql.org
Subject: Re: Query on partitioned table needs memory n_partitions *
 work_mem
In-Reply-To: <CAApHDvpG7Hxb81exMZGuwpTQLbeApMNZvffOmfJ3yJnKAYxqYA@mail.gmail.com>
Message-ID: <e5ae2d08-75e1-3726-8905-1328e86453cd@gmx.net>
References: <3603c380-d094-136e-e333-610914fb3e80@gmx.net> <CAApHDvp_u6SbyTKtk9m7ssh8H4dXMoLMZrAq8a1sJk1cj6YxzQ@mail.gmail.com> <d26e67d3-74bc-60aa-bf24-2a8fb83efe9c@gmx.net> <CAApHDvpG7Hxb81exMZGuwpTQLbeApMNZvffOmfJ3yJnKAYxqYA@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII; format=flowed
UI-OutboundReport: notjunk:1;M01:P0:BQXOnohhoY0=;b61YTTQlU2efgX5R1S24AOErNyL
 wIvmHKOY0GgH3FWjTtyQULneAFymHyJZNzz6uX4MvgPCXqIooBowtPDBRuoNAZGqOEMq3smgD
 rdwtyb3Buo9t+1AsorKiOs+C/QwVjoVVHgPMv1Z4TdKYNmHxZyFiE1gdl99dsusKliCtzpzI0
 DXJjQWcKRWYWTg0hUQSDm/rK70ZjmQmUVaVBWGyFGwUygwpif+ca8xSYvDMDSO2Jjc50+pwVc
 KKDc8kWwYi+E+EjI5SlwyX7ThiCCAexHHUU4PwK7cpC3DxTE4zlgKyCz7uD1d+95GsTx1Qo3m
 ri6re0XB7NxAtVb5D6+jdqUFTS6e1BlJqalv95egidXFVKaFTBxg/hLH9o8VWbrPzd9VraaWI
 pgDf+TVMXXO2oyZsKy26+62DVE+/LE6JFSHRp4hkBz0ZTQ98h7qoGt/An06WjNK06TC03NJvm
 XvequBXSkkGDmN0TRZ67gknjd5qM9aL92h+LbU5IrxcrDwjmBFkqjH8QuUQrMroROTwJ1fwEF
 oadP8TyPMOX9GiL1fzcOHec72i8LkQTZlt8BfS4O52KyaBVOmnV4dwgnI4CVBqVw/KzRfOMxa
 Jjw3KQ9feR9o39le7gzmHWm2ORDCnJxLNvxJcolK/tRqdUWNlUnDtPouWkZjyoXa7ihfKp2LS
 w41PM7UGfo2KF+74k5jfKpIH+K1P+RIr28bDOP9Umh6WyssdYLAav/itjcQzvL2TefQ1zuPDE
 6nnVHL6kQOXf3pK013SEIXq8MF7miT0T8/4Itlr/m2/JF8/rq1RjXGh2pV0mHS7eqA6pSzyjk
 rA1EuU063x06CX1oZS7HZHkA==
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/e5ae2d08-75e1-3726-8905-1328e86453cd%40gmx.net>
Precedence: bulk

On Thu, 18 Jul 2024, David Rowley wrote:

> On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <jimis@gmx.net> wrote:
>
>> * The memory is unnecessarily allocated early on, before any partitions
>>    are actually aggregated. I know this because I/O is slow on this dev=
ice
>>    and the table sizes are huge, it's simply not possible that postgres
>>    went through all partitions and blew up the memory. That would take
>>    hours, but the OOM happens seconds after I start the query.
>
> That's interesting. Certainly, there is some memory allocated during
> executor startup, but that amount should be fairly small.  Are you
> able to provide a self-contained test case that shows the memory
> blowing up before execution begins?

I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).


=2D- I create a table with 2k partitions.

CREATE TABLE   partitioned_table1(
     run_n       bigint   GENERATED ALWAYS AS IDENTITY,
     workitem_n  integer  NOT NULL,
     label_n     smallint,
     result      smallint NOT NULL,
     PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

DO $$
for i in range(0, 2000):
     stmt =3D f'''
         CREATE TABLE part_max{i+1}M
             PARTITION OF partitioned_table1
             FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
     '''
     plpy.execute(stmt)
$$ LANGUAGE plpython3u;


=2D- I insert random data. First I insert to all partitions, 1M rows each:

DO $$
for i in range(0, 2000):
     stmt =3D f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j-j%4,
                     CAST(random()*1000 AS INTEGER),
                     CAST(random()*3    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1=
,
1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


=2D- Disable parallel execution and group aggregate:

SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;

SET SESSION work_mem TO '8MB';


=2D- Now the following query should do a HashAggregate:

SELECT
      workitem_n, label_n, bool_or(result IN (2,3))
FROM
      partitioned_table1
GROUP BY
      workitem_n, label_n
LIMIT 10;


=2D- How much was the RSS of the backend while the previous query was
=2D- running? Not that high. But if we insert some million rows to the
=2D- 1st partition, then it will be much higher.

DO $$
for i in range(0,2000):
     stmt =3D f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j%1000000,
                     CAST(random()*20000 AS INTEGER),
                     CAST(random()*4    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1=
, 1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


=2D- Now that same previous query consumes between 8GB and 10GB RSS. The
=2D- more data I insert (to all partitions?), the more memory the query
=2D- takes.


Overall:

* I don't see the RSS memory usage (8GB) growing proportionally as I
   expected. If I increase work_mem from 4MB to 8MB then I see double RSS
   memory usage (from ~4GB to ~8GB). But then if I increase it further the
   difference is miniscule and no OOM happens.

* Instead I notice RSS memory usage growing slowly while I insert more and=
 more data
   to the table (especially into the 1st partition I think).

* Finally I don't see the memory being free'd by the backend after the
   SELECT finishes. The system is relieved only when I disconnect psql and
   the backend dies. Not sure if that's by design or not.

>
>> Having wasted long time in that, the minimum I can do is submit a
>> documentation patch. At enable_partitionwise_aggregate someting like
>> "WARNING it can increase the memory usage by at least
>> n_partitions * work_mem". How do I move on for such a patch? Pointers
>> would be appreciated. :-)
>
> I think mentioning something about this in enable_partitionwise_join
> and enable_partitionwise_aggregate is probably wise. I'll propose a
> patch on pgsql-hackers.

David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.


Regards,
Dimitris