Re: Query on partitioned table needs memory n_partitions * work

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Query on partitioned table needs memory n_partitions * work_mem
3+ messages / 2 participants
[nested] [flat]

* Re: Query on partitioned table needs memory n_partitions * work_mem
@ 2024-07-17 22:27  David Rowley <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: David Rowley @ 2024-07-17 22:27 UTC (permalink / raw)
  To: Dimitrios Apostolou <[email protected]>; +Cc: [email protected]

On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <[email protected]> wrote:
> I can't help but see this as a bug. I see many issues:
>
> * postgres is not reading from partitions in parallel, but one after the
>    other. It shouldn't need all this memory simultaneously.

I don't know for Hash Aggregate, but for nodes like Sort, we still
hold onto the tuplestore after returning the last tuple as a rescan
might want to read those tuples again. There's also a mark/restore
that might want to rewind a little to match up to the next outer tuple
of a Merge Join.

It might be possible to let go of the memory sooner in plans when
returning the final tuple means we'll never need the memory again, but
that would require figuring out all the cases where that could happen
and ensuring we don't ever release memory when it's required again.

> * The memory is unnecessarily allocated early on, before any partitions
>    are actually aggregated. I know this because I/O is slow on this device
>    and the table sizes are huge, it's simply not possible that postgres
>    went through all partitions and blew up the memory. That would take
>    hours, but the OOM happens seconds after I start the query.

That's interesting. Certainly, there is some memory allocated during
executor startup, but that amount should be fairly small.  Are you
able to provide a self-contained test case that shows the memory
blowing up before execution begins?

> Having wasted long time in that, the minimum I can do is submit a
> documentation patch. At enable_partitionwise_aggregate someting like
> "WARNING it can increase the memory usage by at least
> n_partitions * work_mem". How do I move on for such a patch? Pointers
> would be appreciated. :-)

I think mentioning something about this in enable_partitionwise_join
and enable_partitionwise_aggregate is probably wise. I'll propose a
patch on pgsql-hackers.

David

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Query on partitioned table needs memory n_partitions * work_mem
@ 2024-07-19 13:31  Dimitrios Apostolou <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 3+ messages in thread

From: Dimitrios Apostolou @ 2024-07-19 13:31 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; [email protected]

On Thu, 18 Jul 2024, David Rowley wrote:

> On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou <[email protected]> wrote:
>
>> * The memory is unnecessarily allocated early on, before any partitions
>>    are actually aggregated. I know this because I/O is slow on this device
>>    and the table sizes are huge, it's simply not possible that postgres
>>    went through all partitions and blew up the memory. That would take
>>    hours, but the OOM happens seconds after I start the query.
>
> That's interesting. Certainly, there is some memory allocated during
> executor startup, but that amount should be fairly small.  Are you
> able to provide a self-contained test case that shows the memory
> blowing up before execution begins?

I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).



-- I create a table with 2k partitions.

CREATE TABLE   partitioned_table1(
     run_n       bigint   GENERATED ALWAYS AS IDENTITY,
     workitem_n  integer  NOT NULL,
     label_n     smallint,
     result      smallint NOT NULL,
     PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

DO $$
for i in range(0, 2000):
     stmt = f'''
         CREATE TABLE part_max{i+1}M
             PARTITION OF partitioned_table1
             FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
     '''
     plpy.execute(stmt)
$$ LANGUAGE plpython3u;


-- I insert random data. First I insert to all partitions, 1M rows each:

DO $$
for i in range(0, 2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j-j%4,
                     CAST(random()*1000 AS INTEGER),
                     CAST(random()*3    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1,
1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


-- Disable parallel execution and group aggregate:

SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;

SET SESSION work_mem TO '8MB';


-- Now the following query should do a HashAggregate:

SELECT
      workitem_n, label_n, bool_or(result IN (2,3))
FROM
      partitioned_table1
GROUP BY
      workitem_n, label_n
LIMIT 10;


-- How much was the RSS of the backend while the previous query was
-- running? Not that high. But if we insert some million rows to the
-- 1st partition, then it will be much higher.

DO $$
for i in range(0,2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j%1000000,
                     CAST(random()*20000 AS INTEGER),
                     CAST(random()*4    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;


-- Now that same previous query consumes between 8GB and 10GB RSS. The
-- more data I insert (to all partitions?), the more memory the query
-- takes.


Overall:

* I don't see the RSS memory usage (8GB) growing proportionally as I
   expected. If I increase work_mem from 4MB to 8MB then I see double RSS
   memory usage (from ~4GB to ~8GB). But then if I increase it further the
   difference is miniscule and no OOM happens.

* Instead I notice RSS memory usage growing slowly while I insert more and more data
   to the table (especially into the 1st partition I think).

* Finally I don't see the memory being free'd by the backend after the
   SELECT finishes. The system is relieved only when I disconnect psql and
   the backend dies. Not sure if that's by design or not.

>
>> Having wasted long time in that, the minimum I can do is submit a
>> documentation patch. At enable_partitionwise_aggregate someting like
>> "WARNING it can increase the memory usage by at least
>> n_partitions * work_mem". How do I move on for such a patch? Pointers
>> would be appreciated. :-)
>
> I think mentioning something about this in enable_partitionwise_join
> and enable_partitionwise_aggregate is probably wise. I'll propose a
> patch on pgsql-hackers.

David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.


Regards,
Dimitris







^ permalink  raw  reply  [nested|flat] 3+ messages in thread

* Re: Query on partitioned table needs memory n_partitions * work_mem
@ 2024-07-19 19:33  Dimitrios Apostolou <[email protected]>
  parent: Dimitrios Apostolou <[email protected]>
  0 siblings, 0 replies; 3+ messages in thread

From: Dimitrios Apostolou @ 2024-07-19 19:33 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Ashutosh Bapat <[email protected]>; [email protected]

Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I
mentioned before. The keys to reproduce it I believe are:

1. Write millions of rows to *all* partitions
2. Run ANALYSE so that the planner knows about the sizes

Here are the correct steps now. Let me know if you manage to reproduce.
Kind reminder to set vm.overcommit_memory to 2, before you lose your
system like it happened to me. :-)

CREATE TABLE   partitioned_table1(
     run_n       bigint   GENERATED ALWAYS AS IDENTITY,
     workitem_n  integer  NOT NULL,
     label_n     smallint,
     result      smallint NOT NULL,
     PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

-- Create 200 partitions

DO $$
for i in range(0, 200):
     start = i     * 10 * 1000 * 1000
     end   = (i+1) * 10 * 1000 * 1000
     stmt = f'''
         CREATE TABLE part{i}
             PARTITION OF partitioned_table1
             FOR VALUES FROM ({start}) TO ({end})
     '''
     plpy.execute(stmt)
$$ LANGUAGE plpython3u;

-- Insert 20M rows per partition. This will take a while but I don't know
-- of a way to speed it up.

DO $$
for i in range(0, 2000):
     stmt = f'''
         INSERT INTO partitioned_table1(workitem_n, label_n, result)
             SELECT
                     j,
                     CAST(random()*1000 AS INTEGER),
                     CAST(random()*4    AS INTEGER)
                 FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j
     '''
     plpy.info(stmt)
     plpy.execute(stmt)
     plpy.commit()
$$ LANGUAGE plpython3u;

-- Important to analyse! and set the right settings...

ANALYSE partitioned_table1;
SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort         TO off;
SET SESSION enable_partitionwise_aggregate  TO on;
SET SESSION enable_partitionwise_join       TO on;

-- And then reproduce the issue:

\timing on
SET SESSION work_mem TO '4MB';

SELECT
       workitem_n, label_n, bool_or(result IN (2,3))
FROM
       partitioned_table1
GROUP BY
       workitem_n, label_n
LIMIT 10;

Repeat that last query while doubling the work_mem every time. You'll see
the process growing to immense dimensions. In my case, my 16GB of RAM were
exhausted when work mem was set to '128MB'.

Remarkably, the fastest execution is the one with the smallest work_mem,
where HashAggregate splits the work into many "partitions". As we grow the
work_mem, it becomes slower and slower (of course while not reaching the
limits of RAM yet; if it goes to swap then all measurements are off).

Let me know if you manage to reproduce the issue!

Dimitris

^ permalink  raw  reply  [nested|flat] 3+ messages in thread

end of thread, other threads:[~2024-07-19 19:33 UTC | newest]

Thread overview: 3+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2024-07-17 22:27 Re: Query on partitioned table needs memory n_partitions * work_mem David Rowley <[email protected]>
2024-07-19 13:31 ` Dimitrios Apostolou <[email protected]>
2024-07-19 19:33   ` Dimitrios Apostolou <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox