Date: Thu, 11 Jul 2024 03:19:16 +0200 (CEST)
From: Dimitrios Apostolou <jimis@gmx.net>
To: pgsql-general@lists.postgresql.org
Subject: Query on partitioned table needs memory   n_partitions * work_mem
Message-ID: <3603c380-d094-136e-e333-610914fb3e80@gmx.net>
MIME-Version: 1.0
Content-Type: text/plain; format=flowed; charset=US-ASCII
UI-OutboundReport: notjunk:1;M01:P0:BTAyOesgNho=;aS6TyKVDLG/XSizqNLAAUTJJgtJ
 R4+JfyAgpuKK+Ro1uNH6C7NzvJujLfLI+6XgnTnXsby/Vv440lKHoTjQ9LfYdWWIwuHU5l6Y5
 1dAPT4nRWfxMCJ8X4SE6seiNIk9QRyaUucVT9RNmP8mzuCGTwVyarxAqqACSIpIREYTgVr9Xq
 NFoxY7zecmNCMh1RcGYHjhigw9fFPKp5dGaklH8yETkW83GzYZnWL0lCwJPoFC3o9F6gW/OdZ
 Rc+2WZTIWwja3QZH1voAdBoer++qRHgXMvtJND/nLK0bab2CjPb7QJbpFkT9QJOQ0wC3/EPi5
 WdVa9IDqbW17b7WR+PhYLPU0THxf3r3ssh8L1QBDn9DBEOseG0PUdnBVZ9xi6mWznyNBsc9bo
 YhlYY/m8NP0CL85mUrNcDG5LRdVI9xaRl2/jh8VcfbSnz7CbQ4hSUSSlRmlWUoj4zShrQmnq9
 0RI3QIiNak7YDhfMpqjXTHYUkiunwkzIBFKRzg6lnZlShFIyoYSeoy3o6z3Nj+ShKh5E8kEAd
 FEo1/GjTwVROUoGY8Lc6TQ7Wlg3H4T0R1ryML/ZyVY8r0QhkQNk7xvdV/xVgq2IZPZU15LuS9
 EuqidZPAbohvOcoDF5niL1vWHDUSPIe19XOmMu4Bo8vHuGuPWMG47oqwPlxE4TZJCxV9drFIm
 WSddh3RlDTL9bISwKj5mO3wV0zhqO+e82uMqM1B+ldbaXYiDBaKMF+bwZoUtLqUA71KwQJHEs
 V4EjWLrZCyd/S9m+edmN5/kZWYy0WuHeBOgTYaiYJ+ltbABI4tL0ozbWcWqC1d77B72ivpd04
 b4ymwrf1TmYQgzSpluFhIzGw==
Content-Transfer-Encoding: quoted-printable
Archived-At: <https://www.postgresql.org/message-id/3603c380-d094-136e-e333-610914fb3e80%40gmx.net>
Precedence: bulk

Hello list,

I have a table with 1000 partitions on PostgreSQL 16.
I notice that a fairly complicated query of the form:

SELECT ... GROUP BY ... LIMIT ...

causes the postgres backend process to grow insanely very fast, and the
kernel OOM killer to kill it rather soon.
It seems it tries to allocate at least 1000 * work_mem.

If I reduce the amount of work_mem, I can control the outcome and avoid
the crash, but this is suboptimal.
I have parallel plans disabled (max_parallel_workers_per_gather=3D0).

To add a bit more info on the execution plan, I believe the relevant part
is the 1000 HashAggregate nodes under Append:

->  Append
   ->  HashAggregate
     ->  Seq Scan
   -> ... 1000 more hashagg+seqscans


Is this allocation pattern (workmem * n_partitions) expected under any
scenario? I can't find it documented.  AFAIU the backend should allocate
up to (depth_of_execution_plan * work_mem) (putting aside the
hash_mem_multiplier and the parallel workers).

NOTE: after having written the above message, it occured to me that I have
enable_partitionwise_aggregate=3Don. And Turning it off fixes the issue an=
d
makes the query faster too! Expected behaviour or bug?

Thank you in advance,
Dimitris

P.S. In the meantime I'm trying to reduce the query and the table schema,
      in order to submit a precise bug report with repro instructions.