Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sUni8-005mJM-DC for pgsql-general@arkaria.postgresql.org; Fri, 19 Jul 2024 13:31:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sUni6-00Gpb1-7O for pgsql-general@arkaria.postgresql.org; Fri, 19 Jul 2024 13:31:46 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sUni5-00Gpat-Lq for pgsql-general@lists.postgresql.org; Fri, 19 Jul 2024 13:31:46 +0000 Received: from mout.gmx.net ([212.227.17.22]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sUnhy-000NMZ-Q0 for pgsql-general@lists.postgresql.org; Fri, 19 Jul 2024 13:31:44 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.net; s=s31663417; t=1721395895; x=1722000695; i=jimis@gmx.net; bh=r4vSCUE4I6ZU/81LVcvIwjf3qA3D7+biNTVbTVLbxGs=; h=X-UI-Sender-Class:Date:From:To:cc:Subject:In-Reply-To:Message-ID: References:MIME-Version:Content-Type:Content-Transfer-Encoding:cc: content-transfer-encoding:content-type:date:from:message-id: mime-version:reply-to:subject:to; b=hbz0IJYzNjm6dtEb9MpAqWjKGRu9EosedZw4z/+InX2lwkJj1y9yBDGT0oERAw9/ faz+JJry0DeX1ybxrCULwBqzeiptEjVkM7X1xJ3fy4qhEZofGTynWVFm582ocwqJl Gc+8szCLSV7OxZnraCz0w0P5/Oc23lUqEjD+4ENiVS0FZM+6dAldmpPcslF1jpMWW mKUlexffYsUBEZuIR3HLAL141tW3S6N07MA0W27XFNiwAKGEweHlTsA6cwHtCbPWr 1FfBrPnngof1hzuzkSd9vuvRInV8E9nfUZIq306te2DAM3zHpgEz2aZJzxYrem7l+ Ojzm5OWS5ikpsrm+5w== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [10.9.70.81] ([185.55.106.54]) by mail.gmx.net (mrgmx104 [212.227.17.168]) with ESMTPSA (Nemesis) id 1MbzyJ-1ruMMS2Xbb-00pzmC; Fri, 19 Jul 2024 15:31:35 +0200 Date: Fri, 19 Jul 2024 15:31:34 +0200 (CEST) From: Dimitrios Apostolou To: David Rowley cc: Ashutosh Bapat , pgsql-general@lists.postgresql.org Subject: Re: Query on partitioned table needs memory n_partitions * work_mem In-Reply-To: Message-ID: References: <3603c380-d094-136e-e333-610914fb3e80@gmx.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Provags-ID: V03:K1:zJpYcoLCmikwgtIGxZJ0UI9tYMgOEmYV56Lfn3o9V4uQ0JcvkJ/ SH9LMw7SwxdjzmKa55xzsr2Jf52BfG0kx0XPx4+czBYidFGLPS5d+Uaiq7He7KF2J6preqp SCvKTfJfYvthWBj4x9LVfk4UIkC8R/7oGY3rmJZGYOnnzWWWp2jaYF5BGC3vsLCyk/WDmzU Kl3eMF9akc7I8OQhT/qDg== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:BQXOnohhoY0=;b61YTTQlU2efgX5R1S24AOErNyL wIvmHKOY0GgH3FWjTtyQULneAFymHyJZNzz6uX4MvgPCXqIooBowtPDBRuoNAZGqOEMq3smgD rdwtyb3Buo9t+1AsorKiOs+C/QwVjoVVHgPMv1Z4TdKYNmHxZyFiE1gdl99dsusKliCtzpzI0 DXJjQWcKRWYWTg0hUQSDm/rK70ZjmQmUVaVBWGyFGwUygwpif+ca8xSYvDMDSO2Jjc50+pwVc KKDc8kWwYi+E+EjI5SlwyX7ThiCCAexHHUU4PwK7cpC3DxTE4zlgKyCz7uD1d+95GsTx1Qo3m ri6re0XB7NxAtVb5D6+jdqUFTS6e1BlJqalv95egidXFVKaFTBxg/hLH9o8VWbrPzd9VraaWI pgDf+TVMXXO2oyZsKy26+62DVE+/LE6JFSHRp4hkBz0ZTQ98h7qoGt/An06WjNK06TC03NJvm XvequBXSkkGDmN0TRZ67gknjd5qM9aL92h+LbU5IrxcrDwjmBFkqjH8QuUQrMroROTwJ1fwEF oadP8TyPMOX9GiL1fzcOHec72i8LkQTZlt8BfS4O52KyaBVOmnV4dwgnI4CVBqVw/KzRfOMxa Jjw3KQ9feR9o39le7gzmHWm2ORDCnJxLNvxJcolK/tRqdUWNlUnDtPouWkZjyoXa7ihfKp2LS w41PM7UGfo2KF+74k5jfKpIH+K1P+RIr28bDOP9Umh6WyssdYLAav/itjcQzvL2TefQ1zuPDE 6nnVHL6kQOXf3pK013SEIXq8MF7miT0T8/4Itlr/m2/JF8/rq1RjXGh2pV0mHS7eqA6pSzyjk rA1EuU063x06CX1oZS7HZHkA== Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Thu, 18 Jul 2024, David Rowley wrote: > On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou wrote: > >> * The memory is unnecessarily allocated early on, before any partitions >> are actually aggregated. I know this because I/O is slow on this dev= ice >> and the table sizes are huge, it's simply not possible that postgres >> went through all partitions and blew up the memory. That would take >> hours, but the OOM happens seconds after I start the query. > > That's interesting. Certainly, there is some memory allocated during > executor startup, but that amount should be fairly small. Are you > able to provide a self-contained test case that shows the memory > blowing up before execution begins? I'm trying hard to create a self-contained way to reproduce the issue. It's not easy, the behaviour is a bit unstable. So far I see high memory usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed once, but it was so uncontrollable that I didn't manage to measure and reproduce again later). =2D- I create a table with 2k partitions. CREATE TABLE partitioned_table1( run_n bigint GENERATED ALWAYS AS IDENTITY, workitem_n integer NOT NULL, label_n smallint, result smallint NOT NULL, PRIMARY KEY(workitem_n, run_n) ) PARTITION BY RANGE(workitem_n); DO $$ for i in range(0, 2000): stmt =3D f''' CREATE TABLE part_max{i+1}M PARTITION OF partitioned_table1 FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000}) ''' plpy.execute(stmt) $$ LANGUAGE plpython3u; =2D- I insert random data. First I insert to all partitions, 1M rows each: DO $$ for i in range(0, 2000): stmt =3D f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j-j%4, CAST(random()*1000 AS INTEGER), CAST(random()*3 AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1= , 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; =2D- Disable parallel execution and group aggregate: SET SESSION max_parallel_workers_per_gather TO 0; SET SESSION enable_incremental_sort TO off; SET SESSION work_mem TO '8MB'; =2D- Now the following query should do a HashAggregate: SELECT workitem_n, label_n, bool_or(result IN (2,3)) FROM partitioned_table1 GROUP BY workitem_n, label_n LIMIT 10; =2D- How much was the RSS of the backend while the previous query was =2D- running? Not that high. But if we insert some million rows to the =2D- 1st partition, then it will be much higher. DO $$ for i in range(0,2000): stmt =3D f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j%1000000, CAST(random()*20000 AS INTEGER), CAST(random()*4 AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1= , 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; =2D- Now that same previous query consumes between 8GB and 10GB RSS. The =2D- more data I insert (to all partitions?), the more memory the query =2D- takes. Overall: * I don't see the RSS memory usage (8GB) growing proportionally as I expected. If I increase work_mem from 4MB to 8MB then I see double RSS memory usage (from ~4GB to ~8GB). But then if I increase it further the difference is miniscule and no OOM happens. * Instead I notice RSS memory usage growing slowly while I insert more and= more data to the table (especially into the 1st partition I think). * Finally I don't see the memory being free'd by the backend after the SELECT finishes. The system is relieved only when I disconnect psql and the backend dies. Not sure if that's by design or not. > >> Having wasted long time in that, the minimum I can do is submit a >> documentation patch. At enable_partitionwise_aggregate someting like >> "WARNING it can increase the memory usage by at least >> n_partitions * work_mem". How do I move on for such a patch? Pointers >> would be appreciated. :-) > > I think mentioning something about this in enable_partitionwise_join > and enable_partitionwise_aggregate is probably wise. I'll propose a > patch on pgsql-hackers. David and Ashutosh, thank you both for your interest in improving the documentation. Unfortunately I'm not positive any longer on what exactly is going on here, I don't understand how the memory is growing. One thing I can verify is that it's definitely caused by partitioning: I have another similar huge table but unpartitioned, and no such issues show up. Maybe someone with knowledge of the HashAggregate algorithm and partitioning can throw some ideas in. Regards, Dimitris