Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sUtMP-006UXH-0P for pgsql-general@arkaria.postgresql.org; Fri, 19 Jul 2024 19:33:44 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sUtMN-004Rpl-1w for pgsql-general@arkaria.postgresql.org; Fri, 19 Jul 2024 19:33:43 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sUtMM-004Rpb-N5 for pgsql-general@lists.postgresql.org; Fri, 19 Jul 2024 19:33:43 +0000 Received: from mout.gmx.net ([212.227.15.15]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sUtMK-000Q5P-5a for pgsql-general@lists.postgresql.org; Fri, 19 Jul 2024 19:33:41 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.net; s=s31663417; t=1721417617; x=1722022417; i=jimis@gmx.net; bh=P3zZlZpB93Go9tj67vO7ajOnhC+BYbo3wZ+a/J0HSvM=; h=X-UI-Sender-Class:Date:From:To:cc:Subject:In-Reply-To:Message-ID: References:MIME-Version:Content-Type:Content-Transfer-Encoding:cc: content-transfer-encoding:content-type:date:from:message-id: mime-version:reply-to:subject:to; b=pa7jvxPLGWZ+B0RbvHPdGOdU++9c0r75lno125DYoYAYcQeRO03uU59pVAvYEbS6 nZWwYmZjxrtVRakxwpKPjKzYaKJLJqN3abO1HiOqmHB4qE/yNImBnbnUqdfGE0x/I mdIMEEWd0ir8E8+TAsRnKSQdjgkstuq2kbD2r374MYehvMn7gLG3B08rAQYZRUIyP r28Ckz5CNnn6St2TSebIFNg+6Nn+/90djk8rVm8ZTok3wBA96nbQQ0jZBWIUrnYgA HDK/EP5unsIbmOl+xWn4AQ4co+rKdB7Jw6WCLco8MkB0feFLTOUHLoJQZXygkyc3t T1o2+Bz3t1IApd2lNA== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [10.9.70.81] ([185.55.106.54]) by mail.gmx.net (mrgmx004 [212.227.17.190]) with ESMTPSA (Nemesis) id 1M8QS2-1sQTVR0Qm8-00HChC; Fri, 19 Jul 2024 21:33:37 +0200 Date: Fri, 19 Jul 2024 21:33:35 +0200 (CEST) From: Dimitrios Apostolou To: David Rowley cc: Ashutosh Bapat , pgsql-general@lists.postgresql.org Subject: Re: Query on partitioned table needs memory n_partitions * work_mem In-Reply-To: Message-ID: References: <3603c380-d094-136e-e333-610914fb3e80@gmx.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Provags-ID: V03:K1:QXYHpiOpWY5QuZD4cc9qYjZx/geMleA2Y3a1ixYKPhrtue+T8Hw nGJBHVxiER+Zshqgz6zncsQP3UmAWJqZ3eDWJfecUQRv7b28jnfhA/+SRV/JMO3VaH7v1nA 5s95p4eU46ip57bITHmBrL8j24f5XRlckZT+bc+0LMJgutEAl0sGVd/fLedx3QEayMVyeYV AweRlI3sj4FIUVIjbYv1g== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:7SlowexIHGY=;7ks/NcooWuDgPNAs/xyaCy91qb5 amEUiAp9YNNkDURczkcXvMXHVvgN6Wsd0RUjPwfKrPcnDTchagW5kuQ1Rx2FPhcg1BXEh6MYQ fSqfQCUOO/0nmwnB8fTdp5/780ppAWGUvVVWqqZ6SCVAnNoTshTK3hufX4qxWvv+VZpDCOxo3 UMPwu4dku9DeVS/sTjcSmEA7KIu+dZS6HZp4NpQ9Ax4osEe8N+Z14P4qcvwR86In+Jj00LTWi R36YBhlK1LDm9QV8Ejbq82ZZ8Lu1onwLDcSy+qmrCXwUDtoKG3gEcBECHLW/5qoJyZyk9bblX HqZNXWqfS8+jTazt4cAfEyP3gxDz1rGAf0zXGT7mY0fcjA7pk7yj6BgdNp3hcyf8PeJx+6xz8 m1d/yZPNVBdDTvLijyyF1t5yY4AVXTxq76eBhzE2qnSwJlFI5zUew7EPwMfevO37YU94BYX4g CHiQNkXQ6gE4FpzN1lPSlUSeESTavWVBUXfU4XstMbFkPq9niL54uPITSYpmIdIF/Z6XYiFvV O4ZysXVdvjYZdxZKhNRUGCgNPX9AZojZmci1IrQxTP5OgVJApnrUMethCBLewIop8IoAywZEY +A9Ir3QB9N0NY+Qz9K4McW/XdTnMd0aicoj/g77Xz832vBGt7ngQQi+V8+7zj6aSbM0h7xzGk VCkJRAmQQ/a8d+8X4RYvlYw570A0R/vt4xTStvGJKW66JB+B1qO9XP4y7K132SwOl0te/x6OW A8SC6F/4z6SE0KXXFp2oo2AeXd8aRilpy0MiXweO0FGarvlwJf+HntcYBnxvfVhoI8lnbkfYa E2/gH8k3vUliz7p2aMNUglZw== Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I mentioned before. The keys to reproduce it I believe are: 1. Write millions of rows to *all* partitions 2. Run ANALYSE so that the planner knows about the sizes Here are the correct steps now. Let me know if you manage to reproduce. Kind reminder to set vm.overcommit_memory to 2, before you lose your system like it happened to me. :-) CREATE TABLE partitioned_table1( run_n bigint GENERATED ALWAYS AS IDENTITY, workitem_n integer NOT NULL, label_n smallint, result smallint NOT NULL, PRIMARY KEY(workitem_n, run_n) ) PARTITION BY RANGE(workitem_n); =2D- Create 200 partitions DO $$ for i in range(0, 200): start =3D i * 10 * 1000 * 1000 end =3D (i+1) * 10 * 1000 * 1000 stmt =3D f''' CREATE TABLE part{i} PARTITION OF partitioned_table1 FOR VALUES FROM ({start}) TO ({end}) ''' plpy.execute(stmt) $$ LANGUAGE plpython3u; =2D- Insert 20M rows per partition. This will take a while but I don't kno= w =2D- of a way to speed it up. DO $$ for i in range(0, 2000): stmt =3D f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j, CAST(random()*1000 AS INTEGER), CAST(random()*4 AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1= , 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; =2D- Important to analyse! and set the right settings... ANALYSE partitioned_table1; SET SESSION max_parallel_workers_per_gather TO 0; SET SESSION enable_incremental_sort TO off; SET SESSION enable_partitionwise_aggregate TO on; SET SESSION enable_partitionwise_join TO on; =2D- And then reproduce the issue: \timing on SET SESSION work_mem TO '4MB'; SELECT workitem_n, label_n, bool_or(result IN (2,3)) FROM partitioned_table1 GROUP BY workitem_n, label_n LIMIT 10; Repeat that last query while doubling the work_mem every time. You'll see the process growing to immense dimensions. In my case, my 16GB of RAM were exhausted when work mem was set to '128MB'. Remarkably, the fastest execution is the one with the smallest work_mem, where HashAggregate splits the work into many "partitions". As we grow the work_mem, it becomes slower and slower (of course while not reaching the limits of RAM yet; if it goes to swap then all measurements are off). Let me know if you manage to reproduce the issue! Dimitris