Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sRuTb-001xBX-3y for pgsql-general@arkaria.postgresql.org; Thu, 11 Jul 2024 14:08:51 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1sRuTY-00EbLm-O8 for pgsql-general@arkaria.postgresql.org; Thu, 11 Jul 2024 14:08:48 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sRuTY-00EbLd-7V for pgsql-general@lists.postgresql.org; Thu, 11 Jul 2024 14:08:48 +0000 Received: from mout.gmx.net ([212.227.15.18]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1sRuTV-001ZNa-3Y for pgsql-general@lists.postgresql.org; Thu, 11 Jul 2024 14:08:47 +0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmx.net; s=s31663417; t=1720706923; x=1721311723; i=jimis@gmx.net; bh=y0bA4UMa2WpxKcRTvb13KtPuW1Wv9YlyXMe/Uo6SQIA=; h=X-UI-Sender-Class:Date:From:To:cc:Subject:In-Reply-To:Message-ID: References:MIME-Version:Content-Type:Content-Transfer-Encoding:cc: content-transfer-encoding:content-type:date:from:message-id: mime-version:reply-to:subject:to; b=NrDQpqy+ES664h+mnLlWSfF4iHYDrI27vp0vzigOqhhaPi4W03ke6NqJjOof2Kaj MBd2IxYKPR8NlXYVoOAuV8WUOVE0rwwCN9VloBZYT25tC2wkCZLLT5cD5m0CKDkv2 1Eyf6Aop/MJKKp36amxxuvHAQDc2gmPfD7tFVqm7qqZLohr1Nq+52nLl2YpbeuvGW yzNLnHAcj/OceOupyu9Ugq9t89O2BKca/AglKE+Nalwh9p4OP4WrvWtSf6hvBEDon FCF2k4cKw5O3LgZE1xJJJPRxvpiFEbjXKkwCcQY9+cO33IvwvDEAmuwud/wU6s1oW PUETVODXgnAJ45Fcug== X-UI-Sender-Class: 724b4f7f-cbec-4199-ad4e-598c01a50d3a Received: from [10.9.70.81] ([185.55.106.54]) by mail.gmx.net (mrgmx004 [212.227.17.190]) with ESMTPSA (Nemesis) id 1MY6Cl-1st3AM1rmH-00PVW0; Thu, 11 Jul 2024 16:08:43 +0200 Date: Thu, 11 Jul 2024 16:08:42 +0200 (CEST) From: Dimitrios Apostolou To: David Rowley cc: pgsql-general@lists.postgresql.org Subject: Re: Query on partitioned table needs memory n_partitions * work_mem In-Reply-To: Message-ID: References: <3603c380-d094-136e-e333-610914fb3e80@gmx.net> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed X-Provags-ID: V03:K1:g/fb/n3JFt/igyIo0AmrIh/ifQtKZUPOWd7jq1iHMIkmNvcPi1d 6zGfmZF2kZQNHviaH+z1sQeIbwRTeDe/zsjsigMqhCPgnSfGopI+HyWpoDnEZuH+7ROWCgk gwA9eErShYldynYftF7UqQR7akf7EBo7Q3K6Sm2wMS6yzytGyGWCXslPYCdI20aWDzTken7 ViB5eAGMNqZbKdnyklcaw== X-Spam-Flag: NO UI-OutboundReport: notjunk:1;M01:P0:pbMYG//Ee0Q=;JS7UEi/czVZ1oSr1TejwoQz5Tf3 SdF61jdcqVsNB4gP8SY7db30IwFJP9x0SIRgJxw5EvzjyCTLHhFLqorhcjZaDMqtlpnuiASPB Cv9sreKRN2korz5vFi6XCcVnVFKtae3wnupe+vkLDuIW/5ZUiI5bdc2LB5o7BzFfcQcxxwc/O /o9yP9k1IPCgi+HujGAZr6fhrFXOGkFW9Xi+JiAfR06aULsC8enW1FannoMK7pStj261sMf+6 oj4m4vebHyKaNhxyJQrQLdism1vxNvP7U565X1deKeiK4nf30yJygxQy7Y8+h7SZ66OL28Rcq Sw+caqy632xxxZA7YxOnBXlZ3mjzI2/eLG+CUX/EJOuPv66L5QpdLNjwM7B/tvSWLk402Ati3 HftwguB7uXT64FUz/wVrAlhl/A3V2q5bAwshmShMl3LAK15B4cGADR/cKPiCtmt9T2SQMUJI1 WdilgHfrGBw/WskEA8mRfUyQ7knOc6EGCef4a8vCPwTlwukLvLqE8AY/SGby9oLu89sPChho+ +TVIH1dUZyfzfXsUY4fnZPHF693t87kHikYNk2w3Ty2mOW/z4/13zAeqSanVNABDGlihQj83V rgGjMWdPboZ+vdz52z0eUxA566YWFFrHSvrnZHiWF89a83bAJpP0IW73oi2me2VsxnQagZQDM 9ScRMljnnU7aDZGvzdzhqDIUpDzNBAgF9jCPP/u7zMD39EO6be9ILEP/IWDtfkLnpUmTgBVZ7 0FwbR6vYS5It4iKB0A9yK13lqyw17kIp5jewH0PqkRDUGFK2xrvICQ3ZUdaa2TIIpsv6YUado /U23nbOlEkghfYG3BQeqEATmhHcUuLYkGbXCmuonjG9bM= Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Thank you for the feedback. So I've managed to reduce the query to a rather simple one: SELECT workitem_n, test_executable_n, bool_or(test_resulttype_n IN (2,3)) FROM test_runs_raw GROUP BY workitem_n, test_executable_n LIMIT 10; The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). All the columns are various integer types. There is an index on workitem_n. On Thu, 11 Jul 2024, David Rowley wrote: > On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou wrote: >> I have a table with 1000 partitions on PostgreSQL 16. >> I notice that a fairly complicated query of the form: >> >> SELECT ... GROUP BY ... LIMIT ... >> >> causes the postgres backend process to grow insanely very fast, and the >> kernel OOM killer to kill it rather soon. >> It seems it tries to allocate at least 1000 * work_mem. > >> -> Append >> -> HashAggregate >> -> Seq Scan >> -> ... 1000 more hashagg+seqscans >> >> >> Is this allocation pattern (workmem * n_partitions) expected under any >> scenario? I can't find it documented. AFAIU the backend should allocat= e >> up to (depth_of_execution_plan * work_mem) (putting aside the >> hash_mem_multiplier and the parallel workers). > > Not depth of execution plan. It relates to the number of nodes in the > plan which allocate work_mem or work_mem * hash_mem_multiplier. > > There is some documentation in [1]: > > "Note that a complex query might perform several sort and hash > operations at the same time" The latest query is not complex at all and I don't see it doing 1000s of operations at the same time. By "number of nodes" would you add up all HashAggregate nodes under an Append node? Here is part of the EXPLAIN ANALYZE output: Limit (cost=3D0.01..28.00 rows=3D10 width=3D7) (actual time=3D43120.466= ..43292.246 rows=3D10 loops=3D1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, (bo= ol_or((test_runs_raw.test_resulttype_n =3D ANY ('{2,3}'::integer[])))) Buffers: shared hit=3D96 read=3D883975 I/O Timings: shared read=3D16284.731 -> Append (cost=3D0.01..3416299633.71 rows=3D1220556171 width=3D7) (= actual time=3D42968.794..43139.855 rows=3D10 loops=3D1) Buffers: shared hit=3D96 read=3D883975 I/O Timings: shared read=3D16284.731 -> HashAggregate (cost=3D0.01..0.02 rows=3D1 width=3D7) (actua= l time=3D10.662..10.663 rows=3D0 loops=3D1) Output: test_runs_raw.workitem_n, test_runs_raw.test_execu= table_n, bool_or((test_runs_raw.test_resulttype_n =3D ANY ('{2,3}'::intege= r[]))) Group Key: test_runs_raw.workitem_n, test_runs_raw.test_ex= ecutable_n Batches: 1 Memory Usage: 24kB -> Seq Scan on public.test_runs_raw__part_max20k test_run= s_raw (cost=3D0.00..0.00 rows=3D1 width=3D8) (actual time=3D9.960..9.961 = rows=3D0 loops=3D1) Output: test_runs_raw.workitem_n, test_runs_raw.test= _executable_n, test_runs_raw.test_resulttype_n -> HashAggregate (cost=3D0.01..0.02 rows=3D1 width=3D7) (actua= l time=3D1.913..1.914 rows=3D0 loops=3D1) Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_e= xecutable_n, bool_or((test_runs_raw_1.test_resulttype_n =3D ANY ('{2,3}'::= integer[]))) Group Key: test_runs_raw_1.workitem_n, test_runs_raw_1.tes= t_executable_n Batches: 1 Memory Usage: 24kB -> Seq Scan on public.test_runs_raw__part_max40k test_run= s_raw_1 (cost=3D0.00..0.00 rows=3D1 width=3D8) (actual time=3D1.031..1.03= 1 rows=3D0 loops=3D1) Output: test_runs_raw_1.workitem_n, test_runs_raw_1.= test_executable_n, test_runs_raw_1.test_resulttype_n [ .... 1000s of similar HashAggregate nodes ... ] Settings: temp_buffers =3D '32MB', work_mem =3D '32MB', effective_io_concu= rrency =3D '300', max_parallel_workers_per_gather =3D '0', enable_hashjoin= =3D 'off', enable_partitionwise_join =3D 'on', enable_partitionwise_aggre= gate =3D 'on', random_page_cost =3D '1.1', effective_cache_size =3D '6GB',= from_collapse_limit =3D '24', join_collapse_limit =3D '24' Planning: Buffers: shared hit=3D377 Planning Time: 1503.800 ms Execution Time: 56515.185 ms (5382 rows) Memory usage on each HashAggregate is logged as 24KB (many HashAggregates are missing that info though), I guess the EXPLAIN output is missing some important part of the allocations here since I'm seeing MBs of allocations per node. I can't help but see this as a bug. I see many issues: * postgres is not reading from partitions in parallel, but one after the other. It shouldn't need all this memory simultaneously. * The memory is unnecessarily allocated early on, before any partitions are actually aggregated. I know this because I/O is slow on this device and the table sizes are huge, it's simply not possible that postgres went through all partitions and blew up the memory. That would take hours, but the OOM happens seconds after I start the query. * The memory is not only allocated by the planner, but it's actually accessed. Libc's malloc() has no problem allocating gigabytes more than what I have available, growing the VSZ memory size without swapping out a single byte. That's because my kernel was set to overcommit. The fac= t that postgres grows its RSS memory too and swaps out other processes an= d finally gets killed, is because it actually writes to that memory! I wonder what it writes, if it hasn't read any of the data on the partitions. :-) For your consideration, here is also a memory dump from when the backend consumes all the memory (work_mem increased to 256MB), and I've disabled overcommit in the kernel: TopMemoryContext: 573520 total in 8 blocks; 214096 free (27 chunks); 35942= 4 used TopTransactionContext: 8192 total in 1 blocks; 7752 free (1 chunks); 44= 0 used Record information cache: 8192 total in 1 blocks; 1616 free (0 chunks);= 6576 used RegexpCacheMemoryContext: 1024 total in 1 blocks; 760 free (0 chunks); = 264 used RegexpMemoryContext: 44032 total in 5 blocks; 36400 free (17 chunks);= 7632 used: ^(test_runs_raw)$ Btree proof lookup cache: 8192 total in 1 blocks; 592 free (0 chunks); = 7600 used Operator lookup cache: 24576 total in 2 blocks; 10792 free (3 chunks); = 13784 used TableSpace cache: 8192 total in 1 blocks; 2128 free (0 chunks); 6064 us= ed Type information cache: 24368 total in 2 blocks; 2648 free (0 chunks); = 21720 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 12= 96 used MessageContext: 16777216 total in 12 blocks; 2924704 free (11 chunks); = 13852512 used partition directory: 8192 total in 1 blocks; 592 free (0 chunks); 760= 0 used Operator class cache: 8192 total in 1 blocks; 592 free (0 chunks); 7600= used smgr relation table: 524288 total in 7 blocks; 231192 free (28 chunks);= 293096 used PgStat Shared Ref Hash: 50224 total in 2 blocks; 688 free (0 chunks); 4= 9536 used PgStat Shared Ref: 90112 total in 14 blocks; 3648 free (14 chunks); 864= 64 used PgStat Pending: 294912 total in 39 blocks; 193864 free (788 chunks); 10= 1048 used TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks)= ; 264 used Portal hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 use= d PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used:= ExecutorState: 8462432 total in 13 blocks; 3316568 free (9 chunks);= 5145864 used HashAgg meta context: 8192 total in 1 blocks; 7720 free (0 chunks= ); 472 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed HashAgg meta context: 100671536 total in 2 blocks; 5528 free (0 c= hunks); 100666008 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 = used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed HashAgg meta context: 201334832 total in 2 blocks; 5528 free (0 c= hunks); 201329304 used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 = used ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed ExprContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 us= ed [...] 2907 more child contexts containing 13752609344 total in 3787 blo= cks; 25715912 free (0 chunks); 13726893432 used =2D- NOTE: the total number above is the culprit for the OOM. Relcache by OID: 131072 total in 5 blocks; 38744 free (13 chunks); 9232= 8 used CacheMemoryContext: 47147632 total in 599 blocks; 7315600 free (2 chunk= s); 39832032 used relation rules: 16384 total in 5 blocks; 2272 free (1 chunks); 14112 = used: pg_settings index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= trigger_oid_index index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= trigger_tgconstraint_index index info: 1024 total in 1 blocks; 40 free (0 chunks); 984 used: pg_= publication_rel_prpubid_index index info: 2048 total in 2 blocks; 728 free (2 chunks); 1320 used: p= g_policy_polrelid_polname_index index info: 1024 total in 1 blocks; 72 free (0 chunks); 952 used: pg_= policy_oid_index relation rules: 16384 total in 5 blocks; 7144 free (1 chunks); 9240 u= sed: pg_roles index info: 2048 total in 2 blocks; 544 free (1 chunks); 1504 used: p= g_inherits_relid_seqno_index index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= tablespace_spcname_index index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= constraint_contypid_index index info: 2048 total in 2 blocks; 688 free (1 chunks); 1360 used: p= g_constraint_conname_nsp_index index info: 2048 total in 2 blocks; 912 free (0 chunks); 1136 used: p= g_constraint_conparentid_index index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= attrdef_oid_index index info: 2048 total in 2 blocks; 688 free (1 chunks); 1360 used: p= g_attrdef_adrelid_adnum_index index info: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used: pg= _am_oid_index index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= am_name_index index info: 1024 total in 1 blocks; 0 free (0 chunks); 1024 used: pg_= rewrite_oid_index CachedPlanSource: 4096 total in 3 blocks; 688 free (0 chunks); 3408 u= sed: SELECT * FROM pg_catalog.pg_rewrite WHERE ev_class =3D $1 AND rulenam= e =3D $2 CachedPlanQuery: 8192 total in 4 blocks; 4016 free (1 chunks); 4176= used SPI Plan: 1024 total in 1 blocks; 600 free (0 chunks); 424 used index info: 2048 total in 2 blocks; 656 free (2 chunks); 1392 used: p= g_class_tblspc_relfilenode_index index info: 2048 total in 2 blocks; 616 free (1 chunks); 1432 used: p= g_toast_2619_index index info: 2048 total in 2 blocks; 728 free (2 chunks); 1320 used: t= est_runs_raw__part_max20000k_pkey [...] 1017 more child contexts containing 2142696 total in 2014 blocks; 684= 544 free (1928 chunks); 1458152 used WAL record construction: 50200 total in 2 blocks; 6376 free (0 chunks);= 43824 used PrivateRefCount: 8192 total in 1 blocks; 2648 free (0 chunks); 5544 use= d MdSmgr: 131072 total in 5 blocks; 22552 free (1 chunks); 108520 used LOCALLOCK hash: 262144 total in 6 blocks; 59376 free (23 chunks); 20276= 8 used GUCMemoryContext: 24576 total in 2 blocks; 9440 free (4 chunks); 15136 = used GUC hash table: 32768 total in 3 blocks; 12704 free (5 chunks); 20064= used Timezones: 104112 total in 2 blocks; 2648 free (0 chunks); 101464 used ErrorContext: 8192 total in 1 blocks; 7928 free (2 chunks); 264 used Gr= and total: 16246728816 bytes in 6875 blocks; 41833616 free (3045 chunks); = 16204895200 used > > Also, see the warning about execution time memory in [2]. > >> NOTE: after having written the above message, it occured to me that I h= ave >> enable_partitionwise_aggregate=3Don. And Turning it off fixes the issue= and >> makes the query faster too! Expected behaviour or bug? > > enable_partitionwise_aggregate=3Don causes this query to perform an > aggregate per partition. If your GROUP BY clause values are > distributed evenly throughout all partitions then you might find it's > not much slower to execute the query with > enable_partitionwise_aggregate=3Doff. The GROUP BY first-clause values (workitem_n) are not evenly distributed, in fact the partitions are split by that. The rest indeed are. Thank you, I'll turn off this flag globally to avoid such behaviour. > > It's understandable that how PostgreSQL uses work_mem isn't ideal > here, but unfortunately, that's the state of affairs, currently. You > might want to reconsider your enable_partitionwise_aggregate setting > and/or how many partitions you have. Having wasted long time in that, the minimum I can do is submit a documentation patch. At enable_partitionwise_aggregate someting like "WARNING it can increase the memory usage by at least n_partitions * work_mem". How do I move on for such a patch? Pointers would be appreciated. :-) Thank you, Dimitris > [1] https://www.postgresql.org/docs/current/runtime-config-resource.html > [2] https://www.postgresql.org/docs/current/ddl-partitioning.html#DDL-PA= RTITIONING-DECLARATIVE-BEST-PRACTICES >