Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spZ7o-002RiA-1w for pgsql-general@arkaria.postgresql.org; Sat, 14 Sep 2024 20:12:09 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1spZ7m-003Oij-DZ for pgsql-general@arkaria.postgresql.org; Sat, 14 Sep 2024 20:12:06 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spZ7m-003OiI-0V for pgsql-general@lists.postgresql.org; Sat, 14 Sep 2024 20:12:06 +0000 Received: from smtp.burggraben.net ([2a01:4f8:140:510a::3]) by magus.postgresql.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spZ7d-001CdZ-Nd for pgsql-general@lists.postgresql.org; Sat, 14 Sep 2024 20:12:05 +0000 Received: from elch.exwg.net (elch.exwg.net [IPv6:2001:470:7120:1:21b:21ff:fef0:248b]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (Client CN "elch.exwg.net", Issuer "R10" (verified OK)) by smtp.burggraben.net (Postfix) with ESMTPS id B9279C0030C; Sat, 14 Sep 2024 22:11:57 +0200 (CEST) Received: by elch.exwg.net (Postfix, from userid 1000) id 3A230EAC0F; Sat, 14 Sep 2024 22:11:57 +0200 (CEST) Date: Sat, 14 Sep 2024 22:11:57 +0200 From: Christoph Moench-Tegeder To: Thomas Ziegler Cc: "pgsql-general@lists.postgresql.org" Subject: Re: Failing to allocate memory when I think it shouldn't Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: User-Agent: Mutt/2.2.13 (2024-03-09) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, ## Thomas Ziegler (thomas.ziegler@holmsecurity.com): There's a lot of information missing here. Let's start from the top. > I have had my database killed by the kernel oom-killer. After that I > set turned off memory over-committing and that is where things got weird. What exactly did you set? When playing with vm.overcommit, did you understand "Committed Address Space" and the workings of the overcommit accounting? This is the document: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/Documentation/mm/overcommit-accounting.rst Hint: when setting overcommit_memory=2 you might end up with way less available adress space than you thought you would. Also keep an eye on /proc/meminfo - it's sometimes hard to estimate "just off your cuff" what's in memory and how it's mapped. (Also, anything else on that machine which might hog memory?). > I have `shared_buffers` at `16000MB`, `work_mem` at `80MB`, `temp_buffers` > at `8MB`, `max_connections` at `300` and `maintenance_work_mem` at `1GB`. > So all in all, I get to roughly 42GB of max memory usage > (`16000+(80+8)*300=42400`). That work_mem is "per query operation", you can have multiple of those in a single query: https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-WORK-MEM Also, there's hash_mem_multiplier: https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-HASH-MEM-MULTIPLIER Then I've seen that your query uses parallel workers, remember that each worker requests memory. Next, maintenance_work_mem is a per process limit, and depending on what's running at any given time, that can add up. Finally, there's this: > 2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel worker" (PID 3808076) exited with exit code 1 > terminate called after throwing an instance of 'std::bad_alloc' > what(): std::bad_alloc > 2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel worker" (PID 3808077) was terminated by signal 6: Aborted That "std::bad_alloc" sounds a lot like C++ and not like the C our database is written in. My first suspicion would be that you're using LLVM-JIT (unless you have other - maybe even your own - C++ extensions in the database?) and that in itself can use a good chunk of memory. And it looks like that exception bubbled up as a signal 6 (SIGABRT) which made the process terminate immediately without any cleanup, and after that the server has no other chance than to crash-restart. I recommend starting with understanding the actual memory limits as set by your configuration (personally I believe that memory overcommit is less evil than some people think). Have a close look at /proc/meminfo and if possible disable JIT and check if it changes anything. Also if possible try starting with only a few active connections and increase load carefully once a steady state (in terms of memory usage) has been reached. Regards, Christoph -- Spare Space