Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1u2G6C-00FQ5A-3E for pgsql-general@arkaria.postgresql.org; Tue, 08 Apr 2025 21:03:12 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1u2G69-00FT2Y-Dy for pgsql-general@arkaria.postgresql.org; Tue, 08 Apr 2025 21:03:09 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1u2DDO-00DT3u-Ao for pgsql-general@lists.postgresql.org; Tue, 08 Apr 2025 17:58:26 +0000 Received: from mail-ej1-x630.google.com ([2a00:1450:4864:20::630]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1u2DDM-003j5k-2V for pgsql-general@lists.postgresql.org; Tue, 08 Apr 2025 17:58:25 +0000 Received: by mail-ej1-x630.google.com with SMTP id a640c23a62f3a-ac2ab99e16eso1112051566b.0 for ; Tue, 08 Apr 2025 10:58:24 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=20230601; t=1744135103; x=1744739903; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id:from :to:cc:subject:date:message-id:reply-to; bh=km46WXc1MysENpJXck5imoCNUud1Opx+44dT18VYAnQ=; b=L2CuSJVW5/LcIFnCI3kro3hg5IYSln949WeUu/bNvyPYSYnuMfeMe3keF7NSH5Blus BAvNOwh+W0UVMIUqjkeToX8531g9hWPuUi9EaeAlBFd2U2uADIp1Rz/uBwgUJgIIinO7 Jn9BeWDDcimXlW+UmmwuaUEV5r5MobD+GFT8xne3iQyrYGRDH5a/botQbMk1ydEKzH/D otNSC9Hy0AoUJm3thKFLnTH7mR9oC2LDa0bpZKVAt63ofPsXN16/hVFb4BriknjnGVcZ m9GqEdH0k04Rn1nQeBaVB7cLol697rx9W0VYmat/axcnEPeL4Q7GR9XdUEMVTuzEFcU+ faSg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744135103; x=1744739903; h=content-transfer-encoding:in-reply-to:from:content-language :references:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=km46WXc1MysENpJXck5imoCNUud1Opx+44dT18VYAnQ=; b=bhUkUXSvVsSbNqmFBtApSFDKqKLzcRszoduCldFhf6JdTLwerxV9tA+unV6iT98tDd FyR0Hzf1rHPBfWDEB576MqcqxwvcpSB7hnFBNjq2KFgqaeyn4FI9ciFTwYK1FW/QDmrE KKQUp4N4MeJCzMFRHedrL52IuwwxIMYS1ceH7QYJG1Y4XvObzj10CnTqg0YePZ8zJ1Ss E0yQ3b5yijVpUWUxDec3dmOucmluejq5BaKWA/IWaakXXBpW7MA02SvmnqQny03hTW4f EptPm/nLdb6IX46UVySHjXmWvKdTImHzpmJ7Fr0c1YmgFyWBYnNcpkAb15kadceTVmzk a11A== X-Forwarded-Encrypted: i=1; AJvYcCXLiwTQ30QD2C+KPG0Qqls69XEGJtrXElg+93fjHR9GPSSM8HvaNAlzJJJA6TD4ysn/D8zRzl0MBhhBE8SM@lists.postgresql.org X-Gm-Message-State: AOJu0Yye9/JkIbrwtJq+40McLBVbTKXsgASW2E+wurK6I/zhXfwbpAUg 6xtc6xLprMsb9PBOwGAtsBo2uZV6JqEVf7zZlBskNfMG8/KgKkP7A6Oniw== X-Gm-Gg: ASbGncuf3eAK8oxjuvCL16bs4Q9m8T3j1rKDY1V7fIWNWGVp2p19rAqlUFW88Yh2puc hJJ39N8o2e5kvZ/LwO62y0FX23JbTzeVmBtpbUBmzwcJ4tHpZWrlOI4DzOtaaQ4Jfqv74U7m/dg B3ZBZNznjWkCLEbiP12t/zf2h0JIkbFo8NuNdy48SHdPdGIRECbTfgeAleHDunyXngZ1xW9hvMY TRwJe6QzQfZu8DqTyeXKs5/GXKJ9KewfhcJUn2jP7jImUxe3Ex/8oyCbo5aeB7tTFGHEe8LD3fy /HyVEp8sAN/y8lYb6C5m1IBu+TDgB0a7/hcL8N5ZAmr2Y128RL9LN+mLJNzAEq1y32CnXQtwxNt 6sn6uEsX18vZ2DY43gnwOf8ok2/kuvaI5zGHBs7dT9xfmHUBKgEfh+xjW+5xGP1xguZHgLzWSsN piHV3twF6kJxoRe8LJgMEH1pXaSqmS X-Google-Smtp-Source: AGHT+IE3Oy6bcj3okfj3VHPJ9OWTNL96k4oib9FtkUsi0XrrkcP9QHMhxmlDIYdYkgfdKoLAz1t5CA== X-Received: by 2002:a17:907:7245:b0:ac3:d19b:e07f with SMTP id a640c23a62f3a-aca9b71a165mr12855866b.41.1744135102788; Tue, 08 Apr 2025 10:58:22 -0700 (PDT) Received: from ?IPV6:2003:cd:ef15:7600:1e47:4056:af90:c4ab? (p200300cdef1576001e474056af90c4ab.dip0.t-ipconnect.de. [2003:cd:ef15:7600:1e47:4056:af90:c4ab]) by smtp.googlemail.com with ESMTPSA id a640c23a62f3a-ac7bfe61bb2sm959928366b.32.2025.04.08.10.58.21 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 08 Apr 2025 10:58:22 -0700 (PDT) Message-ID: Date: Tue, 8 Apr 2025 19:58:21 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Kubernetes, cgroups v2 and OOM killer - how to avoid? To: Joe Conway , pgsql-general@lists.postgresql.org References: Content-Language: en-US From: Ancoron Luciferis In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 2025-04-07 15:21, Joe Conway wrote: > On 4/5/25 07:53, Ancoron Luciferis wrote: >> I've been investigating this topic every now and then but to this day >> have not come to a setup that consistently leads to a PostgreSQL backend >> process receiving an allocation error instead of being killed externally >> by the OOM killer. >> >> Why this is a problem for me? Because while applications are accessing >> their DBs (multiple services having their own DBs, some high-frequency), >> the whole server goes into recovery and kills all backends/connections. >> >> While my applications are written to tolerate that, it also means that >> at that time, esp. for the high-frequency apps, events are piling up, >> which then leads to a burst as soon as connectivity is restored. This in >> turn leads to peaks in resource usage in other places (event store, >> in-memory buffers from apps, ...), which sometimes leads to a series of >> OOM killer events being triggered, just because some analytics query >> went overboard. >> >> Ideally, I'd find a configuration that only terminates one backend but >> leaves the others working. >> >> I am wondering whether there is any way to receive a real ENOMEM inside >> a cgroup as soon as I try to allocate beyond its memory.max, instead of >> relying on the OOM killer. >> >> I know the recommendation is to have vm.overcommit_memory set to 2, but >> then that affects all workloads on the host, including critical infra >> like the kubelet, CNI, CSI, monitoring, ... >> >> I have already gone through and tested the obvious: >> >> https://www.postgresql.org/docs/current/kernel-resources.html#LINUX- >> MEMORY-OVERCOMMIT > > Importantly vm.overcommit_memory set to 2 only matters when memory is > constrained at the host level. > > As soon as you are running in a cgroup with a hard memory limit, > vm.overcommit_memory is irrelevant. > > You can have terabytes of free memory on the host, but if cgroup memory > usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer > will pick the process in the cgroup with the highest oom_score and whack > it. > > Unfortunately there is no equivalent to vm.overcommit_memory within the > cgroup. > >> And yes, I know that Linux cgroups v2 memory.max is not an actual hard >> limit: >> >> https://www.kernel.org/doc/html/latest/admin-guide/cgroup- >> v2.html#memory-interface-files > > Read that again -- memory.max *is* a hard limit (same as memory.limit in > cgv1). > >   "memory.max > >     A read-write single value file which exists on non-root cgroups. The >     default is “max”. > >     Memory usage hard limit. This is the main mechanism to limit memory >     usage of a cgroup. If a cgroup’s memory usage reaches this limit and >     can’t be reduced, the OOM killer is invoked in the cgroup." Yes, I know it says "hard limit", but then any app still can go beyond (might just be on me here to assume any "hard limit" to imply an actual error when trying to go beyond). The OOM killer then will kick in eventually, but not in any way that any process inside the cgroup could prevent. So there is no signal that the app could react to saying "hey, you just went beyond what you're allowed, please adjust before I kill you". > > > If you want a soft limit use memory.high. > >   "memory.high > >     A read-write single value file which exists on non-root cgroups. The >     default is “max”. > >     Memory usage throttle limit. If a cgroup’s usage goes over the high >     boundary, the processes of the cgroup are throttled and put under >     heavy reclaim pressure. > >     Going over the high limit never invokes the OOM killer and under >     extreme conditions the limit may be breached. The high limit should >     be used in scenarios where an external process monitors the limited >     cgroup to alleviate heavy reclaim pressure. > > You want to be using memory.high rather than memory.max. Hm, so solely relying on reclaim? I think that'll just get the whole cgroup into ultra-slow mode and would not actually prevent too much memory allocation. While this may work out just fine for the PostgreSQL instance, it'll for sure have effects on the other workloads on the same node (which I have apparently, more PG instances). Apparently, I also don't see a way to even try this out in a Kubernetes environment, since there doesn't seem to be a way to set this field through some workload manifests field. > > Also, I don't know what kubernetes recommends these days, but it used to > require you to disable swap. In more recent versions of kubernetes you > are able to run with swap enabled but I have no idea what the default is > -- make sure you run with swap enabled. Yes, this is what I wanna try out next. > > The combination of some swap being available, and the throttling under > heavy reclaim will likely mitigate your problems. > Thank you for your insights, I have something to think about. Cheers, Ancoron