Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1u1mQ2-007wtW-U6 for pgsql-general@arkaria.postgresql.org; Mon, 07 Apr 2025 13:21:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1u1mQ0-00GLAf-OS for pgsql-general@arkaria.postgresql.org; Mon, 07 Apr 2025 13:21:41 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1u1mQ0-00GLAU-3Y for pgsql-general@lists.postgresql.org; Mon, 07 Apr 2025 13:21:40 +0000 Received: from mail-yw1-x112c.google.com ([2607:f8b0:4864:20::112c]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1u1mPx-003vMr-1Q for pgsql-general@lists.postgresql.org; Mon, 07 Apr 2025 13:21:39 +0000 Received: by mail-yw1-x112c.google.com with SMTP id 00721157ae682-6fedefb1c9cso35182717b3.0 for ; Mon, 07 Apr 2025 06:21:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joeconway.com; s=google; t=1744032095; x=1744636895; darn=lists.postgresql.org; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:to:subject:user-agent:mime-version:date :message-id:from:to:cc:subject:date:message-id:reply-to; bh=UtrMPJXaFsGdHNW/9x3x4F705x+Gyv+NyXEARHb7464=; b=JK4KS2z/LR7pyZMjzkPnLmIhDUkeVjIKyFdMfza+HlsDKSG4ZRm+hy8FyKNVC5RO65 zQVRGqNPztArmQjnSJNC89jUVwscbCqy5uEfxnOGb2Wgvnfbx+YxEEMMSHIiP82i95Yy C8sW3PSwiqNKSXjUVxaUm94g1kSIeL4oyo8TBVChGFt4Br4vGpQTNBLCmXHro4qCBy4O nd6l2l16i/LUHf9929PRKDB4UoBdOv591addXPpX8ZCNjFUvrW+dz6+j6kl1yrde6e9K QkhPMYm3KD8FmfWRNRrkTXluwomeTBxck2FY8+j/lEqx3DaP3XlPFjVpR4MKZotZP/1K ryjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1744032095; x=1744636895; h=content-transfer-encoding:in-reply-to:autocrypt:from :content-language:references:to:subject:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=UtrMPJXaFsGdHNW/9x3x4F705x+Gyv+NyXEARHb7464=; b=H/j5xFO2nRRo2w0g4Jrpb3QP+lnVVKeRy0eOe63YdV7hPl1P16i2rbyrFbFju3nt2o +u06Xe+LJ68a/jc6dF+/WudefcOwepuoE1el4b1nwxSFAyaKudJZibUjSQbCM+i2tvwL HCvf67loER3xwEln76fPH9X0cTWQOAYNuRAOVkCxCA0hfiw3Nt1u6wNDkoCNw5CJbH8C C0bjvXBPkqs9RcojGXoasu77aYmMsAuRVGXyPyWifZnEpxe1cvYd4zASY+k2DkQ7XFIi UxyFH5TSxemvRUJ6Wwz4vqE4fI7+oKBrazeEQNM3o/aqZeA1vrVdJriNCSkMRApVk8gd Upfg== X-Forwarded-Encrypted: i=1; AJvYcCX28XrqYEjkVawnjkqJPPuclAKFpTpkojrkEnAWHEXg97VEr5tJG167drtyDJFeFYkpJMKY4+xHALz/9XQG@lists.postgresql.org X-Gm-Message-State: AOJu0YyUPbYrolIRsevC84AjYe4WZYG+4CP4AST7/xm1x5eNRZUTsK/x nMXE1EHtrDa15KcKl8yUP33ejmo0LgsAgeTRJcgBvvbL7TVQworw7VRIb3WIclf1foJfsnBosay k X-Gm-Gg: ASbGncu3+b6kRuiM+BnZPuaBTGWsIPEYI6uz2ofokCvgtBd3WHbTzxdlZUKEb3Sm7TU v7NN3mTRuZwsgEp6MNAsfEbua8JbmXO8S+jZjK/9w9BCWkvgblNtqdgGdyusEeYwTk1ofPRrfKh Q3llSqp+8YHcQ0PtcMhJxDCeFZu445ReoDVGI3Olp15SFNAyu3iBhPNqzXvlcOOAzKQj007pbnH OVQI5fh8i4CByvf/Rt/0FCM3/siTAQ+o1cAfqIaDcnmhhdTSphWWBQkEKLKHS+pwFuxmYPdyjvL TmSULETRpsWekjHFEJ/PPqTHVYQ7UOa6byjV/J79HUDEiffhgdwRdjdlIwlu23qldSkEEFIFCko QLKCuzXlHv49N0B/P/syLp2Fpa1Um6CWDUbc= X-Google-Smtp-Source: AGHT+IGxILtczEiK4Ux6eCnGcdGxDPllLaPnew6TU0Flg3IuFB/ynzaiU6RYB8knHbUfOrTohkNFJA== X-Received: by 2002:a05:690c:a8b:b0:703:afd6:42b8 with SMTP id 00721157ae682-703e1587aa6mr235161867b3.19.1744032095201; Mon, 07 Apr 2025 06:21:35 -0700 (PDT) Received: from [192.168.4.41] (162-239-31-113.lightspeed.dybhfl.sbcglobal.net. [162.239.31.113]) by smtp.gmail.com with ESMTPSA id 00721157ae682-703d1e50e98sm25225687b3.45.2025.04.07.06.21.34 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 07 Apr 2025 06:21:34 -0700 (PDT) Message-ID: Date: Mon, 7 Apr 2025 09:21:34 -0400 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Kubernetes, cgroups v2 and OOM killer - how to avoid? To: Ancoron Luciferis , pgsql-general@lists.postgresql.org References: Content-Language: en-US From: Joe Conway Autocrypt: addr=mail@joeconway.com; keydata= xsFNBEpXMCsBEADDnXUQzjlyi/cX02Gtdy2CLcroE5CsC7DJKdOBDbfgn0kfiIYoV5JniG4l VyzZUodY8yUAagqLYolh0UkBzs9N+qkm7erde4ypw3jzVQ37BuzIvk3nMUbuDZDgxWqX+nVS sKc+BQ5BpzgCHg48leoRO2ohjvYnUhgH3j2rFZCzaj6qQ7mv+XoxOJmUlVQtG06Jwkk7Vu14 7U9nMMM6hyUKzVnmCphnlcMNo26UyVU70MwFfFJgcI0c5fpp8byN56eD6VJVnufO5WAuEhzE qcrSJR2FAlmM90GBY+6vP29twLDCHuSFvrnujNCx/BvCC/a3/gPvyAFp4JtMm9eXAmq3m/Kw 94nTJXVdcbQeQQDp3KIG7MmWS4lnGvPn8v0CjgNaLvZXFLo1FgmUVsyEq1Lww4iRLa6sbpXJ ESx15UEue1k1YZM9C+4F/o3aeKNsAienjw2EXFzcaxIg/C4P493VMi3Qa8ycVxR5iYhUbYdo DFIUQhbFNsYfrtW/qZAELT3FCYFpZYG01e9Hj+cBrXXgyDDkQ5Lq4mlvmkRvuxn61V6Au4HA 0sJiCox5pM1FvzT+aI8HY1BYaiB9Pl4fhpKgmhhlSuglk9v39S4jmlUIb45iLAUVpeNM6Qjm 69pf5da9sm4aGFa7YlDSKf/WcU7z9ITZxsilOi2n7YJiwG7kTQARAQABzSRKb3NlcGggRSBD b253YXkgPG1haWxAam9lY29ud2F5LmNvbT7CwXoEEwEIACQCGwMCHgECF4AFCwkIBwMFFQoJ CAsFFgIDAQAFAlWTVvUCGQEACgkQMyt+aLaZQ0oPCQ/9HyRewMyvAIJRmoXoLAr8AoFLId6R qBJnNX0Lll0RLZui65aQ0+exwX7aH7TxWR16B2gWX3OmLfGT8XITOoG+zt9zsEpLvNkHchkF T/jyAcbuRj5WX9hamZgMbjXAJeCdlhW+fRA9Upb0w4dgBjqK5OgsqMikASL7t2vogHl9H08j vSoQLW+8wTnSBXBeBTBwB7xLIin5WVivzFHUCrnD2UsjeBIW3fmGdpTAjSxRzG+UPYVwXQ8F FLt7DpEytvLWapmZWMRdj0WZ/Q3SOO/Ed0yFqbzuwKaWcFrQBNeS2Sig+FefBNS98f9Hx7ku H3DW34qX/zSSdDh0jLs7X3PkIgF6BZR2TxaCwHPP9ERDiDaUInC9U7We1iZE1DjW8rLMEVJB hY0ClrrF67pnUKTbcU+uajpPn+2Jl74T0Set/XxpHZ4cezcJuqg31R8vHZgd5cf1WKP0D0pc qiuS02BBFkNCs1jQ+raTWcDuE6F1mUO2nvjUBN9r4y5DUbCNSqLKeAe/aA6JaSDkBpoXKdNS +c4rbzbktWkfUW8EhVlCGzNpy4ezEoVsqV2Ex7fNoxsE2vnSylLT9hycAmYf8ryMvniRZqnD T4JgLenIcQlkhB896T7wApOXfD8OJj1/XFxAfPi6vdlsr81uoxuB4euLp8IyduwLORRUogO9 zmAXG5jOwU0ESlcyJwEQAOkTBb9yDhJbMUgvhM11rZwT5tm4Y9TqtEHn0Zy3t9g7bdFFpMva v/KENd3oAtLFpMDf+H3AggFk4ftUwJwiVgJ88ilvCynJUGXiuYIaexY4DLgn4xpnuiEpYEFV dWnlw7dWVTc62exfqIz9bSWRzwfBCY9ruYGEb4RDPDSNSAVyI7sxHzef2asiYxIcxrTrw5Vu gWNlPZcV5/EJ6PUvATjBF2TBkXV7KOciQng2tsQGrGMkY5mduNqwpuh6zfPcVF8LeObe96wv 5ZhPRpO79nef7hnK2lJogp3JIo558Jlbz9WHtQEMZR85+bUhtI825QyNAFz3Jrn7NMgvDikc 2OrWo7YMgMC5hDSWVFqA6/EQCNnDWGABWgeYHZFpnPwsvUWIYdhSilUuj/Tuzvz9ZmucFNbQ bauDQw6VQ38ofGnoYDZFJsGncprB8dBi4tDrIQ+1RlIh6C2Z/eMipqJOT26+spluTjouvnKT 0S5yOgyX0PjbsysgwQdCGNJLHOjhHbSpSmOLaduV3CQo/0+DHT/TBjYfIXjTWouY9TkGxG4e NrxU0u2xAy5bMqOPmsFdjLTWlQUlF/fTMhB54XwI3FHWgnSnXZzStDTmTebLNdT/ftgliAzA 81uMj49j0exv731/v+7udLA1bV8gnZ01zQCASDpWiRQR3fgwcugSUqgRABEBAAHCwV8EGAEI AAkFAkpXMicCGwwACgkQMyt+aLaZQ0pwAQ//bjcWnZg/jjRQ9gbZUGMqniItZYRglBMKIqt4 Fia379JmHwTvavnFkJ8XMZ56UB0FIrgS+sUkRH6cPRQR+7Qi392LD021DXgSsz9CwFHjFyBG HwLEOTRcfYQbtJy0shHDJB4aQTOX3ERDH1PsvJNuevmQMzS0DWFav9+xMz9rKP4N+HffoBIZ E0C1xIE43nD4eLsbycte9sVIrmlNuUti3qUxJAQw8HwfJ6ZbBInHxquApR16uD1u99o6Xlnd FrDlY22tRmHCM0bR81GfGNdcU3Uo+rG/R/k4qa7s9/dgKvMbyH3fHhp/ceKag80Xo8IFurRl 0ZJP3sHJ2QDHCVLat7jRZ+43hi1WlIhFbrgn6IyI0i7XR/W8JjrC5MsKq4TUwGH077sU/kcH YebVJZRbUUst2hAGHDFVBcG12qoKf+ltL9qXJc1y7BGeCoUW6QjOpljpq6ZL4FQUsM0RSRjs 5egE3szPcIf5SyPK6WDOApoAq6M7BBFMGDZwEylYMtr0YekA1u86UA9D2xwLHEbBBp/uiby1 c9JbPJ1Pn8zJP8WZNeRw4Q9TtqVK09+oLirMUSpIDd6KdZ1VgRxOK2re7tjDvkVuYsSrsiJ+ 1iJNEnp9iK0ok0DlJpSCe6KhkxpaTdeoWMXdKuJWec0NIqoAd54ZgBPnr+UPxTixgPq/p6Q= In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 4/5/25 07:53, Ancoron Luciferis wrote: > I've been investigating this topic every now and then but to this day > have not come to a setup that consistently leads to a PostgreSQL backend > process receiving an allocation error instead of being killed externally > by the OOM killer. > > Why this is a problem for me? Because while applications are accessing > their DBs (multiple services having their own DBs, some high-frequency), > the whole server goes into recovery and kills all backends/connections. > > While my applications are written to tolerate that, it also means that > at that time, esp. for the high-frequency apps, events are piling up, > which then leads to a burst as soon as connectivity is restored. This in > turn leads to peaks in resource usage in other places (event store, > in-memory buffers from apps, ...), which sometimes leads to a series of > OOM killer events being triggered, just because some analytics query > went overboard. > > Ideally, I'd find a configuration that only terminates one backend but > leaves the others working. > > I am wondering whether there is any way to receive a real ENOMEM inside > a cgroup as soon as I try to allocate beyond its memory.max, instead of > relying on the OOM killer. > > I know the recommendation is to have vm.overcommit_memory set to 2, but > then that affects all workloads on the host, including critical infra > like the kubelet, CNI, CSI, monitoring, ... > > I have already gone through and tested the obvious: > > https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT Importantly vm.overcommit_memory set to 2 only matters when memory is constrained at the host level. As soon as you are running in a cgroup with a hard memory limit, vm.overcommit_memory is irrelevant. You can have terabytes of free memory on the host, but if cgroup memory usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer will pick the process in the cgroup with the highest oom_score and whack it. Unfortunately there is no equivalent to vm.overcommit_memory within the cgroup. > And yes, I know that Linux cgroups v2 memory.max is not an actual hard > limit: > > https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files Read that again -- memory.max *is* a hard limit (same as memory.limit in cgv1). "memory.max A read-write single value file which exists on non-root cgroups. The default is “max”. Memory usage hard limit. This is the main mechanism to limit memory usage of a cgroup. If a cgroup’s memory usage reaches this limit and can’t be reduced, the OOM killer is invoked in the cgroup." If you want a soft limit use memory.high. "memory.high A read-write single value file which exists on non-root cgroups. The default is “max”. Memory usage throttle limit. If a cgroup’s usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure. Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached. The high limit should be used in scenarios where an external process monitors the limited cgroup to alleviate heavy reclaim pressure. You want to be using memory.high rather than memory.max. Also, I don't know what kubernetes recommends these days, but it used to require you to disable swap. In more recent versions of kubernetes you are able to run with swap enabled but I have no idea what the default is -- make sure you run with swap enabled. The combination of some swap being available, and the throttling under heavy reclaim will likely mitigate your problems. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com