Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1spWNB-00250W-M5 for pgsql-general@arkaria.postgresql.org; Sat, 14 Sep 2024 17:15:51 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1spWNB-00HXA3-9p for pgsql-general@arkaria.postgresql.org; Sat, 14 Sep 2024 17:15:49 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1soyll-00BUaK-5h for pgsql-general@lists.postgresql.org; Fri, 13 Sep 2024 05:22:57 +0000 Received: from mail-vi1eur03on2133.outbound.protection.outlook.com ([40.107.103.133] helo=EUR03-VI1-obe.outbound.protection.outlook.com) by makus.postgresql.org with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1soylg-000ufA-ML for pgsql-general@lists.postgresql.org; Fri, 13 Sep 2024 05:22:55 +0000 ARC-Seal: i=1; a=rsa-sha256; s=arcselector10001; d=microsoft.com; cv=none; b=qs+/Od/Kbz2EuGaH70mUZ2qpIh3n0EexsDJyUs6Qu/k2bfLQPmXC5xerECy9XX68rwfvSsFMUwEd0mhyOqLP4XOX6matrEjkeXCzFiuYrGc1L6nVxNth0fjRGGRi9qhcLGhXFQQE0HdXF4GINR3NFUby+PkSNS7v2kTDQxt8ydh/V59Y3UpQd+Qts985gI8J7FPGHIsUvVNUhLIWCf7w/bdFsaNgabKi9nE9DocNkpBRnZyiBIb3r1UoZ16dfLfbAFPudhRgamNSuoL1B4p6GmUtqBf6CUVky88ibr5Eo0hqIEvgKCmNU59otivGFkiu+Nzue9fBLjvav33PIHzpRA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector10001; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1; bh=/cSodEsfUXURX0gTvUOP7b+l4WM0j7OBuXgrWWN2grc=; b=goF9MHwRmpcFIUDOrNPKxN25POA1SXUzE4U0kAnaVqMA7ebf9SmaKk4Gm51Y520hFfIdUkKQ8TSh9jb+mlp3pkyRIloSTpIfTgqsKRXx868mkHfllemjRdLhEsIC9KCn8VrEvaCWy4NhQiT1RWymqYph2+MgZx/t/M6US5qJPcYdihi+AnuzDN4P9hncWU9/RRx8JGX2cHOnQBeb5oXY44HSnJIPA5yrmKHAgiJQaxNFKAxiWULIZ/FG53MASGwe9FqkRFNDlukVX3b5RCHfmG8QJvJUQ/mmKXfNp3c6RQwLFtDuemA16S+GumuUkFILs+fmcibaFEJFx47vhgxxEw== ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=pass smtp.mailfrom=holmsecurity.com; dmarc=pass action=none header.from=holmsecurity.com; dkim=pass header.d=holmsecurity.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=holmsecurity.com; s=selector2; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=/cSodEsfUXURX0gTvUOP7b+l4WM0j7OBuXgrWWN2grc=; b=Al/mhJK7aCa5euTLVDlNDf11MJMh40y/Dw5b5275GKWFw9axaJWFiHl6RSBZq0mKRH8dhQjWtzaMeVJNbnYqttUux9ObDoHBjjhgVaJXntmUpzk43PrKtfd0KTMAx8ArBG8PICIInsChBxOZxnutNrVQR+yDwpjYxnHd0N0SLqpedYcr2uXAFUVuVzMsnOdLgU9ESYWwhMDv6hleuYDGFwSx2HDLXbqJ09AUMqcXaSnOoHn7PsV+bseJ2yqP3wKxm/t3w/lPajlYv4Y3zFMrQqhOvuq8qhRzQ1rH5F1M4wsm75dhzvSzAwkyMBgMUG2CRU5HGBsCn4F6qf91ruXZIA== Received: from PAWPR07MB10095.eurprd07.prod.outlook.com (2603:10a6:102:382::18) by DU2PR07MB9488.eurprd07.prod.outlook.com (2603:10a6:10:49a::22) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.7962.17; Fri, 13 Sep 2024 05:22:48 +0000 Received: from PAWPR07MB10095.eurprd07.prod.outlook.com ([fe80::bbd:dd4c:5f28:90e5]) by PAWPR07MB10095.eurprd07.prod.outlook.com ([fe80::bbd:dd4c:5f28:90e5%5]) with mapi id 15.20.7962.016; Fri, 13 Sep 2024 05:22:48 +0000 From: Thomas Ziegler To: "pgsql-general@lists.postgresql.org" Subject: Failing to allocate memory when I think it shouldn't Thread-Topic: Failing to allocate memory when I think it shouldn't Thread-Index: AQHbBZxsPFOwtusEZUqziRoYaLu7zg== Date: Fri, 13 Sep 2024 05:22:47 +0000 Message-ID: Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: msip_labels: authentication-results: dkim=none (message not signed) header.d=none;dmarc=none action=none header.from=holmsecurity.com; x-ms-publictraffictype: Email x-ms-traffictypediagnostic: PAWPR07MB10095:EE_|DU2PR07MB9488:EE_ x-ms-office365-filtering-correlation-id: 2e8b6ac7-05a6-4d96-982c-08dcd3b41af9 x-ms-exchange-senderadcheck: 1 x-ms-exchange-antispam-relay: 0 x-microsoft-antispam: BCL:0;ARA:13230040|1800799024|376014|366016|38070700018; x-microsoft-antispam-message-info: =?iso-8859-1?Q?rm54gEe8LppXq/bf4Xm7o/kiasf6IJlSC/DBY/StzjmExqnQHDGO2nCbcg?= =?iso-8859-1?Q?l1W60Xw9/x1E2hSMeIeihLt+z+LxyXjXQ/i8GHwlp4/8fsYyS6bmWf/dVX?= =?iso-8859-1?Q?KaLozwIVD1On/1J6U4G7wuM8L9RJea+DJfmnBpOnNCo2iMBpDU/utTV2JK?= =?iso-8859-1?Q?InAmDGWvVv1xtLRaU1edxo0aOQltELEDVPGXGI2OmfaVk+Srdrm47JLzb8?= =?iso-8859-1?Q?axoii7jJLVtNgeTLWZbTbCeKf/13keRKsjVeWuBEM7jmnSLEMpUNCxutrB?= =?iso-8859-1?Q?Pi6Jo83M5XfnK95efuZ0K5VnRsBtf2KstvqAmhY6gjNvDnKuikYYtGKFPU?= =?iso-8859-1?Q?iKyB+hiE4BkuLabWyQYFGETf79OH+jVh/lqhhGD65moCLEsWevapHaGHk6?= =?iso-8859-1?Q?pnQ7ADDa4B4uQlK3SC8a6ZJ2buRnrei2Xlufw9y9cgcifDenfZE9Daizi5?= =?iso-8859-1?Q?iJkWOjp4aIs8V5hjq4YJho4EGc2QyTUzmnoAyRvQA4Jkv7DcgiJVoSicJs?= =?iso-8859-1?Q?QiOH97yHdyY3C1l0B3NlcbKqkcofADoxZBTX5Kh0cCuOiRE8xxdUmpEhRA?= =?iso-8859-1?Q?PXMF6b6wz1652gMeYKFY6nPOr4Ab4JDhhOVwX5oNPT3VaC41bKoztAxq0F?= =?iso-8859-1?Q?PXhPoH8AEfdgwPcWkEDr7jF24Elyy4SSh9DRTE4HxpZ1D2Yta/8qoWFfY8?= =?iso-8859-1?Q?2w3Yh3MOrOIhieZye99T907z8jxsyD/EI8bY6wm0r9cKC4h5XXYQvodUyE?= =?iso-8859-1?Q?n/0s6avMkUfD6gbClWxo9078S/iDv71VG3JzVyYu9iVJ0ZdDWDRUVachf9?= =?iso-8859-1?Q?arYlHfor2QG0KDdoHqr1gSREemkeZcoy9MIfHWmAknjiVkIV1IFj7mrpMa?= =?iso-8859-1?Q?+51TF+qXIwKbesXA1zoB8KMsTIp/Xz8PBgma9q1C+y8gFUSCTRUnYFcxma?= =?iso-8859-1?Q?Q1PJHio6EMNyPlzBz/Gg1Sg/sNBo9DpnrOlCmGle/DICGTHiWlDn+TTwdF?= =?iso-8859-1?Q?52UuXAyqs1KiLxJkaN5JyO31EFMxA/D9d5beGSnD7JRM8SvPjNQ+WDxiKo?= =?iso-8859-1?Q?OFHyQCdAEM7s9APPgXUN7dTyLaHDaEsqj2DFjC+Lj+cLY7XfKBlyJgb8i7?= =?iso-8859-1?Q?sxBZrE1hv48xulgLEj2bqtYy4TGUtWyyKahhUJBfaZX6JlEdGLnow7ra7j?= =?iso-8859-1?Q?WDAufEy1RS2DIdY+ox164KAkt82l2VOZVk/shli7hLVcHRLc/dztFcjE85?= =?iso-8859-1?Q?jIaDNUMvGdrlexXeQ1Zo8EG80K+9oLVnEZQmErMxu7R7Um/lebFU32Md2p?= =?iso-8859-1?Q?wF1MB79Hnt8Ja4QIMgrvqQsC+5Z6Kb+bNX2+jUlw27NVq08=3D?= x-forefront-antispam-report: CIP:255.255.255.255;CTRY:;LANG:en;SCL:1;SRV:;IPV:NLI;SFV:NSPM;H:PAWPR07MB10095.eurprd07.prod.outlook.com;PTR:;CAT:NONE;SFS:(13230040)(1800799024)(376014)(366016)(38070700018);DIR:OUT;SFP:1102; x-ms-exchange-antispam-messagedata-chunkcount: 1 x-ms-exchange-antispam-messagedata-0: =?iso-8859-1?Q?AkXga41Ry9xhc9sYacc7Bk31wll+UeChPlEzpXv9buJjqV6KriuYF5ncrk?= =?iso-8859-1?Q?BWqAwojpusWgpy+3vlFZ9Gtk+bwTj5oMfbdT5iHHvyq/TUb0kqNZpv3q0K?= =?iso-8859-1?Q?5DDl0FDzoasjWp9wX3+Sjtlz3Lh/JptSncorvPrvV1P+9aWMzhrje2Bx7O?= =?iso-8859-1?Q?vKpJBBfQ3UTx6/YNTaYtCP03ToJ6l/DXe4wHpKKz7cyxSzBqiJHV+Fbqen?= =?iso-8859-1?Q?h8BS1RbeagkOU0o8KuTEQZTq78D0CccgUE0YD7ZrBu2CSuUZ16gfIOyDuw?= =?iso-8859-1?Q?IMXZdTRpAN0nqot0bvE9U8MaBVtJLRv5tTONneptTYRxC8m4HsONFOaMnW?= =?iso-8859-1?Q?nayw2mdDKkJbuDsTaGtTRo6wlFlhcMKcV6FkFjesV1I0GS+XwlTRB1p81P?= =?iso-8859-1?Q?ClLhhXTnf1Y9hzsRyF0UrNamlk53gzgyk/nqnO6Y9Mz+u4ClYX9KznXcG9?= =?iso-8859-1?Q?H4y8msp8eizfCrJpi9RAtQTlDrw0PhpaU+cXYtIq1EXtxmJwwkcMmgXxmb?= =?iso-8859-1?Q?7rwyAFW0T94PVHcwrXAql5PfISQCRLR6Zwds8SsnPxY5fZKFBwo34eEYyM?= =?iso-8859-1?Q?rZNRqyNXIz5wQC6MCwg/QBPEoDcvc3uXaUfH2nCFoWxase+BIey+J3E5Op?= =?iso-8859-1?Q?dKs1951uyTQdKnh9EGvscoXnxRwOIi3Vc5mPkEA65rjQ5XYmTwthzJ2PVx?= =?iso-8859-1?Q?DaDYnpHi+fIKNtFMJGsRrGP9KnFFxbHD4NTVD0/JviamSrfvP7cg8cmEHj?= =?iso-8859-1?Q?t/NKs+jaUA/V0wQClUGJH+38iR52/OrWS1lIKZYZajmE/FlT020/xLVlCe?= =?iso-8859-1?Q?FH8x7CA06QDWqHTo+mN7fQ9TYkObXx1Fe7emz/LsqqQKQTeOChtobg/Xzx?= =?iso-8859-1?Q?Zz59nZu4cdbs7BCkLnxvECgLP7CoG0aOo9HiT79CbR+5Xc1WGgQtKkUZbz?= =?iso-8859-1?Q?HCMz4k/xfNQwbDM15D2XLwsFepP1iTJwDHKk8oKE1jdkz4b2PWmDw+gNNY?= =?iso-8859-1?Q?f/URvVbTc9Vhb6x2Iiz4wGnPrM1C09PWk/8MVhe7Q9tBdwbS/+2CZYpQ5P?= =?iso-8859-1?Q?u7xXo8mxvHaMv9JISAsKm1jHPyfkkhHQhjTIxDKJaPRujYL4JdOcr+oN7o?= =?iso-8859-1?Q?t6AZE+jckO43cneL796dbYrPWDkxw+anqvhObgLql5Pq0ctgbaEyowwwrW?= =?iso-8859-1?Q?2U8GnkgLHh7191S4WMkpGBWQ9EBrbfwjyJj8fPUHFLvuY4IVqVMWSmh9/m?= =?iso-8859-1?Q?xB/A3TQVS4FHH4G71oi+eGTXFd9Dh5+aGij/FQXdu/JpKXHf7wQOwnINtF?= =?iso-8859-1?Q?ZzRc1e9c/lk58c8tXALgVWTzth4aNR+Xq/LBL8CxsgSfA2BraOyW7VrXc8?= =?iso-8859-1?Q?UpjPnU4Jk6x/VLrI7/4omTd4o2gih1I6B7xhB+Z9eYvrEo8rToK6HBwvys?= =?iso-8859-1?Q?DCONQgZaiCNkQ8cmSwsXlz6tErvd+1wkSCj99JA0uj3RCNheeoAZNWc0Z9?= =?iso-8859-1?Q?ka0qQIF1wdlxFHoXgqrtJx6vbAk+ZMcChlTUk6xcWjEl8C5FiaKimuJzYh?= =?iso-8859-1?Q?2OUEzLbtJnDsD+GDkGjNQXxBSwVr9ftT0t4DeCnS6//lMWKAUtHzuJkDqi?= =?iso-8859-1?Q?n9TU5SarggRcXfCcckQzKtQ+5qSurvlrGyfKnm1Le2+8XGPp/mBFT9IQ?= =?iso-8859-1?Q?=3D=3D?= Content-Type: multipart/alternative; boundary="_000_PAWPR07MB1009548159C0046E01170C6E6ED652PAWPR07MB10095eu_" MIME-Version: 1.0 X-OriginatorOrg: holmsecurity.com X-MS-Exchange-CrossTenant-AuthAs: Internal X-MS-Exchange-CrossTenant-AuthSource: PAWPR07MB10095.eurprd07.prod.outlook.com X-MS-Exchange-CrossTenant-Network-Message-Id: 2e8b6ac7-05a6-4d96-982c-08dcd3b41af9 X-MS-Exchange-CrossTenant-originalarrivaltime: 13 Sep 2024 05:22:47.9562 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f885fa0a-a6c8-41a6-bab6-52b51e6e2fe6 X-MS-Exchange-CrossTenant-mailboxtype: HOSTED X-MS-Exchange-CrossTenant-userprincipalname: qxD3rL/ACAihbmNIvyvsrS09eagR8HihV8B+NmY8aerZGpPGax7/AecHWVa61jV+sHLf/k01T75ufJ7j3yESRQLxTE+Vf4C8Z8n4p6ueSv4= X-MS-Exchange-Transport-CrossTenantHeadersStamped: DU2PR07MB9488 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --_000_PAWPR07MB1009548159C0046E01170C6E6ED652PAWPR07MB10095eu_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Hello Community! I have had my database killed by the kernel oom-killer. After that I set tu= rned off memory over-committing and that is where things got weird. I have a PostgreSQL 16.2 running on a Debian Bookworm. The machine has 64GB= of RAM and runs in a VM in an OpenStack environment. I have `shared_buffers` at `16000MB`, `work_mem` at `80MB`, `temp_buffers` = at `8MB`, `max_connections` at `300` and `maintenance_work_mem` at `1GB`. So all in all, I get to roughly 42GB of max memory usage (`16000+(80+8)*300= =3D42400`). But now, I have connections fail because they can't allocate enough memory.= I thought they should start creating temporary files if that is the case (= and I tested this with trivial examples in psql) but the connections from m= y applications sometimes fail. I suspect that the queries are so large, tha= t the execution-plans etc. are too large to fit into the 80MB of RAM. This = is the "simpler" one of my problems, but it would be nice if somebody could= tell me if my suspicion makes sense. The harder issue is, that in multiple instances those memory-issues actuall= y lead to the database "crashing" and having to recover from the redo-log. = To be precise, the master-process doesn't crash and the database automatica= lly starts the recovery (in contrast to the oom-kill, where I had to restar= t the database manually). I can't get to the bottom of why that happens. At the point in time where t= hat happens, there was more than enough usable memory on the system (>17GB)= and I don't understand how it can kill the whole database if a worker-proc= ess breaks because it can't get more RAM. Also why wasn't it able to get mo= re RAM, the kernel should have been more than happy to give it more as ther= e was so much usable RAM available. (By "usable" I mean free + cache - shar= ed) Here are logs of one of those instances: TopMemoryContext: 166384 total in 7 blocks; 33128 free (10 chunks); 133256 = used smgr relation table: 16384 total in 2 blocks; 4640 free (2 chunks); 11744= used PgStat Shared Ref Hash: 7216 total in 2 blocks; 688 free (0 chunks); 6528= used PgStat Shared Ref: 1024 total in 1 blocks; 328 free (0 chunks); 696 used PgStat Pending: 4096 total in 3 blocks; 2168 free (6 chunks); 1928 used TopTransactionContext: 8192 total in 1 blocks; 4536 free (9 chunks); 3656= used TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chunks); = 264 used Portal hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used TopPortalContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used Relcache by OID: 16384 total in 2 blocks; 7712 free (2 chunks); 8672 used CacheMemoryContext: 262144 total in 6 blocks; 160 free (2 chunks); 261984= used index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288 used: pg_= rewrite_rel_rulename_index [ ... ] WAL record construction: 50200 total in 2 blocks; 6376 free (0 chunks); 4= 3824 used PrivateRefCount: 8192 total in 1 blocks; 2648 free (0 chunks); 5544 used MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used LOCALLOCK hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used GUCMemoryContext: 24576 total in 2 blocks; 5288 free (5 chunks); 19288 us= ed GUC hash table: 32768 total in 3 blocks; 12704 free (5 chunks); 20064 u= sed Timezones: 104112 total in 2 blocks; 2648 free (0 chunks); 101464 used Postmaster: 8192 total in 1 blocks; 6496 free (10 chunks); 1696 used ident parser context: 1024 total in 1 blocks; 760 free (0 chunks); 264 = used hba parser context: 17408 total in 5 blocks; 6944 free (7 chunks); 1046= 4 used ErrorContext: 8192 total in 1 blocks; 7928 free (6 chunks); 264 used Grand total: 889064 bytes in 124 blocks; 180840 free (113 chunks); 708224 u= sed 2024-09-12 05:18:36.064 UTC [3808080] _@_ FATAL: out of memory 2024-09-12 05:18:36.064 UTC [3808080] _@_ DETAIL: Failed on request of siz= e 190 in memory context "CacheMemoryContext". TopMemoryContext: 166384 total in 7 blocks; 33128 free (10 chunks); 133256 = used [ ... ] 2024-09-12 05:18:36.065 UTC [3808075] _@_ FATAL: out of memory 2024-09-12 05:18:36.065 UTC [3808075] _@_ DETAIL: Failed on request of siz= e 190 in memory context "CacheMemoryContext". 2024-09-12 05:18:36.065 UTC [3808076] ERROR: out of memory 2024-09-12 05:18:36.065 UTC [3808076] DETAIL: Failed on request of size 32= 800 in memory context "HashBatchContext". 2024-09-12 05:18:36.065 UTC [3805724] _@_ ERROR: out of memory 2024-09-12 05:18:36.065 UTC [3805724] _@_ DETAIL: Failed on request of siz= e 32800 in memory context "HashBatchContext". 2024-09-12 05:18:36.065 UTC [3805724] _@_ CONTEXT: parallel worker 2024-09-12 05:18:36.065 UTC [3808078] _@_ FATAL: out of memory 2024-09-12 05:18:36.065 UTC [3808078] _@_ DETAIL: Failed on request of siz= e 190 in memory context "CacheMemoryContext". 2024-09-12 05:18:36.065 UTC [3808081] _@_ FATAL: out of memory 2024-09-12 05:18:36.065 UTC [3808081] _@_ DETAIL: Failed on request of siz= e 40992 in memory context "TopMemoryContext". 2024-09-12 05:18:36.066 UTC [1932776] LOG: could not fork new process for = connection: Cannot allocate memory 2024-09-12 05:18:36.066 UTC [3808050] _@_ ERROR: out of memory 2024-09-12 05:18:36.066 UTC [3808050] _@_ DETAIL: Failed while creating me= mory context "HashAgg meta context". 2024-09-12 05:18:36.067 UTC [3488999] _@_ ERROR: out of memory 2024-09-12 05:18:36.067 UTC [3488999] _@_ DETAIL: Failed on request of siz= e 32800 in memory context "HashBatchContext". 2024-09-12 05:18:36.073 UTC [1932776] LOG: background worker "parallel wor= ker" (PID 3808076) exited with exit code 1 terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc 2024-09-12 05:18:36.083 UTC [1932776] LOG: background worker "parallel wor= ker" (PID 3808077) was terminated by signal 6: Aborted 2024-09-12 05:18:36.083 UTC [1932776] LOG: terminating any other active se= rver processes 2024-09-12 05:18:36.829 UTC [1932776] LOG: all server processes terminated= ; reinitializing 2024-09-12 05:18:39.731 UTC [3808861] LOG: database system was interrupted= ; last known up at 2024-09-12 05:08:52 UTC 2024-09-12 05:18:40.277 UTC [3808861] LOG: database system was not properl= y shut down; automatic recovery in progress 2024-09-12 05:18:40.303 UTC [3808861] LOG: redo starts at 3EA1/3AB0CF70 2024-09-12 05:18:50.304 UTC [3808861] LOG: redo in progress, elapsed time:= 10.00 s, current LSN: 3EA1/837DA848 2024-09-12 05:18:55.612 UTC [3808861] LOG: invalid record length at 3EA1/C= 42A4A48: expected at least 24, got 0 2024-09-12 05:18:55.612 UTC [3808861] LOG: redo done at 3EA1/C42A4A20 syst= em usage: CPU: user: 9.04 s, system: 3.49 s, elapsed: 15.31 s 2024-09-12 05:18:55.642 UTC [3808862] LOG: checkpoint starting: end-of-rec= overy immediate wait 2024-09-12 05:19:16.215 UTC [3808862] LOG: checkpoint complete: wrote 2945= 97 buffers (14.4%); 0 WAL file(s) added, 0 removed, 105 recycled; [ ... ] 2024-09-12 05:19:16.340 UTC [1932776] LOG: database system is ready to acc= ept connections I would be eternally grateful if somebody could explain to me what is going= on there. As I wrote earlier, there should have been more than enough RAM = available on the system, so I don't see any reason why the kernel would den= y any memory allocations. The only related item I was able to find is this thread from 2018: https://= www.postgresql.org/message-id/flat/e47eec7f-9dbc-cbe3-dd01-32cbee90f332%40n= eteven.com Cheers, Thomas --_000_PAWPR07MB1009548159C0046E01170C6E6ED652PAWPR07MB10095eu_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable
Hello Community!

I have had my database killed by the kernel oom-killer. After that I set tu= rned off memory over-committing and that is where things got weird.

I have a PostgreSQL 16.2 running on a Debian Bookworm. The machine has 64GB= of RAM and runs in a VM in an OpenStack environment.
I have `shared_buffers` at `16000MB`, `work_mem` at `80MB`, `temp_buffers` = at `8MB`, `max_connections` at `300` and `maintenance_work_mem` at `1GB`.
So all in all, I get to roughly 42GB of max memory usage (`16000+(80+8)*300= =3D42400`).

But now, I have connections fail because they can't allocate enough memory.= I thought they should start creating temporary files if that is the case (= and I tested this with trivial examples in psql) but the connections from m= y applications sometimes fail. I suspect that the queries are so large, that the execution-plans etc. are t= oo large to fit into the 80MB of RAM. This is the "simpler" one o= f my problems, but it would be nice if somebody could tell me if my suspici= on makes sense.

The harder issue is, that in multiple instances those memory-issues actuall= y lead to the database "crashing" and having to recover from the = redo-log. To be precise, the master-process doesn't crash and the database = automatically starts the recovery (in contrast to the oom-kill, where I had to restart the database manually).
I can't get to the bottom of why that happens. At the point in time where t= hat happens, there was more than enough usable memory on the system (>17= GB) and I don't understand how it can kill the whole database if a worker-p= rocess breaks because it can't get more RAM. Also why wasn't it able to get more RAM, the kernel should have = been more than happy to give it more as there was so much usable RAM availa= ble. (By "usable" I mean free + cache - shared)

Here are logs of one of those instances:

TopMemoryContext: 166384 total in 7 blocks; 33128 free (10 chunks); 133256 = used
  smgr relation table: 16384 total in 2 blocks; 4640 free (2 chunks); = 11744 used
  PgStat Shared Ref Hash: 7216 total in 2 blocks; 688 free (0 chunks);= 6528 used
  PgStat Shared Ref: 1024 total in 1 blocks; 328 free (0 chunks); 696 = used
  PgStat Pending: 4096 total in 3 blocks; 2168 free (6 chunks); 1928 u= sed
  TopTransactionContext: 8192 total in 1 blocks; 4536 free (9 chunks);= 3656 used
  TransactionAbortContext: 32768 total in 1 blocks; 32504 free (0 chun= ks); 264 used
  Portal hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 used<= /div>
  TopPortalContext: 8192 total in 1 blocks; 7928 free (0 chunks); 264 = used
  Relcache by OID: 16384 total in 2 blocks; 7712 free (2 chunks); 8672= used
  CacheMemoryContext: 262144 total in 6 blocks; 160 free (2 chunks); 2= 61984 used
    index info: 2048 total in 2 blocks; 760 free (2 chunks); 1288= used: pg_rewrite_rel_rulename_index
[ ... ]
  WAL record construction: 50200 total in 2 blocks; 6376 free (0 chunk= s); 43824 used
  PrivateRefCount: 8192 total in 1 blocks; 2648 free (0 chunks); 5544 = used
  MdSmgr: 8192 total in 1 blocks; 7928 free (0 chunks); 264 used
  LOCALLOCK hash: 8192 total in 1 blocks; 592 free (0 chunks); 7600 us= ed
  GUCMemoryContext: 24576 total in 2 blocks; 5288 free (5 chunks); 192= 88 used
    GUC hash table: 32768 total in 3 blocks; 12704 free (5 chunks= ); 20064 used
  Timezones: 104112 total in 2 blocks; 2648 free (0 chunks); 101464 us= ed
  Postmaster: 8192 total in 1 blocks; 6496 free (10 chunks); 1696 used=
    ident parser context: 1024 total in 1 blocks; 760 free (0 chu= nks); 264 used
    hba parser context: 17408 total in 5 blocks; 6944 free (7 chu= nks); 10464 used
  ErrorContext: 8192 total in 1 blocks; 7928 free (6 chunks); 264 used=
Grand total: 889064 bytes in 124 blocks; 180840 free (113 chunks); 708224 u= sed
2024-09-12 05:18:36.064 UTC [3808080] _@_ FATAL:  out of memory
2024-09-12 05:18:36.064 UTC [3808080] _@_ DETAIL:  Failed on request o= f size 190 in memory context "CacheMemoryContext".
TopMemoryContext: 166384 total in 7 blocks; 33128 free (10 chunks); 133256 = used
[ ... ]
2024-09-12 05:18:36.065 UTC [3808075] _@_ FATAL:  out of memory
2024-09-12 05:18:36.065 UTC [3808075] _@_ DETAIL:  Failed on request o= f size 190 in memory context "CacheMemoryContext".
2024-09-12 05:18:36.065 UTC [3808076] ERROR:  out of memory
2024-09-12 05:18:36.065 UTC [3808076] DETAIL:  Failed on request of si= ze 32800 in memory context "HashBatchContext".
2024-09-12 05:18:36.065 UTC [3805724] _@_ ERROR:  out of memory
2024-09-12 05:18:36.065 UTC [3805724] _@_ DETAIL:  Failed on request o= f size 32800 in memory context "HashBatchContext".
2024-09-12 05:18:36.065 UTC [3805724] _@_ CONTEXT:  parallel worker
2024-09-12 05:18:36.065 UTC [3808078] _@_ FATAL:  out of memory
2024-09-12 05:18:36.065 UTC [3808078] _@_ DETAIL:  Failed on request o= f size 190 in memory context "CacheMemoryContext".
2024-09-12 05:18:36.065 UTC [3808081] _@_ FATAL:  out of memory
2024-09-12 05:18:36.065 UTC [3808081] _@_ DETAIL:  Failed on request o= f size 40992 in memory context "TopMemoryContext".
2024-09-12 05:18:36.066 UTC [1932776] LOG:  could not fork new process= for connection: Cannot allocate memory
2024-09-12 05:18:36.066 UTC [3808050] _@_ ERROR:  out of memory
2024-09-12 05:18:36.066 UTC [3808050] _@_ DETAIL:  Failed while creati= ng memory context "HashAgg meta context".
2024-09-12 05:18:36.067 UTC [3488999] _@_ ERROR:  out of memory
2024-09-12 05:18:36.067 UTC [3488999] _@_ DETAIL:  Failed on request o= f size 32800 in memory context "HashBatchContext".
2024-09-12 05:18:36.073 UTC [1932776] LOG:  background worker "pa= rallel worker" (PID 3808076) exited with exit code 1
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
2024-09-12 05:18:36.083 UTC [1932776] LOG:  background worker "pa= rallel worker" (PID 3808077) was terminated by signal 6: Aborted
2024-09-12 05:18:36.083 UTC [1932776] LOG:  terminating any other acti= ve server processes
2024-09-12 05:18:36.829 UTC [1932776] LOG:  all server processes termi= nated; reinitializing
2024-09-12 05:18:39.731 UTC [3808861] LOG:  database system was interr= upted; last known up at 2024-09-12 05:08:52 UTC
2024-09-12 05:18:40.277 UTC [3808861] LOG:  database system was not pr= operly shut down; automatic recovery in progress
2024-09-12 05:18:40.303 UTC [3808861] LOG:  redo starts at 3EA1/3AB0CF= 70
2024-09-12 05:18:50.304 UTC [3808861] LOG:  redo in progress, elapsed = time: 10.00 s, current LSN: 3EA1/837DA848
2024-09-12 05:18:55.612 UTC [3808861] LOG:  invalid record length at 3= EA1/C42A4A48: expected at least 24, got 0
2024-09-12 05:18:55.612 UTC [3808861] LOG:  redo done at 3EA1/C42A4A20= system usage: CPU: user: 9.04 s, system: 3.49 s, elapsed: 15.31 s
2024-09-12 05:18:55.642 UTC [3808862] LOG:  checkpoint starting: end-o= f-recovery immediate wait
2024-09-12 05:19:16.215 UTC [3808862] LOG:  checkpoint complete: wrote= 294597 buffers (14.4%); 0 WAL file(s) added, 0 removed, 105 recycled; [ ..= . ]
2024-09-12 05:19:16.340 UTC [1932776] LOG:  database system is ready t= o accept connections

I would be eternally grateful if somebody could explain to me what is going= on there. As I wrote earlier, there should have been more than enough RAM = available on the system, so I don't see any reason why the kernel would den= y any memory allocations.
The only related item I was able to find is this thread from 2018: https://= www.postgresql.org/message-id/flat/e47eec7f-9dbc-cbe3-dd01-32cbee90f332%40n= eteven.com

Cheers,
Thomas
--_000_PAWPR07MB1009548159C0046E01170C6E6ED652PAWPR07MB10095eu_--