Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVUuw-004TJ5-0R for pgsql-hackers@arkaria.postgresql.org; Tue, 16 Dec 2025 13:16:42 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vVUuu-006j1q-2a for pgsql-hackers@arkaria.postgresql.org; Tue, 16 Dec 2025 13:16:41 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVUuu-006j1h-1d for pgsql-hackers@lists.postgresql.org; Tue, 16 Dec 2025 13:16:41 +0000 Received: from goedel.df7cb.de ([2a01:4f8:c013:1d4::1]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vVUut-000y7e-0N for pgsql-hackers@lists.postgresql.org; Tue, 16 Dec 2025 13:16:40 +0000 Received: from msg.df7cb.de (unknown [IPv6:2a02:908:1472:9340:f0ad:fc6e:9c86:f1dc]) by goedel.df7cb.de (Postfix) with ESMTPSA id 5DA9640E33; Tue, 16 Dec 2025 13:16:31 +0000 (UTC) Date: Tue, 16 Dec 2025 14:16:30 +0100 From: Christoph Berg To: Tomas Vondra Cc: Jakub Wartak , pgsql-hackers@lists.postgresql.org Subject: Re: failed NUMA pages inquiry status: Operation not permitted Message-ID: References: <7bbc582b-cc70-4a6f-bbf2-b5fd9b13a867@vondra.me> <54329add-59b6-4c08-96f0-a025a7804174@vondra.me> <4ff9578d-1de2-45c1-98c4-29caf99334ff@vondra.me> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4ff9578d-1de2-45c1-98c4-29caf99334ff@vondra.me> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Re: Tomas Vondra > Hmmm, strange. -2 is ENOENT, which should mean this: > > -ENOENT > The page is not present. > > But what does "not present" mean in this context? And why would that be > only intermittent? Presumably this is still running in Docker, so maybe > it's another weird consequence of that? I've managed to reproduce it once, running this loop on 18-as-of-today. It errored out after a few 100 iterations: while psql -c 'SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa'; do :; done 2025-12-16 11:49:35.982 UTC [621807] myon@postgres ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 2025-12-16 11:49:35.982 UTC [621807] myon@postgres STATEMENT: SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa That was on the apt.pg.o amd64 build machine while a few things were just building. Maybe ENOENT "The page is not present" means something was just swapped out because the machine was under heavy load. I tried reading the kernel source and it sounds related: * If the source virtual memory range has any unmapped holes, or if * the destination virtual memory range is not a whole unmapped hole, * move_pages() will fail respectively with -ENOENT or -EEXIST. This * provides a very strict behavior to avoid any chance of memory * corruption going unnoticed if there are userland race conditions. * Only one thread should resolve the userland page fault at any given * time for any given faulting address. This means that if two threads * try to both call move_pages() on the same destination address at the * same time, the second thread will get an explicit error from this * command. ... * The UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES flag can be specified to * prevent -ENOENT errors to materialize if there are holes in the * source virtual range that is being remapped. The holes will be * accounted as successfully remapped in the retval of the * command. This is mostly useful to remap hugepage naturally aligned * virtual regions without knowing if there are transparent hugepage * in the regions or not, but preventing the risk of having to split * the hugepmd during the remap. ... ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start, unsigned long src_start, unsigned long len, __u64 mode) ... if (!(mode & UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES)) { err = -ENOENT; break; What I don't understand yet is why this move_pages() signature does not match the one from libnuma and move_pages(2) (note "mode" vs "flags"): int numa_move_pages(int pid, unsigned long count, void **pages, const int *nodes, int *status, int flags) { return move_pages(pid, count, pages, nodes, status, flags); } I guess the answer is somewhere in that gap. > ERROR: invalid NUMA node id outside of allowed range [0, 0]: -2 Maybe instead of putting sanity checks on what the kernel is returning, we should just pass that through to the user? (Or perhaps transform negative numbers to NULL?) Christoph