Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vcsF0-00BiSf-33 for pgsql-hackers@arkaria.postgresql.org; Mon, 05 Jan 2026 21:35:55 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vcsEx-004Ge1-02 for pgsql-hackers@arkaria.postgresql.org; Mon, 05 Jan 2026 21:35:51 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vcsEw-004Gdo-1d for pgsql-hackers@lists.postgresql.org; Mon, 05 Jan 2026 21:35:51 +0000 Received: from relay1-d.mail.gandi.net ([217.70.183.193]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vcsEu-004kg8-1F for pgsql-hackers@lists.postgresql.org; Mon, 05 Jan 2026 21:35:50 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id E31614441F; Mon, 5 Jan 2026 21:35:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1767648947; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=KMc9smDYWVBw445AGzCLJ/w9ff6BERHj8dJ1428VwF8=; b=OYVYsiq89aCQr/+rFAC6z3Vm8/PLDJHH1mrDSlFT3cmgENKYS1t+y6TDhhGzHv6NJWJ/Jp PdfN/rvumpGpEh4EjaF8eoXxMMlb82z82bZzbQFx5yWNe2wgjg6wg7CTezpmh5AUtqwo0M /s6VVRJYmOFPx8K91MOTg1u5sLPg8zTax1dVuiaRwXxjOrIWMt06B/+BSwsaJ8evuMDZ4K vUlthOrlpUUIc6rL5QE5wv9FgnI3BF6JnF2Lg9GVxtRdYq6yWwN88jFwjGZGA5iPS2KIvI 8DmJEsgSw+tO+imV07AyOLFo/iK2ZP4r3AQERlvUGZN2BLdxpZchKCdnLLqWCA== Message-ID: Date: Mon, 5 Jan 2026 22:35:45 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: failed NUMA pages inquiry status: Operation not permitted From: Tomas Vondra To: Christoph Berg Cc: Jakub Wartak , pgsql-hackers@lists.postgresql.org References: <54329add-59b6-4c08-96f0-a025a7804174@vondra.me> <4ff9578d-1de2-45c1-98c4-29caf99334ff@vondra.me> <183fe9ab-6010-4cca-b648-1deca332ce2a@vondra.me> Content-Language: en-US In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-Sasl: tomas@vondra.me X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdelkeegtdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhepkfffgggfuffhvfevfhgjtgfgsehtjeertddtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpefhgefgleejvefgjeetuedvhffhudetveelgfeugfduledvffejleegjefhteffkeenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpqhhiugepgfefudeiudeggeegudfhpdhmohguvgepshhmthhpohhuthdpnhgspghrtghpthhtohepfedprhgtphhtthhopehmhihonhesuggvsghirghnrdhorhhgpdhrtghpthhtohepjhgrkhhusgdrfigrrhhtrghksegvnhhtvghrphhrihhsvggusgdrtghomhdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-GND-State: clean X-GND-Score: -100 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 12/17/25 12:07, Tomas Vondra wrote: > > > On 12/16/25 18:54, Christoph Berg wrote: >> Re: Tomas Vondra >>> 1) right after opening a connection, I get this >>> >>> test=# select numa_node, count(*) from pg_buffercache_numa group by 1; >>> numa_node | count >>> -----------+------- >>> 0 | 290 >>> -2 | 32478 >> >> Does that mean that the "touch all pages" logic is missing in some >> code paths? >> > > I did check and AFAICS we are touching the pages in pg_buffercache_numa. > > To make it even more confusing, I can no longer reproduce the behavior I > reported yesterday. It just consistently reports "0" and I have no idea > why it changed :-( I did restart since yesterday, so maybe that changed > something. > I kept poking at this, and I managed to reproduce it again. The key seems to be that the system needs to be under pressure, and then it's reliably reproducible (at least for me). What I did is I created two instances - one to keep the system busy, one for experimentation. The "busy" one is set to use shared_buffers=16GB, and then running read-only pgbench. pgbench -i -s 4500 test pgbench -S -j 16 -c 64 -T 600 -P 1 test The system has 64GB of RAM and 12 cores, so this is a lot of load. Then, the other instance is set to use shared_buffers=4GB, is started and immediately queried for NUMA info for buffers (in a loop): pg_ctl -D data -l pg.log start; for r in $(seq 1 10); do psql -p 5001 test -c 'select numa_node, count(*) from pg_buffercache_numa group by 1'; done; pg_ctl -D data -l pg.log stop; And this often fails like this: ---------------------------------------------------------------------- waiting for server to start.... done server started numa_node | count -----------+--------- 0 | 1045302 -2 | 3274 (2 rows) numa_node | count -----------+--------- 0 | 1048576 (1 row) numa_node | count -----------+--------- 0 | 1048576 (1 row) numa_node | count -----------+--------- 0 | 1048576 (1 row) numa_node | count -----------+--------- 0 | 1048576 (1 row) numa_node | count -----------+--------- 0 | 1048576 (1 row) numa_node | count -----------+--------- 0 | 1025321 -2 | 23255 (2 rows) numa_node | count -----------+--------- 0 | 1038596 -2 | 9980 (2 rows) numa_node | count -----------+--------- 0 | 1048518 -2 | 58 (2 rows) numa_node | count -----------+--------- 0 | 1048525 -2 | 51 (2 rows) waiting for server to shut down.... done server stopped ---------------------------------------------------------------------- So, it clearly fails quite often. And it can fail even later, after a run that returned no "-2" buffers. Clearly, something behaves differently than we thought. I've only seen this happen on a system with swap - once I removed it, this behavior disappeared too. So it seems a page can be moved to swap, in which case we get -2 for a status. In hindsight, that's not all that surprising. It's interesting it can happen even with the "touching", but I guess there's a race condition and the memory can get paged out before we inspect the status. We're querying batches of pages, which probably makes the window larger. FWIW I now realized I don't even need two instances. If I try this on the "busy" instance, I get the -2 values too. Which I find a bit weird. Because why should those be paged out? The question is what to do about this. I don't think we can prevent the -2 values, and error-ing out does not seem great either (most systems have swap, so -2 may not be all that rare). In fact, pg_shmem_allocations_numa probably should not error-out either, because it's now reliably failing (on the busy instance). I guess the only solution is to accept -2 as a possible value (unknown node). But that makes regression testing harder, because it means the output could change a lot ... regards -- Tomas Vondra