Message-ID: <b93d876b-67c1-4f0e-b0c5-a4296f09f5b5@vondra.me>
Date: Mon, 5 Jan 2026 22:35:45 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: failed NUMA pages inquiry status: Operation not permitted
From: Tomas Vondra <tomas@vondra.me>
To: Christoph Berg <myon@debian.org>
Cc: Jakub Wartak <jakub.wartak@enterprisedb.com>,
 pgsql-hackers@lists.postgresql.org
References: <aPEJ7yH7ZJAF_MYW@msg.df7cb.de> <aPENGAkcrLyB_NLC@msg.df7cb.de>
 <a05ca971-a8ea-422f-85cd-b5edc46c5a9a@vondra.me>
 <aRcmK8Bpi77yKbq1@msg.df7cb.de>
 <54329add-59b6-4c08-96f0-a025a7804174@vondra.me>
 <aTq5Gt_n-oS_QSpL@msg.df7cb.de>
 <4ff9578d-1de2-45c1-98c4-29caf99334ff@vondra.me>
 <aUFbrmKrYPBuTZ1c@msg.df7cb.de> <aUFxRjXb7dYj1e8P@msg.df7cb.de>
 <183fe9ab-6010-4cca-b648-1deca332ce2a@vondra.me>
 <aUGc5qh977Y4r_jP@msg.df7cb.de>
 <f1af27db-4e59-4c6b-9d8c-6f667186563a@vondra.me>
Content-Language: en-US
In-Reply-To: <f1af27db-4e59-4c6b-9d8c-6f667186563a@vondra.me>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: 
 <https://www.postgresql.org/message-id/b93d876b-67c1-4f0e-b0c5-a4296f09f5b5%40vondra.me>
Precedence: bulk

On 12/17/25 12:07, Tomas Vondra wrote:
> 
> 
> On 12/16/25 18:54, Christoph Berg wrote:
>> Re: Tomas Vondra
>>> 1) right after opening a connection, I get this
>>>
>>> test=# select numa_node, count(*) from pg_buffercache_numa group by 1;
>>>  numa_node | count
>>> -----------+-------
>>>          0 |   290
>>>         -2 | 32478
>>
>> Does that mean that the "touch all pages" logic is missing in some
>> code paths?
>>
> 
> I did check and AFAICS we are touching the pages in pg_buffercache_numa.
> 
> To make it even more confusing, I can no longer reproduce the behavior I
> reported yesterday. It just consistently reports "0" and I have no idea
> why it changed :-( I did restart since yesterday, so maybe that changed
> something.
> 

I kept poking at this, and I managed to reproduce it again. The key
seems to be that the system needs to be under pressure, and then it's
reliably reproducible (at least for me).

What I did is I created two instances - one to keep the system busy, one
for experimentation. The "busy" one is set to use shared_buffers=16GB,
and then running read-only pgbench.

  pgbench -i -s 4500 test
  pgbench -S -j 16 -c 64 -T 600 -P 1 test

The system has 64GB of RAM and 12 cores, so this is a lot of load.

Then, the other instance is set to use shared_buffers=4GB, is started
and immediately queried for NUMA info for buffers (in a loop):

  pg_ctl -D data -l pg.log start;

  for r in $(seq 1 10); do
    psql -p 5001 test -c 'select numa_node, count(*) from
pg_buffercache_numa group by 1';
  done;

  pg_ctl -D data -l pg.log stop;

And this often fails like this:

----------------------------------------------------------------------

waiting for server to start.... done
server started
 numa_node |  count
-----------+---------
         0 | 1045302
        -2 |    3274
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1048576
(1 row)

 numa_node |  count
-----------+---------
         0 | 1025321
        -2 |   23255
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1038596
        -2 |    9980
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1048518
        -2 |      58
(2 rows)

 numa_node |  count
-----------+---------
         0 | 1048525
        -2 |      51
(2 rows)

waiting for server to shut down.... done
server stopped

----------------------------------------------------------------------

So, it clearly fails quite often. And it can fail even later, after a
run that returned no "-2" buffers.

Clearly, something behaves differently than we thought. I've only seen
this happen on a system with swap - once I removed it, this behavior
disappeared too. So it seems a page can be moved to swap, in which case
we get -2 for a status.

In hindsight, that's not all that surprising. It's interesting it can
happen even with the "touching", but I guess there's a race condition
and the memory can get paged out before we inspect the status. We're
querying batches of pages, which probably makes the window larger.

FWIW I now realized I don't even need two instances. If I try this on
the "busy" instance, I get the -2 values too. Which I find a bit weird.
Because why should those be paged out?

The question is what to do about this. I don't think we can prevent the
-2 values, and error-ing out does not seem great either (most systems
have swap, so -2 may not be all that rare).

In fact, pg_shmem_allocations_numa probably should not error-out either,
because it's now reliably failing (on the busy instance).

I guess the only solution is to accept -2 as a possible value (unknown
node). But that makes regression testing harder, because it means the
output could change a lot ...


regards

-- 
Tomas Vondra