Message-ID: <a3a4fe3d-1a80-4e03-aa8e-150ee15f6c35@vondra.me>
Date: Tue, 24 Jun 2025 11:20:15 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view
To: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>
Cc: Christoph Berg <myon@debian.org>, Andres Freund <andres@anarazel.de>,
 Tomas Vondra <tomas.vondra@postgresql.org>,
 pgsql-hackers@lists.postgresql.org
References: <kl4zd72eeaex7zcicpuvpsuslrs5nfvmab7xzt4jnvcjvd6mxw@tcp64c55qkpj>
 <aFmxxHVS9XwbQ_em@msg.df7cb.de>
 <6c9f9f7e-947b-4fc3-bdb6-b0696d7492e5@vondra.me>
 <aFm5tmSFwhGX7mA7@msg.df7cb.de>
 <ce305c79-0a68-46ce-b563-ab9f87bb5f20@vondra.me>
 <aFm-ZfL-vz9I2Zmc@msg.df7cb.de>
 <0643ae61-cf9d-482c-9b2c-fb861b24fd22@vondra.me>
 <aFnGQ2lfb8cfukim@msg.df7cb.de>
 <cd16c4a3-2860-46de-a74e-5532d1b2ee39@vondra.me>
 <6342f601-77de-4ee0-8c2a-3deb50ceac5b@vondra.me>
 <aFpg1de9ZfS1QgUt@ip-10-97-1-34.eu-west-3.compute.internal>
Content-Language: en-US
From: Tomas Vondra <tomas@vondra.me>
In-Reply-To: <aFpg1de9ZfS1QgUt@ip-10-97-1-34.eu-west-3.compute.internal>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Archived-At: 
 <https://www.postgresql.org/message-id/a3a4fe3d-1a80-4e03-aa8e-150ee15f6c35%40vondra.me>
Precedence: bulk

On 6/24/25 10:24, Bertrand Drouvot wrote:
> Hi,
> 
> On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote:
>> On 6/23/25 23:47, Tomas Vondra wrote:
>>> ...
>>>
>>> Or maybe the 32-bit chroot on 64-bit host matters and confuses some
>>> calculation.
>>>
>>
>> I think it's likely something like this.
> 
> I think the same.
> 
>> I noticed that if I modify
>> pg_buffercache_numa_pages() to query the addresses one by one, it works.
>> And when I increase the number, it stops working somewhere between 16k
>> and 17k items.
> 
> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
> than 16 pages.
> 
> It's also confirmed by test_chunk_size.c attached:
> 
> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
> $ ./test_chunk_size
>  1 pages: SUCCESS (0 errors)
>  2 pages: SUCCESS (0 errors)
>  3 pages: SUCCESS (0 errors)
>  4 pages: SUCCESS (0 errors)
>  5 pages: SUCCESS (0 errors)
>  6 pages: SUCCESS (0 errors)
>  7 pages: SUCCESS (0 errors)
>  8 pages: SUCCESS (0 errors)
>  9 pages: SUCCESS (0 errors)
> 10 pages: SUCCESS (0 errors)
> 11 pages: SUCCESS (0 errors)
> 12 pages: SUCCESS (0 errors)
> 13 pages: SUCCESS (0 errors)
> 14 pages: SUCCESS (0 errors)
> 15 pages: SUCCESS (0 errors)
> 16 pages: SUCCESS (0 errors)
> 17 pages: 1 errors
> Threshold: 17 pages
> 
> No error if -m32 is not used.
> 
>> It may be a coincidence, but I suspect it's related to the sizeof(void
>> *) being 8 in the kernel, but only 4 in the chroot. So the userspace
>> passes an array of 4-byte items, but kernel interprets that as 8-byte
>> items. That is, we call
>>
>> long move_pages(int pid, unsigned long count, void *pages[.count], const
>> int nodes[.count], int status[.count], int flags);
>>
>> Which (I assume) just passes the parameters to kernel. And it'll
>> interpret them per kernel pointer size.
>>
> 
> I also suspect something in this area...
> 
>> If this is what's happening, I'm not sure what to do about it ...
> 
> We could work by chunks (16?) on 32 bits but would probably produce performance
> degradation (we mention it in the doc though). Also would always 16 be a correct
> chunk size? 

I don't see how this would solve anything?

AFAICS the problem is the two places are confused about how large the
array elements are, and get to interpret that differently. Using a
smaller array won't solve that. The pg function would still allocate
array of 16 x 32-bit pointers, and the kernel would interpret this as 16
x 64-bit pointers. And that means the kernel will (a) write into memory
beyond the allocated buffer - a clear buffer overflow, and (b) see bogus
pointers, because it'll concatenate two 32-bit pointers.

I don't see how using smaller array makes this correct. That it works is
more a matter of luck, and also a consequence of still allocating the
whole array, so there's no overflow (at least I kept that, not sure how
you did the chunks).

If I fix the code to make the entries 64-bit (by treating the pointers
as int64), it suddenly starts working - no bad addresses, etc. Well,
almost, because I get this

 bufferid | os_page_num | numa_node
----------+-------------+-----------
        1 |           0 |         0
        1 |           1 |       -14
        2 |           2 |         0
        2 |           3 |       -14
        3 |           4 |         0
        3 |           5 |       -14
        4 |           6 |         0
        4 |           7 |       -14
        ...

The -14 status is interesting, because that's the same value Christoph
reported as the other issue (in pg_shmem_allocations_numa).

I did an experiment and changed os_page_status to be declared as int64,
not just int. And interestingly, that produced this:

 bufferid | os_page_num | numa_node
----------+-------------+-----------
        1 |           0 |         0
        1 |           1 |         0
        2 |           2 |         0
        2 |           3 |         0
        3 |           4 |         0
        3 |           5 |         0
        4 |           6 |         0
        4 |           7 |         0
        ...

But I don't see how this makes any sense, because "int" should be 4B in
both cases (in 64-bit kernel and 32-bit chroot).

FWIW I realized this applies to "official" systems with 32-bit user
space on 64-bit kernels, like e.g. rpi5 with RPi OS 32-bit. (Fun fact,
rpi5 has 8 NUMA nodes, with all CPUs attached to all NUMA nodes.)

I'm starting to think we need to disable NUMA for setups like this,
mixing 64-bit kernels with 32-bit chroot. Is there a good way to detect
those, so that we can error-out?

FWIW this doesn't explain the strange valgrind issue, though.


-- 
Tomas Vondra