Message-ID: <d0949d7e-dcf2-4650-8a6e-027eb9e17837@vondra.me>
Date: Wed, 25 Jun 2025 11:31:36 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view
To: Jakub Wartak <jakub.wartak@enterprisedb.com>,
 Christoph Berg <myon@debian.org>
Cc: Bertrand Drouvot <bertranddrouvot.pg@gmail.com>,
 Andres Freund <andres@anarazel.de>,
 Tomas Vondra <tomas.vondra@postgresql.org>,
 pgsql-hackers@lists.postgresql.org
References: <aFnGQ2lfb8cfukim@msg.df7cb.de>
 <cd16c4a3-2860-46de-a74e-5532d1b2ee39@vondra.me>
 <6342f601-77de-4ee0-8c2a-3deb50ceac5b@vondra.me>
 <aFpg1de9ZfS1QgUt@ip-10-97-1-34.eu-west-3.compute.internal>
 <a3a4fe3d-1a80-4e03-aa8e-150ee15f6c35@vondra.me>
 <aFqHoNXQ/uWKXJ4U@ip-10-97-1-34.eu-west-3.compute.internal>
 <8649a4e3-c60d-4f37-aa6f-e7e7c14c581e@vondra.me>
 <aFqnM/iiL+MB62dG@ip-10-97-1-34.eu-west-3.compute.internal>
 <aFq5HQj016rVS2lm@msg.df7cb.de>
 <8961c087-e49b-4b16-9437-31331625215c@vondra.me>
 <aFrEesaN9YlV0RrJ@msg.df7cb.de>
 <CAKZiRmziaa5GtqcSozwvRY_=MPo31nGmgpbm2Jciz=w-BBDsOQ@mail.gmail.com>
Content-Language: en-US
From: Tomas Vondra <tomas@vondra.me>
In-Reply-To: 
 <CAKZiRmziaa5GtqcSozwvRY_=MPo31nGmgpbm2Jciz=w-BBDsOQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Archived-At: 
 <https://www.postgresql.org/message-id/d0949d7e-dcf2-4650-8a6e-027eb9e17837%40vondra.me>
Precedence: bulk

On 6/25/25 09:15, Jakub Wartak wrote:
> On Tue, Jun 24, 2025 at 5:30 PM Christoph Berg <myon@debian.org> wrote:
>>
>> Re: Tomas Vondra
>>> If it's a reliable fix, then I guess we can do it like this. But won't
>>> that be a performance penalty on everyone? Or does the system split the
>>> array into 16-element chunks anyway, so this makes no difference?
>>
>> There's still the overhead of the syscall itself. But no idea how
>> costly it is to have this 16-step loop in user or kernel space.
>>
>> We could claim that on 32-bit systems, shared_buffers would be smaller
>> anyway, so there the overhead isn't that big. And the step size should
>> be larger (if at all) on 64-bit.
>>
>>> Anyway, maybe we should start by reporting this to the kernel people. Do
>>> you want me to do that, or shall one of you take care of that? I suppose
>>> that'd be better, as you already wrote a fix / know the code better.
>>
>> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2
>>
> 
> Hi all, I'm quite late to the party (just noticed the thread), but
> here's some addition context: it technically didn't make any sense to
> me to have NUMA on 32-bit due too small amount of addressable memory
> (after all, NUMA is about big iron, probably not even VMs), so in the
> first versions of the patchset I've excluded 32-bit (and back then for
> some reason I couldn't even find libnuma i386, but Andres pointed to
> me that it exists, so we re-added it probably just to stay
> consistent). The thread has kind of snowballed since then, but I still
> believe that NUMA on 32-bit does not make a lot of sense.
> 
> Even assuming future shm interleaving one day in future version,
> allocation of small s_b sizes will usually fit a single NUMA node.
> 

Not sure. I thought NUMA doesn't matter very much on 32-bit systems too,
exactly because those systems tend to use small amounts of memory. But
then while investigating this issue I realized even rpi5 has NUMA, in
fact it has a whopping 8 nodes:

debian@raspberry-32:~ $ numactl --hardware
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3
node 0 size: 981 MB
node 0 free: 882 MB
node 1 cpus: 0 1 2 3
node 1 size: 1007 MB
node 1 free: 936 MB
node 2 cpus: 0 1 2 3
node 2 size: 1007 MB
node 2 free: 936 MB
node 3 cpus: 0 1 2 3
node 3 size: 943 MB
node 3 free: 873 MB
node 4 cpus: 0 1 2 3
node 4 size: 1007 MB
node 4 free: 936 MB
node 5 cpus: 0 1 2 3
node 5 size: 1007 MB
node 5 free: 935 MB
node 6 cpus: 0 1 2 3
node 6 size: 1007 MB
node 6 free: 936 MB
node 7 cpus: 0 1 2 3
node 7 size: 990 MB
node 7 free: 918 MB
node distances:
node   0   1   2   3   4   5   6   7
  0:  10  10  10  10  10  10  10  10
  1:  10  10  10  10  10  10  10  10
  2:  10  10  10  10  10  10  10  10
  3:  10  10  10  10  10  10  10  10
  4:  10  10  10  10  10  10  10  10
  5:  10  10  10  10  10  10  10  10
  6:  10  10  10  10  10  10  10  10
  7:  10  10  10  10  10  10  10  10


This is with the 32-bit system (which AFAICS means 64-bit kernel and
32-bit user space). I'm not saying it's a particularly interesting NUMA
system, considering all the costs are 10, and it's not like it's
critical to get the best performance on rpi5. But it's NUMA, and maybe
there are some other (more practical) systems. I find it interesting
mostly for testing purposes.


regards

-- 
Tomas Vondra