Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uUMTq-0022id-T1 for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Jun 2025 09:31:46 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uUMTo-001skV-VF for pgsql-hackers@arkaria.postgresql.org; Wed, 25 Jun 2025 09:31:45 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uUMTo-001skG-If for pgsql-hackers@lists.postgresql.org; Wed, 25 Jun 2025 09:31:45 +0000 Received: from relay1-d.mail.gandi.net ([217.70.183.193]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1uUMTm-003zkc-0X for pgsql-hackers@lists.postgresql.org; Wed, 25 Jun 2025 09:31:44 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id B3206432E9; Wed, 25 Jun 2025 09:31:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1750843901; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=0pZoRQGytJtglrd0Be3ygDTKgBtvpLPGb/YE4DKp4HI=; b=kgx2WCZgL6equGirkCxuklEoAbjtqWBxmh1aW0IoAtDaQA7/42PwfLQlg0tXeoPgssKL0v dQvY6l01TvKZ6KT0dNq7Ks0KG9N/UqrZCW3mudChzASdDjaBUyQGS/UBlD/z10OOggOwTd gnPlr0IOqOaXmLCx5R7aAnPkCIVvSJQJ6DCa7BIaxZTSqjY664U4gjXKEhJ8O5EvmddYOM s0WsClFlntLOLEel/bacz/zG4zXbYewq36yIpmE8lhV42H/5kE61Rr78F1wLPfvYov0A/7 eKuW7ZJPMN16KzgP1QxYr0UGWd8ASdiDbWjfCuC+kMC0pslHEiAXA04VFcQjvg== Message-ID: Date: Wed, 25 Jun 2025 11:31:36 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view To: Jakub Wartak , Christoph Berg Cc: Bertrand Drouvot , Andres Freund , Tomas Vondra , pgsql-hackers@lists.postgresql.org References: <6342f601-77de-4ee0-8c2a-3deb50ceac5b@vondra.me> <8649a4e3-c60d-4f37-aa6f-e7e7c14c581e@vondra.me> <8961c087-e49b-4b16-9437-31331625215c@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtddvgddvvdegvdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhepkfffgggfuffvvehfhfgjtgfgsehtkeertddtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpefhhedthfegleeiudehvdeuffejvedtfeefueeffffhjedugffgteeltdeggfeijeenucffohhmrghinhepmhgrrhgtrdhinhhfohenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpnhgspghrtghpthhtohepiedprhgtphhtthhopehjrghkuhgsrdifrghrthgrkhesvghnthgvrhhprhhishgvuggsrdgtohhmpdhrtghpthhtohepmhihohhnseguvggsihgrnhdrohhrghdprhgtphhtthhopegsvghrthhrrghnuggurhhouhhvohhtrdhpghesghhmrghilhdrtghomhdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehtohhmrghsrdhvohhnughrrgesphhoshhtghhrvghsq hhlrdhorhhgpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 6/25/25 09:15, Jakub Wartak wrote: > On Tue, Jun 24, 2025 at 5:30 PM Christoph Berg wrote: >> >> Re: Tomas Vondra >>> If it's a reliable fix, then I guess we can do it like this. But won't >>> that be a performance penalty on everyone? Or does the system split the >>> array into 16-element chunks anyway, so this makes no difference? >> >> There's still the overhead of the syscall itself. But no idea how >> costly it is to have this 16-step loop in user or kernel space. >> >> We could claim that on 32-bit systems, shared_buffers would be smaller >> anyway, so there the overhead isn't that big. And the step size should >> be larger (if at all) on 64-bit. >> >>> Anyway, maybe we should start by reporting this to the kernel people. Do >>> you want me to do that, or shall one of you take care of that? I suppose >>> that'd be better, as you already wrote a fix / know the code better. >> >> Submitted: https://marc.info/?l=linux-mm&m=175077821909222&w=2 >> > > Hi all, I'm quite late to the party (just noticed the thread), but > here's some addition context: it technically didn't make any sense to > me to have NUMA on 32-bit due too small amount of addressable memory > (after all, NUMA is about big iron, probably not even VMs), so in the > first versions of the patchset I've excluded 32-bit (and back then for > some reason I couldn't even find libnuma i386, but Andres pointed to > me that it exists, so we re-added it probably just to stay > consistent). The thread has kind of snowballed since then, but I still > believe that NUMA on 32-bit does not make a lot of sense. > > Even assuming future shm interleaving one day in future version, > allocation of small s_b sizes will usually fit a single NUMA node. > Not sure. I thought NUMA doesn't matter very much on 32-bit systems too, exactly because those systems tend to use small amounts of memory. But then while investigating this issue I realized even rpi5 has NUMA, in fact it has a whopping 8 nodes: debian@raspberry-32:~ $ numactl --hardware available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 node 0 size: 981 MB node 0 free: 882 MB node 1 cpus: 0 1 2 3 node 1 size: 1007 MB node 1 free: 936 MB node 2 cpus: 0 1 2 3 node 2 size: 1007 MB node 2 free: 936 MB node 3 cpus: 0 1 2 3 node 3 size: 943 MB node 3 free: 873 MB node 4 cpus: 0 1 2 3 node 4 size: 1007 MB node 4 free: 936 MB node 5 cpus: 0 1 2 3 node 5 size: 1007 MB node 5 free: 935 MB node 6 cpus: 0 1 2 3 node 6 size: 1007 MB node 6 free: 936 MB node 7 cpus: 0 1 2 3 node 7 size: 990 MB node 7 free: 918 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 10 10 10 10 10 10 10 1: 10 10 10 10 10 10 10 10 2: 10 10 10 10 10 10 10 10 3: 10 10 10 10 10 10 10 10 4: 10 10 10 10 10 10 10 10 5: 10 10 10 10 10 10 10 10 6: 10 10 10 10 10 10 10 10 7: 10 10 10 10 10 10 10 10 This is with the 32-bit system (which AFAICS means 64-bit kernel and 32-bit user space). I'm not saying it's a particularly interesting NUMA system, considering all the costs are 10, and it's not like it's critical to get the best performance on rpi5. But it's NUMA, and maybe there are some other (more practical) systems. I find it interesting mostly for testing purposes. regards -- Tomas Vondra