Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uTzpG-00Dfcj-Me for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Jun 2025 09:20:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uTzpE-00AiuZ-Pj for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Jun 2025 09:20:21 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uTzpE-00AiuL-Dm for pgsql-hackers@lists.postgresql.org; Tue, 24 Jun 2025 09:20:21 +0000 Received: from relay1-d.mail.gandi.net ([217.70.183.193]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1uTzpC-003nWS-24 for pgsql-hackers@lists.postgresql.org; Tue, 24 Jun 2025 09:20:20 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 4B333439EE; Tue, 24 Jun 2025 09:20:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1750756817; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lNEVON6NzCMjnK6zTWsTOx+MyvZf/8Sjzek0kgO7SUQ=; b=pe2UObkytZ1JRQlQ+Cng3JNyFjP5oKVgtrv/rK29aLUlhRH4lYAyoaC6OSrLpERgb7ilXC MvLZkcZAXk14EiFo4YvgeNDVD9tB8246zUXrwumaThElDr5FHbVRksGZSY8RLXMhlssEka hmVioZjvgVvy2ThH7aRFJEFDvNTr40VPuWEm31SVsw8V264BK6NFmAZvYH3NuRk13ezIKY OIsEo5SEz3dWXdUCjDqZvpEoV+8q5EOqvM6PsigggdsxsY/yaFTo3bQkmul50FWAIxPmY+ 1wn5vISY0Dp6XQtoA7To1egeySdbpBG8/f7NwFK7lC5SbTBj3USLCm10QJ2r3w== Message-ID: Date: Tue, 24 Jun 2025 11:20:15 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view To: Bertrand Drouvot Cc: Christoph Berg , Andres Freund , Tomas Vondra , pgsql-hackers@lists.postgresql.org References: <6c9f9f7e-947b-4fc3-bdb6-b0696d7492e5@vondra.me> <0643ae61-cf9d-482c-9b2c-fb861b24fd22@vondra.me> <6342f601-77de-4ee0-8c2a-3deb50ceac5b@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtddvgdduleehudcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhepkfffgggfuffvvehfhfgjtgfgsehtjeertddtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpeeludegieekgfelhffgffeuvdelteetveeghfdvieekfeduudduvdfhvedufefhveenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrudekngdpmhgrihhlfhhrohhmpehtohhmrghssehvohhnughrrgdrmhgvpdhnsggprhgtphhtthhopeehpdhrtghpthhtohepsggvrhhtrhgrnhguughrohhuvhhothdrphhgsehgmhgrihhlrdgtohhmpdhrtghpthhtohepmhihohhnseguvggsihgrnhdrohhrghdprhgtphhtthhopegrnhgurhgvshesrghnrghrrgiivghlrdguvgdprhgtphhtthhopehtohhmrghsrdhvohhnughrrgesphhoshhtghhrvghsqhhlrdhorhhgpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrsheslhhishhtshdrphhoshhtghhrvghsqhhlrdhorhhg X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 6/24/25 10:24, Bertrand Drouvot wrote: > Hi, > > On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote: >> On 6/23/25 23:47, Tomas Vondra wrote: >>> ... >>> >>> Or maybe the 32-bit chroot on 64-bit host matters and confuses some >>> calculation. >>> >> >> I think it's likely something like this. > > I think the same. > >> I noticed that if I modify >> pg_buffercache_numa_pages() to query the addresses one by one, it works. >> And when I increase the number, it stops working somewhere between 16k >> and 17k items. > > Yeah, same for me with pg_get_shmem_allocations_numa(). It works if > pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more > than 16 pages. > > It's also confirmed by test_chunk_size.c attached: > > $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c > $ ./test_chunk_size > 1 pages: SUCCESS (0 errors) > 2 pages: SUCCESS (0 errors) > 3 pages: SUCCESS (0 errors) > 4 pages: SUCCESS (0 errors) > 5 pages: SUCCESS (0 errors) > 6 pages: SUCCESS (0 errors) > 7 pages: SUCCESS (0 errors) > 8 pages: SUCCESS (0 errors) > 9 pages: SUCCESS (0 errors) > 10 pages: SUCCESS (0 errors) > 11 pages: SUCCESS (0 errors) > 12 pages: SUCCESS (0 errors) > 13 pages: SUCCESS (0 errors) > 14 pages: SUCCESS (0 errors) > 15 pages: SUCCESS (0 errors) > 16 pages: SUCCESS (0 errors) > 17 pages: 1 errors > Threshold: 17 pages > > No error if -m32 is not used. > >> It may be a coincidence, but I suspect it's related to the sizeof(void >> *) being 8 in the kernel, but only 4 in the chroot. So the userspace >> passes an array of 4-byte items, but kernel interprets that as 8-byte >> items. That is, we call >> >> long move_pages(int pid, unsigned long count, void *pages[.count], const >> int nodes[.count], int status[.count], int flags); >> >> Which (I assume) just passes the parameters to kernel. And it'll >> interpret them per kernel pointer size. >> > > I also suspect something in this area... > >> If this is what's happening, I'm not sure what to do about it ... > > We could work by chunks (16?) on 32 bits but would probably produce performance > degradation (we mention it in the doc though). Also would always 16 be a correct > chunk size? I don't see how this would solve anything? AFAICS the problem is the two places are confused about how large the array elements are, and get to interpret that differently. Using a smaller array won't solve that. The pg function would still allocate array of 16 x 32-bit pointers, and the kernel would interpret this as 16 x 64-bit pointers. And that means the kernel will (a) write into memory beyond the allocated buffer - a clear buffer overflow, and (b) see bogus pointers, because it'll concatenate two 32-bit pointers. I don't see how using smaller array makes this correct. That it works is more a matter of luck, and also a consequence of still allocating the whole array, so there's no overflow (at least I kept that, not sure how you did the chunks). If I fix the code to make the entries 64-bit (by treating the pointers as int64), it suddenly starts working - no bad addresses, etc. Well, almost, because I get this bufferid | os_page_num | numa_node ----------+-------------+----------- 1 | 0 | 0 1 | 1 | -14 2 | 2 | 0 2 | 3 | -14 3 | 4 | 0 3 | 5 | -14 4 | 6 | 0 4 | 7 | -14 ... The -14 status is interesting, because that's the same value Christoph reported as the other issue (in pg_shmem_allocations_numa). I did an experiment and changed os_page_status to be declared as int64, not just int. And interestingly, that produced this: bufferid | os_page_num | numa_node ----------+-------------+----------- 1 | 0 | 0 1 | 1 | 0 2 | 2 | 0 2 | 3 | 0 3 | 4 | 0 3 | 5 | 0 4 | 6 | 0 4 | 7 | 0 ... But I don't see how this makes any sense, because "int" should be 4B in both cases (in 64-bit kernel and 32-bit chroot). FWIW I realized this applies to "official" systems with 32-bit user space on 64-bit kernels, like e.g. rpi5 with RPi OS 32-bit. (Fun fact, rpi5 has 8 NUMA nodes, with all CPUs attached to all NUMA nodes.) I'm starting to think we need to disable NUMA for setups like this, mixing 64-bit kernels with 32-bit chroot. Is there a good way to detect those, so that we can error-out? FWIW this doesn't explain the strange valgrind issue, though. -- Tomas Vondra