Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uU2qm-00EZi7-7A for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Jun 2025 12:34:08 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1uU2qk-00BvzJ-98 for pgsql-hackers@arkaria.postgresql.org; Tue, 24 Jun 2025 12:34:06 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1uU2qj-00BvzB-UH for pgsql-hackers@lists.postgresql.org; Tue, 24 Jun 2025 12:34:06 +0000 Received: from relay4-d.mail.gandi.net ([217.70.183.196]) by makus.postgresql.org with smtp (Exim 4.96) (envelope-from ) id 1uU2qi-003j6t-1A for pgsql-hackers@lists.postgresql.org; Tue, 24 Jun 2025 12:34:05 +0000 Received: by mail.gandi.net (Postfix) with ESMTPSA id 26D61433B1; Tue, 24 Jun 2025 12:34:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=vondra.me; s=gm1; t=1750768440; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Ol9iRaQ1xXS/kBxybsBJ2VijWKj6F7hGVTKLDO/Kcm4=; b=N2Rx8GuqbAw8PUYJXh7npQ97ZUSL0wpWVtFvrihSVnD2oCvICrBBY6pTZ82az7uQCjIVAq +ETqi9DW1zFV8BM2HuNZjiH666s8Dnhfj+dLXytAToIYPWRANC6IDjL8mklyyfGmrpQTMM gl0+tcAf/zKRLaVjaIO4AyK3LF7EXp2wO2d3FH5WonoZmQtxGvwyapenhJfTnL0WYAK0lh HX4GUCyqTJpwARFLlyRGIMIlQmZ1lUqliJ0gbXb9Kv6SQJTKKBkbQEJe6czS8O/zh1EY6h vQogoMmcNMowMKtoxsWBcFwCUUtvwWHWNbDeLMhSpBCOsg2uXsgZfarsyR25PA== Message-ID: <8649a4e3-c60d-4f37-aa6f-e7e7c14c581e@vondra.me> Date: Tue, 24 Jun 2025 14:33:59 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view To: Bertrand Drouvot Cc: Christoph Berg , Andres Freund , Tomas Vondra , pgsql-hackers@lists.postgresql.org References: <6c9f9f7e-947b-4fc3-bdb6-b0696d7492e5@vondra.me> <0643ae61-cf9d-482c-9b2c-fb861b24fd22@vondra.me> <6342f601-77de-4ee0-8c2a-3deb50ceac5b@vondra.me> Content-Language: en-US From: Tomas Vondra In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-GND-State: clean X-GND-Score: -100 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtddvgdduleeltdcutefuodetggdotefrodftvfcurfhrohhfihhlvgemucfitefpfffkpdcuggftfghnshhusghstghrihgsvgenuceurghilhhouhhtmecufedtudenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurhepkfffgggfuffvvehfhfgjtgfgsehtjeertddtvdejnecuhfhrohhmpefvohhmrghsucggohhnughrrgcuoehtohhmrghssehvohhnughrrgdrmhgvqeenucggtffrrghtthgvrhhnpeeludegieekgfelhffgffeuvdelteetveeghfdvieekfeduudduvdfhvedufefhveenucfkphepkeeirdegledrvdeftddrvddtieenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepihhnvghtpeekiedrgeelrddvfedtrddvtdeipdhhvghloheplgdutddrudefjedrtddrvdgnpdhmrghilhhfrhhomhepthhomhgrshesvhhonhgurhgrrdhmvgdpnhgspghrtghpthhtohephedprhgtphhtthhopegsvghrthhrrghnuggurhhouhhvohhtrdhpghesghhmrghilhdrtghomhdprhgtphhtthhopehmhihonhesuggvsghirghnrdhorhhgpdhrtghpthhtoheprghnughrvghssegrnhgrrhgriigvlhdruggvpdhrtghpthhtohepthhomhgrshdrvhhonhgurhgrsehpohhsthhgrhgvshhqlhdrohhrghdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdrohhrgh X-GND-Sasl: tomas@vondra.me List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 6/24/25 13:10, Bertrand Drouvot wrote: > Hi, > > On Tue, Jun 24, 2025 at 11:20:15AM +0200, Tomas Vondra wrote: >> On 6/24/25 10:24, Bertrand Drouvot wrote: >>> Yeah, same for me with pg_get_shmem_allocations_numa(). It works if >>> pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more >>> than 16 pages. >>> >>> It's also confirmed by test_chunk_size.c attached: >>> >>> $ gcc-11 -m32 -o test_chunk_size test_chunk_size.c >>> $ ./test_chunk_size >>> 1 pages: SUCCESS (0 errors) >>> 2 pages: SUCCESS (0 errors) >>> 3 pages: SUCCESS (0 errors) >>> 4 pages: SUCCESS (0 errors) >>> 5 pages: SUCCESS (0 errors) >>> 6 pages: SUCCESS (0 errors) >>> 7 pages: SUCCESS (0 errors) >>> 8 pages: SUCCESS (0 errors) >>> 9 pages: SUCCESS (0 errors) >>> 10 pages: SUCCESS (0 errors) >>> 11 pages: SUCCESS (0 errors) >>> 12 pages: SUCCESS (0 errors) >>> 13 pages: SUCCESS (0 errors) >>> 14 pages: SUCCESS (0 errors) >>> 15 pages: SUCCESS (0 errors) >>> 16 pages: SUCCESS (0 errors) >>> 17 pages: 1 errors >>> Threshold: 17 pages >>> >>> No error if -m32 is not used. >>> >>> We could work by chunks (16?) on 32 bits but would probably produce performance >>> degradation (we mention it in the doc though). Also would always 16 be a correct >>> chunk size? >> >> I don't see how this would solve anything? >> >> AFAICS the problem is the two places are confused about how large the >> array elements are, and get to interpret that differently. > >> I don't see how using smaller array makes this correct. That it works is >> more a matter of luck, > > Not sure it's luck, maybe the wrong pointers arithmetic has no effect if batch > size is <= 16. > > So we have kernel_move_pages() -> kernel_move_pages() (because nodes is NULL here > for us as we call "numa_move_pages(pid, count, pages, NULL, status, 0);"). > > So, if we look at do_pages_stat() ([1]), we can see that it uses an hardcoded > "#define DO_PAGES_STAT_CHUNK_NR 16UL" and that this pointers arithmetic: > > " > pages += chunk_nr; > status += chunk_nr; > " > > is done but has no effect since nr_pages will exit the loop if we use a batch > size <= 16. > > So if this pointer arithmetic is not correct, (it seems that it should advance > by 16 * sizeof(compat_uptr_t) instead) then it has no effect as long as the batch > size is <= 16. > > Does test_chunk_size also fails at 17 for you? Yes, it fails for me at 17 too. So you're saying the access within each chunk of 16 elements is OK, but that maybe advancing to the next chunk is not quite right? In which case limiting the access to 16 entries might be a workaround. In any case, this sounds like a kernel bug, right? I don't have much experience with the kernel code, so don't want to rely too much on my interpretation of it. regards -- Tomas Vondra