public inbox for [email protected]  
help / color / mirror / Atom feed
From: Bertrand Drouvot <[email protected]>
To: Tomas Vondra <[email protected]>
Cc: Christoph Berg <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: Tomas Vondra <[email protected]>
Cc: [email protected]
Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view
Date: Tue, 24 Jun 2025 08:24:53 +0000
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <kl4zd72eeaex7zcicpuvpsuslrs5nfvmab7xzt4jnvcjvd6mxw@tcp64c55qkpj>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>
	<[email protected]>

Hi,

On Tue, Jun 24, 2025 at 03:43:19AM +0200, Tomas Vondra wrote:
> On 6/23/25 23:47, Tomas Vondra wrote:
> > ...
> > 
> > Or maybe the 32-bit chroot on 64-bit host matters and confuses some
> > calculation.
> >
> 
> I think it's likely something like this.

I think the same.

> I noticed that if I modify
> pg_buffercache_numa_pages() to query the addresses one by one, it works.
> And when I increase the number, it stops working somewhere between 16k
> and 17k items.

Yeah, same for me with pg_get_shmem_allocations_numa(). It works if
pg_numa_query_pages() is done on chunks <= 16 pages but fails if done on more
than 16 pages.

It's also confirmed by test_chunk_size.c attached:

$ gcc-11 -m32 -o test_chunk_size test_chunk_size.c
$ ./test_chunk_size
 1 pages: SUCCESS (0 errors)
 2 pages: SUCCESS (0 errors)
 3 pages: SUCCESS (0 errors)
 4 pages: SUCCESS (0 errors)
 5 pages: SUCCESS (0 errors)
 6 pages: SUCCESS (0 errors)
 7 pages: SUCCESS (0 errors)
 8 pages: SUCCESS (0 errors)
 9 pages: SUCCESS (0 errors)
10 pages: SUCCESS (0 errors)
11 pages: SUCCESS (0 errors)
12 pages: SUCCESS (0 errors)
13 pages: SUCCESS (0 errors)
14 pages: SUCCESS (0 errors)
15 pages: SUCCESS (0 errors)
16 pages: SUCCESS (0 errors)
17 pages: 1 errors
Threshold: 17 pages

No error if -m32 is not used.

> It may be a coincidence, but I suspect it's related to the sizeof(void
> *) being 8 in the kernel, but only 4 in the chroot. So the userspace
> passes an array of 4-byte items, but kernel interprets that as 8-byte
> items. That is, we call
> 
> long move_pages(int pid, unsigned long count, void *pages[.count], const
> int nodes[.count], int status[.count], int flags);
> 
> Which (I assume) just passes the parameters to kernel. And it'll
> interpret them per kernel pointer size.
>

I also suspect something in this area...

> If this is what's happening, I'm not sure what to do about it ...

We could work by chunks (16?) on 32 bits but would probably produce performance
degradation (we mention it in the doc though). Also would always 16 be a correct
chunk size? 

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


Attachments:

  [text/x-csrc] test_chunk_size.c (1.6K, 2-test_chunk_size.c)
  download | inline:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <errno.h>

int test_chunk_size(int chunk_size) {
    size_t page_size = sysconf(_SC_PAGESIZE);
    
    void *mem = malloc(page_size * chunk_size);
    if (!mem) return -1;
    
    memset(mem, 0xFF, page_size * chunk_size);
    
    void **ptrs = malloc(sizeof(void*) * chunk_size);
    int *status = malloc(sizeof(int) * chunk_size);
    
    for (int j = 0; j < chunk_size; j++) {
        ptrs[j] = (char*)mem + (j * page_size);
        status[j] = -999;
    }
    
    long result = syscall(SYS_move_pages, 0, chunk_size, ptrs, NULL, status, 0);
    
    int errors = 0;
    if (result == 0) {
        for (int j = 0; j < chunk_size; j++) {
            if (status[j] < 0) errors++;
        }
    }
    
    free(mem);
    free(ptrs);
    free(status);
    
    return (result == 0) ? errors : -1;
}

int main() {
    int threshold = -1;
    
    // Test sizes from 1 to 40 pages
    for (int size = 1; size <= 40; size++) {
        int errors = test_chunk_size(size);
        
        if (errors == -1) {
            if (threshold == -1) threshold = size;
            break;
        } else if (errors == 0) {
            printf("%2d pages: SUCCESS (0 errors)\n", size);
        } else {
            printf("%2d pages: %d errors\n", 
                   size, errors);
            threshold = size;
            break;
        }
    }
    
    if (threshold > 0)
        printf("Threshold: %d pages\n", threshold);
     else 
        printf("No threshold found in range 1-40 pages\n");
    
    return 0;
}

view thread (83+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: pgsql: Introduce pg_shmem_allocations_numa view
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox