public inbox for [email protected]
help / color / mirror / Atom feedFrom: Rahila Syed <[email protected]>
To: Tomas Vondra <[email protected]>
Cc: Nazir Bilal Yavuz <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Improve monitoring of shared memory allocations
Date: Thu, 27 Mar 2025 17:32:55 +0530
Message-ID: <CAH2L28uG_g1Ljo8aL-g1MupJXO4Y7-a-bUCriE7w2213+KSGdA@mail.gmail.com> (raw)
In-Reply-To: <[email protected]>
References: <CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com>
<k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq>
<CAH2L28uL-8EQTRSeyTpW1DqAXsDXRkXCkT1dus2u6p4HYDrxAg@mail.gmail.com>
<CAH2L28uGLhkXBKDWFKm5XZtp_0nNqpYQ3Hc35vG++mM7wuOhgg@mail.gmail.com>
<CAN55FZ3cJxy0VkeXpuO3K4BpjzJo3S6oU+iMyc00P6gEjqPztw@mail.gmail.com>
<CAH2L28uwxJREzB62UjRDBumE87hHWUJJvRwxqqbO+7qFmoZfTg@mail.gmail.com>
<[email protected]>
Hi Tomas,
> I did a review on v3 of the patch. I see there's been some minor changes
> in v4 - I haven't tried to adjust my review, but I assume most of my
> comments still apply.
>
>
Thank you for your interest and review.
> 1) I don't quite understand why hash_get_shared_size() got renamed to
> hash_get_init_size()? Why is that needed? Does it cover only some
> initial allocation, and there's additional memory allocated later? And
> what's the point of the added const?
>
Earlier, this function was used to calculate the size for shared hash
tables only.
Now, it also calculates the size for a non-shared hash table. Hence the
change
of name.
If I don't change the argument to const, I get a compilation error as
follows:
"passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier from
pointer target type."
This is due to this function being called from hash_create which considers
HASHCTL to be
a const.
>
> 5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
> along with calculating the per-element requestSize, but then still does
>
> memset(PredXact->element, 0, requestSize);
>
> and
>
> memset(RWConflictPool->element, 0, requestSize);
>
> with requestSize for the whole allocation? I haven't seen any crashes,
> but this seems wrong to me. I believe we can simply zero the whole
> allocation right after ShmemInitStruct(), there's no need to do this for
> each individual element.
>
Good catch. Thanks for updating.
>
> 5) InitProcGlobal is another example of hard-to-read code. Admittedly,
> it wasn't particularly readable before, but making the lines even longer
> does not make it much better ...
>
> I guess adding a couple macros similar to hash_create() would be one way
> to improve this (and there's even a review comment in that sense), but I
> chose to just do maintain a simple pointer. But if you think it should
> be done differently, feel free to do so.
>
>
LGTM, long lines have been reduced by the ptr approach.
> 6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
> which seems to be doing a very similar thing, and to not require the
> prototype. Minor detail, feel free to undo.
>
> LGTM.
>
> 7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
> the patch, and I see no reason to do it in the same commit. So 0005
> removes this change, and 0006 reintroduces it separately.
>
> OK.
> FWIW I see no justification for why the cache line padding is useful,
> and it seems quite unlikely this would benefit anything, consiering it
> adds padding between fairly long arrays. What kind of contention can we
> get there? I don't get it.
>
I have done that to address a review comment upthread by Andres and
the because comment above that code block also mentions need for
padding.
> Also, why is the patch adding padding after statusFlags (the last array
> allocated in InitProcGlobal) and not between allProcs and xids?
>
I agree it makes more sense this way, so changing accordingly.
> *
> + * XXX can we rely on this matching the calculation in
> hash_get_shared_size?
> + * or could/should we add some asserts? Can we have at
> least some sanity checks
> + * on nbuckets/nsegs?
>
>
Both places rely on compute_buckets_and_segs() to calculate nbuckets and
nsegs,
hence the probability of mismatch is low. I am open to adding some
asserts to verify this.
Do you have any suggestions in mind?
Please find attached updated patches after merging all your review comments
except
a few discussed above.
Thank you,
Rahila Syed
Attachments:
[application/octet-stream] v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patch (2.1K, 3-v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patch)
download | inline diff:
From 8b8fe395df974e5e7d1ae85d933c53ec86132e1d Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 17:20:12 +0530
Subject: [PATCH 3/3] Add cacheline padding between heavily accessed arrays
---
src/backend/storage/lmgr/proc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6ee48410b8..edc5c7406b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -135,8 +135,11 @@ PGProcShmemSize(void)
uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
size = TotalProcs * sizeof(PGPROC);
+ size = add_size(size, PG_CACHE_LINE_SIZE);
size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+ size = add_size(size, PG_CACHE_LINE_SIZE);
size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+ size = add_size(size, PG_CACHE_LINE_SIZE);
size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
return size;
}
@@ -231,7 +234,7 @@ InitProcGlobal(void)
&found);
procs = (PGPROC *) ptr;
- ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+ ptr = (char *)ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
ProcGlobal->allProcs = procs;
@@ -244,11 +247,11 @@ InitProcGlobal(void)
*/
ProcGlobal->xids = (TransactionId *) ptr;
MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
- ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+ ptr = (char *)ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
- ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+ ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
ProcGlobal->statusFlags = (uint8 *) ptr;
MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
--
2.34.1
[application/octet-stream] v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch (15.3K, 4-v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch)
download | inline diff:
From 44b3fbb06dd9436c5545c2a93432d65e400a360c Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 12:59:02 +0530
Subject: [PATCH 1/2] Account for all the shared memory allocated by
hash_create
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.
Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
src/backend/storage/ipc/shmem.c | 3 +-
src/backend/utils/hash/dynahash.c | 265 +++++++++++++++++++++++-------
src/include/utils/hsearch.h | 3 +-
3 files changed, 213 insertions(+), 58 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#include "utils/dynahash.h"
static void *ShmemAllocRaw(Size size, Size *allocated_size);
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* look it up in the shmem index */
location = ShmemInitStruct(name,
- hash_get_shared_size(infoP, hash_flags),
+ hash_get_init_size(infoP, hash_flags, init_size, 0),
&found);
/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..1f215a16c5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
hash_expansions;
#endif
+
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ (MAXALIGN(sizeof(HASHHDR)) + \
+ ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+ ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+ ((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+ (MAXALIGN(sizeof(HASHHDR)) + \
+ ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+ ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+ (HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) ((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+
+#define HASH_DIRECTORY(hashp) (HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+ ((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
/*
* Private function prototypes
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
static void deregister_seq_scan(HTAB *hashp);
static bool has_seq_scans(HTAB *hashp);
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+ long ssize, /* segment size */
+ int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+ int nelem, int freelist_idx);
/*
* memory allocation support
@@ -353,6 +382,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
{
HTAB *hashp;
HASHHDR *hctl;
+ int nelem_batch;
/*
* Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +537,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
hashp->isshared = false;
}
+ /* Choose the number of entries to allocate at a time. */
+ nelem_batch = choose_nelem_alloc(info->entrysize);
+
+ /*
+ * Allocate the memory needed for hash header, directory, segments and
+ * elements together. Use pointer arithmetic to arrive at the start of
+ * each of these structures later.
+ */
if (!hashp->hctl)
{
- hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+ Size size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+ hashp->hctl = (HASHHDR *) hashp->alloc(size);
if (!hashp->hctl)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
hctl->keysize = info->keysize;
hctl->entrysize = info->entrysize;
+ /* remember how many elements to allocate at once */
+ hctl->nelem_alloc = nelem_batch;
+
/* make local copies of heavily-used constant fields */
hashp->keysize = hctl->keysize;
hashp->ssize = hctl->ssize;
@@ -582,6 +625,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
freelist_partitions,
nelem_alloc,
nelem_alloc_first;
+ void *ptr = NULL;
+ int nsegs;
+ int nbuckets;
/*
* If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +638,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
else
freelist_partitions = 1;
+ compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+ &nbuckets, &nsegs);
+
+ /*
+ * Calculate the offset at which to find the first partition of
+ * elements.
+ * We have to skip space for the header, segments and buckets.
+ */
+ ptr = HASH_ELEMENTS(hashp, nsegs);
+
nelem_alloc = nelem / freelist_partitions;
if (nelem_alloc <= 0)
nelem_alloc = 1;
@@ -610,10 +666,17 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
{
int temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
- if (!element_alloc(hashp, temp, i))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ /*
+ * Assign the correct location of each parition within a
+ * pre-allocated buffer.
+ *
+ * Actual memory allocation happens in ShmemInitHash for
+ * shared hash tables or earlier in this function for non-shared
+ * hash tables.
+ * We just need to split that allocation per-batch freelists.
+ */
+ element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+ ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
}
}
@@ -701,30 +764,12 @@ init_htab(HTAB *hashp, long nelem)
for (i = 0; i < NUM_FREELISTS; i++)
SpinLockInit(&(hctl->freeList[i].mutex));
- /*
- * Allocate space for the next greater power of two number of buckets,
- * assuming a desired maximum load factor of 1.
- */
- nbuckets = next_pow2_int(nelem);
-
- /*
- * In a partitioned table, nbuckets must be at least equal to
- * num_partitions; were it less, keys with apparently different partition
- * numbers would map to the same bucket, breaking partition independence.
- * (Normally nbuckets will be much bigger; this is just a safety check.)
- */
- while (nbuckets < hctl->num_partitions)
- nbuckets <<= 1;
+ compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+ &nbuckets, &nsegs);
hctl->max_bucket = hctl->low_mask = nbuckets - 1;
hctl->high_mask = (nbuckets << 1) - 1;
- /*
- * Figure number of directory segments needed, round up to a power of 2
- */
- nsegs = (nbuckets - 1) / hctl->ssize + 1;
- nsegs = next_pow2_int(nsegs);
-
/*
* Make sure directory is big enough. If pre-allocated directory is too
* small, choke (caller screwed up).
@@ -737,26 +782,25 @@ init_htab(HTAB *hashp, long nelem)
return false;
}
- /* Allocate a directory */
+ /*
+ * Assign a directory by making it point to the correct location in the
+ * pre-allocated buffer.
+ */
if (!(hashp->dir))
{
CurrentDynaHashCxt = hashp->hcxt;
- hashp->dir = (HASHSEGMENT *)
- hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
- if (!hashp->dir)
- return false;
+ hashp->dir = HASH_DIRECTORY(hashp);
}
- /* Allocate initial segments */
+ /* Assign initial segments, which are also pre-allocated */
+ i = 0;
for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
{
- *segp = seg_alloc(hashp);
- if (*segp == NULL)
- return false;
+ *segp = HASH_SEGMENT_PTR(hashp, i++);
+ MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
}
- /* Choose number of entries to allocate at a time */
- hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+ Assert(i == nsegs);
#ifdef HASH_DEBUG
fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -847,15 +891,79 @@ hash_select_dirsize(long num_entries)
/*
* Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters. We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * or non-shared memory hashtable with the given parameters.
+ * We need space for the HASHHDR, for the directory, segments and
+ * the init_size elements in buckets.
+ *
+ * For shared hash tables the directory size is non-expansive.
+ *
+ * init_size should match the total number of elements allocated
+ * during hash table creation, it could be zero for non-shared hash
+ * tables depending on the value of nelem_alloc. For more explanation
+ * see comments within this function.
+ *
+ * nelem_alloc parameter is not relevant for shared hash tables.
*/
Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
{
- Assert(flags & HASH_DIRSIZE);
- Assert(info->dsize == info->max_dsize);
- return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+ int nbuckets;
+ int nsegs;
+ int num_partitions;
+ long ssize;
+ long dsize;
+ bool element_alloc = true; /*Always true for shared hash tables */
+ Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+ /*
+ * For non-shared hash tables, the requested number of elements are
+ * allocated only if they are less than nelem_alloc. In any case, the
+ * init_size should be equal to the number of elements added using
+ * element_add() in hash_create.
+ */
+ if (!(flags & HASH_SHARED_MEM))
+ {
+ if (init_size > nelem_alloc)
+ element_alloc = false;
+ }
+ else
+ {
+ Assert(flags & HASH_DIRSIZE);
+ Assert(info->dsize == info->max_dsize);
+ }
+ /* Non-shared hash tables may not specify dir size */
+ if (!(flags & HASH_DIRSIZE))
+ {
+ dsize = DEF_DIRSIZE;
+ }
+ else
+ dsize = info->dsize;
+
+ if (flags & HASH_PARTITION)
+ {
+ num_partitions = info->num_partitions;
+
+ /* Number of entries should be atleast equal to the freelists */
+ if (init_size < NUM_FREELISTS)
+ init_size = NUM_FREELISTS;
+ }
+ else
+ num_partitions = 0;
+
+ if (flags & HASH_SEGMENT)
+ ssize = info->ssize;
+ else
+ ssize = DEF_SEGSIZE;
+
+ compute_buckets_and_segs(init_size, num_partitions, ssize,
+ &nbuckets, &nsegs);
+
+ if (!element_alloc)
+ init_size = 0;
+
+ return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
+ + MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+ + init_size * elementSize;
}
@@ -1285,7 +1393,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
* Failing because the needed element is in a different freelist is
* not acceptable.
*/
- if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+ newElement = element_alloc(hashp, hctl->nelem_alloc);
+ if (newElement == NULL)
{
int borrow_from_idx;
@@ -1322,6 +1431,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
/* no elements available to borrow either, so out of memory */
return NULL;
}
+ element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
}
/* remove entry from freelist, bump nentries */
@@ -1700,30 +1810,43 @@ seg_alloc(HTAB *hashp)
}
/*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
*/
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
{
HASHHDR *hctl = hashp->hctl;
Size elementSize;
- HASHELEMENT *firstElement;
- HASHELEMENT *tmpElement;
- HASHELEMENT *prevElement;
- int i;
+ HASHELEMENT *firstElement = NULL;
if (hashp->isfixed)
- return false;
+ return NULL;
/* Each element has a HASHELEMENT header plus user data. */
elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
CurrentDynaHashCxt = hashp->hcxt;
firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
if (!firstElement)
- return false;
+ return NULL;
+
+ return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+ HASHHDR *hctl = hashp->hctl;
+ Size elementSize;
+ HASHELEMENT *tmpElement;
+ HASHELEMENT *prevElement;
+ int i;
+ /* Each element has a HASHELEMENT header plus user data. */
+ elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
/* prepare to link all the new entries into the freelist */
prevElement = NULL;
tmpElement = firstElement;
@@ -1744,8 +1867,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
if (IS_PARTITIONED(hctl))
SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
- return true;
}
/*
@@ -1957,3 +2078,35 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
}
}
}
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+ int *nbuckets, int *nsegments)
+{
+ /*
+ * Allocate space for the next greater power of two number of buckets,
+ * assuming a desired maximum load factor of 1.
+ */
+ *nbuckets = next_pow2_int(nelem);
+
+ /*
+ * In a partitioned table, nbuckets must be at least equal to
+ * num_partitions; were it less, keys with apparently different partition
+ * numbers would map to the same bucket, breaking partition independence.
+ * (Normally nbuckets will be much bigger; this is just a safety check.)
+ */
+ while ((*nbuckets) < num_partitions)
+ (*nbuckets) <<= 1;
+
+
+ /*
+ * Figure number of directory segments needed, round up to a power of 2
+ */
+ *nsegments = ((*nbuckets) - 1) / ssize + 1;
+ *nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..79b959ffc3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
extern void hash_freeze(HTAB *hashp);
extern Size hash_estimate_size(long num_entries, Size entrysize);
extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+ long init_size, int nelem_alloc);
extern void AtEOXact_HashTables(bool isCommit);
extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
--
2.34.1
[application/octet-stream] v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (7.2K, 5-v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
download | inline diff:
From 65dab85e3c0d6889cfa02da29c2b11d6dd39b56a Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 16:43:28 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct
The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
src/backend/storage/lmgr/predicate.c | 27 +++++++++-----
src/backend/storage/lmgr/proc.c | 56 +++++++++++++++++++++++-----
2 files changed, 63 insertions(+), 20 deletions(-)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..de2629fdf0 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,20 @@ PredicateLockShmemInit(void)
*/
max_table_size *= 10;
+ requestSize = add_size(PredXactListDataSize,
+ (mul_size((Size) max_table_size,
+ sizeof(SERIALIZABLEXACT))));
PredXact = ShmemInitStruct("PredXactList",
- PredXactListDataSize,
+ requestSize,
&found);
Assert(found == IsUnderPostmaster);
if (!found)
{
int i;
+ /* reset everything, both the header and the element */
+ memset(PredXact, 0, requestSize);
+
dlist_init(&PredXact->availableList);
dlist_init(&PredXact->activeList);
PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1248,8 @@ PredicateLockShmemInit(void)
PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
PredXact->CanPartialClearThrough = 0;
PredXact->HavePartialClearedThrough = 0;
- requestSize = mul_size((Size) max_table_size,
- sizeof(SERIALIZABLEXACT));
- PredXact->element = ShmemAlloc(requestSize);
+ PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
/* Add all elements to available list, clean. */
- memset(PredXact->element, 0, requestSize);
for (i = 0; i < max_table_size; i++)
{
LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1299,21 +1302,25 @@ PredicateLockShmemInit(void)
* probably OK.
*/
max_table_size *= 5;
+ requestSize = RWConflictPoolHeaderDataSize +
+ mul_size((Size) max_table_size,
+ RWConflictDataSize);
RWConflictPool = ShmemInitStruct("RWConflictPool",
- RWConflictPoolHeaderDataSize,
+ requestSize,
&found);
Assert(found == IsUnderPostmaster);
if (!found)
{
int i;
+ /* clean everything, including the elements */
+ memset(RWConflictPool, 0, requestSize);
+
dlist_init(&RWConflictPool->availableList);
- requestSize = mul_size((Size) max_table_size,
- RWConflictDataSize);
- RWConflictPool->element = ShmemAlloc(requestSize);
+ RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+ RWConflictPoolHeaderDataSize);
/* Add all elements to available list, clean. */
- memset(RWConflictPool->element, 0, requestSize);
for (i = 0; i < max_table_size; i++)
{
dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e..6ee48410b8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,24 @@ ProcGlobalShmemSize(void)
return size;
}
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+ Size size;
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+ size = TotalProcs * sizeof(PGPROC);
+ size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+ size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+ size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+ return size;
+}
+
/*
* Report number of semaphores needed by InitProcGlobal.
*/
@@ -175,6 +193,8 @@ InitProcGlobal(void)
*fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
Size fpLockBitsSize,
fpRelIdSize;
+ Size requestSize;
+ char *ptr;
/* Create the ProcGlobal shared structure */
ProcGlobal = (PROC_HDR *)
@@ -204,7 +224,15 @@ InitProcGlobal(void)
* with a single freelist.) Each PGPROC structure is dedicated to exactly
* one of these purposes, and they do not move between groups.
*/
- procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+ requestSize = PGProcShmemSize();
+
+ ptr = ShmemInitStruct("PGPROC structures",
+ requestSize,
+ &found);
+
+ procs = (PGPROC *) ptr;
+ ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+
MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
ProcGlobal->allProcs = procs;
/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +241,21 @@ InitProcGlobal(void)
/*
* Allocate arrays mirroring PGPROC fields in a dense manner. See
* PROC_HDR.
- *
- * XXX: It might make sense to increase padding for these arrays, given
- * how hotly they are accessed.
*/
- ProcGlobal->xids =
- (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+ ProcGlobal->xids = (TransactionId *) ptr;
MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
- ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+ ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+ ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
- ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+ ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+ ProcGlobal->statusFlags = (uint8 *) ptr;
MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+ ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+ /* make sure wer didn't overflow */
+ Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
/*
* Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +265,9 @@ InitProcGlobal(void)
fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
- fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+ fpPtr = ShmemInitStruct("Fast path lock arrays",
+ TotalProcs * (fpLockBitsSize + fpRelIdSize),
+ &found);
MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
/* For asserts checking we did not overflow. */
@@ -330,7 +364,9 @@ InitProcGlobal(void)
PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
/* Create ProcStructLock spinlock, too */
- ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+ ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+ sizeof(slock_t),
+ &found);
SpinLockInit(ProcStructLock);
}
--
2.34.1
view thread (20+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected]
Subject: Re: Improve monitoring of shared memory allocations
In-Reply-To: <CAH2L28uG_g1Ljo8aL-g1MupJXO4Y7-a-bUCriE7w2213+KSGdA@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox