public inbox for [email protected]  
help / color / mirror / Atom feed
From: Rahila Syed <[email protected]>
To: Nazir Bilal Yavuz <[email protected]>
Cc: Andres Freund <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Improve monitoring of shared memory allocations
Date: Sun, 23 Mar 2025 14:06:47 +0530
Message-ID: <CAH2L28uwxJREzB62UjRDBumE87hHWUJJvRwxqqbO+7qFmoZfTg@mail.gmail.com> (raw)
In-Reply-To: <CAN55FZ3cJxy0VkeXpuO3K4BpjzJo3S6oU+iMyc00P6gEjqPztw@mail.gmail.com>
References: <CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com>
	<k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq>
	<CAH2L28uL-8EQTRSeyTpW1DqAXsDXRkXCkT1dus2u6p4HYDrxAg@mail.gmail.com>
	<CAH2L28uGLhkXBKDWFKm5XZtp_0nNqpYQ3Hc35vG++mM7wuOhgg@mail.gmail.com>
	<CAN55FZ3cJxy0VkeXpuO3K4BpjzJo3S6oU+iMyc00P6gEjqPztw@mail.gmail.com>

Hi Bilal,


I have a couple of comments, I have only reviewed 0001 so far.
>

Thank you for reviewing!


>
> You may need to run pgindent, it makes some changes.
>

Attached v4-patch has been updated after running pgindent.


+         * If table is shared, calculate the offset at which to find the
> the
> +         * first partition of elements
> +         */
> +
> +        nsegs = compute_buckets_and_segs(nelem, &nbuckets,
> hctl->num_partitions, hctl->ssize);
>
> Blank line between the comment and the code.
>

Removed this.


>  /*
>   * allocate some new elements and link them into the indicated free list
>   */
> -static bool
> -element_alloc(HTAB *hashp, int nelem, int freelist_idx)
> +static HASHELEMENT *
> +element_alloc(HTAB *hashp, int nelem)
>
> Comment needs an update. This function no longer links elements into
> the free list.
>

Updated this and few other comments in the attached v4-patch.


>
> +static int
> +compute_buckets_and_segs(long nelem, int *nbuckets, long
> num_partitions, long ssize)
> +{
> ...
> +    /*
> +     * In a partitioned table, nbuckets must be at least equal to
> +     * num_partitions; were it less, keys with apparently different
> partition
> +     * numbers would map to the same bucket, breaking partition
> independence.
> +     * (Normally nbuckets will be much bigger; this is just a safety
> check.)
> +     */
> +    while ((*nbuckets) < num_partitions)
> +        (*nbuckets) <<= 1;
>
> I have some worries about this function, I am not sure what I said
> below has real life implications as you already said 'Normally
> nbuckets will be much bigger; this is just a safety check.'.
>
> 1- num_partitions is long and nbuckets is int, so could there be any
> case where num_partition is bigger than MAX_INT and cause an infinite
> loop?
> 2- Although we assume both nbuckets and num_partition initialized as
> the same type, (*nbuckets) <<= 1 will cause an infinite loop if
> num_partition is bigger than MAX_TYPE / 2.
>
> So I think that the solution is to confirm that num_partition <
> MAX_NBUCKETS_TYPE / 2, what do you think?
>
>
Your concern is valid. This has been addressed in the existing code by
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate
function
for reuse.

Please find attached the updated and rebased patches.

Thank you,
Rahila Syed


Attachments:

  [application/octet-stream] v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (6.3K, 3-v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
  download | inline diff:
From 56c1ca94ef109f484a01496f09e9aba2bed0a3a9 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e..65239d743d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.34.1



  [application/octet-stream] v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch (13.5K, 4-v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch)
  download | inline diff:
From 2352d4918f3c5b548790de8e2a673759d83b6f96 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 220 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 169 insertions(+), 57 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..550f04359a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,14 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements
+		 */
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +636,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +733,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +750,28 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +866,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1353,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1391,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1770,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1827,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2038,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..79b959ffc3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1



view thread (20+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Improve monitoring of shared memory allocations
  In-Reply-To: <CAH2L28uwxJREzB62UjRDBumE87hHWUJJvRwxqqbO+7qFmoZfTg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox