Re: Improve monitoring of shared memory allocations

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: Improve monitoring of shared memory allocations
14+ messages / 5 participants
[nested] [flat]

* Re: Improve monitoring of shared memory allocations
@ 2025-03-12 10:46 Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Rahila Syed @ 2025-03-12 10:46 UTC (permalink / raw)
  To: Andres Freund <[email protected]>; +Cc: pgsql-hackers

Hi Andres,



>> > +             if (hashp->isshared)
>> > +             {
>> > +                     int                     nsegs;
>> > +                     int                     nbuckets;
>> > +                     nsegs = find_num_of_segs(nelem, &nbuckets,
>> hctl->num_partitions, hctl->ssize);
>> > +
>> > +                     curr_offset =  (((char *) hashp->hctl) +
>> sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
>> > +                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
>> > +             }
>> > +
>>
>> Why only do this for shared hashtables? Couldn't we allocate the elments
>> together with the rest for non-share hashtables too?
>>
>
>
I have now made the changes uniformly across shared and non-shared hash
tables,
making the code fix look cleaner. I hope this aligns with your suggestions.
Please find attached updated and rebased versions of both patches.

Kindly let me know your views.

Thank you,
Rahila Syed


Attachments:

  [application/octet-stream] v3-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch (13.2K, 3-v3-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch)
  download | inline diff:
From 98ca9cc3f34b6dd0892281583b4eb5e38ce0a668 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 209 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 161 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132f..3bdf3d6fd5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start
+	 * of each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the the
+		 * first partition of elements
+		 */
+
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +637,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +734,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -741,22 +755,21 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +864,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements
+	 * are allocated only if they are less than nelem_alloc.
+	 * In any case, the init_size should be equal to the number of
+	 * elements added using element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1351,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1389,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1702,28 +1770,38 @@ seg_alloc(HTAB *hashp)
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
 
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1822,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2033,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e513c8116 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+					long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1



  [application/octet-stream] v3-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (6.3K, 4-v3-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
  download | inline diff:
From 598287c0d366bc32def8e39a747f8f4418376706 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48e..3ae817dedc 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
@ 2025-03-21 11:15 ` Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Nazir Bilal Yavuz @ 2025-03-21 11:15 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers

Hi,

On Wed, 12 Mar 2025 at 13:46, Rahila Syed <[email protected]> wrote:
> I have now made the changes uniformly across shared and non-shared hash tables,
> making the code fix look cleaner. I hope this aligns with your suggestions.
> Please find attached updated and rebased versions of both patches.

Thank you for working on this!

I have a couple of comments, I have only reviewed 0001 so far.

You may need to run pgindent, it makes some changes.

diff --git a/src/backend/utils/hash/dynahash.c
b/src/backend/utils/hash/dynahash.c
index cd5a00132f..3bdf3d6fd5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c

+        /*
+         * If table is shared, calculate the offset at which to find the the
+         * first partition of elements
+         */
+
+        nsegs = compute_buckets_and_segs(nelem, &nbuckets,
hctl->num_partitions, hctl->ssize);

Blank line between the comment and the code.

+    bool        element_alloc = true;
+    Size        elementSize = MAXALIGN(sizeof(HASHELEMENT)) +
MAXALIGN(info->entrysize);

It looks odd to me that camelCase and snake_case are used together but
it is already used like that in this file; so I think it should be
okay.

 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)

Comment needs an update. This function no longer links elements into
the free list.

+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long
num_partitions, long ssize)
+{
...
+    /*
+     * In a partitioned table, nbuckets must be at least equal to
+     * num_partitions; were it less, keys with apparently different partition
+     * numbers would map to the same bucket, breaking partition independence.
+     * (Normally nbuckets will be much bigger; this is just a safety check.)
+     */
+    while ((*nbuckets) < num_partitions)
+        (*nbuckets) <<= 1;

I have some worries about this function, I am not sure what I said
below has real life implications as you already said 'Normally
nbuckets will be much bigger; this is just a safety check.'.

1- num_partitions is long and nbuckets is int, so could there be any
case where num_partition is bigger than MAX_INT and cause an infinite
loop?
2- Although we assume both nbuckets and num_partition initialized as
the same type, (*nbuckets) <<= 1 will cause an infinite loop if
num_partition is bigger than MAX_TYPE / 2.

So I think that the solution is to confirm that num_partition <
MAX_NBUCKETS_TYPE / 2, what do you think?

--
Regards,
Nazir Bilal Yavuz
Microsoft

^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
@ 2025-03-23 08:36   ` Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Rahila Syed @ 2025-03-23 08:36 UTC (permalink / raw)
  To: Nazir Bilal Yavuz <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers

Hi Bilal,


I have a couple of comments, I have only reviewed 0001 so far.
>

Thank you for reviewing!


>
> You may need to run pgindent, it makes some changes.
>

Attached v4-patch has been updated after running pgindent.


+         * If table is shared, calculate the offset at which to find the
> the
> +         * first partition of elements
> +         */
> +
> +        nsegs = compute_buckets_and_segs(nelem, &nbuckets,
> hctl->num_partitions, hctl->ssize);
>
> Blank line between the comment and the code.
>

Removed this.


>  /*
>   * allocate some new elements and link them into the indicated free list
>   */
> -static bool
> -element_alloc(HTAB *hashp, int nelem, int freelist_idx)
> +static HASHELEMENT *
> +element_alloc(HTAB *hashp, int nelem)
>
> Comment needs an update. This function no longer links elements into
> the free list.
>

Updated this and few other comments in the attached v4-patch.


>
> +static int
> +compute_buckets_and_segs(long nelem, int *nbuckets, long
> num_partitions, long ssize)
> +{
> ...
> +    /*
> +     * In a partitioned table, nbuckets must be at least equal to
> +     * num_partitions; were it less, keys with apparently different
> partition
> +     * numbers would map to the same bucket, breaking partition
> independence.
> +     * (Normally nbuckets will be much bigger; this is just a safety
> check.)
> +     */
> +    while ((*nbuckets) < num_partitions)
> +        (*nbuckets) <<= 1;
>
> I have some worries about this function, I am not sure what I said
> below has real life implications as you already said 'Normally
> nbuckets will be much bigger; this is just a safety check.'.
>
> 1- num_partitions is long and nbuckets is int, so could there be any
> case where num_partition is bigger than MAX_INT and cause an infinite
> loop?
> 2- Although we assume both nbuckets and num_partition initialized as
> the same type, (*nbuckets) <<= 1 will cause an infinite loop if
> num_partition is bigger than MAX_TYPE / 2.
>
> So I think that the solution is to confirm that num_partition <
> MAX_NBUCKETS_TYPE / 2, what do you think?
>
>
Your concern is valid. This has been addressed in the existing code by
calling next_pow2_int() on num_partitions before running the function.
Additionally, I am not adding any new code to the compute_buckets_and_segs
function. I am simply moving part of the init_tab() code into a separate
function
for reuse.

Please find attached the updated and rebased patches.

Thank you,
Rahila Syed


Attachments:

  [application/octet-stream] v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (6.3K, 3-v4-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
  download | inline diff:
From 56c1ca94ef109f484a01496f09e9aba2bed0a3a9 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e..65239d743d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.34.1



  [application/octet-stream] v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch (13.5K, 4-v4-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch)
  download | inline diff:
From 2352d4918f3c5b548790de8e2a673759d83b6f96 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 220 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 169 insertions(+), 57 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..550f04359a 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,14 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements
+		 */
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +636,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +733,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +750,28 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +866,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1353,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1391,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1770,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1827,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2038,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..79b959ffc3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
@ 2025-03-24 13:24     ` Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Tomas Vondra @ 2025-03-24 13:24 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; Nazir Bilal Yavuz <[email protected]>; +Cc: Andres Freund <[email protected]>; pgsql-hackers

Hi,

I did a review on v3 of the patch. I see there's been some minor changes
in v4 - I haven't tried to adjust my review, but I assume most of my
comments still apply.

Most of my suggestions are about formatting and/or readability. Some of
the likes (e.g. the pointer arithmetics) got so long pgindent would have
to wrap them, but more importantly really hard to decipher what it does.

I've added a couple "review" commits, actually doing most of what I'm
going to suggest.


1) I don't quite understand why hash_get_shared_size() got renamed to
hash_get_init_size()? Why is that needed? Does it cover only some
initial allocation, and there's additional memory allocated later? And
what's the point of the added const?


2) I find it more readable when ShmemInitStruct() is formatted on
multiple lines. Yes, it's a matter of choice, but also other places do
it this way, I think.


3) The changes in hash_create() are a good example of readability issues
I already mentioned. Consider this line:

curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize *
sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));

First, I'd point out this is not an offset but a pointer, so the
variable name is a bit misleading. But more importantly, I envy those
who can parse this in their head and see if it's correct.

I think it's much better to define a couple macros to calculate parts of
this, essentially a tiny "language" expressing this in a concise way.
The 0002 patch defines

- HASH_ELEMENTS
- HASH_ELEMENT_NEXT
- HASH_SEGMENT_PTR
- HASH_SEGMENT_SIZE
- ...

and then uses that in hash_create(). I believe it's way more readable
like this.


4) I find it a bit strange when a function calculates multiple values,
and then returns them in different ways - one as a return value, one
through a pointer, the way compute_buckets_and_segs() did.

There are cases when it makes sense (e.g. when one of the values is
returned only optionally), but otherwise I think it's better to return
both values in the same way through a pointer. The 0002 patch adjusts
compute_buckets_and_segs() to it like this.

We have a precedent for this - ExecChooseHashTableSize(), which is doing
a very similar thing for sizing hashjoin hash table.


5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
along with calculating the per-element requestSize, but then still does

    memset(PredXact->element, 0, requestSize);

and

    memset(RWConflictPool->element, 0, requestSize);

with requestSize for the whole allocation? I haven't seen any crashes,
but this seems wrong to me. I believe we can simply zero the whole
allocation right after ShmemInitStruct(), there's no need to do this for
each individual element.

5) InitProcGlobal is another example of hard-to-read code. Admittedly,
it wasn't particularly readable before, but making the lines even longer
does not make it much better ...

I guess adding a couple macros similar to hash_create() would be one way
to improve this (and there's even a review comment in that sense), but I
chose to just do maintain a simple pointer. But if you think it should
be done differently, feel free to do so.


6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
which seems to be doing a very similar thing, and to not require the
prototype. Minor detail, feel free to undo.


7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
the patch, and I see no reason to do it in the same commit. So 0005
removes this change, and 0006 reintroduces it separately.

FWIW I see no justification for why the cache line padding is useful,
and it seems quite unlikely this would benefit anything, consiering it
adds padding between fairly long arrays. What kind of contention can we
get there? I don't get it.

Also, why is the patch adding padding after statusFlags (the last array
allocated in InitProcGlobal) and not between allProcs and xids?



regards

-- 
Tomas Vondra


Attachments:

  [text/x-patch] v20250324-0001-Account-for-initial-shared-memory-allocate.patch (13.2K, 2-v20250324-0001-Account-for-initial-shared-memory-allocate.patch)
  download | inline diff:
From f527909dda02b4c7231db53a0fe6cecbaec55ca4 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH v20250324 1/6] Account for initial shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 209 ++++++++++++++++++++++--------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 161 insertions(+), 54 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..d8aed0bfaaf 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..ba785f1d5f2 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static int	compute_buckets_and_segs(long nelem, int *nbuckets,
+									 long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -353,6 +356,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +511,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose number of entries to allocate at a time */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start
+	 * of each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +572,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *curr_offset = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +611,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		/*
+		 * If table is shared, calculate the offset at which to find the the
+		 * first partition of elements
+		 */
+
+		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -609,11 +637,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+			HASHELEMENT *firstElement;
+			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Memory is allocated as part of initial allocation in
+			 * ShmemInitHash
+			 */
+			firstElement = (HASHELEMENT *) curr_offset;
+			curr_offset = (((char *) curr_offset) + (temp * elementSize));
+			element_add(hashp, firstElement, i, temp);
 		}
 	}
 
@@ -701,30 +734,11 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -741,22 +755,21 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
 	}
 
 	/* Allocate initial segments */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
-	}
+		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
+								+ sizeof(HASHHDR)
+								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
+		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+		i = i + 1;
+	}
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -851,11 +864,64 @@ hash_select_dirsize(long num_entries)
  * and for the (non expansible) directory.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true;
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements
+	 * are allocated only if they are less than nelem_alloc.
+	 * In any case, the init_size should be equal to the number of
+	 * elements added using element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT) +
+		+sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1351,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1389,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1702,28 +1770,38 @@ seg_alloc(HTAB *hashp)
 /*
  * allocate some new elements and link them into the indicated free list
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
 
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1822,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2033,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+	int			nsegs;
+
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	nsegs = ((*nbuckets) - 1) / ssize + 1;
+	nsegs = next_pow2_int(nsegs);
+	return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..5e513c8116a 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+					long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0



  [text/x-patch] v20250324-0002-review.patch (10.3K, 3-v20250324-0002-review.patch)
  download | inline diff:
From d737c4033a92f070c3c3ad791fdef2c8e133394f Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sat, 22 Mar 2025 12:20:44 +0100
Subject: [PATCH v20250324 2/6] review

---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 112 ++++++++++++++++++++----------
 src/include/utils/hsearch.h       |   4 +-
 3 files changed, 81 insertions(+), 39 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index d8aed0bfaaf..51ad68cc796 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -345,9 +345,11 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	infoP->alloc = ShmemAllocNoError;
 	hash_flags |= HASH_SHARED_MEM | HASH_ALLOC | HASH_DIRSIZE;
 
+	/* review: I don't see a reason to rename this to _init_ */
+
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_init_size(infoP, hash_flags, init_size, 0),
+							   hash_get_shared_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index ba785f1d5f2..9792377d567 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,6 +260,30 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+/* review: shouldn't this have some MAXALIGNs? */
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
@@ -281,9 +305,16 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
-static int	compute_buckets_and_segs(long nelem, int *nbuckets,
-									 long num_partitions, long ssize);
-static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
+/*
+ * review: it's a bit weird to have a function that calculates two values,
+ * but returns one through a pointer and another as a regular retval. see
+ * ExecChooseHashTableSize for a better approach.
+ */
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int freelist_idx, int nelem);
 
 /*
  * memory allocation support
@@ -511,7 +542,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
-	/* Choose number of entries to allocate at a time */
+	/* Choose the number of entries to allocate at a time. */
 	nelem_batch = choose_nelem_alloc(info->entrysize);
 
 	/*
@@ -521,7 +552,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
+		Size	size = hash_get_shared_size(info, flags, nelem, nelem_batch);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -572,6 +603,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
 	hctl->nelem_alloc = nelem_batch;
 
 	/* make local copies of heavily-used constant fields */
@@ -598,7 +630,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
-		void	   *curr_offset = NULL;
+		void	   *ptr = NULL;		/* XXX it's not offset, clearly */
 		int			nsegs;
 		int			nbuckets;
 
@@ -612,13 +644,18 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 			freelist_partitions = 1;
 
 		/*
-		 * If table is shared, calculate the offset at which to find the the
-		 * first partition of elements
+		 * For shared hash tables, we need to determine the offset for the
+		 * first partition of elements. We have to skip space for the header,
+		 * segments and buckets.
+		 *
+	 	 * XXX can we rely on this matching the calculation in hash_get_shared_size?
+		 * or could/should we add some asserts? Can we have at least some sanity checks
+		 * on nbuckets/nsegs?
 		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
 
-		nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
-
-		curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (hctl->dsize * sizeof(HASHSEGMENT)) + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+		ptr = HASH_ELEMENTS(hashp, nsegs);
 
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
@@ -637,16 +674,15 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
-			HASHELEMENT *firstElement;
-			Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 
 			/*
-			 * Memory is allocated as part of initial allocation in
-			 * ShmemInitHash
+			 * Memory is allocated as part of allocation in ShmemInitHash, we
+			 * just need to split that allocation per-batch freelists.
+			 *
+			 * review: why do we invert the argument order?
 			 */
-			firstElement = (HASHELEMENT *) curr_offset;
-			curr_offset = (((char *) curr_offset) + (temp * elementSize));
-			element_add(hashp, firstElement, i, temp);
+			element_add(hashp, (HASHELEMENT *) ptr, i, temp);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -734,7 +770,8 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
@@ -755,22 +792,19 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *) (((char *) hashp->hctl) + sizeof(HASHHDR));
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
 	/* Allocate initial segments */
 	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = (HASHBUCKET *) (((char *) hashp->hctl)
-								+ sizeof(HASHHDR)
-								+ (hashp->hctl->dsize * sizeof(HASHSEGMENT))
-								+ (i * sizeof(HASHBUCKET) * hashp->ssize));
-		MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
-
-		i = i + 1;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
+	Assert(i == nsegs);
+
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
 			"TABLE POINTER   ", hashp,
@@ -862,9 +896,14 @@ hash_select_dirsize(long num_entries)
  * Compute the required initial memory allocation for a shared-memory
  * hashtable with the given parameters.  We need space for the HASHHDR
  * and for the (non expansible) directory.
+ *
+ * review: why rename this to "init"
+ * review: also, why adding the "const"?
+ * review: the comment should really explain the arguments, e.g. what
+ * is nelem_alloc for is not obvious?
  */
 Size
-hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
+hash_get_shared_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
 	int			nbuckets;
 	int			nsegs;
@@ -914,7 +953,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	else
 		ssize = DEF_SEGSIZE;
 
-	nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
 
 	if (!element_alloc)
 		init_size = 0;
@@ -1791,6 +1831,7 @@ element_alloc(HTAB *hashp, int nelem)
 	return firstElement;
 }
 
+/* review: comment needed, how is the work divided between element_alloc and this? */
 static void
 element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
 {
@@ -2034,11 +2075,11 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 	}
 }
 
-static int
-compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+/* review: comment needed */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
 {
-	int			nsegs;
-
 	/*
 	 * Allocate space for the next greater power of two number of buckets,
 	 * assuming a desired maximum load factor of 1.
@@ -2058,7 +2099,6 @@ compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ss
 	/*
 	 * Figure number of directory segments needed, round up to a power of 2
 	 */
-	nsegs = ((*nbuckets) - 1) / ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-	return nsegs;
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
 }
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 5e513c8116a..e78ab38b4e8 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,8 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_init_size(const HASHCTL *info, int flags,
-					long init_size, int nelem_alloc);
+extern Size hash_get_shared_size(const HASHCTL *info, int flags,
+								 long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0



  [text/x-patch] v20250324-0003-Replace-ShmemAlloc-calls-by-ShmemInitStruc.patch (6.4K, 4-v20250324-0003-Replace-ShmemAlloc-calls-by-ShmemInitStruc.patch)
  download | inline diff:
From bd743402321ae01e27348250865f95290d37b842 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH v20250324 3/6] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
 src/backend/storage/lmgr/predicate.c | 17 +++++++-------
 src/backend/storage/lmgr/proc.c      | 33 +++++++++++++++++++++++-----
 2 files changed, 36 insertions(+), 14 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..dd66990335b 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = mul_size((Size) max_table_size,
+						   RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 RWConflictPoolHeaderDataSize + requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
 		int			i;
 
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e6..65239d743da 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
 
 
 /*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
 	 * how hotly they are accessed.
 	 */
 	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,10 +335,26 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
 	SpinLockInit(ProcStructLock);
 }
 
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.49.0



  [text/x-patch] v20250324-0004-review.patch (7.5K, 5-v20250324-0004-review.patch)
  download | inline diff:
From eac38c347318a0ee074b25d487de66575a242e12 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sat, 22 Mar 2025 12:50:38 +0100
Subject: [PATCH v20250324 4/6] review

---
 src/backend/storage/lmgr/predicate.c | 21 +++++---
 src/backend/storage/lmgr/proc.c      | 75 +++++++++++++++++++---------
 2 files changed, 66 insertions(+), 30 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index dd66990335b..aeba84de786 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1237,6 +1237,9 @@ PredicateLockShmemInit(void)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1247,7 +1250,9 @@ PredicateLockShmemInit(void)
 		PredXact->HavePartialClearedThrough = 0;
 		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
+		/* XXX wasn't this actually wrong, considering requestSize is the whole
+		 * shmem allocation, including PredXactListDataSize? */
+		// memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,21 +1305,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
-	requestSize = mul_size((Size) max_table_size,
-						   RWConflictDataSize);
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize + requestSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 65239d743da..4afd7cd42c3 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,7 +88,6 @@ static void RemoveProcFromArray(int code, Datum arg);
 static void ProcKill(int code, Datum arg);
 static void AuxiliaryProcKill(int code, Datum arg);
 static void CheckDeadLock(void);
-static Size PGProcShmemSize(void);
 
 
 /*
@@ -124,6 +123,27 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -177,6 +197,7 @@ InitProcGlobal(void)
 	Size		fpLockBitsSize,
 				fpRelIdSize;
 	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -208,7 +229,12 @@ InitProcGlobal(void)
 	 */
 	requestSize = PGProcShmemSize();
 
-	procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr += TotalProcs * sizeof(PGPROC);
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -221,14 +247,27 @@ InitProcGlobal(void)
 	 *
 	 * XXX: It might make sense to increase padding for these arrays, given
 	 * how hotly they are accessed.
+	 *
+	 * review: does the padding comment still make sense with PG_CACHE_LINE_SIZE?
+	 *         presumably that's the padding mentioned by the comment?
+	 *
+	 * review: those lines are too long / not comprehensible, let's define some
+	 *         macros to calculate stuff?
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -238,7 +277,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -335,26 +376,12 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock", sizeof(slock_t), &found);
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-static Size
-PGProcShmemSize(void)
-{
-	Size		size;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
-
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	return size;
-}
-
 /*
  * InitProcess -- initialize a per-process PGPROC entry for this backend
  */
-- 
2.49.0



  [text/x-patch] v20250324-0005-remove-cacheline.patch (1.8K, 6-v20250324-0005-remove-cacheline.patch)
  download | inline diff:
From 123e403ed6bfca929b3a33afe210eacc38dd73ad Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sat, 22 Mar 2025 15:22:08 +0100
Subject: [PATCH v20250324 5/6] remove cacheline

---
 src/backend/storage/lmgr/proc.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4afd7cd42c3..f7957eb008b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -136,11 +136,8 @@ PGProcShmemSize(void)
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
 	return size;
 }
 
@@ -256,15 +253,15 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids);
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates);
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags);
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0



  [text/x-patch] v20250324-0006-add-cacheline-padding-back.patch (2.4K, 7-v20250324-0006-add-cacheline-padding-back.patch)
  download | inline diff:
From 8f54aefb9e3e83973d0a8adf6dc84fe876a0f858 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Sat, 22 Mar 2025 15:23:43 +0100
Subject: [PATCH v20250324 6/6] add cacheline padding back

---
 src/backend/storage/lmgr/proc.c | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index f7957eb008b..e9c22f03f27 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -136,8 +136,11 @@ PGProcShmemSize(void)
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	return size;
 }
 
@@ -242,26 +245,22 @@ InitProcGlobal(void)
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
 	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
-	 *
-	 * review: does the padding comment still make sense with PG_CACHE_LINE_SIZE?
-	 *         presumably that's the padding mentioned by the comment?
+	 * review: shouldn't the first cacheline padding be right after "procs", before "xids"?
 	 *
 	 * review: those lines are too long / not comprehensible, let's define some
 	 *         macros to calculate stuff?
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr += TotalProcs * sizeof(*ProcGlobal->xids);
+	ptr += TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates);
+	ptr += TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags);
+	ptr += TotalProcs * sizeof(*ProcGlobal->statusFlags) + PG_CACHE_LINE_SIZE;
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
@ 2025-03-27 12:02       ` Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Rahila Syed @ 2025-03-27 12:02 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi Tomas,


> I did a review on v3 of the patch. I see there's been some minor changes
> in v4 - I haven't tried to adjust my review, but I assume most of my
> comments still apply.
>
>
Thank you for your interest and review.


> 1) I don't quite understand why hash_get_shared_size() got renamed to
> hash_get_init_size()? Why is that needed? Does it cover only some
> initial allocation, and there's additional memory allocated later? And
> what's the point of the added const?
>

Earlier, this function was used to calculate the size for shared hash
tables only.
Now, it also calculates the size for a non-shared hash table. Hence the
change
of name.

If I don't change the argument to const, I get a compilation error as
follows:
"passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier from
pointer target type."
This is due to this function being called from hash_create which considers
HASHCTL to be
a const.


>
> 5) Isn't it wrong that PredicateLockShmemInit() removes the ShmemAlloc()
> along with calculating the per-element requestSize, but then still does
>
>     memset(PredXact->element, 0, requestSize);
>
> and
>
>     memset(RWConflictPool->element, 0, requestSize);
>
> with requestSize for the whole allocation? I haven't seen any crashes,
> but this seems wrong to me. I believe we can simply zero the whole
> allocation right after ShmemInitStruct(), there's no need to do this for
> each individual element.
>

Good catch.  Thanks for updating.


>
> 5) InitProcGlobal is another example of hard-to-read code. Admittedly,
> it wasn't particularly readable before, but making the lines even longer
> does not make it much better ...
>
> I guess adding a couple macros similar to hash_create() would be one way
> to improve this (and there's even a review comment in that sense), but I
> chose to just do maintain a simple pointer. But if you think it should
> be done differently, feel free to do so.
>
>
LGTM, long lines have been reduced by the ptr approach.


> 6) I moved the PGProcShmemSize() to be right after ProcGlobalShmemSize()
> which seems to be doing a very similar thing, and to not require the
> prototype. Minor detail, feel free to undo.
>
> LGTM.


>
> 7) I think the PG_CACHE_LINE_SIZE is entirely unrelated to the rest of
> the patch, and I see no reason to do it in the same commit. So 0005
> removes this change, and 0006 reintroduces it separately.
>
> OK.


> FWIW I see no justification for why the cache line padding is useful,
> and it seems quite unlikely this would benefit anything, consiering it
> adds padding between fairly long arrays. What kind of contention can we
> get there? I don't get it.
>

I have done that to address a review comment upthread by Andres and
the because comment above that code block also mentions need for
padding.


> Also, why is the patch adding padding after statusFlags (the last array
> allocated in InitProcGlobal) and not between allProcs and xids?
>

I agree it makes more sense this way, so changing accordingly.



>          *
> +                * XXX can we rely on this matching the calculation in
> hash_get_shared_size?
> +                * or could/should we add some asserts? Can we have at
> least some sanity checks
> +                * on nbuckets/nsegs?
>
>
 Both places rely on compute_buckets_and_segs() to calculate nbuckets and
nsegs,
 hence the probability of mismatch is low.  I am open to adding some
asserts to verify this.
 Do you have any suggestions in mind?

Please find attached updated patches after merging all your review comments
except
a few discussed above.

Thank you,
Rahila Syed


Attachments:

  [application/octet-stream] v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patch (2.1K, 3-v5-0003-Add-cacheline-padding-between-heavily-accessed-array.patch)
  download | inline diff:
From 8b8fe395df974e5e7d1ae85d933c53ec86132e1d Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 17:20:12 +0530
Subject: [PATCH 3/3] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6ee48410b8..edc5c7406b 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -135,8 +135,11 @@ PGProcShmemSize(void)
 	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 	return size;
 }
@@ -231,7 +234,7 @@ InitProcGlobal(void)
 									   &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -244,11 +247,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *)ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.34.1



  [application/octet-stream] v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch (15.3K, 4-v5-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch)
  download | inline diff:
From 44b3fbb06dd9436c5545c2a93432d65e400a360c Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 12:59:02 +0530
Subject: [PATCH 1/2] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 265 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 213 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..d8aed0bfaa 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..1f215a16c5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +537,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +625,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +638,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements.
+		 * We have to skip space for the header, segments and buckets.
+ 		 */
+		ptr = HASH_ELEMENTS(hashp, nsegs);
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -610,10 +666,17 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Assign the correct location of each parition within a
+			 * pre-allocated buffer.
+			 *
+			 * Actual memory allocation happens in ShmemInitHash for
+			 * shared hash tables or earlier in this function for non-shared
+			 * hash tables.
+			 * We just need to split that allocation per-batch freelists.
+			 */
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +764,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +782,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -847,15 +891,79 @@ hash_select_dirsize(long num_entries)
 
 /*
  * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * or non-shared memory hashtable with the given parameters.
+ * We need space for the HASHHDR, for the directory, segments and
+ * the init_size elements in buckets.	
+ * 
+ * For shared hash tables the directory size is non-expansive.
+ * 
+ * init_size should match the total number of elements allocated
+ * during hash table creation, it could be zero for non-shared hash
+ * tables depending on the value of nelem_alloc. For more explanation
+ * see comments within this function.
+ *
+ * nelem_alloc parameter is not relevant for shared hash tables.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true; /*Always true for shared hash tables */
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
+		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1393,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1431,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1810,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1867,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2078,35 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..79b959ffc3 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1



  [application/octet-stream] v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (7.2K, 5-v5-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
  download | inline diff:
From 65dab85e3c0d6889cfa02da29c2b11d6dd39b56a Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 16:43:28 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 27 +++++++++-----
 src/backend/storage/lmgr/proc.c      | 56 +++++++++++++++++++++++-----
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..de2629fdf0 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,20 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1248,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1299,21 +1302,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e..6ee48410b8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,24 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +193,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +224,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +241,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +265,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +364,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
@ 2025-03-27 12:56         ` Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Tomas Vondra @ 2025-03-27 12:56 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi,

On 3/27/25 13:02, Rahila Syed wrote:
> 
> Hi Tomas,
> 
> 
>     I did a review on v3 of the patch. I see there's been some minor changes
>     in v4 - I haven't tried to adjust my review, but I assume most of my
>     comments still apply.
> 
>  
> Thank you for your interest and review.
> 
> 
>     1) I don't quite understand why hash_get_shared_size() got renamed to
>     hash_get_init_size()? Why is that needed? Does it cover only some
>     initial allocation, and there's additional memory allocated later? And
>     what's the point of the added const?
> 
> 
> Earlier, this function was used to calculate the size for shared hash
> tables only.
> Now, it also calculates the size for a non-shared hash table. Hence the
> change
> of name.
> 
> If I don't change the argument to const, I get a compilation error as
> follows:
> "passing argument 1 of ‘hash_get_init_size’ discards ‘const’ qualifier
> from pointer target type."
> This is due to this function being called from hash_create which
> considers HASHCTL to be
> a const. 
> 

OK, makes sense. I haven't tried removing the const, but if removing it
would trigger a compiler warning, it's probably needed.

> ...
> 
>     FWIW I see no justification for why the cache line padding is useful,
>     and it seems quite unlikely this would benefit anything, consiering it
>     adds padding between fairly long arrays. What kind of contention can we
>     get there? I don't get it.
> 
>  
> I have done that to address a review comment upthread by Andres and
> the because comment above that code block also mentions need for
> padding.
> 

Neither that seems like a sufficient justification for the change. The
comment is more of a speculation, it doesn't demonstrate the benefits
are real. It's true Andres tends to be right about these things, but
still, I'd like to see some argument for why this helps.

How exactly does padding before/after the whole array help anything?

In fact, is this the padding Andres (or the comment) suggested? The
comment says:

 * XXX: It might make sense to increase padding for these arrays, given
 * how hotly they are accessed.

Doesn't "padding of array" actually suggest padding each element, not
the array as a whole?

In any case, if the 0003 patch addresses the padding, it should also
remove the XXX comment.

> 
>     Also, why is the patch adding padding after statusFlags (the last array
>     allocated in InitProcGlobal) and not between allProcs and xids?
> 
>  
> I agree it makes more sense this way, so changing accordingly.
>  
>  
> 
>              *
>     +                * XXX can we rely on this matching the calculation
>     in hash_get_shared_size?
>     +                * or could/should we add some asserts? Can we have
>     at least some sanity checks
>     +                * on nbuckets/nsegs?
> 
>  
>  Both places rely on compute_buckets_and_segs() to calculate nbuckets
> and nsegs,
>  hence the probability of mismatch is low.  I am open to adding some
> asserts to verify this.
>  Do you have any suggestions in mind?  
> 
> Please find attached updated patches after merging all your review
> comments except
> a few discussed above.
>  

OK, I don't have any other comments for 0001 and 0002.  I'll do some
more review and polishing on those, and will get them committed soon.

I don't plan to push 0003, unless someone can actually explain and
demonstrate the benefits of the proposed padding,


regards

-- 
Tomas Vondra






^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
@ 2025-03-27 23:50           ` Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Tomas Vondra @ 2025-03-27 23:50 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers



On 3/27/25 13:56, Tomas Vondra wrote:
> ...
> 
> OK, I don't have any other comments for 0001 and 0002.  I'll do some
> more review and polishing on those, and will get them committed soon.
> 

Actually ... while polishing 0001 and 0002, I noticed a couple more
details that I'd like to ask about. I also ran pgindent and tweaked the
formatting, so some of the diff is caused by that.


1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Now, back to the patch. The code originally did this in ShmemInitStruct

    hashp = ShmemInitStruct(...)

to allocate the hctl, and then

    firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

    size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

to

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))

First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

So something like this might be more correct:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 MAXALIGN((hctl)->dsize * izeof(HASHSEGMENT)) + \
+	 MAXALIGN((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+    CACHELINEALIGN(sizeof(HASHHDR) + \
+     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.

In fact, I'm starting to wonder if we need to worry about this at all.
Maybe it's not that significant for dynahash after all - otherwise we
actually should align/pad the individual elements, not just the array as
a whole, I think.


2) I do think the last patch should use CACHELINEALIGN() too, instead of
adding PG_CACHE_LINE_SIZE to the sizes (which does not ensure good
alignment, it just means there's gap between the allocated parts). But I
still don't know what the intention was, so hard to say ...


3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:

 * init_size should match the total number of elements allocated during
 * hash table creation, it could be zero for non-shared hash tables
 * depending on the value of nelem_alloc. For more explanation see
 * comments within this function.
 *
 * nelem_alloc parameter is not relevant for shared hash tables.

What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?

AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right? But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.

It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.

Then it says "nelem_alloc parameter is not relevant for shared hash
tables". What does "not relevant" mean? It should say it's ignored.

The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.



regards

-- 
Tomas Vondra


Attachments:

  [text/x-patch] v6-0001-Account-for-all-the-shared-memory-allocated-by-ha.patch (15.4K, 2-v6-0001-Account-for-all-the-shared-memory-allocated-by-ha.patch)
  download | inline diff:
From a0ab96326687ff8eb9ff3fd8343cff545226fd0a Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 12:59:02 +0530
Subject: [PATCH v6 1/7] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 265 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 213 insertions(+), 58 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..d8aed0bfaaf 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags, init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..1f215a16c51 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_ELEMENTS(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(MAXALIGN(sizeof(HASHHDR)) + \
+	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+
+#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void	compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize, /* segment size */
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +537,19 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +598,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -582,6 +625,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +638,16 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements.
+		 * We have to skip space for the header, segments and buckets.
+ 		 */
+		ptr = HASH_ELEMENTS(hashp, nsegs);
+
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -610,10 +666,17 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			/*
+			 * Assign the correct location of each parition within a
+			 * pre-allocated buffer.
+			 *
+			 * Actual memory allocation happens in ShmemInitHash for
+			 * shared hash tables or earlier in this function for non-shared
+			 * hash tables.
+			 * We just need to split that allocation per-batch freelists.
+			 */
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +764,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +782,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = HASH_DIRECTORY(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -847,15 +891,79 @@ hash_select_dirsize(long num_entries)
 
 /*
  * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * or non-shared memory hashtable with the given parameters.
+ * We need space for the HASHHDR, for the directory, segments and
+ * the init_size elements in buckets.	
+ * 
+ * For shared hash tables the directory size is non-expansive.
+ * 
+ * init_size should match the total number of elements allocated
+ * during hash table creation, it could be zero for non-shared hash
+ * tables depending on the value of nelem_alloc. For more explanation
+ * see comments within this function.
+ *
+ * nelem_alloc parameter is not relevant for shared hash tables.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	bool		element_alloc = true; /*Always true for shared hash tables */
+	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
+	/*
+	 * For non-shared hash tables, the requested number of elements are
+	 * allocated only if they are less than nelem_alloc. In any case, the
+	 * init_size should be equal to the number of elements added using
+	 * element_add() in hash_create.
+	 */
+	if (!(flags & HASH_SHARED_MEM))
+	{
+		if (init_size > nelem_alloc)
+			element_alloc = false;
+	}
+	else
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (init_size < NUM_FREELISTS)
+			init_size = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(init_size, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	if (!element_alloc)
+		init_size = 0;
+
+	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
+		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1393,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		newElement = element_alloc(hashp, hctl->nelem_alloc);
+		if (newElement == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1431,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,30 +1810,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
 	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * Link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
 
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -1744,8 +1867,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2078,35 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..79b959ffc3c 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   long init_size, int nelem_alloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0



  [text/x-patch] v6-0002-review.patch (10.5K, 3-v6-0002-review.patch)
  download | inline diff:
From 5d504b93f4880c7d1836304b7ad1a212bd5a5c58 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 27 Mar 2025 20:52:31 +0100
Subject: [PATCH v6 2/7] review

---
 src/backend/storage/ipc/shmem.c   |   3 +-
 src/backend/utils/hash/dynahash.c | 127 ++++++++++++++++--------------
 2 files changed, 70 insertions(+), 60 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index d8aed0bfaaf..7e03a371d22 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -347,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_init_size(infoP, hash_flags, init_size, 0),
+							   hash_get_init_size(infoP, hash_flags,
+												  init_size, 0),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 1f215a16c51..dcdc143e690 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,29 +260,35 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
-
-#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
-	(MAXALIGN(sizeof(HASHHDR)) + \
-	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
-	 ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
-
-#define HASH_ELEMENTS(hashp, nsegs) \
-	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
 
 #define HASH_SEGMENT_OFFSET(hctl, idx) \
-	(MAXALIGN(sizeof(HASHHDR)) + \
-	 ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
-	 ((hctl)->ssize * (idx) * MAXALIGN(sizeof(HASHBUCKET))))
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
 
 #define HASH_SEGMENT_PTR(hashp, idx) \
-	(HASHSEGMENT) ((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
 
-#define HASH_SEGMENT_SIZE(hashp)	((hashp)->ssize * MAXALIGN(sizeof(HASHBUCKET)))
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+/* XXX this is exactly the same as HASH_SEGMENT_OFFSET, unite? */
+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_ELEMENTS_OFFSET((hashp)->hctl, nsegs))
 
-#define	HASH_DIRECTORY(hashp)	(HASHSEGMENT *) (((char *) (hashp)->hctl) + MAXALIGN(sizeof(HASHHDR)))
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
 
 #define HASH_ELEMENT_NEXT(hctl, num, ptr) \
-	((char *) (ptr) + ((num) * (MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))))
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
 
 /*
  * Private function prototypes
@@ -305,8 +311,8 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
-static void	compute_buckets_and_segs(long nelem, long num_partitions,
-									 long ssize, /* segment size */
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
 									 int *nbuckets, int *nsegments);
 static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
 						int nelem, int freelist_idx);
@@ -547,7 +553,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size	size = hash_get_init_size(info, flags, nelem, nelem_batch);
+		Size		size = hash_get_init_size(info, flags, nelem, nelem_batch);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -638,16 +644,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			freelist_partitions = 1;
 
-		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
-								 &nbuckets, &nsegs);
-
-		/*
-		 * Calculate the offset at which to find the first partition of
-		 * elements.
-		 * We have to skip space for the header, segments and buckets.
- 		 */
-		ptr = HASH_ELEMENTS(hashp, nsegs);
-
 		nelem_alloc = nelem / freelist_partitions;
 		if (nelem_alloc <= 0)
 			nelem_alloc = 1;
@@ -662,19 +658,29 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			/*
-			 * Assign the correct location of each parition within a
-			 * pre-allocated buffer.
-			 *
-			 * Actual memory allocation happens in ShmemInitHash for
-			 * shared hash tables or earlier in this function for non-shared
-			 * hash tables.
-			 * We just need to split that allocation per-batch freelists.
-			 */
 			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
 			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
@@ -789,14 +795,14 @@ init_htab(HTAB *hashp, long nelem)
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = HASH_DIRECTORY(hashp);
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
 	/* Assign initial segments, which are also pre-allocated */
 	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = HASH_SEGMENT_PTR(hashp, i++);
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
 		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
@@ -890,19 +896,22 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * or non-shared memory hashtable with the given parameters.
- * We need space for the HASHHDR, for the directory, segments and
- * the init_size elements in buckets.	
- * 
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
  * For shared hash tables the directory size is non-expansive.
- * 
- * init_size should match the total number of elements allocated
- * during hash table creation, it could be zero for non-shared hash
- * tables depending on the value of nelem_alloc. For more explanation
- * see comments within this function.
+ *
+ * init_size should match the total number of elements allocated during hash
+ * table creation, it could be zero for non-shared hash tables depending on the
+ * value of nelem_alloc. For more explanation see comments within this function.
  *
  * nelem_alloc parameter is not relevant for shared hash tables.
+ *
+ * XXX I find this comment hard to understand / confusing. So what is init_size?
+ * Why should it match the total number of elements allocated during hash table
+ * creation? What does "it could be zero" say? Should it be zero, or why would I
+ * want to make it zero in some cases?
  */
 Size
 hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_alloc)
@@ -912,8 +921,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	int			num_partitions;
 	long		ssize;
 	long		dsize;
-	bool		element_alloc = true; /*Always true for shared hash tables */
-	Size		elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+	bool		element_alloc = true;	/* Always true for shared hash tables */
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
 
 	/*
 	 * For non-shared hash tables, the requested number of elements are
@@ -931,6 +940,7 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 		Assert(flags & HASH_DIRSIZE);
 		Assert(info->dsize == info->max_dsize);
 	}
+
 	/* Non-shared hash tables may not specify dir size */
 	if (!(flags & HASH_DIRSIZE))
 	{
@@ -961,8 +971,8 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	if (!element_alloc)
 		init_size = 0;
 
-	return MAXALIGN(sizeof(HASHHDR)) + dsize * MAXALIGN(sizeof(HASHSEGMENT)) +
-		+ MAXALIGN(sizeof(HASHBUCKET)) * ssize * nsegs
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
 		+ init_size * elementSize;
 }
 
@@ -1393,8 +1403,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		newElement = element_alloc(hashp, hctl->nelem_alloc);
-		if (newElement == NULL)
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1823,7 +1832,7 @@ element_alloc(HTAB *hashp, int nelem)
 		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
@@ -1834,7 +1843,7 @@ element_alloc(HTAB *hashp, int nelem)
 }
 
 /*
- * Link the elements allocated by element_alloc into the indicated free list
+ * link the elements allocated by element_alloc into the indicated free list
  */
 static void
 element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
@@ -1846,7 +1855,8 @@ element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
 	int			i;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
+	elementSize = HASH_ELEMENT_SIZE(hctl);
+
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
 	tmpElement = firstElement;
@@ -2103,7 +2113,6 @@ compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
 	while ((*nbuckets) < num_partitions)
 		(*nbuckets) <<= 1;
 
-
 	/*
 	 * Figure number of directory segments needed, round up to a power of 2
 	 */
-- 
2.49.0



  [text/x-patch] v6-0003-align-using-CACHELINESIZE-to-match-ShmemAllocRaw.patch (1.3K, 4-v6-0003-align-using-CACHELINESIZE-to-match-ShmemAllocRaw.patch)
  download | inline diff:
From 88974dc9a36759eaf2b0048b6ac45bc0c83cb537 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 27 Mar 2025 21:58:22 +0100
Subject: [PATCH v6 3/7] align using CACHELINESIZE, to match ShmemAllocRaw

---
 src/backend/utils/hash/dynahash.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index dcdc143e690..fc5575daaea 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -276,7 +276,7 @@ static long hash_accesses,
 
 /* XXX this is exactly the same as HASH_SEGMENT_OFFSET, unite? */
 #define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
-	(sizeof(HASHHDR) + \
+	CACHELINEALIGN(sizeof(HASHHDR) + \
 	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
 	 ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
 
@@ -971,9 +971,11 @@ hash_get_init_size(const HASHCTL *info, int flags, long init_size, int nelem_all
 	if (!element_alloc)
 		init_size = 0;
 
-	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
-		+ sizeof(HASHBUCKET) * ssize * nsegs
-		+ init_size * elementSize;
+	/* enforce elements to be cacheline aligned */
+	return CACHELINEALIGN(sizeof(HASHHDR)
+			+ (dsize * sizeof(HASHSEGMENT))
+			+ (ssize * nsegs * sizeof(HASHBUCKET)))
+			+ (init_size * elementSize);
 }
 
 
-- 
2.49.0



  [text/x-patch] v6-0004-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (7.2K, 5-v6-0004-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
  download | inline diff:
From edccce9feb8a930150d97d034ef96c6bff544d96 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 27 Mar 2025 16:43:28 +0530
Subject: [PATCH v6 4/7] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 27 +++++++++-----
 src/backend/storage/lmgr/proc.c      | 56 +++++++++++++++++++++++-----
 2 files changed, 63 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..de2629fdf0c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,20 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* reset everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1248,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1299,21 +1302,25 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+	requestSize = RWConflictPoolHeaderDataSize +
+					mul_size((Size) max_table_size,
+							 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+			RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index e4ca861a8e6..6ee48410b84 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,24 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
+ * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
+ * a separate commit, not to mix it with the "monitoring improvement"
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +193,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +224,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+									   requestSize,
+									   &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +241,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +265,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +364,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.49.0



  [text/x-patch] v6-0005-review.patch (4.6K, 6-v6-0005-review.patch)
  download | inline diff:
From 03809b97dca1a25b1b61f119ec05ef4b1d8e5b49 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 27 Mar 2025 21:20:26 +0100
Subject: [PATCH v6 5/7] review

---
 src/backend/storage/lmgr/predicate.c | 13 ++++++++-----
 src/backend/storage/lmgr/proc.c      | 21 +++++++++++----------
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index de2629fdf0c..d82114ffca1 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1229,6 +1229,7 @@ PredicateLockShmemInit(void)
 	requestSize = add_size(PredXactListDataSize,
 						   (mul_size((Size) max_table_size,
 									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
 							   requestSize,
 							   &found);
@@ -1237,7 +1238,7 @@ PredicateLockShmemInit(void)
 	{
 		int			i;
 
-		/* reset everything, both the header and the element */
+		/* clean everything, both the header and the element */
 		memset(PredXact, 0, requestSize);
 
 		dlist_init(&PredXact->availableList);
@@ -1248,7 +1249,8 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
 		for (i = 0; i < max_table_size; i++)
 		{
@@ -1302,9 +1304,10 @@ PredicateLockShmemInit(void)
 	 * probably OK.
 	 */
 	max_table_size *= 5;
+
 	requestSize = RWConflictPoolHeaderDataSize +
-					mul_size((Size) max_table_size,
-							 RWConflictDataSize);
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
 
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
 									 requestSize,
@@ -1319,7 +1322,7 @@ PredicateLockShmemInit(void)
 
 		dlist_init(&RWConflictPool->availableList);
 		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
-			RWConflictPoolHeaderDataSize);
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
 		for (i = 0; i < max_table_size; i++)
 		{
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6ee48410b84..c08e3bb3d56 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -124,20 +124,21 @@ ProcGlobalShmemSize(void)
 }
 
 /*
- * review: add comment, explaining the PG_CACHE_LINE_SIZE thing
- * review: I'd even maybe split the PG_CACHE_LINE_SIZE thing into
- * a separate commit, not to mix it with the "monitoring improvement"
+ * Report shared-memory space needed by PGPROC.
  */
 static Size
 PGProcShmemSize(void)
 {
 	Size		size;
-	uint32		TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
 	return size;
 }
 
@@ -227,11 +228,11 @@ InitProcGlobal(void)
 	requestSize = PGProcShmemSize();
 
 	ptr = ShmemInitStruct("PGPROC structures",
-									   requestSize,
-									   &found);
+						  requestSize,
+						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *)ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -244,15 +245,15 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-	ptr = (char *)ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/* make sure wer didn't overflow */
 	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
-- 
2.49.0



  [text/x-patch] v6-0006-Add-cacheline-padding-between-heavily-accessed-ar.patch (2.0K, 7-v6-0006-Add-cacheline-padding-between-heavily-accessed-ar.patch)
  download | inline diff:
From 2a266c6115dca6ad32f7dbb136b63cf8d278504d Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 27 Mar 2025 21:23:05 +0100
Subject: [PATCH v6 6/7] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index c08e3bb3d56..6e0fc6a08e6 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -135,8 +135,11 @@ PGProcShmemSize(void)
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
 	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, PG_CACHE_LINE_SIZE);
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -232,7 +235,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -245,11 +248,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0



  [text/x-patch] v6-0007-review.patch (2.3K, 8-v6-0007-review.patch)
  download | inline diff:
From 8f74e27118cc68c5348def34cfda814ca846db49 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Thu, 27 Mar 2025 22:03:23 +0100
Subject: [PATCH v6 7/7] review

---
 src/backend/storage/lmgr/proc.c | 15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 6e0fc6a08e6..5757545defb 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,12 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	size = add_size(size, PG_CACHE_LINE_SIZE);
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -235,7 +232,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -248,11 +245,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates)) + PG_CACHE_LINE_SIZE;
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
@ 2025-03-28 11:10             ` Rahila Syed <[email protected]>
  2025-03-28 13:00               ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Rahila Syed @ 2025-03-28 11:10 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi Tomas,


1) alignment
>
> There was a comment with a question whether we need to MAXALIGN the
> chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
> now it's part of one large allocation, which is then cut into pieces
> (using pointer arithmetics).
>
> I was not sure whether we need to enforce some alignment, we briefly
> discussed that off-list. I realize you chose to add the alignment, but I
> haven't noticed any comment in the patch why it's needed, and it seems
> to me it may not be quite correct.
>


I have added MAXALIGN to specific allocations, such as HASHHDR and
HASHSEGMENT, with the expectation that allocations in multiples of this,
like dsize * HASHSEGMENT, would automatically align.



> Let me explain what I had in mind, and why I think the way v5 doesn't
> actually do that. It took me a while before I understood what alignment
> is about, and for a while it was haunting my patches, so hopefully this
> will help others ...
>
> The "alignment" is about pointers (or addresses), and when a pointer is
> aligned it means the address is a multiple of some number. For example
> 4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
> while 0x00000101 is not. Sometimes we use data types to express the
> alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
> the alignment is always 2^k, so 1, 2, 4, 8, ...
>
> The primary reason for alignment is that some architectures require the
> pointers to be well-aligned for a given data type. For example (int*)
> needs to be int-aligned. If you have a pointer that's not 4B-aligned,
> it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
> like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
> work, even if there might be a minor performance impact. Anyway, we
> still enforce/expect correct alignment, because we may still support
> some of those alignment-sensitive platforms, and it's "tidy".
>
> The other reason is that we sometimes use alignment to add padding, to
> reduce contention when accessing elements in hot arrays. We want to
> align to cacheline boundaries, so that a struct does not require
> accessing more cachelines than really necessary. And also to reduce
> contention - the more cachelines, the higher the risk of contention.
>
>
Thank you for your explanation. I had a similar understanding. However,
I believed that MAXALIGN and CACHEALIGN are primarily performance
optimizations
that do not impact the correctness of the code. This assumption is based on
the fact
that I have not observed any failures on GitHub CI, even when changing the
alignment
in this part of the code.


Now, back to the patch. The code originally did this in ShmemInitStruct
>
>     hashp = ShmemInitStruct(...)
>
> to allocate the hctl, and then
>
>     firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);
>
> in element_alloc(). But this means the "elements" allocation is aligned
> to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:
>
>     size = CACHELINEALIGN(size);
>
> So it distributes memory in multiples of 128B, and I believe it starts
> at a multiple of 128B.
>
> But the patch reworks this to allocate everything at once, and thus it
> won't get this alignment automatically. AFAIK that's not intentional,
> because no one explicitly mentioned this. And it's may not be quite
> desirable, judging by the comment in ShmemAllocRaw().
>
>
Yes, the patch reworks this to allocate all the shared memory at once.
It uses ShmemInitStruct which internally calls ShmemAllocRaw. So the whole
chunk
of memory allocated is still CACHEALIGNed.

I mentioned v5 adds alignment, but I think it does not quite do that
> quite correctly. It adds alignment by changing the macros from:
>
> +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
> +       (sizeof(HASHHDR) + \
> +        ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
> +        ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
>
> to
>
> +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
> +       (MAXALIGN(sizeof(HASHHDR)) + \
> +        ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
> +        ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
>
> First, it uses MAXALIGN, but that's mostly my fault, because my comment
> suggested that - the ShmemAllocRaw however and makes the case for using
> CACHELINEALIGN.
>

Good catch. For a shared hash table, allocations need to be
CACHELINEALIGNED.
I think hash_get_init_size does not need to call CACHELINEALIGNED
explicitly as ShmemInitStruct already does this.
In that case, the size returned by hash_get_init_size just needs to
MAXALIGN required structs as per hash_create() requirements and
CACHELINEALIGN
will be taken care of in ShmemInitStruct at the time of allocating the
entire chunk.


> But more importantly, it adds alignment to all hctl field, and to every
> element of those arrays. But that's not what the alignment was supposed
> to do - it was supposed to align arrays, not individual elements. Not
> only would this waste memory, it would actually break direct access to
> those array elements.
>

I think existing code has occurrences of both i,.e aligning individual
elements and
arrays.
A similar precedent exists in the function hash_estimate_size(), which only
applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
entrysize, but also an array of HASHBUCKET headers.

I agree with you that perhaps we don't need maxalignment for all of these
structures.
For ex, HASHBUCKET is a pointer to a linked list of elements, it might not
require alignment
if the elements it points to are already aligned.


> But there's another detail - even before this patch, most of the stuff
> was allocated at once by ShmemInitStruct(). Everything except for the
> elements, so to replicate the alignment we only need to worry about that
> last part. So I think this should do:
>


> +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
> +    CACHELINEALIGN(sizeof(HASHHDR) + \
> +     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
> +     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
>
> This is what the 0003 patch does. There's still one minor difference, in
> that we used to align each segment independently - each element_alloc()
> call allocated a new CACHELINEALIGN-ed chunk, while now have just a
> single chunk. But I think that's OK.
>
>
Before this patch, following structures were allocated separately using
ShmemAllocRaw
directory, each segment(seg_alloc) and a chunk of elements (element_alloc).
Hence,
I don't understand why v-0003* CACHEALIGNs  in the manner it does.

I think if we want to emulate the current behaviour we should do something
like:
CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
                + CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
                + CACHELINEALIGN(init_size * elementSize);

Like you mentioned the only difference would be that we would be aligning
all elements
at once instead of aligning individual partitions of elements.


3) I find the comment before hash_get_init_size a bit unclear/confusing.
> It says this:
>
>  * init_size should match the total number of elements allocated during
>  * hash table creation, it could be zero for non-shared hash tables
>  * depending on the value of nelem_alloc. For more explanation see
>  * comments within this function.
>  *
>  * nelem_alloc parameter is not relevant for shared hash tables.
>
> What does "should match" mean here? Doesn't it *determine* the number of
> elements allocated? What if it doesn't match?
>

by should match I mean - init_size  here  *is* equal to nelem in
hash_create() .

>
> AFAICS it means the hash table is sized to expect init_size elements,
> but only nelem_alloc elements are actually pre-allocated, right?


No. All the init_size elements are pre-allocated for shared hash table
irrespective of
nelem_alloc value.
For non-shared hash tables init_size elements are allocated only
if it is less than nelem_alloc, otherwise they are allocated as part of
expansion.


> But the
> comment says it's init_size which determines the number of elements
> allocated during creation. Confusing.
>
> It says "it could be zero ... depending on the value of nelem_alloc".
> Depending how? What's the relationship.
>
>
The relationship is defined in this comment:
 /*
* For a shared hash table, preallocate the requested number of elements.
* This reduces problems with run-time out-of-shared-memory conditions.
*
* For a non-shared hash table, preallocate the requested number of
* elements if it's less than our chosen nelem_alloc.  This avoids wasting
* space if the caller correctly estimates a small table size.
*/

hash_create code is confusing because the nelem_alloc named variable is used
in two different cases, In  the above case  nelem_alloc  refers to the one
returned by choose_nelem_alloc function.

The other nelem_alloc determines the number of elements in each partition
for a partitioned hash table. This is not what is being referred to in the
above
comment.

The bit "For more explanation see comments within this function" is not
> great, if only because there are not many comments within the function,
> so there's no "more explanation". But if there's something important, it
> should be in the main comment, preferably.
>
>
I will improve the comment in the next version.

Thank you,
Rahila Syed


^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
@ 2025-03-28 13:00               ` Tomas Vondra <[email protected]>
  2025-03-30 23:01                 ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Tomas Vondra @ 2025-03-28 13:00 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers



On 3/28/25 12:10, Rahila Syed wrote:
> Hi Tomas,
> 
> 
>     1) alignment
> 
>     There was a comment with a question whether we need to MAXALIGN the
>     chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
>     now it's part of one large allocation, which is then cut into pieces
>     (using pointer arithmetics).
> 
>     I was not sure whether we need to enforce some alignment, we briefly
>     discussed that off-list. I realize you chose to add the alignment, but I
>     haven't noticed any comment in the patch why it's needed, and it seems
>     to me it may not be quite correct.
> 
>  
> 
> I have added MAXALIGN to specific allocations, such as HASHHDR and
> HASHSEGMENT, with the expectation that allocations in multiples of this,
> like dsize * HASHSEGMENT, would automatically align.
> 

Yes, assuming the original allocation is aligned, maxalign-ing the
smaller allocations would ensure that. But there's still the issue with
aligning array elements.

> 
> 
>     Let me explain what I had in mind, and why I think the way v5 doesn't
>     actually do that. It took me a while before I understood what alignment
>     is about, and for a while it was haunting my patches, so hopefully this
>     will help others ...
> 
>     The "alignment" is about pointers (or addresses), and when a pointer is
>     aligned it means the address is a multiple of some number. For example
>     4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
>     while 0x00000101 is not. Sometimes we use data types to express the
>     alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
>     the alignment is always 2^k, so 1, 2, 4, 8, ...
> 
>     The primary reason for alignment is that some architectures require the
>     pointers to be well-aligned for a given data type. For example (int*)
>     needs to be int-aligned. If you have a pointer that's not 4B-aligned,
>     it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
>     like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
>     work, even if there might be a minor performance impact. Anyway, we
>     still enforce/expect correct alignment, because we may still support
>     some of those alignment-sensitive platforms, and it's "tidy".
> 
>     The other reason is that we sometimes use alignment to add padding, to
>     reduce contention when accessing elements in hot arrays. We want to
>     align to cacheline boundaries, so that a struct does not require
>     accessing more cachelines than really necessary. And also to reduce
>     contention - the more cachelines, the higher the risk of contention.
> 
>  
> Thank you for your explanation. I had a similar understanding. However,
> I believed that MAXALIGN and CACHEALIGN are primarily performance
> optimizations
> that do not impact the correctness of the code. This assumption is based
> on the fact
> that I have not observed any failures on GitHub CI, even when changing
> the alignment
> in this part of the code.
> 

I hope it didn't come over as explaining something you already know ...
Apologies if that's the case.

As for why the CI doesn't fail on this, that's because it runs on x86,
which tolerates alignment issues. AFAIK all "modern" platforms do. You'd
need some old platform (I recall we saw SIGBUG on powerpc and s390, or
something like that). It'd probably matter even on x86 with something
like AVX, but that's irrelevant for this patch.

I don't even know what is the performance impact on x86. I think it was
measurable years ago, but it got mostly negligible for recent CPUs.

Another reason is that some of those structs may be "naturally aligned"
because they happen to be multiples of MAXALIGN. For example:

HASHHDR -> 96 bytes
HASHELEMENT -> 16 bytes (this covers HASHBUCKET / HASHSEGMENT too)


> 
>     Now, back to the patch. The code originally did this in ShmemInitStruct
> 
>         hashp = ShmemInitStruct(...)
> 
>     to allocate the hctl, and then
> 
>         firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);
> 
>     in element_alloc(). But this means the "elements" allocation is aligned
>     to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:
> 
>         size = CACHELINEALIGN(size);
> 
>     So it distributes memory in multiples of 128B, and I believe it starts
>     at a multiple of 128B.
> 
>     But the patch reworks this to allocate everything at once, and thus it
>     won't get this alignment automatically. AFAIK that's not intentional,
>     because no one explicitly mentioned this. And it's may not be quite
>     desirable, judging by the comment in ShmemAllocRaw().
> 
> 
> Yes, the patch reworks this to allocate all the shared memory at once.
> It uses ShmemInitStruct which internally calls ShmemAllocRaw. So the
> whole chunk
> of memory allocated is still CACHEALIGNed. 
> 
>     I mentioned v5 adds alignment, but I think it does not quite do that
>     quite correctly. It adds alignment by changing the macros from:
> 
>     +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
>     +       (sizeof(HASHHDR) + \
>     +        ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
>     +        ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
> 
>     to
> 
>     +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
>     +       (MAXALIGN(sizeof(HASHHDR)) + \
>     +        ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
>     +        ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))
> 
>     First, it uses MAXALIGN, but that's mostly my fault, because my comment
>     suggested that - the ShmemAllocRaw however and makes the case for using
>     CACHELINEALIGN.
> 
> 
> Good catch. For a shared hash table, allocations need to be
> CACHELINEALIGNED.  
> I think hash_get_init_size does not need to call CACHELINEALIGNED
> explicitly as ShmemInitStruct already does this.
> In that case, the size returned by hash_get_init_size just needs to
> MAXALIGN required structs as per hash_create() requirements and
> CACHELINEALIGN
> will be taken care of in ShmemInitStruct at the time of allocating the
> entire chunk.
> 
> 
>     But more importantly, it adds alignment to all hctl field, and to every
>     element of those arrays. But that's not what the alignment was supposed
>     to do - it was supposed to align arrays, not individual elements. Not
>     only would this waste memory, it would actually break direct access to
>     those array elements.
> 
> 
> I think existing code has occurrences of both i,.e aligning individual
> elements and 
> arrays.
> A similar precedent exists in the function hash_estimate_size(), which only
> applies maxalignment to the individual structs like HASHHDR, HASHELEMENT,
> entrysize, but also an array of HASHBUCKET headers. 
> 
> I agree with you that perhaps we don't need maxalignment for all of
> these structures.
> For ex, HASHBUCKET is a pointer to a linked list of elements, it might
> not require alignment
> if the elements it points to are already aligned.
> 
> 
>     But there's another detail - even before this patch, most of the stuff
>     was allocated at once by ShmemInitStruct(). Everything except for the
>     elements, so to replicate the alignment we only need to worry about that
>     last part. So I think this should do:
> 
>  
> 
>     +#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
>     +    CACHELINEALIGN(sizeof(HASHHDR) + \
>     +     ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
>     +     ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))
> 
>     This is what the 0003 patch does. There's still one minor difference, in
>     that we used to align each segment independently - each element_alloc()
>     call allocated a new CACHELINEALIGN-ed chunk, while now have just a
>     single chunk. But I think that's OK.
> 
>  
> Before this patch, following structures were allocated separately using
> ShmemAllocRaw
> directory, each segment(seg_alloc) and a chunk of elements
> (element_alloc). Hence, 
> I don't understand why v-0003* CACHEALIGNs  in the manner it does.
> 

I may not have gotten it quite right. The intent was to align it the
same way as before, and the allocation used to be sized like this:

Size
hash_get_shared_size(HASHCTL *info, int flags)
{
	Assert(flags & HASH_DIRSIZE);
	Assert(info->dsize == info->max_dsize);
	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
}

But I got confused a bit. I think you're correct here:

> I think if we want to emulate the current behaviour we should do
> something like:
> CACHELINEALIGN(sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)) +
>                 + CACHELINEALIGN(sizeof(HASHBUCKET) * ssize) * nsegs
>                 + CACHELINEALIGN(init_size * elementSize);
> 
> Like you mentioned the only difference would be that we would be
> aligning all elements
> at once instead of aligning individual partitions of elements.
> 

Right. I'm still not convinced if this makes any difference, or whether
this alignment was merely a consequence of using ShmemAlloc(). I don't
want to make this harder to understand unnecessarily.

Let's keep this simple - without additional alignment. I'll think about
it a bit more, and maybe add it before commit.

> 
>     3) I find the comment before hash_get_init_size a bit unclear/confusing.
>     It says this:
> 
>      * init_size should match the total number of elements allocated during
>      * hash table creation, it could be zero for non-shared hash tables
>      * depending on the value of nelem_alloc. For more explanation see
>      * comments within this function.
>      *
>      * nelem_alloc parameter is not relevant for shared hash tables.
> 
>     What does "should match" mean here? Doesn't it *determine* the number of
>     elements allocated? What if it doesn't match?
> 
>  
> by should match I mean - init_size  here  *is* equal to nelem in
> hash_create() .
> 
> 
>     AFAICS it means the hash table is sized to expect init_size elements,
>     but only nelem_alloc elements are actually pre-allocated, right?
> 
>  
> No. All the init_size elements are pre-allocated for shared hash table
> irrespective of 
> nelem_alloc value.
> For non-shared hash tables init_size elements are allocated only
> if it is less than nelem_alloc, otherwise they are allocated as part of
> expansion.
>  
> 
>     But the
>     comment says it's init_size which determines the number of elements
>     allocated during creation. Confusing.
> 
>     It says "it could be zero ... depending on the value of nelem_alloc".
>     Depending how? What's the relationship.
> 
>  
> The relationship is defined in this comment:
>  /*
> * For a shared hash table, preallocate the requested number of elements.
> * This reduces problems with run-time out-of-shared-memory conditions.
> *
> * For a non-shared hash table, preallocate the requested number of
> * elements if it's less than our chosen nelem_alloc.  This avoids wasting
> * space if the caller correctly estimates a small table size.
> */
> 
> hash_create code is confusing because the nelem_alloc named variable is used
> in two different cases, In  the above case  nelem_alloc  refers to the one 
> returned by choose_nelem_alloc function.
> 
> The other nelem_alloc determines the number of elements in each partition
> for a partitioned hash table. This is not what is being referred to in
> the above 
> comment.
> 
>     The bit "For more explanation see comments within this function" is not
>     great, if only because there are not many comments within the function,
>     so there's no "more explanation". But if there's something important, it
>     should be in the main comment, preferably.
> 
>  
> I will improve the comment in the next version.
> 

OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
really used except for this bit:

+    if (init_size > nelem_alloc)
+        element_alloc = false;

Can't we determine before calling the function, to make it a bit less
confusing?


regards

-- 
Tomas Vondra






^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-28 13:00               ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
@ 2025-03-30 23:01                 ` Rahila Syed <[email protected]>
  2025-03-31 14:11                   ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  0 siblings, 1 reply; 14+ messages in thread

From: Rahila Syed @ 2025-03-30 23:01 UTC (permalink / raw)
  To: Tomas Vondra <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi Tomas,


>
> Right. I'm still not convinced if this makes any difference, or whether
> this alignment was merely a consequence of using ShmemAlloc(). I don't
> want to make this harder to understand unnecessarily.
>

Yeah, it makes sense.


> Let's keep this simple - without additional alignment. I'll think about
> it a bit more, and maybe add it before commit.
>

OK.


>
>
> > I will improve the comment in the next version.
> >
>
> OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
> really used except for this bit:
>
> +    if (init_size > nelem_alloc)
> +        element_alloc = false;
>
> Can't we determine before calling the function, to make it a bit less
> confusing?
>

Yes, we could determine whether the pre-allocated elements are zero before
calling the function, I have fixed it accordingly in the attached 0001
patch.
Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
passed this information as a boolean variable-initial_elems. If it is
false,
no elements are pre-allocated.

Please find attached the v7-series, which incorporates your review patches
and addresses a few remaining comments.

Thank you,
Rahila Syed


Attachments:

  [application/octet-stream] v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (7.0K, 3-v7-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
  download | inline diff:
From dfb7731cd9f4687f4d90a5210fb32c3a78fc4a5f Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Mon, 31 Mar 2025 03:45:50 +0530
Subject: [PATCH 2/3] Replace ShmemAlloc calls by ShmemInitStruct

The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
call.
---
 src/backend/storage/lmgr/predicate.c | 30 ++++++++++-----
 src/backend/storage/lmgr/proc.c      | 57 +++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..d82114ffca 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,21 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1249,9 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,20 +1305,25 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 5;
 
+	requestSize = RWConflictPoolHeaderDataSize +
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
+
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 066319afe2..4e23c793f7 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,25 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * Report shared-memory space needed by PGPROC.
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +194,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +225,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+						  requestSize,
+						  &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +242,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +266,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +365,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.34.1



  [application/octet-stream] v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patch (2.1K, 4-v7-0003-Add-cacheline-padding-between-heavily-accessed-array.patch)
  download | inline diff:
From 61712f86abd9edae190d27dc08be7b9753751a1d Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Mon, 31 Mar 2025 04:27:43 +0530
Subject: [PATCH 3/3] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4e23c793f7..8363e2f61d 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,9 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -232,7 +232,7 @@ InitProcGlobal(void)
 						  &found);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
@@ -245,11 +245,11 @@ InitProcGlobal(void)
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.34.1



  [application/octet-stream] v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patch (15.9K, 5-v7-0001-Account-for-all-the-shared-memory-allocated-by-hash_.patch)
  download | inline diff:
From 4cce66bdfbb53b7ef3d9a7058f5979ac3b4dc89f Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Mon, 31 Mar 2025 03:23:20 +0530
Subject: [PATCH 1/3] Account for all the shared memory allocated by
 hash_create

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and header structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.

Allocate memory for segments, buckets and elements together with the
directory and header structures. This results in the existing ShmemIndex
entries to reflect size of hash table more accurately, thus improving
the pg_shmem_allocation monitoring. Also, make this change for non-
shared hash table since they both share the hash_create code.
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 283 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 222 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..3c030d5743 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags,
+												  true, init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d..3dede9caa5 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, nsegs))
+
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
+	bool 	initial_elems = false;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +538,35 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Calculate the number of elements to allocate
+	 *
+	 * For a shared hash table, preallocate the requested number of elements.
+	 * This reduces problems with run-time out-of-shared-memory conditions.
+	 *
+	 * For a non-shared hash table, preallocate the requested number of
+	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
+	 * space if the caller correctly estimates a small table size.
+	 */
+	if ((flags & HASH_SHARED_MEM) ||
+		nelem < nelem_batch)
+		initial_elems = true;
+	else
+		initial_elems = false;
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, initial_elems, nelem);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +615,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -567,14 +627,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	if (!init_htab(hashp, nelem))
 		elog(ERROR, "failed to initialize hash table \"%s\"", hashp->tabname);
 
-	/*
-	 * For a shared hash table, preallocate the requested number of elements.
-	 * This reduces problems with run-time out-of-shared-memory conditions.
-	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
-	 */
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
 	{
@@ -582,6 +634,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -606,14 +661,31 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +773,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +791,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -846,16 +899,70 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
+ * For shared hash tables the directory size is non-expansive.
+ *
+ * nelem is total number of elements requested during hash table creation.
+ *
+ * initial_nelems is true if it is decided to pre-allocate elements and false
+ * otherwise.
+ *
+ * For a shared hash table initial_elems is always true.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nelem)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
+	int 	init_size = 0;
+
+	if (flags & HASH_SHARED_MEM)
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (nelem < NUM_FREELISTS)
+			nelem = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(nelem, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	/* initial_elems as false indicates no elements are to be pre-allocated */
+	if (initial_elems)
+		init_size = nelem;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1392,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1429,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,29 +1808,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
@@ -1744,8 +1866,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2077,34 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..4d5f09081b 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   bool initial_elems, long nelem);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.34.1



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-28 13:00               ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-30 23:01                 ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
@ 2025-03-31 14:11                   ` Tomas Vondra <[email protected]>
  2025-05-13 11:04                     ` Re: Improve monitoring of shared memory allocations Amit Kapila <[email protected]>
  2026-02-10 12:17                     ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  0 siblings, 2 replies; 14+ messages in thread

From: Tomas Vondra @ 2025-03-31 14:11 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; +Cc: Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On 3/31/25 01:01, Rahila Syed wrote:
> ...
> 
> 
>     > I will improve the comment in the next version.
>     >
> 
>     OK. Do we even need to pass nelem_alloc to hash_get_init_size? It's not
>     really used except for this bit:
> 
>     +    if (init_size > nelem_alloc)
>     +        element_alloc = false;
> 
>     Can't we determine before calling the function, to make it a bit less
>     confusing?
> 
> 
> Yes, we could determine whether the pre-allocated elements are zero before
> calling the function, I have fixed it accordingly in the attached 0001
> patch. 
> Now, there's no need to pass `nelem_alloc` as a parameter. Instead, I've
> passed this information as a boolean variable-initial_elems. If it is
> false,
> no elements are pre-allocated.
> 
> Please find attached the v7-series, which incorporates your review patches
> and addresses a few remaining comments.
> 

I think it's almost committable. Attached is v8 with some minor review
adjustments, and updated commit messages. Please read through those and
feel free to suggest changes.

I still found the hash_get_init_size() comment unclear, and it also
referenced init_size, which is no longer relevant. I improved the
comment a bit (I find it useful to mimic comments of nearby functions,
so I did that too here). The "initial_elems" name was a bit confusing,
as it seemed to suggest "number of elements", but it's a simple flag. So
I renamed it to "prealloc", which seems clearer to me. I also tweaked
(reordered/reformatted) the conditions a bit.

For the other patch, I realized we can simply MemSet() the whole chunk,
instead of resetting the individual parts.

regards

-- 
Tomas Vondra


Attachments:

  [text/x-patch] v8-0001-Improve-acounting-for-memory-used-by-shared-hash-.patch (16.6K, 2-v8-0001-Improve-acounting-for-memory-used-by-shared-hash-.patch)
  download | inline diff:
From 1a82ac7407eb88bc085694519739115f718cc59a Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Mon, 31 Mar 2025 03:23:20 +0530
Subject: [PATCH v8 1/5] Improve acounting for memory used by shared hash
 tables

pg_shmem_allocations tracks the memory allocated by ShmemInitStruct(),
but for shared hash tables that covered only the header and hash
directory.  The remaining parts (segments and buckets) were allocated
later using ShmemAlloc(), which does not update the shmem accounting.
Thus, these allocations were not shown in pg_shmem_allocations.

This commit improves the situation by allocating all the hash table
parts at once, using a single ShmemInitStruct() call. This way the
ShmemIndex entries (and thus pg_shmem_allocations) better reflect the
proper size of the hash table.

This affects allocations for private (non-shared) hash tables too, as
the hash_create() code is shared. For non-shared tables this however
makes no practical difference.

This changes the alignment a bit. ShmemAlloc() aligns the chunks using
CACHELINEALIGN(), which means some parts (header, directory, segments)
were aligned this way. Allocating all parts as a single chunks removes
this (implicit) alignment. We've considered adding explicit alignment,
but we've decided not to - it seems to be merely a coincidence due to
using the ShmemAlloc() API, not due to necessity.

Author: Rahila Syed <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/ipc/shmem.c   |   4 +-
 src/backend/utils/hash/dynahash.c | 283 +++++++++++++++++++++++-------
 src/include/utils/hsearch.h       |   3 +-
 3 files changed, 222 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39e..3c030d5743d 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
 #include "storage/shmem.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
+#include "utils/dynahash.h"
 
 static void *ShmemAllocRaw(Size size, Size *allocated_size);
 
@@ -346,7 +347,8 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
-							   hash_get_shared_size(infoP, hash_flags),
+							   hash_get_init_size(infoP, hash_flags,
+												  true, init_size),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3f25929f2d8..3dede9caa5d 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -260,12 +260,36 @@ static long hash_accesses,
 			hash_expansions;
 #endif
 
+#define	HASH_DIRECTORY_PTR(hashp) \
+	(((char *) (hashp)->hctl) + sizeof(HASHHDR))
+
+#define HASH_SEGMENT_OFFSET(hctl, idx) \
+	(sizeof(HASHHDR) + \
+	 ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+	 ((hctl)->ssize * (idx) * sizeof(HASHBUCKET)))
+
+#define HASH_SEGMENT_PTR(hashp, idx) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, (idx)))
+
+#define HASH_SEGMENT_SIZE(hashp) \
+	((hashp)->ssize * sizeof(HASHBUCKET))
+
+#define HASH_ELEMENTS_PTR(hashp, nsegs) \
+	((char *) (hashp)->hctl + HASH_SEGMENT_OFFSET((hashp)->hctl, nsegs))
+
+/* Each element has a HASHELEMENT header plus user data. */
+#define HASH_ELEMENT_SIZE(hctl) \
+	(MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN((hctl)->entrysize))
+
+#define HASH_ELEMENT_NEXT(hctl, num, ptr) \
+	((char *) (ptr) + ((num) * HASH_ELEMENT_SIZE(hctl)))
+
 /*
  * Private function prototypes
  */
 static void *DynaHashAlloc(Size size);
 static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +305,11 @@ static void register_seq_scan(HTAB *hashp);
 static void deregister_seq_scan(HTAB *hashp);
 static bool has_seq_scans(HTAB *hashp);
 
+static void compute_buckets_and_segs(long nelem, long num_partitions,
+									 long ssize,
+									 int *nbuckets, int *nsegments);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement,
+						int nelem, int freelist_idx);
 
 /*
  * memory allocation support
@@ -353,6 +382,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 {
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
+	int			nelem_batch;
+	bool 	initial_elems = false;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -507,9 +538,35 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hashp->isshared = false;
 	}
 
+	/* Choose the number of entries to allocate at a time. */
+	nelem_batch = choose_nelem_alloc(info->entrysize);
+
+	/*
+	 * Calculate the number of elements to allocate
+	 *
+	 * For a shared hash table, preallocate the requested number of elements.
+	 * This reduces problems with run-time out-of-shared-memory conditions.
+	 *
+	 * For a non-shared hash table, preallocate the requested number of
+	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
+	 * space if the caller correctly estimates a small table size.
+	 */
+	if ((flags & HASH_SHARED_MEM) ||
+		nelem < nelem_batch)
+		initial_elems = true;
+	else
+		initial_elems = false;
+
+	/*
+	 * Allocate the memory needed for hash header, directory, segments and
+	 * elements together. Use pointer arithmetic to arrive at the start of
+	 * each of these structures later.
+	 */
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Size		size = hash_get_init_size(info, flags, initial_elems, nelem);
+
+		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -558,6 +615,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	hctl->keysize = info->keysize;
 	hctl->entrysize = info->entrysize;
 
+	/* remember how many elements to allocate at once */
+	hctl->nelem_alloc = nelem_batch;
+
 	/* make local copies of heavily-used constant fields */
 	hashp->keysize = hctl->keysize;
 	hashp->ssize = hctl->ssize;
@@ -567,14 +627,6 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	if (!init_htab(hashp, nelem))
 		elog(ERROR, "failed to initialize hash table \"%s\"", hashp->tabname);
 
-	/*
-	 * For a shared hash table, preallocate the requested number of elements.
-	 * This reduces problems with run-time out-of-shared-memory conditions.
-	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
-	 */
 	if ((flags & HASH_SHARED_MEM) ||
 		nelem < hctl->nelem_alloc)
 	{
@@ -582,6 +634,9 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 					freelist_partitions,
 					nelem_alloc,
 					nelem_alloc_first;
+		void	   *ptr = NULL;
+		int			nsegs;
+		int			nbuckets;
 
 		/*
 		 * If hash table is partitioned, give each freelist an equal share of
@@ -606,14 +661,31 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		else
 			nelem_alloc_first = nelem_alloc;
 
+		/*
+		 * Calculate the offset at which to find the first partition of
+		 * elements. We have to skip space for the header, segments and
+		 * buckets.
+		 */
+		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+								 &nbuckets, &nsegs);
+
+		ptr = HASH_ELEMENTS_PTR(hashp, nsegs);
+
+		/*
+		 * Assign the correct location of each parition within a pre-allocated
+		 * buffer.
+		 *
+		 * Actual memory allocation happens in ShmemInitHash for shared hash
+		 * tables or earlier in this function for non-shared hash tables.
+		 *
+		 * We just need to split that allocation into per-batch freelists.
+		 */
 		for (i = 0; i < freelist_partitions; i++)
 		{
 			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
 
-			if (!element_alloc(hashp, temp, i))
-				ereport(ERROR,
-						(errcode(ERRCODE_OUT_OF_MEMORY),
-						 errmsg("out of memory")));
+			element_add(hashp, (HASHELEMENT *) ptr, temp, i);
+			ptr = HASH_ELEMENT_NEXT(hctl, temp, ptr);
 		}
 	}
 
@@ -701,30 +773,12 @@ init_htab(HTAB *hashp, long nelem)
 		for (i = 0; i < NUM_FREELISTS; i++)
 			SpinLockInit(&(hctl->freeList[i].mutex));
 
-	/*
-	 * Allocate space for the next greater power of two number of buckets,
-	 * assuming a desired maximum load factor of 1.
-	 */
-	nbuckets = next_pow2_int(nelem);
-
-	/*
-	 * In a partitioned table, nbuckets must be at least equal to
-	 * num_partitions; were it less, keys with apparently different partition
-	 * numbers would map to the same bucket, breaking partition independence.
-	 * (Normally nbuckets will be much bigger; this is just a safety check.)
-	 */
-	while (nbuckets < hctl->num_partitions)
-		nbuckets <<= 1;
+	compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
+							 &nbuckets, &nsegs);
 
 	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
 	hctl->high_mask = (nbuckets << 1) - 1;
 
-	/*
-	 * Figure number of directory segments needed, round up to a power of 2
-	 */
-	nsegs = (nbuckets - 1) / hctl->ssize + 1;
-	nsegs = next_pow2_int(nsegs);
-
 	/*
 	 * Make sure directory is big enough. If pre-allocated directory is too
 	 * small, choke (caller screwed up).
@@ -737,26 +791,25 @@ init_htab(HTAB *hashp, long nelem)
 			return false;
 	}
 
-	/* Allocate a directory */
+	/*
+	 * Assign a directory by making it point to the correct location in the
+	 * pre-allocated buffer.
+	 */
 	if (!(hashp->dir))
 	{
 		CurrentDynaHashCxt = hashp->hcxt;
-		hashp->dir = (HASHSEGMENT *)
-			hashp->alloc(hctl->dsize * sizeof(HASHSEGMENT));
-		if (!hashp->dir)
-			return false;
+		hashp->dir = (HASHSEGMENT *) HASH_DIRECTORY_PTR(hashp);
 	}
 
-	/* Allocate initial segments */
+	/* Assign initial segments, which are also pre-allocated */
+	i = 0;
 	for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
 	{
-		*segp = seg_alloc(hashp);
-		if (*segp == NULL)
-			return false;
+		*segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+		MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
 	}
 
-	/* Choose number of entries to allocate at a time */
-	hctl->nelem_alloc = choose_nelem_alloc(hctl->entrysize);
+	Assert(i == nsegs);
 
 #ifdef HASH_DEBUG
 	fprintf(stderr, "init_htab:\n%s%p\n%s%ld\n%s%ld\n%s%d\n%s%ld\n%s%u\n%s%x\n%s%x\n%s%ld\n",
@@ -846,16 +899,70 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a shared-memory
- * hashtable with the given parameters.  We need space for the HASHHDR
- * and for the (non expansible) directory.
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. The hash table may be shared or private. We need space for the
+ * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ *
+ * For shared hash tables the directory size is non-expansive.
+ *
+ * nelem is total number of elements requested during hash table creation.
+ *
+ * initial_nelems is true if it is decided to pre-allocate elements and false
+ * otherwise.
+ *
+ * For a shared hash table initial_elems is always true.
  */
 Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nelem)
 {
-	Assert(flags & HASH_DIRSIZE);
-	Assert(info->dsize == info->max_dsize);
-	return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+	int			nbuckets;
+	int			nsegs;
+	int			num_partitions;
+	long		ssize;
+	long		dsize;
+	Size		elementSize = HASH_ELEMENT_SIZE(info);
+	int 	init_size = 0;
+
+	if (flags & HASH_SHARED_MEM)
+	{
+		Assert(flags & HASH_DIRSIZE);
+		Assert(info->dsize == info->max_dsize);
+	}
+
+	/* Non-shared hash tables may not specify dir size */
+	if (!(flags & HASH_DIRSIZE))
+	{
+		dsize = DEF_DIRSIZE;
+	}
+	else
+		dsize = info->dsize;
+
+	if (flags & HASH_PARTITION)
+	{
+		num_partitions = info->num_partitions;
+
+		/* Number of entries should be atleast equal to the freelists */
+		if (nelem < NUM_FREELISTS)
+			nelem = NUM_FREELISTS;
+	}
+	else
+		num_partitions = 0;
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
+	else
+		ssize = DEF_SEGSIZE;
+
+	compute_buckets_and_segs(nelem, num_partitions, ssize,
+							 &nbuckets, &nsegs);
+
+	/* initial_elems as false indicates no elements are to be pre-allocated */
+	if (initial_elems)
+		init_size = nelem;
+
+	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
+		+ sizeof(HASHBUCKET) * ssize * nsegs
+		+ init_size * elementSize;
 }
 
 
@@ -1285,7 +1392,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 		 * Failing because the needed element is in a different freelist is
 		 * not acceptable.
 		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+		if ((newElement = element_alloc(hashp, hctl->nelem_alloc)) == NULL)
 		{
 			int			borrow_from_idx;
 
@@ -1322,6 +1429,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
 			/* no elements available to borrow either, so out of memory */
 			return NULL;
 		}
+		element_add(hashp, newElement, hctl->nelem_alloc, freelist_idx);
 	}
 
 	/* remove entry from freelist, bump nentries */
@@ -1700,29 +1808,43 @@ seg_alloc(HTAB *hashp)
 }
 
 /*
- * allocate some new elements and link them into the indicated free list
+ * allocate some new elements
  */
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	Size		elementSize;
-	HASHELEMENT *firstElement;
-	HASHELEMENT *tmpElement;
-	HASHELEMENT *prevElement;
-	int			i;
+	HASHELEMENT *firstElement = NULL;
 
 	if (hashp->isfixed)
-		return false;
+		return NULL;
 
 	/* Each element has a HASHELEMENT header plus user data. */
-	elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 	CurrentDynaHashCxt = hashp->hcxt;
 	firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
 
 	if (!firstElement)
-		return false;
+		return NULL;
+
+	return firstElement;
+}
+
+/*
+ * link the elements allocated by element_alloc into the indicated free list
+ */
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int nelem, int freelist_idx)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	Size		elementSize;
+	HASHELEMENT *tmpElement;
+	HASHELEMENT *prevElement;
+	int			i;
+
+	/* Each element has a HASHELEMENT header plus user data. */
+	elementSize = HASH_ELEMENT_SIZE(hctl);
 
 	/* prepare to link all the new entries into the freelist */
 	prevElement = NULL;
@@ -1744,8 +1866,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-	return true;
 }
 
 /*
@@ -1957,3 +2077,34 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
 		}
 	}
 }
+
+/*
+ * Calculate the number of buckets and segments to store the given
+ * number of elements in a hash table. Segments contain buckets which
+ * in turn contain elements.
+ */
+static void
+compute_buckets_and_segs(long nelem, long num_partitions, long ssize,
+						 int *nbuckets, int *nsegments)
+{
+	/*
+	 * Allocate space for the next greater power of two number of buckets,
+	 * assuming a desired maximum load factor of 1.
+	 */
+	*nbuckets = next_pow2_int(nelem);
+
+	/*
+	 * In a partitioned table, nbuckets must be at least equal to
+	 * num_partitions; were it less, keys with apparently different partition
+	 * numbers would map to the same bucket, breaking partition independence.
+	 * (Normally nbuckets will be much bigger; this is just a safety check.)
+	 */
+	while ((*nbuckets) < num_partitions)
+		(*nbuckets) <<= 1;
+
+	/*
+	 * Figure number of directory segments needed, round up to a power of 2
+	 */
+	*nsegments = ((*nbuckets) - 1) / ssize + 1;
+	*nsegments = next_pow2_int(*nsegments);
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d9..4d5f09081b4 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,8 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
 extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_init_size(const HASHCTL *info, int flags,
+							   bool initial_elems, long nelem);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0



  [text/x-patch] v8-0002-review.patch (6.7K, 3-v8-0002-review.patch)
  download | inline diff:
From d80cdbcac8cb4105d35aeaa8f0600b3b5f530942 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 31 Mar 2025 14:54:06 +0200
Subject: [PATCH v8 2/5] review

---
 src/backend/storage/ipc/shmem.c   |  2 +-
 src/backend/utils/hash/dynahash.c | 71 ++++++++++++++++---------------
 src/include/utils/hsearch.h       |  2 +-
 3 files changed, 39 insertions(+), 36 deletions(-)

diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 3c030d5743d..8e19134cb5c 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -348,7 +348,7 @@ ShmemInitHash(const char *name,		/* table string name for shmem index */
 	/* look it up in the shmem index */
 	location = ShmemInitStruct(name,
 							   hash_get_init_size(infoP, hash_flags,
-												  true, init_size),
+												  init_size, true),
 							   &found);
 
 	/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3dede9caa5d..0240a871395 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -383,7 +383,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	HTAB	   *hashp;
 	HASHHDR    *hctl;
 	int			nelem_batch;
-	bool 	initial_elems = false;
+	bool		prealloc;
 
 	/*
 	 * Hash tables now allocate space for key and data, but you have to say
@@ -547,15 +547,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 * For a shared hash table, preallocate the requested number of elements.
 	 * This reduces problems with run-time out-of-shared-memory conditions.
 	 *
-	 * For a non-shared hash table, preallocate the requested number of
-	 * elements if it's less than our chosen nelem_alloc.  This avoids wasting
-	 * space if the caller correctly estimates a small table size.
+	 * For a private hash table, preallocate the requested number of elements
+	 * if it's less than our chosen nelem_alloc.  This avoids wasting space if
+	 * the caller correctly estimates a small table size.
 	 */
-	if ((flags & HASH_SHARED_MEM) ||
-		nelem < nelem_batch)
-		initial_elems = true;
-	else
-		initial_elems = false;
+	prealloc = (flags & HASH_SHARED_MEM) || (nelem < nelem_batch);
 
 	/*
 	 * Allocate the memory needed for hash header, directory, segments and
@@ -564,7 +560,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 	 */
 	if (!hashp->hctl)
 	{
-		Size		size = hash_get_init_size(info, flags, initial_elems, nelem);
+		Size		size = hash_get_init_size(info, flags, nelem, prealloc);
 
 		hashp->hctl = (HASHHDR *) hashp->alloc(size);
 		if (!hashp->hctl)
@@ -664,7 +660,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		/*
 		 * Calculate the offset at which to find the first partition of
 		 * elements. We have to skip space for the header, segments and
-		 * buckets.
+		 * buckets. We need to recalculate the number of segments, which we
+		 * don't store anywhere.
 		 */
 		compute_buckets_and_segs(nelem, hctl->num_partitions, hctl->ssize,
 								 &nbuckets, &nsegs);
@@ -899,21 +896,24 @@ hash_select_dirsize(long num_entries)
 }
 
 /*
- * Compute the required initial memory allocation for a hashtable with the given
- * parameters. The hash table may be shared or private. We need space for the
- * HASHHDR, for the directory, segments and the init_size elements in buckets.
+ * hash_get_init_size -- determine memory needed for a new dynamic hash table
  *
- * For shared hash tables the directory size is non-expansive.
- *
- * nelem is total number of elements requested during hash table creation.
+ *	info: hash table parameters
+ *	flags: bitmask indicating which parameters to take from *info
+ *	nelem: maximum number of elements expected
+ *	prealloc: request preallocation of elements
  *
- * initial_nelems is true if it is decided to pre-allocate elements and false
- * otherwise.
+ * Compute the required initial memory allocation for a hashtable with the given
+ * parameters. We need space for the HASHHDR, for the directory, segments and
+ * preallocated elements.
  *
- * For a shared hash table initial_elems is always true.
+ * The hash table may be private or shared. For shared hash tables the directory
+ * size is non-expansive, and we preallocate all elements (nelem). For private
+ * hash tables, we preallocate elements only if the expected number of elements
+ * is small (less than nelem_alloc).
  */
 Size
-hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nelem)
+hash_get_init_size(const HASHCTL *info, int flags, long nelem, bool prealloc)
 {
 	int			nbuckets;
 	int			nsegs;
@@ -921,21 +921,29 @@ hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nele
 	long		ssize;
 	long		dsize;
 	Size		elementSize = HASH_ELEMENT_SIZE(info);
-	int 	init_size = 0;
+	long		nelem_prealloc = 0;
 
+#ifdef USE_ASSERT_CHECKING
+	/* shared hash tables have non-expansive directory */
+	/* XXX what about segment size? should check have HASH_SEGMENT? */
 	if (flags & HASH_SHARED_MEM)
 	{
 		Assert(flags & HASH_DIRSIZE);
 		Assert(info->dsize == info->max_dsize);
+		Assert(prealloc);
 	}
+#endif
 
 	/* Non-shared hash tables may not specify dir size */
-	if (!(flags & HASH_DIRSIZE))
-	{
+	if (flags & HASH_DIRSIZE)
+		dsize = info->dsize;
+	else
 		dsize = DEF_DIRSIZE;
-	}
+
+	if (flags & HASH_SEGMENT)
+		ssize = info->ssize;
 	else
-		dsize = info->dsize;
+		ssize = DEF_SEGSIZE;
 
 	if (flags & HASH_PARTITION)
 	{
@@ -948,21 +956,16 @@ hash_get_init_size(const HASHCTL *info, int flags, bool initial_elems, long nele
 	else
 		num_partitions = 0;
 
-	if (flags & HASH_SEGMENT)
-		ssize = info->ssize;
-	else
-		ssize = DEF_SEGSIZE;
-
 	compute_buckets_and_segs(nelem, num_partitions, ssize,
 							 &nbuckets, &nsegs);
 
 	/* initial_elems as false indicates no elements are to be pre-allocated */
-	if (initial_elems)
-		init_size = nelem;
+	if (prealloc)
+		nelem_prealloc = nelem;
 
 	return sizeof(HASHHDR) + dsize * sizeof(HASHSEGMENT)
 		+ sizeof(HASHBUCKET) * ssize * nsegs
-		+ init_size * elementSize;
+		+ nelem_prealloc * elementSize;
 }
 
 
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 4d5f09081b4..5dc74db6e96 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -152,7 +152,7 @@ extern void hash_freeze(HTAB *hashp);
 extern Size hash_estimate_size(long num_entries, Size entrysize);
 extern long hash_select_dirsize(long num_entries);
 extern Size hash_get_init_size(const HASHCTL *info, int flags,
-							   bool initial_elems, long nelem);
+							   long nelem, bool prealloc);
 extern void AtEOXact_HashTables(bool isCommit);
 extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
 
-- 
2.49.0



  [text/x-patch] v8-0003-Improve-accounting-for-PredXactList-RWConflictPoo.patch (7.4K, 4-v8-0003-Improve-accounting-for-PredXactList-RWConflictPoo.patch)
  download | inline diff:
From 609404c682e28a59f80ac7630e4be8e46d12566e Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Mon, 31 Mar 2025 03:45:50 +0530
Subject: [PATCH v8 3/5] Improve accounting for PredXactList, RWConflictPool
 and PGPROC

Various places allocated shared memory by first allocating a small chunk
using ShmemInitStruct(), followed by ShmemAlloc() calls to allocate more
memory. Unfortunately, ShmemAlloc() does not update ShmemIndex, so this
affected pg_shmem_allocations - it only shown the initial chunk.

This commit modifies the following allocations, to allocate everything
as a single chunk, and then split it internally.

- PredXactList
- RWConflictPool
- PGPROC structures
- Fast-Path lock arrays

Author: Rahila Syed <[email protected]>
Reviewed-by: Andres Freund <[email protected]>
Reviewed-by: Nazir Bilal Yavuz <[email protected]>
Reviewed-by: Tomas Vondra <[email protected]>
Discussion: https://postgr.es/m/CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com
---
 src/backend/storage/lmgr/predicate.c | 30 ++++++++++-----
 src/backend/storage/lmgr/proc.c      | 57 +++++++++++++++++++++++-----
 2 files changed, 67 insertions(+), 20 deletions(-)

diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a053981..d82114ffca1 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,14 +1226,21 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 10;
 
+	requestSize = add_size(PredXactListDataSize,
+						   (mul_size((Size) max_table_size,
+									 sizeof(SERIALIZABLEXACT))));
+
 	PredXact = ShmemInitStruct("PredXactList",
-							   PredXactListDataSize,
+							   requestSize,
 							   &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, both the header and the element */
+		memset(PredXact, 0, requestSize);
+
 		dlist_init(&PredXact->availableList);
 		dlist_init(&PredXact->activeList);
 		PredXact->SxactGlobalXmin = InvalidTransactionId;
@@ -1242,11 +1249,9 @@ PredicateLockShmemInit(void)
 		PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
 		PredXact->CanPartialClearThrough = 0;
 		PredXact->HavePartialClearedThrough = 0;
-		requestSize = mul_size((Size) max_table_size,
-							   sizeof(SERIALIZABLEXACT));
-		PredXact->element = ShmemAlloc(requestSize);
+		PredXact->element
+			= (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
 		/* Add all elements to available list, clean. */
-		memset(PredXact->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			LWLockInitialize(&PredXact->element[i].perXactPredicateListLock,
@@ -1300,20 +1305,25 @@ PredicateLockShmemInit(void)
 	 */
 	max_table_size *= 5;
 
+	requestSize = RWConflictPoolHeaderDataSize +
+		mul_size((Size) max_table_size,
+				 RWConflictDataSize);
+
 	RWConflictPool = ShmemInitStruct("RWConflictPool",
-									 RWConflictPoolHeaderDataSize,
+									 requestSize,
 									 &found);
 	Assert(found == IsUnderPostmaster);
 	if (!found)
 	{
 		int			i;
 
+		/* clean everything, including the elements */
+		memset(RWConflictPool, 0, requestSize);
+
 		dlist_init(&RWConflictPool->availableList);
-		requestSize = mul_size((Size) max_table_size,
-							   RWConflictDataSize);
-		RWConflictPool->element = ShmemAlloc(requestSize);
+		RWConflictPool->element = (RWConflict) ((char *) RWConflictPool +
+												RWConflictPoolHeaderDataSize);
 		/* Add all elements to available list, clean. */
-		memset(RWConflictPool->element, 0, requestSize);
 		for (i = 0; i < max_table_size; i++)
 		{
 			dlist_push_tail(&RWConflictPool->availableList,
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 066319afe2b..4e23c793f7f 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -123,6 +123,25 @@ ProcGlobalShmemSize(void)
 	return size;
 }
 
+/*
+ * Report shared-memory space needed by PGPROC.
+ */
+static Size
+PGProcShmemSize(void)
+{
+	Size		size;
+	uint32		TotalProcs;
+
+	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+	size = TotalProcs * sizeof(PGPROC);
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	return size;
+}
+
 /*
  * Report number of semaphores needed by InitProcGlobal.
  */
@@ -175,6 +194,8 @@ InitProcGlobal(void)
 			   *fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
 	Size		fpLockBitsSize,
 				fpRelIdSize;
+	Size		requestSize;
+	char	   *ptr;
 
 	/* Create the ProcGlobal shared structure */
 	ProcGlobal = (PROC_HDR *)
@@ -204,7 +225,15 @@ InitProcGlobal(void)
 	 * with a single freelist.)  Each PGPROC structure is dedicated to exactly
 	 * one of these purposes, and they do not move between groups.
 	 */
-	procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+	requestSize = PGProcShmemSize();
+
+	ptr = ShmemInitStruct("PGPROC structures",
+						  requestSize,
+						  &found);
+
+	procs = (PGPROC *) ptr;
+	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+
 	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -213,17 +242,21 @@ InitProcGlobal(void)
 	/*
 	 * Allocate arrays mirroring PGPROC fields in a dense manner. See
 	 * PROC_HDR.
-	 *
-	 * XXX: It might make sense to increase padding for these arrays, given
-	 * how hotly they are accessed.
 	 */
-	ProcGlobal->xids =
-		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+	ProcGlobal->xids = (TransactionId *) ptr;
 	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
-	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+
+	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
 	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
-	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+
+	ProcGlobal->statusFlags = (uint8 *) ptr;
 	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
+
+	/* make sure wer didn't overflow */
+	Assert((ptr > (char *) procs) && (ptr <= (char *) procs + requestSize));
 
 	/*
 	 * Allocate arrays for fast-path locks. Those are variable-length, so
@@ -233,7 +266,9 @@ InitProcGlobal(void)
 	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
 	fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
 
-	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+	fpPtr = ShmemInitStruct("Fast path lock arrays",
+							TotalProcs * (fpLockBitsSize + fpRelIdSize),
+							&found);
 	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
 
 	/* For asserts checking we did not overflow. */
@@ -330,7 +365,9 @@ InitProcGlobal(void)
 	PreparedXactProcs = &procs[MaxBackends + NUM_AUXILIARY_PROCS];
 
 	/* Create ProcStructLock spinlock, too */
-	ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
+	ProcStructLock = (slock_t *) ShmemInitStruct("ProcStructLock spinlock",
+												 sizeof(slock_t),
+												 &found);
 	SpinLockInit(ProcStructLock);
 }
 
-- 
2.49.0



  [text/x-patch] v8-0004-review.patch (1.6K, 5-v8-0004-review.patch)
  download | inline diff:
From a0d81976406ea47ac249f07bf850e65d126627cb Mon Sep 17 00:00:00 2001
From: Tomas Vondra <[email protected]>
Date: Mon, 31 Mar 2025 14:56:56 +0200
Subject: [PATCH v8 4/5] review

---
 src/backend/storage/lmgr/proc.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 4e23c793f7f..ba1ac7e4861 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -231,10 +231,11 @@ InitProcGlobal(void)
 						  requestSize,
 						  &found);
 
+	MemSet(ptr, 0, requestSize);
+
 	procs = (PGPROC *) ptr;
 	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
 
-	MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
 	ProcGlobal->allProcCount = MaxBackends + NUM_AUXILIARY_PROCS;
@@ -244,15 +245,12 @@ InitProcGlobal(void)
 	 * PROC_HDR.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
-	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
-	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
-	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	/* make sure wer didn't overflow */
-- 
2.49.0



  [text/x-patch] v8-0005-Add-cacheline-padding-between-heavily-accessed-ar.patch (2.0K, 6-v8-0005-Add-cacheline-padding-between-heavily-accessed-ar.patch)
  download | inline diff:
From 71b206e3b17b167733055c8a392be687deee6410 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Mon, 31 Mar 2025 04:27:43 +0530
Subject: [PATCH v8 5/5] Add cacheline padding between heavily accessed arrays

---
 src/backend/storage/lmgr/proc.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index ba1ac7e4861..c39b51127a8 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -134,9 +134,9 @@ PGProcShmemSize(void)
 
 	TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
 
-	size = TotalProcs * sizeof(PGPROC);
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
-	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	size = CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids)));
+	size = add_size(size, CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates)));
 	size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
 
 	return size;
@@ -234,7 +234,7 @@ InitProcGlobal(void)
 	MemSet(ptr, 0, requestSize);
 
 	procs = (PGPROC *) ptr;
-	ptr = (char *) ptr + TotalProcs * sizeof(PGPROC);
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(PGPROC));
 
 	ProcGlobal->allProcs = procs;
 	/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -245,10 +245,10 @@ InitProcGlobal(void)
 	 * PROC_HDR.
 	 */
 	ProcGlobal->xids = (TransactionId *) ptr;
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->xids));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->xids));
 
 	ProcGlobal->subxidStates = (XidCacheStatus *) ptr;
-	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->subxidStates));
+	ptr = (char *) ptr + CACHELINEALIGN(TotalProcs * sizeof(*ProcGlobal->subxidStates));
 
 	ProcGlobal->statusFlags = (uint8 *) ptr;
 	ptr = (char *) ptr + (TotalProcs * sizeof(*ProcGlobal->statusFlags));
-- 
2.49.0



^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-28 13:00               ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-30 23:01                 ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-31 14:11                   ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
@ 2025-05-13 11:04                     ` Amit Kapila <[email protected]>
  1 sibling, 0 replies; 14+ messages in thread

From: Amit Kapila @ 2025-05-13 11:04 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On Fri, Apr 4, 2025 at 9:15 PM Rahila Syed <[email protected]> wrote:
>
> Please find attached the patch which removes the changes for non-shared hash tables
> and keeps them for shared hash tables.
>

  /* Allocate initial segments */
+ i = 0;
  for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
  {
- *segp = seg_alloc(hashp);
- if (*segp == NULL)
- return false;
+ /* Assign initial segments, which are also pre-allocated */
+ if (hashp->isshared)
+ {
+ *segp = (HASHSEGMENT) HASH_SEGMENT_PTR(hashp, i++);
+ MemSet(*segp, 0, HASH_SEGMENT_SIZE(hashp));
+ }
+ else
+ {
+ *segp = seg_alloc(hashp);
+ i++;
+ }

In the non-shared hash table case, previously, we used to return false
from here when seg_alloc() fails, but now it seems to be proceeding
without returning false. Is there a reason for the same?

> I tested this by running make-check, make-check world and the reproducer script shared
> by David. I also ran pgbench to test creation and expansion of some of the
> shared hash tables.
>

This covers the basic tests for this patch. I think we should do some
low-level testing of both shared and non-shared hash tables by having
a contrib module or such (we don't need to commit such a contrib
module, but it will give us confidence that the low-level data
structure allocation change is thoroughly tested). We also need to
focus on negative tests where there is insufficient memory in the
system.

-- 
With Regards,
Amit Kapila.





^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-28 13:00               ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-30 23:01                 ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-31 14:11                   ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
@ 2026-02-10 12:17                     ` Rahila Syed <[email protected]>
  2026-04-07 08:19                       ` Re: Improve monitoring of shared memory allocations Heikki Linnakangas <[email protected]>
  1 sibling, 1 reply; 14+ messages in thread

From: Rahila Syed @ 2026-02-10 12:17 UTC (permalink / raw)
  To: Álvaro Herrera <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

Hi,


> Was this forgotten, superseded, abandoned, or should it still be under
> consideration?
> https://commitfest.postgresql.org/patch/5620/


This hasn't been forgotten; I just haven't had a chance to address the
comments
about improving test coverage yet. I think it is still possible to make
this work for
shared hash tables. I will try to rebase and add some tests before the next
commitfest.

 Thank you,
Rahila Syed


^ permalink  raw  reply  [nested|flat] 14+ messages in thread

* Re: Improve monitoring of shared memory allocations
  2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-21 11:15 ` Re: Improve monitoring of shared memory allocations Nazir Bilal Yavuz <[email protected]>
  2025-03-23 08:36   ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-24 13:24     ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 12:02       ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-27 12:56         ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-27 23:50           ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-28 11:10             ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-28 13:00               ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2025-03-30 23:01                 ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
  2025-03-31 14:11                   ` Re: Improve monitoring of shared memory allocations Tomas Vondra <[email protected]>
  2026-02-10 12:17                     ` Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
@ 2026-04-07 08:19                       ` Heikki Linnakangas <[email protected]>
  0 siblings, 0 replies; 14+ messages in thread

From: Heikki Linnakangas @ 2026-04-07 08:19 UTC (permalink / raw)
  To: Rahila Syed <[email protected]>; Álvaro Herrera <[email protected]>; +Cc: Tomas Vondra <[email protected]>; Nazir Bilal Yavuz <[email protected]>; Andres Freund <[email protected]>; pgsql-hackers

On 10/02/2026 14:17, Rahila Syed wrote:
> Hi,
> 
> 
>     Was this forgotten, superseded, abandoned, or should it still be under
>     consideration?
>     https://commitfest.postgresql.org/patch/5620/ <https://
>     commitfest.postgresql.org/patch/5620/>
> 
> 
> This hasn't been forgotten; I just haven't had a chance to address the 
> comments
> about improving test coverage yet. I think it is still possible to make 
> this work for
> shared hash tables. I will try to rebase and add some tests before the 
> next commitfest.

This was superseded by commit 9fe9ecd516, I'll close this in the 
commitfest app. Thanks!

- Heikki






^ permalink  raw  reply  [nested|flat] 14+ messages in thread

end of thread, other threads:[~2026-04-07 08:19 UTC | newest]

Thread overview: 14+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2025-03-12 10:46 Re: Improve monitoring of shared memory allocations Rahila Syed <[email protected]>
2025-03-21 11:15 ` Nazir Bilal Yavuz <[email protected]>
2025-03-23 08:36   ` Rahila Syed <[email protected]>
2025-03-24 13:24     ` Tomas Vondra <[email protected]>
2025-03-27 12:02       ` Rahila Syed <[email protected]>
2025-03-27 12:56         ` Tomas Vondra <[email protected]>
2025-03-27 23:50           ` Tomas Vondra <[email protected]>
2025-03-28 11:10             ` Rahila Syed <[email protected]>
2025-03-28 13:00               ` Tomas Vondra <[email protected]>
2025-03-30 23:01                 ` Rahila Syed <[email protected]>
2025-03-31 14:11                   ` Tomas Vondra <[email protected]>
2025-05-13 11:04                     ` Amit Kapila <[email protected]>
2026-02-10 12:17                     ` Rahila Syed <[email protected]>
2026-04-07 08:19                       ` Heikki Linnakangas <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox