Re: Improve monitoring of shared memory allocations

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Andres Freund <[email protected]>
To: Rahila Syed <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Improve monitoring of shared memory allocations
Date: Mon, 3 Mar 2025 13:07:47 -0500
Message-ID: <k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq> (raw)
In-Reply-To: <CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com>
References: <CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com>

Hi,

Thanks for sending these, the issues addressed here have been bugging me for a
long while.

On 2025-03-01 10:19:01 +0530, Rahila Syed wrote:
> The 0001* patch improved the accounting for the shared memory allocated for
> a hash table during hash_create.  pg_shmem_allocations tracks the memory
> allocated by ShmemInitStruct, which, for shared hash tables, only covers
> memory allocated for the hash directory and control structure via
> ShmemInitHash. The hash segments and buckets are allocated using
> ShmemAllocNoError, which does not attribute the allocation to the hash table
> and also does not add it to ShmemIndex.
> 
> Therefore, these allocations are not tracked in pg_shmem_allocations.  To
> improve this, include the allocation of segments and buckets in the initial
> allocation of the shared memory for the hash table, in ShmemInitHash.
>
> This will result in pg_shmem_allocations representing the total size of the
> initial hash table, including all the buckets and elements, instead of just
> the directory size.

I think this should also be more efficient. Less space wasted on padding and
fewer indirect function calls.


> Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
> tracked by pg_shmem_allocations.  The 0002* patch replaces most of the calls
> to ShmemAlloc with ShmemInitStruct to associate a name with the allocations
> and ensure that they get tracked by pg_shmem_allocations.

In some of these cases it may be better to combine the allocation with a prior
ShmemInitStruct instead of doing a separate ShmemInitStruct() for the
allocations that are doing ShmemAlloc() right now.


cfbot found a few compiler warnings:

https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496]   467 |                 bool found;
[16:49:10.496]       |                      ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195]   643 |                         curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195]       |                         ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195]   588 |                 void *curr_offset;
[16:49:20.195]       |                       ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1


> From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001
> From: Rahila Syed <[email protected]>
> Date: Fri, 21 Feb 2025 15:08:12 +0530
> Subject: [PATCH 1/2] Account for initial shared memory allocated during
>  hash_create.
> 
> pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
> which, in case of shared hash tables, only covers memory allocated
> to the hash directory and control structure. The hash segments and
> buckets are allocated using ShmemAllocNoError which does not attribute
> the allocations to the hash table name. Thus, these allocations are
> not tracked in pg_shmem_allocations.
> 
> Improve this include the alocation of segments and buckets or elements
> in the initial allocation of shared hash directory. Since this adds numbers
> to existing hash table entries, the resulting tuples in pg_shmem_allocations
> represent the total size of the initial hash table including all the
> buckets and the elements they contain, instead of just the directory size.


> diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
> index cd5a00132f..5203f5b30b 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -120,7 +120,6 @@
>   * a good idea of the maximum number of entries!).  For non-shared hash
>   * tables, the initial directory size can be left at the default.
>   */
> -#define DEF_SEGSIZE			   256
>  #define DEF_SEGSIZE_SHIFT	   8	/* must be log2(DEF_SEGSIZE) */
>  #define DEF_DIRSIZE			   256

Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?


> @@ -265,7 +264,7 @@ static long hash_accesses,
>   */
>  static void *DynaHashAlloc(Size size);
>  static HASHSEGMENT seg_alloc(HTAB *hashp);
> -static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
> +static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement);
>  static bool dir_realloc(HTAB *hashp);
>  static bool expand_table(HTAB *hashp);
>  static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
> @@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn();
>  static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
>  								  HASHBUCKET **bucketptr);
>  static long next_pow2_long(long num);
> -static int	next_pow2_int(long num);
>  static void register_seq_scan(HTAB *hashp);
>  static void deregister_seq_scan(HTAB *hashp);
>  static bool has_seq_scans(HTAB *hashp);
> -
> +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);
>  
>  /*
>   * memory allocation support

You removed a newline here that probably shouldn't be removed.



> @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>  	else
>  		hashp->keycopy = memcpy;
>  
> -	/* And select the entry allocation function, too. */
> +	/*
> +	 * And select the entry allocation function, too. XXX should this also
> +	 * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> +	 * currently only set with HASH_SHARED_MEM *
> +	 */
>  	if (flags & HASH_ALLOC)
>  		hashp->alloc = info->alloc;
>  	else
> @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>  
>  	hashp->frozen = false;
>  
> +	/* Initializing the HASHHDR variables with default values */
>  	hdefault(hashp);
>  
>  	hctl = hashp->hctl;

I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?


> @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>  					freelist_partitions,
>  					nelem_alloc,
>  					nelem_alloc_first;
> -
> +		void *curr_offset;
> +	
>  		/*
>  		 * If hash table is partitioned, give each freelist an equal share of
>  		 * the initial allocation.  Otherwise only freeList[0] is used.
> @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>  		else
>  			freelist_partitions = 1;
>  
> +		/*
> +		 * If table is shared, calculate the offset at which to find the
> +		 * the first partition of elements
> +		 */
> +		if (hashp->isshared)
> +		{
> +			int			nsegs;
> +			int			nbuckets;
> +			nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +			
> +			curr_offset =  (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> +                        + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> +		}
> +

Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?


>  		nelem_alloc = nelem / freelist_partitions;
>  		if (nelem_alloc <= 0)
>  			nelem_alloc = 1;
> @@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>  		for (i = 0; i < freelist_partitions; i++)
>  		{
>  			int			temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
> -
> -			if (!element_alloc(hashp, temp, i))
> +			HASHELEMENT *firstElement;
> + 			Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
> +
> +			/* Memory is allocated as part of initial allocation in ShmemInitHash */
> +			if (hashp->isshared)
> +				firstElement = (HASHELEMENT *) curr_offset;
> +			else
> +				firstElement = NULL;		
> +				
> +			if (!element_alloc(hashp, temp, i, firstElement))
>  				ereport(ERROR,
>  						(errcode(ERRCODE_OUT_OF_MEMORY),
>  						 errmsg("out of memory")));
> +			curr_offset = (((char *)curr_offset) + (temp * elementSize));
>  		}
>  	}

Seems a bit ugly to go through element_alloc() when pre-allocating.  Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such?  With the latter doing everything
after hashp->alloc().


> @@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem)
>  	int			nbuckets;
>  	int			nsegs;
>  	int			i;
> -
> +	
>  	/*
>  	 * initialize mutexes if it's a partitioned table
>  	 */

Spurious change.


> @@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem)
>  		for (i = 0; i < NUM_FREELISTS; i++)
>  			SpinLockInit(&(hctl->freeList[i].mutex));
>  
> -	/*
> -	 * Allocate space for the next greater power of two number of buckets,
> -	 * assuming a desired maximum load factor of 1.
> -	 */
> -	nbuckets = next_pow2_int(nelem);
> -
> -	/*
> -	 * In a partitioned table, nbuckets must be at least equal to
> -	 * num_partitions; were it less, keys with apparently different partition
> -	 * numbers would map to the same bucket, breaking partition independence.
> -	 * (Normally nbuckets will be much bigger; this is just a safety check.)
> -	 */
> -	while (nbuckets < hctl->num_partitions)
> -		nbuckets <<= 1;
> +	nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
>  
>  	hctl->max_bucket = hctl->low_mask = nbuckets - 1;
>  	hctl->high_mask = (nbuckets << 1) - 1;
>  
> -	/*
> -	 * Figure number of directory segments needed, round up to a power of 2
> -	 */
> -	nsegs = (nbuckets - 1) / hctl->ssize + 1;
> -	nsegs = next_pow2_int(nsegs);
> -
>  	/*
>  	 * Make sure directory is big enough. If pre-allocated directory is too
>  	 * small, choke (caller screwed up).

A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing.  I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.


> @@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp)
>  	HASHSEGMENT segp;
>  
>  	CurrentDynaHashCxt = hashp->hcxt;
> +	
>  	segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);
>  
>  	if (!segp)

Spurious change.


> @@ -1816,7 +1854,7 @@ next_pow2_long(long num)
>  }
>  
>  /* calculate first power of 2 >= num, bounded to what will fit in an int */
> -static int
> +int
>  next_pow2_int(long num)
>  {
>  	if (num > INT_MAX / 2)
> @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
>  		}
>  	}
>  }

Why export this?


> diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> index 932cc4f34d..5e16bd4183 100644
> --- a/src/include/utils/hsearch.h
> +++ b/src/include/utils/hsearch.h
> @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
>  extern void hash_freeze(HTAB *hashp);
>  extern Size hash_estimate_size(long num_entries, Size entrysize);
>  extern long hash_select_dirsize(long num_entries);
> -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
>  extern void AtEOXact_HashTables(bool isCommit);
>  extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);

It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?



> --- a/src/backend/storage/lmgr/predicate.c
> +++ b/src/backend/storage/lmgr/predicate.c
> @@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void)
>  		PredXact->HavePartialClearedThrough = 0;
>  		requestSize = mul_size((Size) max_table_size,
>  							   sizeof(SERIALIZABLEXACT));
> -		PredXact->element = ShmemAlloc(requestSize);
> +		PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found);
>  		/* Add all elements to available list, clean. */
>  		memset(PredXact->element, 0, requestSize);
>  		for (i = 0; i < max_table_size; i++)
> @@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void)
>  	 * probably OK.
>  	 */
>  	max_table_size *= 5;
> +	requestSize = mul_size((Size) max_table_size,
> +						RWConflictDataSize);
>  
>  	RWConflictPool = ShmemInitStruct("RWConflictPool",
> -									 RWConflictPoolHeaderDataSize,
> +									 RWConflictPoolHeaderDataSize + requestSize,
>  									 &found);
>  	Assert(found == IsUnderPostmaster);
>  	if (!found)
> @@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void)
>  		int			i;
>  
>  		dlist_init(&RWConflictPool->availableList);
> -		requestSize = mul_size((Size) max_table_size,
> -							   RWConflictDataSize);
> -		RWConflictPool->element = ShmemAlloc(requestSize);
> +		RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize); 
>  		/* Add all elements to available list, clean. */
>  		memset(RWConflictPool->element, 0, requestSize);
>  		for (i = 0; i < max_table_size; i++)

These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.


> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index 49204f91a2..e112735d93 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -218,11 +218,11 @@ InitProcGlobal(void)
>  	 * how hotly they are accessed.
>  	 */
>  	ProcGlobal->xids =
> -		(TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
> +		(TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
>  	MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
> -	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> +	ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
>  	MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
> -	ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
> +	ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
>  	MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
>  
>  	/*

Same.

Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.


> @@ -233,7 +233,7 @@ InitProcGlobal(void)
>  	fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
>  	fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP);
>  
> -	fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> +	fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
>  	MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
>  
>  	/* For asserts checking we did not overflow. */

This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..


Greetings,

Andres Freund

view thread (19+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Improve monitoring of shared memory allocations
  In-Reply-To: <k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox