public inbox for [email protected]
help / color / mirror / Atom feedFrom: Rahila Syed <[email protected]>
To: Andres Freund <[email protected]>
Cc: PostgreSQL-development <[email protected]>
Subject: Re: Improve monitoring of shared memory allocations
Date: Thu, 6 Mar 2025 21:28:17 +0530
Message-ID: <CAH2L28uL-8EQTRSeyTpW1DqAXsDXRkXCkT1dus2u6p4HYDrxAg@mail.gmail.com> (raw)
In-Reply-To: <k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq>
References: <CAH2L28vHzRankszhqz7deXURxKncxfirnuW68zD7+hVAqaS5GQ@mail.gmail.com>
<k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq>
Hi,
Thank you for the review.
cfbot found a few compiler warnings:
>
> https://cirrus-ci.com/task/6526903542087680
> [16:47:46.964] make -s -j${BUILD_JOBS} clean
> [16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
> [16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
> [16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’
> [-Werror=unused-variable]
> [16:49:10.496] 467 | bool found;
> [16:49:10.496] | ^~~~~
> [16:49:10.496] cc1: all warnings being treated as errors
> [16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
> [16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37:
> lmgr-recursive] Error 2
> [16:49:10.496] make[3]: *** Waiting for unfinished jobs....
> [16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
> [16:49:11.881] make[2]: *** Waiting for unfinished jobs....
> [16:49:20.195] dynahash.c: In function ‘hash_create’:
> [16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used
> uninitialized [-Werror=maybe-uninitialized]
> [16:49:20.195] 643 | curr_offset = (((char
> *)curr_offset) + (temp * elementSize));
> [16:49:20.195] |
> ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> [16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
> [16:49:20.195] 588 | void *curr_offset;
> [16:49:20.195] | ^~~~~~~~~~~
> [16:49:20.195] cc1: all warnings being treated as errors
> [16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1
>
>
Fixed these.
>
> > diff --git a/src/backend/utils/hash/dynahash.c
> b/src/backend/utils/hash/dynahash.c
> > index cd5a00132f..5203f5b30b 100644
> > --- a/src/backend/utils/hash/dynahash.c
> > +++ b/src/backend/utils/hash/dynahash.c
> > @@ -120,7 +120,6 @@
> > * a good idea of the maximum number of entries!). For non-shared hash
> > * tables, the initial directory size can be left at the default.
> > */
> > -#define DEF_SEGSIZE 256
> > #define DEF_SEGSIZE_SHIFT 8 /* must be log2(DEF_SEGSIZE) */
> > #define DEF_DIRSIZE 256
>
> Why did you move this to the header? Afaict it's just used in
> hash_get_shared_size(), which is also in dynahash.c?
>
>
Yes. This was accidentally left behind by the previous version of the
patch, so I undid the change.
> > static void register_seq_scan(HTAB *hashp);
> > static void deregister_seq_scan(HTAB *hashp);
> > static bool has_seq_scans(HTAB *hashp);
> > -
> > +static int find_num_of_segs(long nelem, int *nbuckets, long
> num_partitions, long ssize);
> >
> > /*
> > * memory allocation support
>
> You removed a newline here that probably shouldn't be removed.
Fixed this.
>
> > @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> > else
> > hashp->keycopy = memcpy;
> >
> > - /* And select the entry allocation function, too. */
> > + /*
> > + * And select the entry allocation function, too. XXX should this
> also
> > + * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> > + * currently only set with HASH_SHARED_MEM *
> > + */
> > if (flags & HASH_ALLOC)
> > hashp->alloc = info->alloc;
> > else
> > @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> >
> > hashp->frozen = false;
> >
> > + /* Initializing the HASHHDR variables with default values */
> > hdefault(hashp);
> >
> > hctl = hashp->hctl;
>
> I assume these were just observations you made while looking into this?
> They
> seem unrelated to the change itself?
Yes. I removed the first one and left the second one as a code comment.
>
> > @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> > freelist_partitions,
> > nelem_alloc,
> > nelem_alloc_first;
> > -
> > + void *curr_offset;
> > +
> > /*
> > * If hash table is partitioned, give each freelist an
> equal share of
> > * the initial allocation. Otherwise only freeList[0] is
> used.
> > @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const
> HASHCTL *info, int flags)
> > else
> > freelist_partitions = 1;
> >
> > + /*
> > + * If table is shared, calculate the offset at which to
> find the
> > + * the first partition of elements
> > + */
> > + if (hashp->isshared)
> > + {
> > + int nsegs;
> > + int nbuckets;
> > + nsegs = find_num_of_segs(nelem, &nbuckets,
> hctl->num_partitions, hctl->ssize);
> > +
> > + curr_offset = (((char *) hashp->hctl) +
> sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> > + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> > + }
> > +
>
> Why only do this for shared hashtables? Couldn't we allocate the elments
> together with the rest for non-share hashtables too?
>
I think it is possible to consolidate the allocations for non-shared hash
tables
too. However, initial elements are much smaller in non-shared hash tables
due to
their ease of expansion. Therefore, there is probably less benefit in
trying to do
that for non-shared tables.
In addition, the proposed changes are targeted to improve the monitoring in
pg_shmem_allocations which won't be applicable to non-shared hashtables.
While I believe it is feasible, I am uncertain about the utility of such a
change.
> Seems a bit ugly to go through element_alloc() when pre-allocating.
> Perhaps
> it's the best thing we can do to avoid duplicating code, but it seems worth
> checking if we can do better. Perhaps we could split element_alloc() into
> element_alloc() and element_add() or such? With the latter doing
> everything
> after hashp->alloc().
>
>
Makes sense. I split the element_alloc() into element_alloc() and
element_add().
>
> > -
> > +
> > /*
> > * initialize mutexes if it's a partitioned table
> > */
>
> Spurious change.
>
>
Fixed.
A function called find_num_of_segs() that also sets nbuckets seems a bit
> confusing. I also don't like "find_*", as that sounds like it's searching
> some datastructure, rather than just doing a bit of math.
>
I renamed it to compute_buckets_and_segs(). I am open to better
suggestions.
> segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) *
> hashp->ssize);
> >
> > if (!segp)
>
> Spurious change.
>
Fixed.
> -static int
> > +int
> > next_pow2_int(long num)
> > {
> > if (num > INT_MAX / 2)
> > @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int
> nestDepth)
> > }
> > }
> > }
>
> Why export this?
>
> It was a stale change, I removed it now
> > diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> > index 932cc4f34d..5e16bd4183 100644
> > --- a/src/include/utils/hsearch.h
> > +++ b/src/include/utils/hsearch.h
> > @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
> > extern void hash_freeze(HTAB *hashp);
> > extern Size hash_estimate_size(long num_entries, Size entrysize);
> > extern long hash_select_dirsize(long num_entries);
> > -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> > +extern Size hash_get_shared_size(HASHCTL *info, int flags, long
> init_size);
> > extern void AtEOXact_HashTables(bool isCommit);
> > extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
>
> It's imo a bit weird that we have very related logic in
> hash_estimate_size()
> and hash_get_shared_size(). Why do we need to duplicate it?
>
>
hash_estimate_size() estimates using default values and
hash_get_shared_size()
calculates using specific values depending on the flags associated with the
hash
table. For instance, segment_size used by the former is DEF_SEGSIZE and
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.
>
> These I'd just combine with the ShmemInitStruct("PredXactList"), by
> allocating
> the additional space. The pointer math is a bit annoying, but it makes much
> more sense to have one entry in pg_shmem_allocations.
>
> Fixed accordingly.
>
> > - (TransactionId *) ShmemAlloc(TotalProcs *
> sizeof(*ProcGlobal->xids));
> > + (TransactionId *) ShmemInitStruct("Proc Transaction Ids",
> TotalProcs * sizeof(*ProcGlobal->xids), &found);
> > MemSet(ProcGlobal->xids, 0, TotalProcs *
> sizeof(*ProcGlobal->xids));
> > - ProcGlobal->subxidStates = (XidCacheStatus *)
> ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> > + ProcGlobal->subxidStates = (XidCacheStatus *)
> ShmemInitStruct("Proc Sub-transaction id states", TotalProcs *
> sizeof(*ProcGlobal->subxidStates), &found);
> > MemSet(ProcGlobal->subxidStates, 0, TotalProcs *
> sizeof(*ProcGlobal->subxidStates));
> > - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs *
> sizeof(*ProcGlobal->statusFlags));
> > + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status
> Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
> > MemSet(ProcGlobal->statusFlags, 0, TotalProcs *
> sizeof(*ProcGlobal->statusFlags));
> >
> > /*
>
> Same.
>
> Although here I'd say it's worth padding the size of each separate
> "allocation" by PG_CACHE_LINE_SIZE.
>
Made this change.
> > - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> > + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs *
> (fpLockBitsSize + fpRelIdSize), &found);
> > MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
> >
> > /* For asserts checking we did not overflow. */
>
> This one might actually make sense to keep separate, depending on the
> configuration it can be reasonably big (max_connection = 1k,
> max_locks_per_transaction=1k results in ~5MB)..
>
> OK
PFA the rebased patches with the above changes.
Kindly let me know your views.
Thank you,
Rahila Syed
Attachments:
[application/octet-stream] v2-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch (6.1K, 3-v2-0002-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch)
download | inline diff:
From 2858595994ccc1d48a50c318c011c22f8e56d7e8 Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:32:27 +0530
Subject: [PATCH 2/2] Replace ShmemAlloc calls by ShmemInitStruct
The shared memory allocated by ShmemAlloc is not tracked
by pg_shmem_allocations. This commit replaces most of the
calls to ShmemAlloc by ShmemInitStruct to associate a name
with the allocations and ensure that they get tracked by
pg_shmem_allocations. It also merges several smaller
ShmemAlloc calls into larger ShmemInitStruct to allocate
and track all the related memory allocations under single
---
src/backend/storage/lmgr/predicate.c | 17 ++++++++-------
src/backend/storage/lmgr/proc.c | 31 +++++++++++++++++++++++-----
2 files changed, 35 insertions(+), 13 deletions(-)
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 5b21a05398..dd66990335 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -1226,8 +1226,11 @@ PredicateLockShmemInit(void)
*/
max_table_size *= 10;
+ requestSize = add_size(PredXactListDataSize,
+ (mul_size((Size) max_table_size,
+ sizeof(SERIALIZABLEXACT))));
PredXact = ShmemInitStruct("PredXactList",
- PredXactListDataSize,
+ requestSize,
&found);
Assert(found == IsUnderPostmaster);
if (!found)
@@ -1242,9 +1245,7 @@ PredicateLockShmemInit(void)
PredXact->LastSxactCommitSeqNo = FirstNormalSerCommitSeqNo - 1;
PredXact->CanPartialClearThrough = 0;
PredXact->HavePartialClearedThrough = 0;
- requestSize = mul_size((Size) max_table_size,
- sizeof(SERIALIZABLEXACT));
- PredXact->element = ShmemAlloc(requestSize);
+ PredXact->element = (SERIALIZABLEXACT *) ((char *) PredXact + PredXactListDataSize);
/* Add all elements to available list, clean. */
memset(PredXact->element, 0, requestSize);
for (i = 0; i < max_table_size; i++)
@@ -1299,9 +1300,11 @@ PredicateLockShmemInit(void)
* probably OK.
*/
max_table_size *= 5;
+ requestSize = mul_size((Size) max_table_size,
+ RWConflictDataSize);
RWConflictPool = ShmemInitStruct("RWConflictPool",
- RWConflictPoolHeaderDataSize,
+ RWConflictPoolHeaderDataSize + requestSize,
&found);
Assert(found == IsUnderPostmaster);
if (!found)
@@ -1309,9 +1312,7 @@ PredicateLockShmemInit(void)
int i;
dlist_init(&RWConflictPool->availableList);
- requestSize = mul_size((Size) max_table_size,
- RWConflictDataSize);
- RWConflictPool->element = ShmemAlloc(requestSize);
+ RWConflictPool->element = (RWConflict) ((char *) RWConflictPool + RWConflictPoolHeaderDataSize);
/* Add all elements to available list, clean. */
memset(RWConflictPool->element, 0, requestSize);
for (i = 0; i < max_table_size; i++)
diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
index 749a79d48e..4befbb0318 100644
--- a/src/backend/storage/lmgr/proc.c
+++ b/src/backend/storage/lmgr/proc.c
@@ -88,6 +88,7 @@ static void RemoveProcFromArray(int code, Datum arg);
static void ProcKill(int code, Datum arg);
static void AuxiliaryProcKill(int code, Datum arg);
static void CheckDeadLock(void);
+static Size PGProcShmemSize(void);
/*
@@ -175,6 +176,7 @@ InitProcGlobal(void)
*fpEndPtr PG_USED_FOR_ASSERTS_ONLY;
Size fpLockBitsSize,
fpRelIdSize;
+ Size requestSize;
/* Create the ProcGlobal shared structure */
ProcGlobal = (PROC_HDR *)
@@ -204,7 +206,10 @@ InitProcGlobal(void)
* with a single freelist.) Each PGPROC structure is dedicated to exactly
* one of these purposes, and they do not move between groups.
*/
- procs = (PGPROC *) ShmemAlloc(TotalProcs * sizeof(PGPROC));
+ requestSize = PGProcShmemSize();
+
+ procs = (PGPROC *) ShmemInitStruct("PGPROC structures", requestSize, &found);
+
MemSet(procs, 0, TotalProcs * sizeof(PGPROC));
ProcGlobal->allProcs = procs;
/* XXX allProcCount isn't really all of them; it excludes prepared xacts */
@@ -218,11 +223,11 @@ InitProcGlobal(void)
* how hotly they are accessed.
*/
ProcGlobal->xids =
- (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
+ (TransactionId *) ((char *) procs + TotalProcs * sizeof(PGPROC));
MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
- ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
+ ProcGlobal->subxidStates = (XidCacheStatus *) ((char *) ProcGlobal->xids + TotalProcs * sizeof(*ProcGlobal->xids) + PG_CACHE_LINE_SIZE);
MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
- ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
+ ProcGlobal->statusFlags = (uint8 *) ((char *) ProcGlobal->subxidStates + TotalProcs * sizeof(*ProcGlobal->subxidStates) + PG_CACHE_LINE_SIZE);
MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
/*
@@ -233,7 +238,7 @@ InitProcGlobal(void)
fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
fpRelIdSize = MAXALIGN(FastPathLockSlotsPerBackend() * sizeof(Oid));
- fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
+ fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
/* For asserts checking we did not overflow. */
@@ -334,6 +339,22 @@ InitProcGlobal(void)
SpinLockInit(ProcStructLock);
}
+static Size
+PGProcShmemSize(void)
+{
+ Size size;
+ uint32 TotalProcs = MaxBackends + NUM_AUXILIARY_PROCS + max_prepared_xacts;
+
+ size = TotalProcs * sizeof(PGPROC);
+ size = add_size(size, TotalProcs * sizeof(*ProcGlobal->xids));
+ size = add_size(size, PG_CACHE_LINE_SIZE);
+ size = add_size(size, TotalProcs * sizeof(*ProcGlobal->subxidStates));
+ size = add_size(size, PG_CACHE_LINE_SIZE);
+ size = add_size(size, TotalProcs * sizeof(*ProcGlobal->statusFlags));
+ size = add_size(size, PG_CACHE_LINE_SIZE);
+ return size;
+}
+
/*
* InitProcess -- initialize a per-process PGPROC entry for this backend
*/
--
2.34.1
[application/octet-stream] v2-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch (11.4K, 4-v2-0001-Account-for-initial-shared-memory-allocated-by-hash_.patch)
download | inline diff:
From 608d8667fbd05b6902c4e1ad1d810edf7d0f78bf Mon Sep 17 00:00:00 2001
From: Rahila Syed <[email protected]>
Date: Thu, 6 Mar 2025 20:06:20 +0530
Subject: [PATCH 1/2] Account for initial shared memory allocated by
hash_create
pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
which, in case of shared hash tables, only covers memory allocated
to the hash directory and control structure. The hash segments and
buckets are allocated using ShmemAllocNoError which does not attribute
the allocations to the hash table name. Thus, these allocations are
not tracked in pg_shmem_allocations.
Include the allocation of segments, buckets and elements in the initial
allocation of shared hash directory. This results in the existing ShmemIndex
entries to reflect all these allocations. The resulting tuples in
pg_shmem_allocations represent the total size of the initial hash table
including all the buckets and the elements they contain, instead of just
the directory size.
---
src/backend/storage/ipc/shmem.c | 3 +-
src/backend/utils/hash/dynahash.c | 172 +++++++++++++++++++++++-------
src/include/utils/hsearch.h | 2 +-
3 files changed, 135 insertions(+), 42 deletions(-)
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 895a43fb39..c56e9b6c77 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -73,6 +73,7 @@
#include "storage/shmem.h"
#include "storage/spin.h"
#include "utils/builtins.h"
+#include "utils/dynahash.h"
static void *ShmemAllocRaw(Size size, Size *allocated_size);
@@ -346,7 +347,7 @@ ShmemInitHash(const char *name, /* table string name for shmem index */
/* look it up in the shmem index */
location = ShmemInitStruct(name,
- hash_get_shared_size(infoP, hash_flags),
+ hash_get_shared_size(infoP, hash_flags, init_size),
&found);
/*
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index cd5a00132f..0c66d63f8e 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -265,7 +265,7 @@ static long hash_accesses,
*/
static void *DynaHashAlloc(Size size);
static HASHSEGMENT seg_alloc(HTAB *hashp);
-static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
+static HASHELEMENT *element_alloc(HTAB *hashp, int nelem);
static bool dir_realloc(HTAB *hashp);
static bool expand_table(HTAB *hashp);
static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
@@ -281,6 +281,9 @@ static void register_seq_scan(HTAB *hashp);
static void deregister_seq_scan(HTAB *hashp);
static bool has_seq_scans(HTAB *hashp);
+static int compute_buckets_and_segs(long nelem, int *nbuckets,
+ long num_partitions, long ssize);
+static void element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem);
/*
* memory allocation support
@@ -518,6 +521,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
hashp->frozen = false;
+ /* Initializing the HASHHDR variables with default values */
hdefault(hashp);
hctl = hashp->hctl;
@@ -582,6 +586,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
freelist_partitions,
nelem_alloc,
nelem_alloc_first;
+ void *curr_offset = NULL;
/*
* If hash table is partitioned, give each freelist an equal share of
@@ -592,6 +597,21 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
else
freelist_partitions = 1;
+ /*
+ * If table is shared, calculate the offset at which to find the the
+ * first partition of elements
+ */
+ if (hashp->isshared)
+ {
+ int nsegs;
+ int nbuckets;
+
+ nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
+
+ curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
+ +(sizeof(HASHBUCKET) * hctl->ssize * nsegs));
+ }
+
nelem_alloc = nelem / freelist_partitions;
if (nelem_alloc <= 0)
nelem_alloc = 1;
@@ -609,11 +629,27 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
for (i = 0; i < freelist_partitions; i++)
{
int temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
+ HASHELEMENT *firstElement;
+ Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
- if (!element_alloc(hashp, temp, i))
- ereport(ERROR,
- (errcode(ERRCODE_OUT_OF_MEMORY),
- errmsg("out of memory")));
+ /*
+ * Memory is allocated as part of initial allocation in
+ * ShmemInitHash
+ */
+ if (hashp->isshared)
+ {
+ firstElement = (HASHELEMENT *) curr_offset;
+ curr_offset = (((char *) curr_offset) + (temp * elementSize));
+ }
+ else
+ {
+ firstElement = element_alloc(hashp, temp);
+ if (!firstElement)
+ ereport(ERROR,
+ (errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory")));
+ }
+ element_add(hashp, firstElement, i, temp);
}
}
@@ -701,30 +737,11 @@ init_htab(HTAB *hashp, long nelem)
for (i = 0; i < NUM_FREELISTS; i++)
SpinLockInit(&(hctl->freeList[i].mutex));
- /*
- * Allocate space for the next greater power of two number of buckets,
- * assuming a desired maximum load factor of 1.
- */
- nbuckets = next_pow2_int(nelem);
-
- /*
- * In a partitioned table, nbuckets must be at least equal to
- * num_partitions; were it less, keys with apparently different partition
- * numbers would map to the same bucket, breaking partition independence.
- * (Normally nbuckets will be much bigger; this is just a safety check.)
- */
- while (nbuckets < hctl->num_partitions)
- nbuckets <<= 1;
+ nsegs = compute_buckets_and_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
hctl->max_bucket = hctl->low_mask = nbuckets - 1;
hctl->high_mask = (nbuckets << 1) - 1;
- /*
- * Figure number of directory segments needed, round up to a power of 2
- */
- nsegs = (nbuckets - 1) / hctl->ssize + 1;
- nsegs = next_pow2_int(nsegs);
-
/*
* Make sure directory is big enough. If pre-allocated directory is too
* small, choke (caller screwed up).
@@ -748,11 +765,22 @@ init_htab(HTAB *hashp, long nelem)
}
/* Allocate initial segments */
+ i = 0;
for (segp = hashp->dir; hctl->nsegs < nsegs; hctl->nsegs++, segp++)
{
- *segp = seg_alloc(hashp);
+ if (hashp->isshared)
+ {
+ *segp = (HASHBUCKET *) (((char *) hashp->hctl)
+ + sizeof(HASHHDR)
+ + (hashp->hctl->dsize * sizeof(HASHSEGMENT))
+ + (i * sizeof(HASHBUCKET) * hashp->ssize));
+ MemSet(*segp, 0, sizeof(HASHBUCKET) * hashp->ssize);
+ }
+ else
+ *segp = seg_alloc(hashp);
if (*segp == NULL)
return false;
+ i = i + 1;
}
/* Choose number of entries to allocate at a time */
@@ -851,11 +879,36 @@ hash_select_dirsize(long num_entries)
* and for the (non expansible) directory.
*/
Size
-hash_get_shared_size(HASHCTL *info, int flags)
+hash_get_shared_size(HASHCTL *info, int flags, long init_size)
{
+ int nbuckets;
+ int nsegs;
+ int num_partitions;
+ int ssize;
+ Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(info->entrysize);
+
Assert(flags & HASH_DIRSIZE);
Assert(info->dsize == info->max_dsize);
- return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT);
+
+ if (flags & HASH_PARTITION)
+ num_partitions = info->num_partitions;
+ else
+ num_partitions = 0;
+
+ if (flags & HASH_SEGMENT)
+ ssize = info->ssize;
+ else
+ ssize = DEF_SEGSIZE;
+
+ nsegs = compute_buckets_and_segs(init_size, &nbuckets, num_partitions, ssize);
+
+ /* Number of entries should be atleast equal to number of partitions */
+ if (init_size < num_partitions)
+ init_size = num_partitions;
+
+ return sizeof(HASHHDR) + info->dsize * sizeof(HASHSEGMENT) +
+ +sizeof(HASHBUCKET) * ssize * nsegs
+ + init_size * elementSize;
}
@@ -1285,7 +1338,8 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
* Failing because the needed element is in a different freelist is
* not acceptable.
*/
- if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+ newElement = element_alloc(hashp, hctl->nelem_alloc);
+ if (newElement == NULL)
{
int borrow_from_idx;
@@ -1322,6 +1376,7 @@ get_hash_entry(HTAB *hashp, int freelist_idx)
/* no elements available to borrow either, so out of memory */
return NULL;
}
+ element_add(hashp, newElement, freelist_idx, hctl->nelem_alloc);
}
/* remove entry from freelist, bump nentries */
@@ -1702,28 +1757,38 @@ seg_alloc(HTAB *hashp)
/*
* allocate some new elements and link them into the indicated free list
*/
-static bool
-element_alloc(HTAB *hashp, int nelem, int freelist_idx)
+static HASHELEMENT *
+element_alloc(HTAB *hashp, int nelem)
{
HASHHDR *hctl = hashp->hctl;
Size elementSize;
- HASHELEMENT *firstElement;
- HASHELEMENT *tmpElement;
- HASHELEMENT *prevElement;
- int i;
+ HASHELEMENT *firstElement = NULL;
if (hashp->isfixed)
- return false;
+ return NULL;
/* Each element has a HASHELEMENT header plus user data. */
elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
-
CurrentDynaHashCxt = hashp->hcxt;
firstElement = (HASHELEMENT *) hashp->alloc(nelem * elementSize);
if (!firstElement)
- return false;
+ return NULL;
+ return firstElement;
+}
+
+static void
+element_add(HTAB *hashp, HASHELEMENT *firstElement, int freelist_idx, int nelem)
+{
+ HASHHDR *hctl = hashp->hctl;
+ Size elementSize;
+ HASHELEMENT *tmpElement;
+ HASHELEMENT *prevElement;
+ int i;
+
+ /* Each element has a HASHELEMENT header plus user data. */
+ elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
/* prepare to link all the new entries into the freelist */
prevElement = NULL;
tmpElement = firstElement;
@@ -1744,8 +1809,6 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
if (IS_PARTITIONED(hctl))
SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
- return true;
}
/*
@@ -1957,3 +2020,32 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
}
}
}
+
+static int
+compute_buckets_and_segs(long nelem, int *nbuckets, long num_partitions, long ssize)
+{
+ int nsegs;
+
+ /*
+ * Allocate space for the next greater power of two number of buckets,
+ * assuming a desired maximum load factor of 1.
+ */
+ *nbuckets = next_pow2_int(nelem);
+
+ /*
+ * In a partitioned table, nbuckets must be at least equal to
+ * num_partitions; were it less, keys with apparently different partition
+ * numbers would map to the same bucket, breaking partition independence.
+ * (Normally nbuckets will be much bigger; this is just a safety check.)
+ */
+ while ((*nbuckets) < num_partitions)
+ (*nbuckets) <<= 1;
+
+
+ /*
+ * Figure number of directory segments needed, round up to a power of 2
+ */
+ nsegs = ((*nbuckets) - 1) / ssize + 1;
+ nsegs = next_pow2_int(nsegs);
+ return nsegs;
+}
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 932cc4f34d..5e16bd4183 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
extern void hash_freeze(HTAB *hashp);
extern Size hash_estimate_size(long num_entries, Size entrysize);
extern long hash_select_dirsize(long num_entries);
-extern Size hash_get_shared_size(HASHCTL *info, int flags);
+extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
extern void AtEOXact_HashTables(bool isCommit);
extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
--
2.34.1
view thread (19+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected]
Subject: Re: Improve monitoring of shared memory allocations
In-Reply-To: <CAH2L28uL-8EQTRSeyTpW1DqAXsDXRkXCkT1dus2u6p4HYDrxAg@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox