Thank you for the review.
cfbot found a few compiler warnings:
https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496] 467 | bool found;
[16:49:10.496] | ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195] 643 | curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195] | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195] 588 | void *curr_offset;
[16:49:20.195] | ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1
Fixed these.
> diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
> index cd5a00132f..5203f5b30b 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -120,7 +120,6 @@
> * a good idea of the maximum number of entries!). For non-shared hash
> * tables, the initial directory size can be left at the default.
> */
> -#define DEF_SEGSIZE 256
> #define DEF_SEGSIZE_SHIFT 8 /* must be log2(DEF_SEGSIZE) */
> #define DEF_DIRSIZE 256
Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?
Yes. This was accidentally left behind by the previous version of the
patch, so I undid the change.
> static void register_seq_scan(HTAB *hashp);
> static void deregister_seq_scan(HTAB *hashp);
> static bool has_seq_scans(HTAB *hashp);
> -
> +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);
>
> /*
> * memory allocation support
You removed a newline here that probably shouldn't be removed.
Fixed this.
> @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> else
> hashp->keycopy = memcpy;
>
> - /* And select the entry allocation function, too. */
> + /*
> + * And select the entry allocation function, too. XXX should this also
> + * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> + * currently only set with HASH_SHARED_MEM *
> + */
> if (flags & HASH_ALLOC)
> hashp->alloc = info->alloc;
> else
> @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>
> hashp->frozen = false;
>
> + /* Initializing the HASHHDR variables with default values */
> hdefault(hashp);
>
> hctl = hashp->hctl;
I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?
Yes. I removed the first one and left the second one as a code comment.
> @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> freelist_partitions,
> nelem_alloc,
> nelem_alloc_first;
> -
> + void *curr_offset;
> +
> /*
> * If hash table is partitioned, give each freelist an equal share of
> * the initial allocation. Otherwise only freeList[0] is used.
> @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> else
> freelist_partitions = 1;
>
> + /*
> + * If table is shared, calculate the offset at which to find the
> + * the first partition of elements
> + */
> + if (hashp->isshared)
> + {
> + int nsegs;
> + int nbuckets;
> + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +
> + curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> + }
> +
Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?
I think it is possible to consolidate the allocations for non-shared hash tables
too. However, initial elements are much smaller in non-shared hash tables due to
their ease of expansion. Therefore, there is probably less benefit in trying to do
that for non-shared tables.
In addition, the proposed changes are targeted to improve the monitoring in
pg_shmem_allocations which won't be applicable to non-shared hashtables.
While I believe it is feasible, I am uncertain about the utility of such a change.
Seems a bit ugly to go through element_alloc() when pre-allocating. Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such? With the latter doing everything
after hashp->alloc().
Makes sense. I split the element_alloc() into element_alloc() and element_add().
> -
> +
> /*
> * initialize mutexes if it's a partitioned table
> */
Spurious change.
Fixed.
A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing. I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.
I renamed it to compute_buckets_and_segs(). I am open to better suggestions.
> segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);
>
> if (!segp)
Spurious change.
Fixed.
> -static int
> +int
> next_pow2_int(long num)
> {
> if (num > INT_MAX / 2)
> @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
> }
> }
> }
Why export this?
It was a stale change, I removed it now
> diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> index 932cc4f34d..5e16bd4183 100644
> --- a/src/include/utils/hsearch.h
> +++ b/src/include/utils/hsearch.h
> @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
> extern void hash_freeze(HTAB *hashp);
> extern Size hash_estimate_size(long num_entries, Size entrysize);
> extern long hash_select_dirsize(long num_entries);
> -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
> extern void AtEOXact_HashTables(bool isCommit);
> extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);
It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?
hash_estimate_size() estimates using default values and hash_get_shared_size()
calculates using specific values depending on the flags associated with the hash
table. For instance, segment_size used by the former is DEF_SEGSIZE and
the latter uses info->ssize which is set when the HASH_SEGMENT flag is true.
Hence, they might return different values for shared memory sizes.
These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.
Fixed accordingly.
> - (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
> + (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
> MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
> - ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> + ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
> MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
> - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
> + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
> MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
>
> /*
Same.
Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.
Made this change.
> - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
> MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
>
> /* For asserts checking we did not overflow. */
This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..
OK
PFA the rebased patches with the above changes.
Kindly let me know your views.
Thank you,
Rahila Syed