Message-ID: <83e37829-0d94-49b2-ad48-5feb7b5d5e44@iki.fi>
Date: Thu, 2 Apr 2026 17:14:46 +0300
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Shared hash table allocations
To: Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>
Cc: Tomas Vondra <tomas@vondra.me>,
 "pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>,
 Robert Haas <robertmhaas@gmail.com>, Rahila Syed <rahilasyed90@gmail.com>
References: <01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi>
 <2981bb36-6bbe-4bdc-9a94-29b1114c79bd@vondra.me>
 <3026ec05-f664-4ebe-8bf6-0a1218b234ec@iki.fi>
 <19945803-6bcc-40fe-a14a-7dc5c462ed80@iki.fi>
 <e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi>
 <CAExHW5uWdU1iEM_eVFVVmaHqfjLpq0QrdFUeZjtBDYpNwfuRBg@mail.gmail.com>
Content-Language: en-US
From: Heikki Linnakangas <hlinnaka@iki.fi>
In-Reply-To: 
 <CAExHW5uWdU1iEM_eVFVVmaHqfjLpq0QrdFUeZjtBDYpNwfuRBg@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Archived-At: 
 <https://www.postgresql.org/message-id/83e37829-0d94-49b2-ad48-5feb7b5d5e44%40iki.fi>
Precedence: bulk

On 02/04/2026 15:55, Ashutosh Bapat wrote:
> When we "allocate" shared memory, we are just allocating space on
> systems which use mmap. The memory gets allocated only when it is
> touched. The wiggle room as a whole is never touched during
> initialization. Those pages get allocated when wiggle room is used -
> i.e. when the entries beyond initial number are allocated. By
> allocating maximal hash tables, I was worried that we will allocate
> more memory than required. But that's not true since a 4K memory page
> fits only 50-60 entries - far less than the default configuration
> permits. Most of the memory for the hash table will be allocated as
> the entries as used.

Hmm, that's a good point about untouched memory not being allocated. I 
think it's fine, though.

With small changes on top of the the earlier refactorings from this 
thread, we could stop pre-allocating all the elements when a shared 
memory hash table is created, and have ShmemHashAlloc() allocate them on 
the fly, but instead of doing them as anonymous allocations like we do 
with ShmemAlloc() today, the allocations could come from the 
pre-allocated region dedicated to the hash table. You'd still get the 
same determinism and visibility in pg_shmem_allocations, but you could 
avoid actually touching the pages until they're needed. Not sure it's 
worth the trouble.

> The second hazard of increasing hash table size is the hash table
> access becomes slower as it becomes sparse [1]. I don't think it shows
> up in performance but maybe worth trying a trivial pgbench run, just
> to make sure that default performance doesn't regress.

Interesting, but yeah I don't think that's going to be measurable. I did 
some quick testing with a test function that just locks and unlocks 
relations:

PG_FUNCTION_INFO_V1(test_lock_bench);
Datum
test_lock_bench(PG_FUNCTION_ARGS)
{
	int32		num_distinct_locks = PG_GETARG_INT32(0);
	int32		num_acquires = PG_GETARG_INT32(1);

	LOCKMODE	lockmode = AccessExclusiveLock;

#define FIRST_RELID 1000000000

	for (int32 i = 0; i < num_acquires; i++)
	{
		Oid			relid = FIRST_RELID + i % num_distinct_locks;

		if (i >= num_distinct_locks)
			UnlockRelationOid(relid, lockmode);

		if (!ConditionalLockRelationOid(relid, lockmode))
		{
			elog(LOG, "could not acquire lock, iteration %d", i);
			break;
		}
	}

	PG_RETURN_VOID();
}

With test_lock_bench(1, 5000000), I don't see any meaningful difference, 
i.e. it's within 1-2 %, with anything from max_locks_per_transactions=10 
to max_locks_per_transactions=128.

With more distinct locks involved, the caching effects might be bigger, 
and maybe you'd see a difference because of more or less collisions. 
Spot testing some values on my laptop, I don't see anything that would 
worry me though.

> The increase in memory usage is 3MB, which is fine usually. I mean, we
> didn't hear any complaints when we increased the default size of the
> shared buffer pool - this is much less than that. But why do you want
> to double the max_locks_per_transaction? I first thought it's because
> the hash table size is anyway a power of 2. But then the size of the
> hash table is actually max_locks_per_transaction * (number of backends
> + number of prepared transactions). What we want is the default
> max_locks_per_transaction such that 14927 locks are allowed. Playing
> with max_locks_per_transaction using your script 109 seems to be the
> number which will give us 14951 locks. It looks (and is) an odd
> number. If we are worried about memory increase, that's the number we
> should use as default and then write a long paragraph about why we
> chose such an odd-looking number :D.

My first thought was actually to set max_locks_per_transaction=100, 
making it a nice round number :-). But then the neighboring default of 
max_pred_locks_per_transaction=64 looks weird. We could reduce it 
max_pred_locks_per_transaction=50 to make it fit in. But it feels a 
little arbitrary to change just for aesthetic reasons.

> I think we should highlight the change in default in the release notes
> though. The users which use default configuration will notice an
> increase in the memory. If they are using a custom value, they will
> think of bumping it up. Can we give them some ballpark % by which they
> should increase their max_locks_per_transaction? E.g. double the
> number or something?

I don't think people who are using the defaults will notice. I'm worried 
about the people who have set max_locks_per_transactions manually, and 
now effectively get less lock space for the same setting. Yeah, doubling 
the previous value is a good rule of thumb.

- Heikki