MIME-Version: 1.0
References: <01ab1d41-3eda-4705-8bbd-af898f5007f1@iki.fi>
 <2981bb36-6bbe-4bdc-9a94-29b1114c79bd@vondra.me>
 <3026ec05-f664-4ebe-8bf6-0a1218b234ec@iki.fi>
 <19945803-6bcc-40fe-a14a-7dc5c462ed80@iki.fi>
 <e07be2ba-856b-4ff5-8313-8b58b6b4e4d0@iki.fi>
 <CAEze2WhYsCNRd3E9qGSZbXd5k0UVa7xgMZ1V6tARRKezPPEFUw@mail.gmail.com>
 <a47e1b92-2e88-4554-b4d3-61934173222d@iki.fi>
In-Reply-To: <a47e1b92-2e88-4554-b4d3-61934173222d@iki.fi>
From: Matthias van de Meent <boekewurm+postgres@gmail.com>
Date: Thu, 2 Apr 2026 15:47:54 +0200
Message-ID: 
 <CAEze2WjbDV--L+p8K5w7zj9JoDqhqNqGX2KTj=5O76Wp5si1tQ@mail.gmail.com>
Subject: Re: Shared hash table allocations
To: Heikki Linnakangas <hlinnaka@iki.fi>
Cc: Tomas Vondra <tomas@vondra.me>,
	"pgsql-hackers@postgresql.org" <pgsql-hackers@postgresql.org>,
 Robert Haas <robertmhaas@gmail.com>,
	Rahila Syed <rahilasyed90@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Archived-At: 
 <https://www.postgresql.org/message-id/CAEze2WjbDV--L%2Bp8K5w7zj9JoDqhqNqGX2KTj%3D5O76Wp5si1tQ%40mail.gmail.com>
Precedence: bulk

On Thu, 2 Apr 2026 at 13:52, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 02/04/2026 13:24, Matthias van de Meent wrote:
> > On Tue, 31 Mar 2026 at 23:25, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> >>
> >> 0003: In patch 0003 I removed that flexibility by marking them both with
> >> HASH_FIXED_SIZE, and making init_size equal to max_size. That also stops
> >> the hash tables from using any of the other remaining wiggle room,
> >> making them truly fixed-size.
> >
> > I think this patch finally gave me a good reason why PROCLOCK would've
> > needed to be allocated with double the sizes of LOCK:
> >
> > LOCK is (was) initialized with only 50% of its max capacity. If
> > PROCLOCK was initialized with the same parameters and all spare shmem
> > is then allocated to other processes, then backends wouldn't be able
> > to safely use max_locks_per_transaction. To guarantee no OOMs when all
> > backends use max_locks_per_transaction, PROCLOCK's size must be
> > doubled to make sure PROCLOCK has sufficient space. (The same isn't
> > usually an issue for LOCK, because it's very likely backends will
> > operate on the same tables, and thus will be able to share most of the
> > LOCK structs.)
>
> Hmm, I don't know if that makes sense.

Code and mailing history indicate it's not the reason, but there is no
other sane reason why PROCLOCK would *not* be sized to
max_locks_per_transaction * MaxBackends. At least with this reasoning
the minimum size is exactly that.

> It can happen that you have a lot
> of backends acquiring the same, smaller set of locks, growing PROCLOCK
> so that it uses up all the available wiggle room, and LOCK can never
> grow from its initial size, 1/2 * max_locks_per_transactions *
> MaxBackends. If the workload then changes so that every backend tries to
> acquire exactly max_locks_per_transactions locks, but this time each
> lock is on a different object, you will run out of shared memory at 1/2
> the size of what you expected.
>
> The opposite can't happen, because PROCLOCK is always at least as large
> as LOCK. It doesn't matter what you set PROCLOCK's initial size to, it
> will grow together with LOCK, and you will not run out of shared memory
> before PROCLOCK has grown up to max_locks_per_transactions * MaxBackends
> anyway.
>
> > Now that LOCK is fully allocated, I think the size doubling can be
> > removed, or possibly parameterized for those that need it.
>
> I don't think that follows. The 2x factor is pretty arbitrary, but it's
> still a fair assumption that many backends will be acquiring locks on
> the same objects so you need more space in PROCLOCK than in LOCK.

I agree that we'll *probably* have more PROCLOCKs in use than LOCKs.
But max_locks_per_transaction (MLPT) to me indicates that it is an
indicator of the maximum number of locks taken by a transaction, and
transaction locks have a 1:1 correspondence with PROCLOCKs (as long as
we ignore fast-path locking).

Adjusting that value by an arbitrary factor does not many any sense.
The user configured a value X, so we should use that value X.
Possibly there could be adjustments we need to make to give ourself
some breathing room (it's not uncommon to overallocate by a constant
factor to allow evict-after-insert patterns in caches), but I can't
explain a blanket doubling of usage "because we have a hunch LOCK
usage will be lower than PROCLOCK usage" when the user specified a
value that would/should map 1:1 against PROCLOCKs scaling as anything
other than plainly wasting memory.

> I don't know how true that assumption is. It feels right for OLTP
> applications. But the situation where I've hit max_locks_per_transaction
> is when I've tried to create one table with thousands or partitions. Or
> rather, when I try to *drop* that table. In that situation, there's just
> one transaction acquiring all the locks, so the PROCLOCK / LOCK ratio is 1.

> We could parameterize it, but I feel that's probably overkill and
> exposing too much detail to users. At the end of the day, if you hit the
> limit, you just bump up max_locks_per_transactions.

Or, if it's for DROP, you could use a phased dropping scheme, where
you spread the operation across many transactions by dropping a subset
of the partitions in each transaction. It takes more careful execution
and more time, but it allows you to avoid hitting the limits and
starving other backends of lock slots, and avoids requiring postmaster
restarts.

> If there are two
> settings, it's more complicated; which one do you change? You probably
> don't mind wasting the few MB of memory that you could gain by carefully
> tuning the LOCK / PROCLOCK factor.

Yes, that would be more complicated, but we have similar factors
elsewhere (hash_mem_multiplier, various costs, weights). We wouldn't
even have to use a factor, we could just as well use a new, more
direct `max_unique_locks_per_transaction`, which we'd use to scale the
LOCK hash.

Note that with our current default settings we're spending 11kiB (= 64
* (64+24)) per backend on what I would consider oversized PROCLOCK
allocations. With MLPT=128, that doubles to 22kiB per backend. Every
50 max_backends, that'd be ~1.1MB of shared memory allocated in excess
of user's requested configuration.


Kind regards,

Matthias van de Meent