Feedback-ID: id4a34324:Fastmail
Date: Tue, 25 Nov 2025 15:46:49 -0500
From: Andres Freund <andres@anarazel.de>
To: Melanie Plageman <melanieplageman@gmail.com>
Cc: Matthias van de Meent <boekewurm+postgres@gmail.com>,
	pgsql-hackers@postgresql.org, Thomas Munro <thomas.munro@gmail.com>,
	Heikki Linnakangas <hlinnaka@iki.fi>, Noah Misch <noah@leadboat.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Michael Paquier <michael.paquier@gmail.com>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Message-ID: <uuukiok2b5gbbae6blokwim2ohxxvp5235epd5zrbc5d3fojb6@vsqee77athye>
References: <6kmid26do57ykqfpvq6iieniy4djsymhrypkjccazq5g4bbe6a@2y6owwv7qpex>
 <CAEze2WgGe8vjj3jiWqUugWuwLJ9cLryaGrnASjm-yJ=tEALX2A@mail.gmail.com>
 <pmto7djq64mei53p7r5smfync2waittilhbuzc7j7lpflf2b3y@laz7r76y5pux>
 <CAEze2WjeK9CY003S4dmCugv_H4tz9AaXgnqW+wTc=BaPDg+2xg@mail.gmail.com>
 <3je3ahgf7rrmmurxo6hnlhg5d3ffwfrtjwjxd6jm5srlv5iebp@vxqk5qtgmowr>
 <3w7v3w6a57jnssokap4k7thoekig72flnyhd4wp3yftzdd7lm7@f6lpcfen6hr7>
 <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar>
 <CAAKRu_ZcZCXNUWfhEg4qOYuRO8T1x3_gvaUFFrwgCfY0RnVhkg@mail.gmail.com>
 <3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru>
 <CAAKRu_bsQ5KmtAjFsOEHGZCiirp_wgTPyfAQfsyJsGU=g7rkCQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: 
 <CAAKRu_bsQ5KmtAjFsOEHGZCiirp_wgTPyfAQfsyJsGU=g7rkCQ@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/uuukiok2b5gbbae6blokwim2ohxxvp5235epd5zrbc5d3fojb6%40vsqee77athye>
Precedence: bulk

Hi,

On 2025-11-25 15:02:02 -0500, Melanie Plageman wrote:
> On Tue, Nov 25, 2025 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
> > > > As of this commit nothing uses the new share-exclusive lock mode. It will be
> > > > documented and used in a future commit. It seemed too complicated to introduce
> > > > the lock-level in a separate commit.
> > >
> > > I would have liked this mentioned earlier in the commit message.
> >
> > Hm, ok.  I could split share-exclusive out again, but it was somewhat painful,
> > because it doesn't just lead to adding code, but to changing code.
> 
> I don't think you need to introduce share-exclusive in a separate
> commit. I just mean it would be good to mention in the commit message
> before you get into so many other details that this commit doesn't use
> the share-exclusive level yet.
> 
> > > Also, I don't know how I feel about it being "documented" in a future
> > > commit...perhaps just don't say that.
> >
> > Hm, why?
> 
> I don't really see you documenting share-exclusive lock semantics in a
> later commit. Documentation for what the lock level is should be in
> the same commit that introduces it -- which I think you mostly have
> done. I feel like it is weird to say you will document that lock level
> later when 1) you don't really do that and 2) this commit mostly
> (besides some requests I had for elaboration) already does that.

The problem I faced was that the explanation for the lock level depends on
later changes - how do you document why share-exclusive is useful without
explaining the hint bit thing, which isn't yet implemented as of that commit.


> > > diff --git a/src/include/storage/buf_internals.h
> > > b/src/include/storage/buf_internals.h
> > > index 28519ad2813..0a145d95024 100644
> > > --- a/src/include/storage/buf_internals.h
> > > +++ b/src/include/storage/buf_internals.h
> > > @@ -32,22 +33,29 @@
> > >  /*
> > >   * Buffer state is a single 64-bit variable where following data is combined.
> > >   *
> > > + * State of the buffer itself:
> > >   * - 18 bits refcount
> > >   * - 4 bits usage count
> > >   * - 10 bits of flags
> > >   *
> > > + * State of the content lock:
> > > + * - 1 bit has_waiter
> > > + * - 1 bit release_ok
> > > + * - 1 bit lock state locked
> > >
> > > Somewhere you should clearly explain the scenarios in which you still
> > > need to take the buffer header lock with LockBufHdr() vs when you can
> > > just use atomic operations/CAS loops.
> >
> > There is an explanation of that, or at least my attempt at it ;). See the
> > dcoumentation for BufferDesc.
> 
> Hmm. I suppose. I was imagining something above the state member since
> you have added to it, but okay.
> 
> Separately, I'll admit I don't quite understand when I have to use
> LockBufHdr() and when I should use a CAS loop to update the
> bufferdesc->state.

It's definitely subtle :(. Not sure what to do about that, other than to work
on eventually just making everything doable with just a CAS. But that's a fair
bit of future work away.


> > > @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
> > >      if (BufferIsLocal(buffer))
> > >          UnpinLocalBufferNoOwner(buffer);
> > >      else
> > > +    {
> > > +        PrivateRefCountEntry *ref;
> > > +
> > > +        ref = GetPrivateRefCountEntry(buffer, false);
> > > +
> > > +        /*
> > > +         * If the buffer was locked at the time of the resowner release,
> > > +         * release the lock now. This should only happen after errors.
> > > +         */
> > > +        if (ref->data.lockmode != BUFFER_LOCK_UNLOCK)
> > > +        {
> > > +            BufferDesc *buf = GetBufferDescriptor(buffer - 1);
> > > +
> > > +            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS */
> > > +            BufferLockUnlock(buffer, buf);
> > > +        }
> > > +
> > >          UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
> > > +    }
> > >  }
> > >
> > >  Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.
> >
> > Do you have a better suggestion? I'll add a comment that makes that explicit,
> > but other than that I don't have a great idea. Renaming the whole buffer pin
> > mechanism seems pretty noisy.
> 
> ResOwnerReleaseBuffer()?
> 
> What do you mean renaming the whole buffer pin mechanism?

All the *PrivateRefCount* stuff arguably aught to be renamed...


> > > I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> > > SNAPSHOT_MVCC calls the non-batch version (and, of course,
> > > HeapTupleSatisifiesVisibility() is the much more common case).
> > >
> > > if (TransactionIdIsValid(xid))
> > > {
> > >         if (BufferIsPermanent(buffer))
> > >         {
> > >                 /* NB: xid must be known committed here! */
> > >                 XLogRecPtr    commitLSN = TransactionIdGetCommitLSN(xid);
> > >
> > >                 if (XLogNeedsFlush(commitLSN) &&
> > >                         BufferGetLSNAtomic(buffer) < commitLSN)
> > >                 {
> > >                         /* not flushed and no LSN interlock, so don't
> > > set hint */
> > >                         return; false;
> > >                 }
> > >         }
> > > }
> > >
> > > Separately, I was thinking, should we assert here about having the
> > > right lock type?
> >
> > Not sure I get what assert of what locktype where?
> 
> In SetHintBitsExt() that we have share-exclusive or above.

We *don't* necessarily hold that though, it'll just be acquired by
BufferBeginSetHintBits().  Or do you mean in the SHB_ENABLED case?


> Or are there still callers with only a share lock?

Almost all of them, I think? We can't just call this with an unconditional
share-exclusive lock, since that'd destroy concurrency. Instead we just try to
upgrade to share-exclusive to set the hint bit. If we can't get
share-exclusive, we don't need to block, we just haven't set the hint bit.


> > > @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
> > > Buffer buffer,
> > > +    if (state == SHB_ENABLED)
> > > +        BufferFinishSetHintBits(buffer, true, true);
> > > +
> > >      return nvis;
> > >  }
> > >
> > > I wondered if it would be more natural for BufferBeginSetHintBits()
> > > and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
> > > instead of having callers do it. But, I guess you don't do this
> > > because of gist and hash indexes using this for doing their own
> > > modifications.
> >
> > I thought about it. But yea, the different callers seemed to make that not
> > really useful. It'd also mean that we'd do an external function call for every
> > tuple, which I think would be prohibitively expensive.
> 
> I'm confused why this would mean more external function calls.

The visibility logic is in heapam_visibility.c, the locking for the buffer in
bufmgr.c? If the knowledge about SetHintBitsState lives in
BufferBeginSetHintBits(), we need to call it to do that check.


> > > --- a/src/backend/storage/freespace/freespace.c
> > > +++ b/src/backend/storage/freespace/freespace.c
> > > @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
> > >     max_avail = fsm_get_max_avail(page);
> > >     /*
> > > -    * Reset the next slot pointer. This encourages the use of low-numbered
> > > -    * pages, increasing the chances that a later vacuum can truncate the
> > > -    * relation.  We don't bother with a lock here, nor with marking the page
> > > -    * dirty if it wasn't already, since this is just a hint.
> > > +    * Try to reset the next slot pointer. This encourages the use of
> > > +    * low-numbered pages, increasing the chances that a later vacuum can
> > > +    * truncate the relation.  We don't bother with a lock here, nor with
> > > +    * marking the page dirty if it wasn't already, since this is just a hint.
> > > +    *
> > > +    * To be allowed to update the page without an exclusive lock, we have to
> > > +    * use the hint bit infrastructure.
> > >      */
> > >
> > > What the heck? This didn't even take a share lock before...
> >
> > It is insane. I do not understand how anybody thought this was ok.  I think I
> > probably should split this out into a separate commit, since it's so insane.
> 
> Aren't you going to backport having it take a lock? Then it can be
> separate (e.g. have it take a share lock in the "fix" commit and then
> this commit bumps it to share-exclusive).

I wasn't thinking we should backport that. It's been this way for ages,
without having caused known issues....


> > > diff --git a/src/backend/storage/freespace/fsmpage.c
> > > b/src/backend/storage/freesp>
> > > /*
> > > * Update the next-target pointer. Note that we do this even if we're only
> > > * holding a shared lock, on the grounds that it's better to use a shared
> > > * lock and get a garbled next pointer every now and then, than take the
> > > * concurrency hit of an exclusive lock.
> > >
> > > We appear to avoid the garbling now?
> >
> > I don't think so. Two backends concurrently can do fsm_search_avail() and
> > one backend might set a hint to a page that is already used up by the other
> > one. At least I think so?
> 
> Maybe I don't know what it meant by garbled, but I thought it was
> talking about two backends each trying to set fp_next_slot. If they
> now have to have a share-exclusive lock and they can't both have a
> share-exclusive lock at the same time, then it seems like that
> wouldn't be a problem. It sounds like you may be talking about a
> backend taking up the freespace of a page that is referred to by the
> fp_next_slot?

Yes, a version of the latter. The value that fp_next_slot will be set to can
be outdated by the time we actually set it, unless we do all of
fsm_search_avail() under some form of exclusive lock - clearly not something
desirable.

Greetings,

Andres Freund