MIME-Version: 1.0
References: <fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff>
 <yivb2evcrj7fna5ymuunw3g5u5xxttwjbjxaa4ofkfkviystjv@4dfylftqxyxh>
 <6kmid26do57ykqfpvq6iieniy4djsymhrypkjccazq5g4bbe6a@2y6owwv7qpex>
 <CAEze2WgGe8vjj3jiWqUugWuwLJ9cLryaGrnASjm-yJ=tEALX2A@mail.gmail.com>
 <pmto7djq64mei53p7r5smfync2waittilhbuzc7j7lpflf2b3y@laz7r76y5pux>
 <CAEze2WjeK9CY003S4dmCugv_H4tz9AaXgnqW+wTc=BaPDg+2xg@mail.gmail.com>
 <3je3ahgf7rrmmurxo6hnlhg5d3ffwfrtjwjxd6jm5srlv5iebp@vxqk5qtgmowr>
 <3w7v3w6a57jnssokap4k7thoekig72flnyhd4wp3yftzdd7lm7@f6lpcfen6hr7>
 <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar>
 <CAAKRu_ZcZCXNUWfhEg4qOYuRO8T1x3_gvaUFFrwgCfY0RnVhkg@mail.gmail.com>
 <3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru>
In-Reply-To: <3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru>
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 25 Nov 2025 15:02:02 -0500
Message-ID: 
 <CAAKRu_bsQ5KmtAjFsOEHGZCiirp_wgTPyfAQfsyJsGU=g7rkCQ@mail.gmail.com>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
To: Andres Freund <andres@anarazel.de>
Cc: Matthias van de Meent <boekewurm+postgres@gmail.com>,
 pgsql-hackers@postgresql.org,
	Thomas Munro <thomas.munro@gmail.com>, Heikki Linnakangas <hlinnaka@iki.fi>,
 Noah Misch <noah@leadboat.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Michael Paquier <michael.paquier@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAAKRu_bsQ5KmtAjFsOEHGZCiirp_wgTPyfAQfsyJsGU%3Dg7rkCQ%40mail.gmail.com>
Precedence: bulk

On Tue, Nov 25, 2025 at 11:54=E2=80=AFAM Andres Freund <andres@anarazel.de>=
 wrote:
>
> > I skipped 0013 and 0014 after seeing "#if 1" in 0013 :)
>
> I left that in there to make the comparison easier. But clearly that's no=
t to
> be committed...

Eh, I mostly just ran out of steam.

> > > [PATCH v6 08/14] bufmgr: Implement buffer content locks independently
> >  of lwlocks
> > ...
> > > This commit unfortunately introduces some code that is very similar t=
o the
> > > code in lwlock.c, however the code is not equivalent enough to easily=
 merge
> > > it. The future wins that this commit makes possible seem worth the co=
st.
> >
> > It is a truly unfortunate amount of duplication. I tried some
> > refactoring myself just to convince myself it wasn't a good idea.
>
> Without success, I guess?

Nothing that seemed better...and not worse :)

> > > As of this commit nothing uses the new share-exclusive lock mode. It =
will be
> > > documented and used in a future commit. It seemed too complicated to =
introduce
> > > the lock-level in a separate commit.
> >
> > I would have liked this mentioned earlier in the commit message.
>
> Hm, ok.  I could split share-exclusive out again, but it was somewhat pai=
nful,
> because it doesn't just lead to adding code, but to changing code.

I don't think you need to introduce share-exclusive in a separate
commit. I just mean it would be good to mention in the commit message
before you get into so many other details that this commit doesn't use
the share-exclusive level yet.

> > Also, I don't know how I feel about it being "documented" in a future
> > commit...perhaps just don't say that.
>
> Hm, why?

I don't really see you documenting share-exclusive lock semantics in a
later commit. Documentation for what the lock level is should be in
the same commit that introduces it -- which I think you mostly have
done. I feel like it is weird to say you will document that lock level
later when 1) you don't really do that and 2) this commit mostly
(besides some requests I had for elaboration) already does that.

> > diff --git a/src/include/storage/buf_internals.h
> > b/src/include/storage/buf_internals.h
> > index 28519ad2813..0a145d95024 100644
> > --- a/src/include/storage/buf_internals.h
> > +++ b/src/include/storage/buf_internals.h
> > @@ -32,22 +33,29 @@
> >  /*
> >   * Buffer state is a single 64-bit variable where following data is co=
mbined.
> >   *
> > + * State of the buffer itself:
> >   * - 18 bits refcount
> >   * - 4 bits usage count
> >   * - 10 bits of flags
> >   *
> > + * State of the content lock:
> > + * - 1 bit has_waiter
> > + * - 1 bit release_ok
> > + * - 1 bit lock state locked
> >
> > Somewhere you should clearly explain the scenarios in which you still
> > need to take the buffer header lock with LockBufHdr() vs when you can
> > just use atomic operations/CAS loops.
>
> There is an explanation of that, or at least my attempt at it ;). See the
> dcoumentation for BufferDesc.

Hmm. I suppose. I was imagining something above the state member since
you have added to it, but okay.

Separately, I'll admit I don't quite understand when I have to use
LockBufHdr() and when I should use a CAS loop to update the
bufferdesc->state.

> >
> > +         * Signal that the process isn't on the wait list anymore. Thi=
s allows
> > +         * BufferLockDequeueSelf() to remove itself of the waitlist wi=
th a
> > +         * proclist_delete(), rather than having to check if it has be=
en
> >
> > I know this comment was ported over, but the "remove itself of the
> > waitlist" -- the "of" is confusing in the original comment and it is
> > confusing here.
>
> As in s/of/from/?

Yes, I wasn't sure if it meant "from" or "off" -- but I believe "from"
is more grammatically correct.

> > +static inline uint64
> > +BufferLockReleaseSub(BufferLockMode mode)
> >
> > I don't understand why this is a separate function even with your comme=
nt.
>
> Because there are operations where we want to unlock the buffer as well a=
s do
> something else. E.g. in 0013, UnlockReleaseBuffer() we want to unlock the
> buffer and decrease the refcount in one atomic operation. For that we nee=
d to
> know what to subtract from the state variable for the lock portion - henc=
e
> BufferLockReleaseSub().

Hmm. Okay well maybe save it for 0013? I don't care that much, though.

> > Maybe you should add an assert to the lock acquisition path that the
> > prevate ref count entry mode is UNLOCK?
>
> Yes, we should...
>
> It'd be nice if we had a decent way to test things that we except to
> crash. Like repeated buffer lock acquisitions...

Indeed.

> > @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
> >      if (BufferIsLocal(buffer))
> >          UnpinLocalBufferNoOwner(buffer);
> >      else
> > +    {
> > +        PrivateRefCountEntry *ref;
> > +
> > +        ref =3D GetPrivateRefCountEntry(buffer, false);
> > +
> > +        /*
> > +         * If the buffer was locked at the time of the resowner releas=
e,
> > +         * release the lock now. This should only happen after errors.
> > +         */
> > +        if (ref->data.lockmode !=3D BUFFER_LOCK_UNLOCK)
> > +        {
> > +            BufferDesc *buf =3D GetBufferDescriptor(buffer - 1);
> > +
> > +            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRU=
PTS */
> > +            BufferLockUnlock(buffer, buf);
> > +        }
> > +
> >          UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
> > +    }
> >  }
> >
> >  Bit confusing that ResOwnerReleaseBufferBin() now releases locks as we=
ll.
>
> Do you have a better suggestion? I'll add a comment that makes that expli=
cit,
> but other than that I don't have a great idea. Renaming the whole buffer =
pin
> mechanism seems pretty noisy.

ResOwnerReleaseBuffer()?

What do you mean renaming the whole buffer pin mechanism?

> > 0011:
> > --------
> > > [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page=
 mode
> >
> > > 2) We would like to stop setting hint bits while pages are being writ=
ten
> > > out. The necessary locking becomes visible for page mode scans if don=
e for
> > > every tuple. With batching the overhead can be amortized to only happ=
en
> > > once per page.
>
> > And I presume you mean don't set hint bits on a buffer that is being fl=
ushed
> > by someone else -- but it sounds like you mean not to set hint bits as =
part
> > of flushing a buffer.
>
> Right, I do mean the former.

I would try and state it more clearly then.

> > +        /*
> > +         * If the page is not all-visible or we need to check serializ=
ability,
> > +         * maintain enough state to be able to refind the tuple effici=
ently,
> > +         * without again needing to extract it from the page.
> > +         */
> > +        if (!all_visible || check_serializable)
> > +        {
> >
> > "enough state" is pretty vague here.
>
> I don't really follow. Wouldn't going into more detail just restate the c=
ode
> in a comment?

I guess maybe something like

If the page is not all-visible or we need to check serializability,
keep track of the tuples so we can examine them later without the
overhead of extracting them from the page again.

> > 0012:
> > -------
>
> > @@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan)
> >       */
> >      for (i =3D 0; i < so->numKilled; i++)
> >      {
> > +        if (!killedsomething)
> > +        {
> > +            /*
> > +             * Use hint bit infrastructure to be allowed to modify the=
 page
> > +             * without holding an exclusive lock.
> > +             */
> > +            if (!BufferBeginSetHintBits(buffer))
> > +                goto unlock;
> > +        }
> > +
> >
> > I don't understand why this is in the loop. Clearly you want to call
> > BufferBeginSetHintBits() once, but why would you do it in the loop?
>
> Why would we want to continue if we can't set hint bits?

No, I'm suggesting that you move BufferBeginSetHintBits() outside the
loop so it is more obvious that it is happening once at the beginning.
Like this:

    if (so->numKilled > 0 && !BufferBeginSetHintBits(buffer))
        goto unlock;

    for (i =3D 0; i < so->numKilled; i++)
    {
        offnum =3D so->killedItems[i];
        iid =3D PageGetItemId(page, offnum);
        ItemIdMarkDead(iid);
        killedsomething =3D true;
    }

    if (killedsomething)
    {
        GistMarkPageHasGarbage(page);
        BufferFinishSetHintBits(buffer, true, true);
    }

unlock:
    UnlockReleaseBuffer(buffer);

> > And should the enum name itself (SetHintBitsState) include the word
> > "batch"? I know that would make it
> > long. At least the comment should explain that these are needed when ba=
tch
> > setting hint bits.
>
> Isn't that what the comment does, explaining that we want to amortize the=
 cost
> of BufferBeginSetHintBits()?

well, amortize is a pretty fancy word and you don't say "batch" anywhere.

> > And then you inlined BufferSetHintBits16().

This was part of my wishlist for separating the batch and non-batch version=
s.

> > I presume you didn't do this because HeapTupleSatisifiesMVCC() for
> > SNAPSHOT_MVCC calls the non-batch version (and, of course,
> > HeapTupleSatisifiesVisibility() is the much more common case).
> >
> > if (TransactionIdIsValid(xid))
> > {
> >         if (BufferIsPermanent(buffer))
> >         {
> >                 /* NB: xid must be known committed here! */
> >                 XLogRecPtr    commitLSN =3D TransactionIdGetCommitLSN(x=
id);
> >
> >                 if (XLogNeedsFlush(commitLSN) &&
> >                         BufferGetLSNAtomic(buffer) < commitLSN)
> >                 {
> >                         /* not flushed and no LSN interlock, so don't
> > set hint */
> >                         return; false;
> >                 }
> >         }
> > }
> >
> > Separately, I was thinking, should we assert here about having the
> > right lock type?
>
> Not sure I get what assert of what locktype where?

In SetHintBitsExt() that we have share-exclusive or above. Or are
there still callers with only a share lock?

> >  * It is only safe to set a transaction-committed hint bit if we know t=
he
> >  * transaction's commit record is guaranteed to be flushed to disk befo=
re the
> >  * buffer, or if the table is temporary or unlogged and will be obliter=
ated by
> >  * a crash anyway.  We cannot change the LSN of the page here, because =
we may
> >  * hold only a share lock on the buffer, so we can only use the LSN to
> >  * interlock this if the buffer's LSN already is newer than the commit =
LSN;
> >  * otherwise we have to just refrain from setting the hint bit until so=
me
> >  * future re-examination of the tuple.
> >  *
> >
> > Should this say "we may hold only a share exclusive lock on the
> > buffer". Also what is "this" in "only use the LSN to interlock this"?
>
> That's a pre-existing comment, right?

Right, but are there callers that will only have a share lock after your ch=
ange?

> > @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
> > Buffer buffer,
> > +    if (state =3D=3D SHB_ENABLED)
> > +        BufferFinishSetHintBits(buffer, true, true);
> > +
> >      return nvis;
> >  }
> >
> > I wondered if it would be more natural for BufferBeginSetHintBits()
> > and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
> > instead of having callers do it. But, I guess you don't do this
> > because of gist and hash indexes using this for doing their own
> > modifications.
>
> I thought about it. But yea, the different callers seemed to make that no=
t
> really useful. It'd also mean that we'd do an external function call for =
every
> tuple, which I think would be prohibitively expensive.

I'm confused why this would mean more external function calls.

> > --- a/src/backend/storage/freespace/freespace.c
> > +++ b/src/backend/storage/freespace/freespace.c
> > @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
> >     max_avail =3D fsm_get_max_avail(page);
> >     /*
> > -    * Reset the next slot pointer. This encourages the use of low-numb=
ered
> > -    * pages, increasing the chances that a later vacuum can truncate t=
he
> > -    * relation.  We don't bother with a lock here, nor with marking th=
e page
> > -    * dirty if it wasn't already, since this is just a hint.
> > +    * Try to reset the next slot pointer. This encourages the use of
> > +    * low-numbered pages, increasing the chances that a later vacuum c=
an
> > +    * truncate the relation.  We don't bother with a lock here, nor wi=
th
> > +    * marking the page dirty if it wasn't already, since this is just =
a hint.
> > +    *
> > +    * To be allowed to update the page without an exclusive lock, we h=
ave to
> > +    * use the hint bit infrastructure.
> >      */
> >
> > What the heck? This didn't even take a share lock before...
>
> It is insane. I do not understand how anybody thought this was ok.  I thi=
nk I
> probably should split this out into a separate commit, since it's so insa=
ne.

Aren't you going to backport having it take a lock? Then it can be
separate (e.g. have it take a share lock in the "fix" commit and then
this commit bumps it to share-exclusive).

> > diff --git a/src/backend/storage/freespace/fsmpage.c
> > b/src/backend/storage/freesp>
> > /*
> > * Update the next-target pointer. Note that we do this even if we're on=
ly
> > * holding a shared lock, on the grounds that it's better to use a share=
d
> > * lock and get a garbled next pointer every now and then, than take the
> > * concurrency hit of an exclusive lock.
> >
> > We appear to avoid the garbling now?
>
> I don't think so. Two backends concurrently can do fsm_search_avail() and
> one backend might set a hint to a page that is already used up by the oth=
er
> one. At least I think so?

Maybe I don't know what it meant by garbled, but I thought it was
talking about two backends each trying to set fp_next_slot. If they
now have to have a share-exclusive lock and they can't both have a
share-exclusive lock at the same time, then it seems like that
wouldn't be a problem. It sounds like you may be talking about a
backend taking up the freespace of a page that is referred to by the
fp_next_slot?

- Melanie