MIME-Version: 1.0
References: <fvfmkr5kk4nyex56ejgxj3uzi63isfxovp2biecb4bspbjrze7@az2pljabhnff>
 <yivb2evcrj7fna5ymuunw3g5u5xxttwjbjxaa4ofkfkviystjv@4dfylftqxyxh>
 <6kmid26do57ykqfpvq6iieniy4djsymhrypkjccazq5g4bbe6a@2y6owwv7qpex>
 <CAEze2WgGe8vjj3jiWqUugWuwLJ9cLryaGrnASjm-yJ=tEALX2A@mail.gmail.com>
 <pmto7djq64mei53p7r5smfync2waittilhbuzc7j7lpflf2b3y@laz7r76y5pux>
 <CAEze2WjeK9CY003S4dmCugv_H4tz9AaXgnqW+wTc=BaPDg+2xg@mail.gmail.com>
 <3je3ahgf7rrmmurxo6hnlhg5d3ffwfrtjwjxd6jm5srlv5iebp@vxqk5qtgmowr>
 <3w7v3w6a57jnssokap4k7thoekig72flnyhd4wp3yftzdd7lm7@f6lpcfen6hr7>
 <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar>
In-Reply-To: <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar>
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Tue, 25 Nov 2025 10:44:31 -0500
Message-ID: 
 <CAAKRu_ZcZCXNUWfhEg4qOYuRO8T1x3_gvaUFFrwgCfY0RnVhkg@mail.gmail.com>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
To: Andres Freund <andres@anarazel.de>
Cc: Matthias van de Meent <boekewurm+postgres@gmail.com>,
 pgsql-hackers@postgresql.org,
	Thomas Munro <thomas.munro@gmail.com>, Heikki Linnakangas <hlinnaka@iki.fi>,
 Noah Misch <noah@leadboat.com>,
	Robert Haas <robertmhaas@gmail.com>,
 Michael Paquier <michael.paquier@gmail.com>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Archived-At: 
 <https://www.postgresql.org/message-id/CAAKRu_ZcZCXNUWfhEg4qOYuRO8T1x3_gvaUFFrwgCfY0RnVhkg%40mail.gmail.com>
Precedence: bulk

On Wed, Nov 19, 2025 at 9:47=E2=80=AFPM Andres Freund <andres@anarazel.de> =
wrote:
>
> 0008: The main change. Implements buffer content locking independently fr=
om
> lwlock.c. There's obviously a lot of similarity between lwlock.c code and
> this, but I've not found a good way to reduce the duplication without giv=
ing
> up too much.  This patch does immediately introduce share-exclusive as a =
new
> lock level, mostly because it was too painful to do separately.
>
> 0009+0010+0011: Preparatory work for 0012.
>
> 0012: One of the main goals of this patchset - use the new share-exclusiv=
e
> lock level to only allow hint bits to be set while no IO is going on.

Below is a review of 0008-0012
I skipped 0013 and 0014 after seeing "#if 1" in 0013 :)

0008:
-------
> [PATCH v6 08/14] bufmgr: Implement buffer content locks independently
 of lwlocks
...
> This commit unfortunately introduces some code that is very similar to th=
e
> code in lwlock.c, however the code is not equivalent enough to easily mer=
ge
> it. The future wins that this commit makes possible seem worth the cost.

It is a truly unfortunate amount of duplication. I tried some
refactoring myself just to convince myself it wasn't a good idea.
However, ISTM there is no reason for the lwlock and buffer locking
implementations to have to stay in sync. So they can diverge as
features are added and perhaps the duplication won't be as conspicuous
in the future.

> As of this commit nothing uses the new share-exclusive lock mode. It will=
 be
>documented and used in a future commit. It seemed too complicated to intro=
duce
> the lock-level in a separate commit.

I would have liked this mentioned earlier in the commit message. Also,
I don't know how I feel about it being "documented" in a future
commit...perhaps just don't say that.

diff --git a/src/include/storage/buf_internals.h
b/src/include/storage/buf_internals.h
index 28519ad2813..0a145d95024 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -32,22 +33,29 @@
 /*
  * Buffer state is a single 64-bit variable where following data is combin=
ed.
  *
+ * State of the buffer itself:
  * - 18 bits refcount
  * - 4 bits usage count
  * - 10 bits of flags
  *
+ * State of the content lock:
+ * - 1 bit has_waiter
+ * - 1 bit release_ok
+ * - 1 bit lock state locked

Somewhere you should clearly explain the scenarios in which you still
need to take the buffer header lock with LockBufHdr() vs when you can
just use atomic operations/CAS loops.

Separately, you should clearly explain what BM_LOCKED (I presume what
"lock state locked" refers to) protects.

+/*
+ * Definitions related to buffer content locks
+ */
+#define BM_LOCK_HAS_WAITERS         (UINT64CONST(1) << 63)
+#define BM_LOCK_RELEASE_OK          (UINT64CONST(1) << 62)
+
+#define BM_LOCK_VAL_SHARED          (UINT64CONST(1) << 32)
+#define BM_LOCK_VAL_SHARE_EXCLUSIVE (UINT64CONST(1) << (32 +
MAX_BACKENDS_BITS))
+#define BM_LOCK_VAL_EXCLUSIVE       (UINT64CONST(1) << (32 + 1 +
MAX_BACKENDS_BITS))
+
+#define BM_LOCK_MASK                (((uint64)MAX_BACKENDS << 32) |
BM_LOCK_VAL_SHARE_EXCLUSIVE | BM_LOCK_VAL_EXCLUSIVE)

Not the fault of this patch, but I always found it confusing that the
BM flags were for buffer manager. I always think BM stands for buffer
mapping.

On a more actionable note: you probably want to update the comment in
procnumber.h in your earlier patch to take out the part about how the
limitation could be lifted using a 64 bit state -- since you made the
state 64 bit and didn't lift the limitation (see below for the
specific comment).

/*
 * Note: MAX_BACKENDS_BITS is 18 as that is the space available for buffer
 * refcounts in buf_internals.h.  This limitation could be lifted by using =
a
 * 64bit state; but it's unlikely to be worthwhile as 2^18-1 backends excee=
d
 * currently realistic configurations. Even if that limitation were removed=
,
 * we still could not a) exceed 2^23-1 because inval.c stores the ProcNumbe=
r
 * as a 3-byte signed integer, b) INT_MAX/4 because some places compute
 * 4*MaxBackends without any overflow check.  We check that the configured
 * number of backends does not exceed MAX_BACKENDS in InitializeMaxBackends=
().
 */

@@ -268,6 +287,15 @@ BufMappingPartitionLockByIndex(uint32 index)
  * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER.  At pre=
sent,
  * there can be only one such waiter per buffer.
  *
+ * The content of buffers is protected via the buffer content lock,
+ * implemented as part buffer state. Note that the buffer header lock is *=
not*
+ * used to control access to the data in the buffer! We used to use an LWL=
ock
+ * to implement the content lock, but having a dedicated implementation of
+ * content locks allows to implement some otherwise hard things (e.g.
+ * race-freely checking if AIO is in progress before locking a buffer
+ * exclusively) and makes otherwise impossible optimizations possible
+ * (e.g. unlocking and unpinning a buffer in one atomic operation).
+ *

I don't really like that this includes a description of how it used to
work. It makes the description more confusing when trying to
understand the current state. And comments like that make most sense
when the current state is confusing because of a long annoying
history.

@@ -302,7 +303,24 @@ extern void BufferGetTag(Buffer buffer,
RelFileLocator *rlocator,
 extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std);
 extern void UnlockBuffers(void);
-extern void LockBuffer(Buffer buffer, BufferLockMode mode);
+extern void UnlockBuffer(Buffer buffer);
+extern void LockBufferInternal(Buffer buffer, BufferLockMode mode);
+
+/*
+ * Handling BUFFER_LOCK_UNLOCK in bufmgr.c leads to sufficiently worse bra=
nch
+ * prediction to impact performance. Therefore handle that switch here, we=
re
+ * most of the time `mode` will be a constant and thus can be optimized ou=
t by
+ * the compiler.

Typo: were -> where

+ */
+static inline void
+LockBuffer(Buffer buffer, BufferLockMode mode)
+{
+    if (mode =3D=3D BUFFER_LOCK_UNLOCK)
+        UnlockBuffer(buffer);
+    else
+        LockBufferInternal(buffer, mode);
+}
+

Is there a reason to stick with the LockBuffer(buf,
BUFFER_LOCK_UNLOCK) interface here? It seems like a cgood time to
start using UnlockBuffer() which I thought most people find more
intuitive.

 /*
- * Acquire or release the content_lock for the buffer.
+ * Acquire the buffer content lock in the specified mode
+ *
+ * If the lock is not available, sleep until it is.
+ *
+ * Side effect: cancel/die interrupts are held off until lock release.
+ *
+ * This uses almost the same locking approach as lwlock.c's
+ * LWLockAcquire(). See documentation atop of lwlock.c for a more detailed
+ * discussion.
+ *
+ * The reason that this, and most of the other BufferLock* functions, get =
both
+ * the Buffer and BufferDesc* as parameters, is that looking up one from t=
he
+ * other repeatedly shows up noticeably in profiles.
+ */
+static inline void
+BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode)

Either here or over the lock mode enum, I'd spend a few sentences
describing how the buffer lock modes interact -- what conflicts with
what. You spend more time saying how this is like LWLock than
explaining what it actually is.

+/*
+ * Remove ourselves from the waitlist.
+ *
+ * This is used if we queued ourselves because we thought we needed to sle=
ep
+ * but, after further checking, we discovered that we don't actually need =
to
+ * do so.
+ */
+static void
+BufferLockDequeueSelf(BufferDesc *buf_hdr)
+{
+    bool        on_waitlist;
+
+    LockBufHdr(buf_hdr);
+
...
+    /* XXX: combine with fetch_and above? */
+    UnlockBufHdr(buf_hdr);

I know you just ported this comment from the LWLock implementation but
I can't see how it could be worth the effort. It won't be in the
hottest path as you must satisfy a few conditions to get here.


+ * Wakeup all the lockers that currently have a chance to acquire the lock=
.
+ */
+static void
+BufferLockWakeup(BufferDesc *buf_hdr, bool unlocked)

This should have a more explanatory comment. You should especially
document the unlocked parameter. I would expect something somewhere in
or above this function that basically says (maybe less verbosely)

- Before waking anything: allow all
- After waking SHARE:
    wake SHARE only
- After waking SHARE_EXCLUSIVE:
    wake SHARE only
    (no more SHARE_EXCLUSIVE)
- After waking EXCLUSIVE:
    wake none

+{
+    bool        new_release_ok;
+    bool        wake_exclusive =3D unlocked;
+    bool        wake_share_exclusive =3D true;
+    proclist_head wakeup;
+    proclist_mutable_iter iter;
+
+    proclist_init(&wakeup);
+
+    new_release_ok =3D true;
+
+    /* lock wait list while collecting backends to wake up */
+    LockBufHdr(buf_hdr);
+
+    proclist_foreach_modify(iter, &buf_hdr->lock_waiters, lwWaitLink)
+    {
+        PGPROC       *waiter =3D GetPGProcByNumber(iter.cur);
+
+        if (!wake_exclusive && waiter->lwWaitMode =3D=3D BUFFER_LOCK_EXCLU=
SIVE)
+            continue;
+
+        if (!wake_share_exclusive && waiter->lwWaitMode =3D=3D
BUFFER_LOCK_SHARE_EXCLUSIVE)
+            continue;
+

It seems there is a difference between LWLockWakeup() and
BufferLockWakeup() if the queue contains a share lock waiter followed
by an exclusive lock waiter.

In LWLockWakeup(), it will wake up the share waiter, set
wokeup_somebody to true, and then not wake up the exclusive lock
waiter because wokeup_somebody is true.

In BufferLockWakeup() when unlocked is true, it will wake up the share
lock waiter and then wake up the exclusive lock waiter because
wake_exclusive and wake_share_exclusive are both still true.

This might be the intended behavior, but it is a difference from
LWLockWakeup(), so I think it is worth documenting why it is okay.


+         * Signal that the process isn't on the wait list anymore. This al=
lows
+         * BufferLockDequeueSelf() to remove itself of the waitlist with a
+         * proclist_delete(), rather than having to check if it has been

I know this comment was ported over, but the "remove itself of the
waitlist" -- the "of" is confusing in the original comment and it is
confusing here.

+ * Compute subtraction from buffer state for a release of a held lock in
+ * `mode`.
+ *
+ * This is separated from BufferLockUnlock() as we want to combine the loc=
k
+ * release with other atomic operations when possible, leading to the lock
+ * release being done in multiple places.
+ */
+static inline uint64
+BufferLockReleaseSub(BufferLockMode mode)

I don't understand why this is a separate function even with your comment.

+ * BufferLockHeldByMe - test whether my process holds the content lock in =
any
+ * mode
+ *
+ * This is meant as debug support only.
+ */
+static bool
+BufferLockHeldByMe(BufferDesc *buf_hdr)
+{
+    PrivateRefCountEntry *entry =3D
+        GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false)=
;
+
+    if (!entry)
+        return false;
+    else
+        return entry->data.lockmode !=3D BUFFER_LOCK_UNLOCK;
+}

Previously, if I called LockBuffer(buf, BUFFER_LOCK_SHARE) again after
calling it once, say in RelationCopyStorageUsingBuffer(), I would trip
an assert when unpinning the buffer about how I needed to have
released the lock.

Now, though, it doesn't trip the assert because we don't track
multiple locks. When I call LockBuffer(UNLOCK), it sets lockmode to
BUFFER_LOCK_UNLOCK and BufferLockHeldByMe() only checks the private
refcount entry. Whereas LWLockHeldByMe() checked held_lwlocks which
had multiple entries and would report that I still held a lock. Now,
if I call LockBuffer(SHARE) twice, I'll only find out later when I
have a hang because someone is trying to get an exclusive lock and the
actual BufferDesc->state still has a lock set.

Maybe you should add an assert to the lock acquisition path that the
prevate ref count entry mode is UNLOCK?

 void
-LockBuffer(Buffer buffer, BufferLockMode mode)
+LockBufferInternal(Buffer buffer, BufferLockMode mode)
 {
-    buf =3D GetBufferDescriptor(buffer - 1);
+    buf_hdr =3D GetBufferDescriptor(buffer - 1);

-    if (mode =3D=3D BUFFER_LOCK_UNLOCK)
-        LWLockRelease(BufferDescriptorGetContentLock(buf));
-    else if (mode =3D=3D BUFFER_LOCK_SHARE)
-        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED);
+    if (mode =3D=3D BUFFER_LOCK_SHARE)
+        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE);
+    else if (mode =3D=3D BUFFER_LOCK_SHARE_EXCLUSIVE)
+        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE_EXCLUSIVE);
     else if (mode =3D=3D BUFFER_LOCK_EXCLUSIVE)
-        LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE);
+        BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_EXCLUSIVE);
     else
         elog(ERROR, "unrecognized buffer lock mode: %d", mode);

I presume you've stuck with this if statement structure because that
is what LockBuffer() used. Even though now the BufferLockMode passed
through to BufferLockAcquire is the exact thing you are testing. It
caught my eye and I had to check multiple times if the mode being
passed in is the same. Basically I found it a bit distracting.


@@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res)
     if (BufferIsLocal(buffer))
         UnpinLocalBufferNoOwner(buffer);
     else
+    {
+        PrivateRefCountEntry *ref;
+
+        ref =3D GetPrivateRefCountEntry(buffer, false);
+
+        /*
+         * If the buffer was locked at the time of the resowner release,
+         * release the lock now. This should only happen after errors.
+         */
+        if (ref->data.lockmode !=3D BUFFER_LOCK_UNLOCK)
+        {
+            BufferDesc *buf =3D GetBufferDescriptor(buffer - 1);
+
+            HOLD_INTERRUPTS();    /* match the upcoming RESUME_INTERRUPTS =
*/
+            BufferLockUnlock(buffer, buf);
+        }
+
         UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1));
+    }
 }

 Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well.

0009 -- I didn't look closely at 0009

0010:
--------
> [PATCH v6 10/14] heapam: Use exclusive lock on old page in CLUSTER

> To be able to guarantee that we can set the hint bit, acquire an exclusiv=
e
> lock on the old buffer. We need the hint bits to be set as otherwise
> reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple()=
 will
> get confused.

So, this is an active bug? And what exactly do you mean
heap_freeze_tuple() gets confused? I thought somewhere in there we
would check the clog.

0011:
--------
> [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page mod=
e

> 2) We would like to stop setting hint bits while pages are being written
> out. The necessary locking becomes visible for page mode scans if done fo=
r
> every tuple. With batching the overhead can be amortized to only happen
> once per page.

I don't understand the above point. What does this patch have to do
with not setting hint bits while pages are being written out (that
happens in the next patch)? And I presume you mean don't set hint bits
on a buffer that is being flushed by someone else -- but it sounds
like you mean not to set hint bits as part of flushing a buffer.

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/hea=
pam.c
index 4b0c49f4bb0..ddabd1a3ec3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -504,42 +504,93 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snaps=
hot,
                     BlockNumber block, int lines,
                     bool all_visible, bool check_serializable)
 {
+    Oid            relid =3D RelationGetRelid(scan->rs_base.rs_rd);
+#ifdef BATCHMVCC_FEWER_ARGS
+    BatchMVCCState batchmvcc;
+    HeapTupleData *tuples =3D batchmvcc.tuples;
+    bool       *visible =3D batchmvcc.visible;
+#else
+    HeapTupleData tuples[MaxHeapTuplesPerPage];
+    bool        visible[MaxHeapTuplesPerPage];
+#endif
     int            ntup =3D 0;

It's pretty confusing when visible is an output vs input parameter and
who fills it in when. (i.e. it's filled in in page_collect_tuples() if
check_serializable and all_visible are true, otherwise it's filled in
in HeapTupleSatisifiesMVCCBatch())

Personally, I think I'd almost prefer an all-visible and
not-all-visible version of page_collect_tuples() (or helper functions
containing the loop) that separate this. (I haven't tried it though)

BATCHMVCC_FEWER_ARGS definitely doesn't make it any easier to read --
which I assume you are removing.

+        /*
+         * If the page is not all-visible or we need to check serializabil=
ity,
+         * maintain enough state to be able to refind the tuple efficientl=
y,
+         * without again needing to extract it from the page.
+         */
+        if (!all_visible || check_serializable)
+        {

"enough state" is pretty vague here. Maybe mention what it wouldn't be
valid to do with the tuples array?

/*
* If the page is all visible, these fields won'otherwise wont be
* populated in loop below.
*/

Some spelling issues with "won'" and "wont"

+ * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_den=
se
+ * is set to contain the offsets of visible tuples.
+ *
+ * Returns the number of visible tuples.
+ */
+int
+HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer,
+                            int ntups,

I don't really get why this (the patch in general) would be
substantially faster.  You still call HeapTupleSatisfiesMVCC() for
each tuple in a loop. The difference is that you've got pointers into
the array of tuples instead of doing PageGetItem(), then calling
HeapTupleSatisfiesMVCC() for each tuple.


0012:
-------

> [PATCH v6 12/14] Require share-exclusive lock to set hint bits
> To address these issue, this commit changes the rules so that modificatio=
ns to
> pages are not allowed anymore while holding a share lock. Instead the new

In the commit message, you make it sound like you only change the lock
level for setting the hint bits. But that wouldn't solve any problems
if FlushBuffer() could still happen with a share lock. I would try to
make it clear that you change the lock level both for setting the hint
bits and flushing the buffer.

@@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan)
      */
     for (i =3D 0; i < so->numKilled; i++)
     {
+        if (!killedsomething)
+        {
+            /*
+             * Use hint bit infrastructure to be allowed to modify the pag=
e
+             * without holding an exclusive lock.
+             */
+            if (!BufferBeginSetHintBits(buffer))
+                goto unlock;
+        }
+

I don't understand why this is in the loop. Clearly you want to call
BufferBeginSetHintBits() once, but why would you do it in the loop?

In the comment, I might also note that the lock level will be upgraded
as needed or something since we only have a share lock, it is
confusing at first

- * SetHintBits()
+ * To be allowed to set hint bits, SetHintBits() needs to call
+ * BufferBeginSetHintBits(). However, that's not free, and some callsites =
call
+ * SetHintBits() on many tuples in a row. For those it makes sense to amor=
tize
+ * the cost of BufferBeginSetHintBits(). Additionally it's desirable to de=
fer
+ * the cost of BufferBeginSetHintBits() until a hint bit needs to actually=
 be
+ * set. This enum serves as the necessary state space passed to
+ * SetHintbitsExt().
+ */
+typedef enum SetHintBitsState
+{
+   /* not yet checked if hint bits may be set */
+   SHB_INITIAL,
+   /* failed to get permission to set hint bits, don't check again */
+   SHB_DISABLED,
+   /* allowed to set hint bits */
+   SHB_ENABLED,
+} SetHintBitsState;

I dislike the SHB prefix. Perhaps something involving the word hint?
And should the enum name itself (SetHintBitsState) include the word
"batch"? I know that would make it long. At least the comment should
explain that these are needed when batch setting hint bits.

+SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer,
+               uint16 infomask, TransactionId xid, SetHintBitsState *state=
)
 {
     if (TransactionIdIsValid(xid))
     {
-        /* NB: xid must be known committed here! */
-        XLogRecPtr    commitLSN =3D TransactionIdGetCommitLSN(xid);
+        if (BufferIsPermanent(buffer))
+        {

I really wish there was a way to better pull apart the batch and
non-batch cases in a way that could allow the below block to be in a
helper do_set_hint() (or whatever) which SetHintBitsExt() and
SetHintBits() called. And then you inlined BufferSetHintBits16().

I presume you didn't do this because HeapTupleSatisifiesMVCC() for
SNAPSHOT_MVCC calls the non-batch version (and, of course,
HeapTupleSatisifiesVisibility() is the much more common case).

if (TransactionIdIsValid(xid))
{
        if (BufferIsPermanent(buffer))
        {
                /* NB: xid must be known committed here! */
                XLogRecPtr    commitLSN =3D TransactionIdGetCommitLSN(xid);

                if (XLogNeedsFlush(commitLSN) &&
                        BufferGetLSNAtomic(buffer) < commitLSN)
                {
                        /* not flushed and no LSN interlock, so don't
set hint */
                        return; false;
                }
        }
}

Separately, I was thinking, should we assert here about having the
right lock type?

 * It is only safe to set a transaction-committed hint bit if we know the
 * transaction's commit record is guaranteed to be flushed to disk before t=
he
 * buffer, or if the table is temporary or unlogged and will be obliterated=
 by
 * a crash anyway.  We cannot change the LSN of the page here, because we m=
ay
 * hold only a share lock on the buffer, so we can only use the LSN to
 * interlock this if the buffer's LSN already is newer than the commit LSN;
 * otherwise we have to just refrain from setting the hint bit until some
 * future re-examination of the tuple.
 *

Should this say "we may hold only a share exclusive lock on the
buffer". Also what is "this" in "only use the LSN to interlock this"?

@@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot,
Buffer buffer,
+    if (state =3D=3D SHB_ENABLED)
+        BufferFinishSetHintBits(buffer, true, true);
+
     return nvis;
 }

I wondered if it would be more natural for BufferBeginSetHintBits()
and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED
instead of having callers do it. But, I guess you don't do this
because of gist and hash indexes using this for doing their own
modifications. It seems like it also would help make it clear that
BufferBegin and BufferFinish are for batch mode.

I can't help but feel a bit of awkwardness in the whole API between
this and the way you've supported non-batch mode in SetHintBitsExt().
But it's easy for me to criticize without providing concrete ideas of
how to reorganize it.

+static inline bool
+SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr,
uint64 *locksta>
+{
+   uint64      old_state;
+   PrivateRefCountEntry *ref;
+   BufferLockMode mode;

These functions probably ought to have comments. And I wonder if you
should say anything (or even assert anything) about how IO better not
be in progress (though it is enforced by the locks).

+        /*
+         * Can't upgrade if somebody else holds the lock in exlusive or
+         * share-exclusive mode.
+         */

typo: exlusive -> exclusive

-4. It is considered OK to update tuple commit status bits (ie, OR the
-values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or
+4. Non-critical information on a page ("hint bits") may be modified while
+holding only a share-exclusive lock and pin on the page. To do so in cases
+where only a share lock is already held, use BufferBeginSetHintBits() &
+BufferFinishSetHintBits() (if multiple hint bits are to be set) or
+BufferSetHintBits16() (if a single hit bit is set).

typo: single hit -> single hint
Also, you should probably say that you use BufferBeginSetHintBits() to
actually upgrade the lock.

I also don't see anywhere in the README where flushing a buffer is
mentioned -- and what lock level is needed for that in different
situations. It kind of feels like it is worth mentioning.

-   if ((pg_atomic_read_u64(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED))=
 !=3D
-       (BM_DIRTY | BM_JUST_DIRTIED))
+   if (unlikely((lockstate & (BM_DIRTY | BM_JUST_DIRTIED)) !=3D
+                (BM_DIRTY | BM_JUST_DIRTIED)))

I don't quite understand why you do the atomic read in the call to
MarkSharedBufferDirtyHint() form MarkBufferDirtyHint() instead of at
the top of MarkSharedBufferDirtyHint() -- and then you wouldn't have
to pass the lockstate as a parameter, right?

--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr,
    max_avail =3D fsm_get_max_avail(page);
    /*
-    * Reset the next slot pointer. This encourages the use of low-numbered
-    * pages, increasing the chances that a later vacuum can truncate the
-    * relation.  We don't bother with a lock here, nor with marking the pa=
ge
-    * dirty if it wasn't already, since this is just a hint.
+    * Try to reset the next slot pointer. This encourages the use of
+    * low-numbered pages, increasing the chances that a later vacuum can
+    * truncate the relation.  We don't bother with a lock here, nor with
+    * marking the page dirty if it wasn't already, since this is just a hi=
nt.
+    *
+    * To be allowed to update the page without an exclusive lock, we have =
to
+    * use the hint bit infrastructure.
     */

What the heck? This didn't even take a share lock before...

diff --git a/src/backend/storage/freespace/fsmpage.c
b/src/backend/storage/freesp>
index 66a5c80b5a6..a59696b6484 100644
--- a/src/backend/storage/freespace/fsmpage.c
+++ b/src/backend/storage/freespace/fsmpage.c
@@ -298,9 +298,18 @@ restart:
     * lock and get a garbled next pointer every now and then, than take th=
e
     * concurrency hit of an exclusive lock.
     *
+    * Without an exclusive lock, we need to use the hint bit infrastructur=
e
+    * to be allowed to modify the page.
+    *

Is the sentence above this still correct?

/*
* Update the next-target pointer. Note that we do this even if we're only
* holding a shared lock, on the grounds that it's better to use a shared
* lock and get a garbled next pointer every now and then, than take the
* concurrency hit of an exclusive lock.

We appear to avoid the garbling now?

In general on 0012, I didn't spend much time checking if you caught
all the places where we mention our hint bit hackery (only taking the
share lock). But those can always be caught later as we inevitably
encounter them.

- Melanie