Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNvDk-001nZw-1i for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 15:44:48 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vNvDj-009ZPg-0T for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 15:44:47 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNvDi-009ZPY-2V for pgsql-hackers@lists.postgresql.org; Tue, 25 Nov 2025 15:44:47 +0000 Received: from mail-ed1-x52f.google.com ([2a00:1450:4864:20::52f]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vNvDg-001Rfr-2l for pgsql-hackers@postgresql.org; Tue, 25 Nov 2025 15:44:46 +0000 Received: by mail-ed1-x52f.google.com with SMTP id 4fb4d7f45d1cf-64165cd689eso7548566a12.0 for ; Tue, 25 Nov 2025 07:44:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764085483; x=1764690283; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=gWQWOEW9+rP+MskAP6HQTMdCqgFOy244bgcCMKTrvR0=; b=Ap5kla4AyUCV7dCcbYtomNvi36650Fa747lKShHrmsURF7qUgXsY6rrTSiN0cCgOJH 6mTzFc0j7xd7zZ4B8QRqISUOyZpRn4xQX1O7b2X4hpKdVBCJDufoSdoAoSCLsrIVtnGe 6Ifpk+ytkWpdLWLcxuylX7Baln0Q/ToVc7A1+3/ZjgAUojJQf/gsV4q3uJpG35N2gOmL LsQTBEGiqQQcgPZOMCcbA9L5+Iz2ZYKmv8WHxylqcXY2kEzxep35wy/fFsENHFTaJKgc aM7R8GhfA6P3MDZQ0N2Sil8NpX6bCXb2Ns7WOyCiSPors+jbYEEpyoBd2SmGvcEnFYVp 2L7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764085483; x=1764690283; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=gWQWOEW9+rP+MskAP6HQTMdCqgFOy244bgcCMKTrvR0=; b=qVKOU+UTe30T66vQN31RwefP0eyfsKW6VOrOoCwHRZO4jtn9xnYsMGpTqfFKp7n2iq rdHE5yplovDp+w3bdx4fgmghwe1IgHoWChxJNRLxlYq2sU7hWmyUtAek+sCn4DbXUBrh RckWknxBVfol/wMtXh7ddAOfqOL8AcfTlmF59CzUxPj6u0KiW8hdpDc4+0fS5ryq4V48 V+dmf78pz3TmzkpPfLQ6G6fzot5mI8kY8gdQDKWsOstclfYbCRP9tqoYaRai5hA8IL1n 6KPxzVdi9AAJMAwiEOa6YrUs2Jl1fyaziieR+4mI8g77u+IjFPU729qIZPBnFFuobvKR vVdw== X-Forwarded-Encrypted: i=1; AJvYcCX5b98olCyk6uJuN/YiYDLni5BYFBHrfSt3T814BsQZbGaYEuKU5rWH+cP1a5BsZ4qFjsdqHrf8B/6EXGdZ@postgresql.org X-Gm-Message-State: AOJu0YziUWIO05dSl1zTr+AYA2hZIyEUH2yO5Y4c6C0wMuIcrxcKjvo/ 9E0g3l4X8NLVoZXT+eN2UIWXfGVlvQr70qJjTA9a1knPQfQDLE7+czpiRZPBKrO9alJGjzTxkTL +DhKFQDDGlATkAkEh23yOChR1i+ngwKblZMlD X-Gm-Gg: ASbGncu/sBoKItOp8wsOYJNyCK4OfmBi8nEnRb6XaP0mLWht8Gyabnsp1bDNKQH4YqE nQzPoostcN1wwEV0rsLoOJ+IttoJDDWevloh3ZMoYZlWugagdtkeq1zeQ1rnVu/vy6xV+C8jdQ0 ru24LltiKnWzMmZNtG9ywdQmEMzd/yfTEl55R/OmhzzrZLLjf9FGQC0+ndQf3htyXn2Y7Bf62Qh hmmEY077ATla4UYULHboaTanw2FXjjf/oAMKy3DOYDE7/ntxvwPiMysBLJdOX1H1giZHInV X-Google-Smtp-Source: AGHT+IHH6POtA4v2weuvx9uSqiWHa/VgLfd+CKG0MrdcePb4XFkgF/Rut3xYLdqTk5N6U1Gd1ZTgx8xaK2C5JdrrPhA= X-Received: by 2002:a05:6402:34d1:b0:643:4e9c:d165 with SMTP id 4fb4d7f45d1cf-645550808a7mr16149145a12.5.1764085483079; Tue, 25 Nov 2025 07:44:43 -0800 (PST) MIME-Version: 1.0 References: <6kmid26do57ykqfpvq6iieniy4djsymhrypkjccazq5g4bbe6a@2y6owwv7qpex> <3je3ahgf7rrmmurxo6hnlhg5d3ffwfrtjwjxd6jm5srlv5iebp@vxqk5qtgmowr> <3w7v3w6a57jnssokap4k7thoekig72flnyhd4wp3yftzdd7lm7@f6lpcfen6hr7> <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar> In-Reply-To: <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar> From: Melanie Plageman Date: Tue, 25 Nov 2025 10:44:31 -0500 X-Gm-Features: AWmQ_blEJ91Bw0StTFeTwBsx3T5T3S2yribleM1ujAvyOGKX0VHQcK6vRjVmypU Message-ID: Subject: Re: Buffer locking is special (hints, checksums, AIO writes) To: Andres Freund Cc: Matthias van de Meent , pgsql-hackers@postgresql.org, Thomas Munro , Heikki Linnakangas , Noah Misch , Robert Haas , Michael Paquier Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Wed, Nov 19, 2025 at 9:47=E2=80=AFPM Andres Freund = wrote: > > 0008: The main change. Implements buffer content locking independently fr= om > lwlock.c. There's obviously a lot of similarity between lwlock.c code and > this, but I've not found a good way to reduce the duplication without giv= ing > up too much. This patch does immediately introduce share-exclusive as a = new > lock level, mostly because it was too painful to do separately. > > 0009+0010+0011: Preparatory work for 0012. > > 0012: One of the main goals of this patchset - use the new share-exclusiv= e > lock level to only allow hint bits to be set while no IO is going on. Below is a review of 0008-0012 I skipped 0013 and 0014 after seeing "#if 1" in 0013 :) 0008: ------- > [PATCH v6 08/14] bufmgr: Implement buffer content locks independently of lwlocks ... > This commit unfortunately introduces some code that is very similar to th= e > code in lwlock.c, however the code is not equivalent enough to easily mer= ge > it. The future wins that this commit makes possible seem worth the cost. It is a truly unfortunate amount of duplication. I tried some refactoring myself just to convince myself it wasn't a good idea. However, ISTM there is no reason for the lwlock and buffer locking implementations to have to stay in sync. So they can diverge as features are added and perhaps the duplication won't be as conspicuous in the future. > As of this commit nothing uses the new share-exclusive lock mode. It will= be >documented and used in a future commit. It seemed too complicated to intro= duce > the lock-level in a separate commit. I would have liked this mentioned earlier in the commit message. Also, I don't know how I feel about it being "documented" in a future commit...perhaps just don't say that. diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h index 28519ad2813..0a145d95024 100644 --- a/src/include/storage/buf_internals.h +++ b/src/include/storage/buf_internals.h @@ -32,22 +33,29 @@ /* * Buffer state is a single 64-bit variable where following data is combin= ed. * + * State of the buffer itself: * - 18 bits refcount * - 4 bits usage count * - 10 bits of flags * + * State of the content lock: + * - 1 bit has_waiter + * - 1 bit release_ok + * - 1 bit lock state locked Somewhere you should clearly explain the scenarios in which you still need to take the buffer header lock with LockBufHdr() vs when you can just use atomic operations/CAS loops. Separately, you should clearly explain what BM_LOCKED (I presume what "lock state locked" refers to) protects. +/* + * Definitions related to buffer content locks + */ +#define BM_LOCK_HAS_WAITERS (UINT64CONST(1) << 63) +#define BM_LOCK_RELEASE_OK (UINT64CONST(1) << 62) + +#define BM_LOCK_VAL_SHARED (UINT64CONST(1) << 32) +#define BM_LOCK_VAL_SHARE_EXCLUSIVE (UINT64CONST(1) << (32 + MAX_BACKENDS_BITS)) +#define BM_LOCK_VAL_EXCLUSIVE (UINT64CONST(1) << (32 + 1 + MAX_BACKENDS_BITS)) + +#define BM_LOCK_MASK (((uint64)MAX_BACKENDS << 32) | BM_LOCK_VAL_SHARE_EXCLUSIVE | BM_LOCK_VAL_EXCLUSIVE) Not the fault of this patch, but I always found it confusing that the BM flags were for buffer manager. I always think BM stands for buffer mapping. On a more actionable note: you probably want to update the comment in procnumber.h in your earlier patch to take out the part about how the limitation could be lifted using a 64 bit state -- since you made the state 64 bit and didn't lift the limitation (see below for the specific comment). /* * Note: MAX_BACKENDS_BITS is 18 as that is the space available for buffer * refcounts in buf_internals.h. This limitation could be lifted by using = a * 64bit state; but it's unlikely to be worthwhile as 2^18-1 backends excee= d * currently realistic configurations. Even if that limitation were removed= , * we still could not a) exceed 2^23-1 because inval.c stores the ProcNumbe= r * as a 3-byte signed integer, b) INT_MAX/4 because some places compute * 4*MaxBackends without any overflow check. We check that the configured * number of backends does not exceed MAX_BACKENDS in InitializeMaxBackends= (). */ @@ -268,6 +287,15 @@ BufMappingPartitionLockByIndex(uint32 index) * wait_backend_pgprocno and setting flag bit BM_PIN_COUNT_WAITER. At pre= sent, * there can be only one such waiter per buffer. * + * The content of buffers is protected via the buffer content lock, + * implemented as part buffer state. Note that the buffer header lock is *= not* + * used to control access to the data in the buffer! We used to use an LWL= ock + * to implement the content lock, but having a dedicated implementation of + * content locks allows to implement some otherwise hard things (e.g. + * race-freely checking if AIO is in progress before locking a buffer + * exclusively) and makes otherwise impossible optimizations possible + * (e.g. unlocking and unpinning a buffer in one atomic operation). + * I don't really like that this includes a description of how it used to work. It makes the description more confusing when trying to understand the current state. And comments like that make most sense when the current state is confusing because of a long annoying history. @@ -302,7 +303,24 @@ extern void BufferGetTag(Buffer buffer, RelFileLocator *rlocator, extern void MarkBufferDirtyHint(Buffer buffer, bool buffer_std); extern void UnlockBuffers(void); -extern void LockBuffer(Buffer buffer, BufferLockMode mode); +extern void UnlockBuffer(Buffer buffer); +extern void LockBufferInternal(Buffer buffer, BufferLockMode mode); + +/* + * Handling BUFFER_LOCK_UNLOCK in bufmgr.c leads to sufficiently worse bra= nch + * prediction to impact performance. Therefore handle that switch here, we= re + * most of the time `mode` will be a constant and thus can be optimized ou= t by + * the compiler. Typo: were -> where + */ +static inline void +LockBuffer(Buffer buffer, BufferLockMode mode) +{ + if (mode =3D=3D BUFFER_LOCK_UNLOCK) + UnlockBuffer(buffer); + else + LockBufferInternal(buffer, mode); +} + Is there a reason to stick with the LockBuffer(buf, BUFFER_LOCK_UNLOCK) interface here? It seems like a cgood time to start using UnlockBuffer() which I thought most people find more intuitive. /* - * Acquire or release the content_lock for the buffer. + * Acquire the buffer content lock in the specified mode + * + * If the lock is not available, sleep until it is. + * + * Side effect: cancel/die interrupts are held off until lock release. + * + * This uses almost the same locking approach as lwlock.c's + * LWLockAcquire(). See documentation atop of lwlock.c for a more detailed + * discussion. + * + * The reason that this, and most of the other BufferLock* functions, get = both + * the Buffer and BufferDesc* as parameters, is that looking up one from t= he + * other repeatedly shows up noticeably in profiles. + */ +static inline void +BufferLockAcquire(Buffer buffer, BufferDesc *buf_hdr, BufferLockMode mode) Either here or over the lock mode enum, I'd spend a few sentences describing how the buffer lock modes interact -- what conflicts with what. You spend more time saying how this is like LWLock than explaining what it actually is. +/* + * Remove ourselves from the waitlist. + * + * This is used if we queued ourselves because we thought we needed to sle= ep + * but, after further checking, we discovered that we don't actually need = to + * do so. + */ +static void +BufferLockDequeueSelf(BufferDesc *buf_hdr) +{ + bool on_waitlist; + + LockBufHdr(buf_hdr); + ... + /* XXX: combine with fetch_and above? */ + UnlockBufHdr(buf_hdr); I know you just ported this comment from the LWLock implementation but I can't see how it could be worth the effort. It won't be in the hottest path as you must satisfy a few conditions to get here. + * Wakeup all the lockers that currently have a chance to acquire the lock= . + */ +static void +BufferLockWakeup(BufferDesc *buf_hdr, bool unlocked) This should have a more explanatory comment. You should especially document the unlocked parameter. I would expect something somewhere in or above this function that basically says (maybe less verbosely) - Before waking anything: allow all - After waking SHARE: wake SHARE only - After waking SHARE_EXCLUSIVE: wake SHARE only (no more SHARE_EXCLUSIVE) - After waking EXCLUSIVE: wake none +{ + bool new_release_ok; + bool wake_exclusive =3D unlocked; + bool wake_share_exclusive =3D true; + proclist_head wakeup; + proclist_mutable_iter iter; + + proclist_init(&wakeup); + + new_release_ok =3D true; + + /* lock wait list while collecting backends to wake up */ + LockBufHdr(buf_hdr); + + proclist_foreach_modify(iter, &buf_hdr->lock_waiters, lwWaitLink) + { + PGPROC *waiter =3D GetPGProcByNumber(iter.cur); + + if (!wake_exclusive && waiter->lwWaitMode =3D=3D BUFFER_LOCK_EXCLU= SIVE) + continue; + + if (!wake_share_exclusive && waiter->lwWaitMode =3D=3D BUFFER_LOCK_SHARE_EXCLUSIVE) + continue; + It seems there is a difference between LWLockWakeup() and BufferLockWakeup() if the queue contains a share lock waiter followed by an exclusive lock waiter. In LWLockWakeup(), it will wake up the share waiter, set wokeup_somebody to true, and then not wake up the exclusive lock waiter because wokeup_somebody is true. In BufferLockWakeup() when unlocked is true, it will wake up the share lock waiter and then wake up the exclusive lock waiter because wake_exclusive and wake_share_exclusive are both still true. This might be the intended behavior, but it is a difference from LWLockWakeup(), so I think it is worth documenting why it is okay. + * Signal that the process isn't on the wait list anymore. This al= lows + * BufferLockDequeueSelf() to remove itself of the waitlist with a + * proclist_delete(), rather than having to check if it has been I know this comment was ported over, but the "remove itself of the waitlist" -- the "of" is confusing in the original comment and it is confusing here. + * Compute subtraction from buffer state for a release of a held lock in + * `mode`. + * + * This is separated from BufferLockUnlock() as we want to combine the loc= k + * release with other atomic operations when possible, leading to the lock + * release being done in multiple places. + */ +static inline uint64 +BufferLockReleaseSub(BufferLockMode mode) I don't understand why this is a separate function even with your comment. + * BufferLockHeldByMe - test whether my process holds the content lock in = any + * mode + * + * This is meant as debug support only. + */ +static bool +BufferLockHeldByMe(BufferDesc *buf_hdr) +{ + PrivateRefCountEntry *entry =3D + GetPrivateRefCountEntry(BufferDescriptorGetBuffer(buf_hdr), false)= ; + + if (!entry) + return false; + else + return entry->data.lockmode !=3D BUFFER_LOCK_UNLOCK; +} Previously, if I called LockBuffer(buf, BUFFER_LOCK_SHARE) again after calling it once, say in RelationCopyStorageUsingBuffer(), I would trip an assert when unpinning the buffer about how I needed to have released the lock. Now, though, it doesn't trip the assert because we don't track multiple locks. When I call LockBuffer(UNLOCK), it sets lockmode to BUFFER_LOCK_UNLOCK and BufferLockHeldByMe() only checks the private refcount entry. Whereas LWLockHeldByMe() checked held_lwlocks which had multiple entries and would report that I still held a lock. Now, if I call LockBuffer(SHARE) twice, I'll only find out later when I have a hang because someone is trying to get an exclusive lock and the actual BufferDesc->state still has a lock set. Maybe you should add an assert to the lock acquisition path that the prevate ref count entry mode is UNLOCK? void -LockBuffer(Buffer buffer, BufferLockMode mode) +LockBufferInternal(Buffer buffer, BufferLockMode mode) { - buf =3D GetBufferDescriptor(buffer - 1); + buf_hdr =3D GetBufferDescriptor(buffer - 1); - if (mode =3D=3D BUFFER_LOCK_UNLOCK) - LWLockRelease(BufferDescriptorGetContentLock(buf)); - else if (mode =3D=3D BUFFER_LOCK_SHARE) - LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_SHARED); + if (mode =3D=3D BUFFER_LOCK_SHARE) + BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE); + else if (mode =3D=3D BUFFER_LOCK_SHARE_EXCLUSIVE) + BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_SHARE_EXCLUSIVE); else if (mode =3D=3D BUFFER_LOCK_EXCLUSIVE) - LWLockAcquire(BufferDescriptorGetContentLock(buf), LW_EXCLUSIVE); + BufferLockAcquire(buffer, buf_hdr, BUFFER_LOCK_EXCLUSIVE); else elog(ERROR, "unrecognized buffer lock mode: %d", mode); I presume you've stuck with this if statement structure because that is what LockBuffer() used. Even though now the BufferLockMode passed through to BufferLockAcquire is the exact thing you are testing. It caught my eye and I had to check multiple times if the mode being passed in is the same. Basically I found it a bit distracting. @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res) if (BufferIsLocal(buffer)) UnpinLocalBufferNoOwner(buffer); else + { + PrivateRefCountEntry *ref; + + ref =3D GetPrivateRefCountEntry(buffer, false); + + /* + * If the buffer was locked at the time of the resowner release, + * release the lock now. This should only happen after errors. + */ + if (ref->data.lockmode !=3D BUFFER_LOCK_UNLOCK) + { + BufferDesc *buf =3D GetBufferDescriptor(buffer - 1); + + HOLD_INTERRUPTS(); /* match the upcoming RESUME_INTERRUPTS = */ + BufferLockUnlock(buffer, buf); + } + UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1)); + } } Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well. 0009 -- I didn't look closely at 0009 0010: -------- > [PATCH v6 10/14] heapam: Use exclusive lock on old page in CLUSTER > To be able to guarantee that we can set the hint bit, acquire an exclusiv= e > lock on the old buffer. We need the hint bits to be set as otherwise > reform_and_rewrite_tuple() -> rewrite_heap_tuple() -> heap_freeze_tuple()= will > get confused. So, this is an active bug? And what exactly do you mean heap_freeze_tuple() gets confused? I thought somewhere in there we would check the clog. 0011: -------- > [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page mod= e > 2) We would like to stop setting hint bits while pages are being written > out. The necessary locking becomes visible for page mode scans if done fo= r > every tuple. With batching the overhead can be amortized to only happen > once per page. I don't understand the above point. What does this patch have to do with not setting hint bits while pages are being written out (that happens in the next patch)? And I presume you mean don't set hint bits on a buffer that is being flushed by someone else -- but it sounds like you mean not to set hint bits as part of flushing a buffer. diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/hea= pam.c index 4b0c49f4bb0..ddabd1a3ec3 100644 --- a/src/backend/access/heap/heapam.c +++ b/src/backend/access/heap/heapam.c @@ -504,42 +504,93 @@ page_collect_tuples(HeapScanDesc scan, Snapshot snaps= hot, BlockNumber block, int lines, bool all_visible, bool check_serializable) { + Oid relid =3D RelationGetRelid(scan->rs_base.rs_rd); +#ifdef BATCHMVCC_FEWER_ARGS + BatchMVCCState batchmvcc; + HeapTupleData *tuples =3D batchmvcc.tuples; + bool *visible =3D batchmvcc.visible; +#else + HeapTupleData tuples[MaxHeapTuplesPerPage]; + bool visible[MaxHeapTuplesPerPage]; +#endif int ntup =3D 0; It's pretty confusing when visible is an output vs input parameter and who fills it in when. (i.e. it's filled in in page_collect_tuples() if check_serializable and all_visible are true, otherwise it's filled in in HeapTupleSatisifiesMVCCBatch()) Personally, I think I'd almost prefer an all-visible and not-all-visible version of page_collect_tuples() (or helper functions containing the loop) that separate this. (I haven't tried it though) BATCHMVCC_FEWER_ARGS definitely doesn't make it any easier to read -- which I assume you are removing. + /* + * If the page is not all-visible or we need to check serializabil= ity, + * maintain enough state to be able to refind the tuple efficientl= y, + * without again needing to extract it from the page. + */ + if (!all_visible || check_serializable) + { "enough state" is pretty vague here. Maybe mention what it wouldn't be valid to do with the tuples array? /* * If the page is all visible, these fields won'otherwise wont be * populated in loop below. */ Some spelling issues with "won'" and "wont" + * visibility is set in batchmvcc->visible[]. In addition, ->vistuples_den= se + * is set to contain the offsets of visible tuples. + * + * Returns the number of visible tuples. + */ +int +HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer, + int ntups, I don't really get why this (the patch in general) would be substantially faster. You still call HeapTupleSatisfiesMVCC() for each tuple in a loop. The difference is that you've got pointers into the array of tuples instead of doing PageGetItem(), then calling HeapTupleSatisfiesMVCC() for each tuple. 0012: ------- > [PATCH v6 12/14] Require share-exclusive lock to set hint bits > To address these issue, this commit changes the rules so that modificatio= ns to > pages are not allowed anymore while holding a share lock. Instead the new In the commit message, you make it sound like you only change the lock level for setting the hint bits. But that wouldn't solve any problems if FlushBuffer() could still happen with a share lock. I would try to make it clear that you change the lock level both for setting the hint bits and flushing the buffer. @@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan) */ for (i =3D 0; i < so->numKilled; i++) { + if (!killedsomething) + { + /* + * Use hint bit infrastructure to be allowed to modify the pag= e + * without holding an exclusive lock. + */ + if (!BufferBeginSetHintBits(buffer)) + goto unlock; + } + I don't understand why this is in the loop. Clearly you want to call BufferBeginSetHintBits() once, but why would you do it in the loop? In the comment, I might also note that the lock level will be upgraded as needed or something since we only have a share lock, it is confusing at first - * SetHintBits() + * To be allowed to set hint bits, SetHintBits() needs to call + * BufferBeginSetHintBits(). However, that's not free, and some callsites = call + * SetHintBits() on many tuples in a row. For those it makes sense to amor= tize + * the cost of BufferBeginSetHintBits(). Additionally it's desirable to de= fer + * the cost of BufferBeginSetHintBits() until a hint bit needs to actually= be + * set. This enum serves as the necessary state space passed to + * SetHintbitsExt(). + */ +typedef enum SetHintBitsState +{ + /* not yet checked if hint bits may be set */ + SHB_INITIAL, + /* failed to get permission to set hint bits, don't check again */ + SHB_DISABLED, + /* allowed to set hint bits */ + SHB_ENABLED, +} SetHintBitsState; I dislike the SHB prefix. Perhaps something involving the word hint? And should the enum name itself (SetHintBitsState) include the word "batch"? I know that would make it long. At least the comment should explain that these are needed when batch setting hint bits. +SetHintBitsExt(HeapTupleHeader tuple, Buffer buffer, + uint16 infomask, TransactionId xid, SetHintBitsState *state= ) { if (TransactionIdIsValid(xid)) { - /* NB: xid must be known committed here! */ - XLogRecPtr commitLSN =3D TransactionIdGetCommitLSN(xid); + if (BufferIsPermanent(buffer)) + { I really wish there was a way to better pull apart the batch and non-batch cases in a way that could allow the below block to be in a helper do_set_hint() (or whatever) which SetHintBitsExt() and SetHintBits() called. And then you inlined BufferSetHintBits16(). I presume you didn't do this because HeapTupleSatisifiesMVCC() for SNAPSHOT_MVCC calls the non-batch version (and, of course, HeapTupleSatisifiesVisibility() is the much more common case). if (TransactionIdIsValid(xid)) { if (BufferIsPermanent(buffer)) { /* NB: xid must be known committed here! */ XLogRecPtr commitLSN =3D TransactionIdGetCommitLSN(xid); if (XLogNeedsFlush(commitLSN) && BufferGetLSNAtomic(buffer) < commitLSN) { /* not flushed and no LSN interlock, so don't set hint */ return; false; } } } Separately, I was thinking, should we assert here about having the right lock type? * It is only safe to set a transaction-committed hint bit if we know the * transaction's commit record is guaranteed to be flushed to disk before t= he * buffer, or if the table is temporary or unlogged and will be obliterated= by * a crash anyway. We cannot change the LSN of the page here, because we m= ay * hold only a share lock on the buffer, so we can only use the LSN to * interlock this if the buffer's LSN already is newer than the commit LSN; * otherwise we have to just refrain from setting the hint bit until some * future re-examination of the tuple. * Should this say "we may hold only a share exclusive lock on the buffer". Also what is "this" in "only use the LSN to interlock this"? @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, Buffer buffer, + if (state =3D=3D SHB_ENABLED) + BufferFinishSetHintBits(buffer, true, true); + return nvis; } I wondered if it would be more natural for BufferBeginSetHintBits() and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED instead of having callers do it. But, I guess you don't do this because of gist and hash indexes using this for doing their own modifications. It seems like it also would help make it clear that BufferBegin and BufferFinish are for batch mode. I can't help but feel a bit of awkwardness in the whole API between this and the way you've supported non-batch mode in SetHintBitsExt(). But it's easy for me to criticize without providing concrete ideas of how to reorganize it. +static inline bool +SharedBufferBeginSetHintBits(Buffer buffer, BufferDesc *buf_hdr, uint64 *locksta> +{ + uint64 old_state; + PrivateRefCountEntry *ref; + BufferLockMode mode; These functions probably ought to have comments. And I wonder if you should say anything (or even assert anything) about how IO better not be in progress (though it is enforced by the locks). + /* + * Can't upgrade if somebody else holds the lock in exlusive or + * share-exclusive mode. + */ typo: exlusive -> exclusive -4. It is considered OK to update tuple commit status bits (ie, OR the -values HEAP_XMIN_COMMITTED, HEAP_XMIN_INVALID, HEAP_XMAX_COMMITTED, or +4. Non-critical information on a page ("hint bits") may be modified while +holding only a share-exclusive lock and pin on the page. To do so in cases +where only a share lock is already held, use BufferBeginSetHintBits() & +BufferFinishSetHintBits() (if multiple hint bits are to be set) or +BufferSetHintBits16() (if a single hit bit is set). typo: single hit -> single hint Also, you should probably say that you use BufferBeginSetHintBits() to actually upgrade the lock. I also don't see anywhere in the README where flushing a buffer is mentioned -- and what lock level is needed for that in different situations. It kind of feels like it is worth mentioning. - if ((pg_atomic_read_u64(&bufHdr->state) & (BM_DIRTY | BM_JUST_DIRTIED))= !=3D - (BM_DIRTY | BM_JUST_DIRTIED)) + if (unlikely((lockstate & (BM_DIRTY | BM_JUST_DIRTIED)) !=3D + (BM_DIRTY | BM_JUST_DIRTIED))) I don't quite understand why you do the atomic read in the call to MarkSharedBufferDirtyHint() form MarkBufferDirtyHint() instead of at the top of MarkSharedBufferDirtyHint() -- and then you wouldn't have to pass the lockstate as a parameter, right? --- a/src/backend/storage/freespace/freespace.c +++ b/src/backend/storage/freespace/freespace.c @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, max_avail =3D fsm_get_max_avail(page); /* - * Reset the next slot pointer. This encourages the use of low-numbered - * pages, increasing the chances that a later vacuum can truncate the - * relation. We don't bother with a lock here, nor with marking the pa= ge - * dirty if it wasn't already, since this is just a hint. + * Try to reset the next slot pointer. This encourages the use of + * low-numbered pages, increasing the chances that a later vacuum can + * truncate the relation. We don't bother with a lock here, nor with + * marking the page dirty if it wasn't already, since this is just a hi= nt. + * + * To be allowed to update the page without an exclusive lock, we have = to + * use the hint bit infrastructure. */ What the heck? This didn't even take a share lock before... diff --git a/src/backend/storage/freespace/fsmpage.c b/src/backend/storage/freesp> index 66a5c80b5a6..a59696b6484 100644 --- a/src/backend/storage/freespace/fsmpage.c +++ b/src/backend/storage/freespace/fsmpage.c @@ -298,9 +298,18 @@ restart: * lock and get a garbled next pointer every now and then, than take th= e * concurrency hit of an exclusive lock. * + * Without an exclusive lock, we need to use the hint bit infrastructur= e + * to be allowed to modify the page. + * Is the sentence above this still correct? /* * Update the next-target pointer. Note that we do this even if we're only * holding a shared lock, on the grounds that it's better to use a shared * lock and get a garbled next pointer every now and then, than take the * concurrency hit of an exclusive lock. We appear to avoid the garbling now? In general on 0012, I didn't spend much time checking if you caught all the places where we mention our hint bit hackery (only taking the share lock). But those can always be caught later as we inevitably encounter them. - Melanie