Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNzw9-005A6j-35 for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 20:46:58 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vNzw8-00BVQj-1k for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 20:46:56 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNzw7-00BVN6-2T for pgsql-hackers@lists.postgresql.org; Tue, 25 Nov 2025 20:46:56 +0000 Received: from fout-a5-smtp.messagingengine.com ([103.168.172.148]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNzw4-001UEF-2I for pgsql-hackers@postgresql.org; Tue, 25 Nov 2025 20:46:55 +0000 Received: from phl-compute-04.internal (phl-compute-04.internal [10.202.2.44]) by mailfout.phl.internal (Postfix) with ESMTP id C83C4EC03D6; Tue, 25 Nov 2025 15:46:50 -0500 (EST) Received: from phl-mailfrontend-02 ([10.202.2.163]) by phl-compute-04.internal (MEProxy); Tue, 25 Nov 2025 15:46:50 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm3; t=1764103610; x=1764190010; bh=QhHL446vxFA6EkogV5Xzd4Fhy+kVCyovMu0nFWn6Lzo=; b= JNSRTrZ91KIhmLdP8GoUENvsc0CL4GJ2ZcqliAhYZznGe4xSmWuI1qev+kw4XyJh Nix8qseYk+ulhecO8Xpr9ZSjf4OmNHL9tl9I8YKP3b/gqpI0f6XD0IRI/U9zp/Pb U7T7YnTGkY+jVzQbNQku2zvHF9VlUImdas7mzHWFnbCggxF9eJEm3duDAzbiTWmj blO1uDGzd5pznUeI5wsouBA3va2mTFtZuHnndVj4pbqJR+WK4Xhi9Vb8jU/PFynA lDzksmGFv8emhox0ti7p359icsAnkI7KNlXZeWANf/mSSsspnMTORdx7+3pT0cHZ aGQbaGLE78THcrFjPscWMg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-transfer-encoding :content-type:content-type:date:date:feedback-id:feedback-id :from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to:x-me-proxy :x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t=1764103610; x= 1764190010; bh=QhHL446vxFA6EkogV5Xzd4Fhy+kVCyovMu0nFWn6Lzo=; b=C HAPHBfC7lYU3DQl8BgOms/l0P8PIQkVL06ZEOztDzFIwaOLRiwJeyrfxeVInLJz+ isMYsJWuaZH/YSAxPQ7SYow/wqZ9MkXJE65xuuleyhligncBatub1dk0J5LrSdRN g9F6fxSHfavCmlWeXy+gJrwXzwrPik5Shj58VDzrjutE8iBjD50B0nKwSWUR/bjQ bdAqnXXKsyS6MPt7pc1nQ+/Q5MEsu1ejO6R0oKlVyg05lMN3twXa27f2DIXcTyXh N/wLd/fegWhYSJhQmT/sHhFoLQqql2sMlvvREY/KSmA7kT8ViHyWnqF2ij0+bSuH gXMZU8wfbaNfD4Qd4+DBA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeeffedrtdeggddvgedvgeeiucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggugfgjsehtkefstddttdejnecuhfhrohhmpeetnhgurhgv shcuhfhrvghunhguuceorghnughrvghssegrnhgrrhgriigvlhdruggvqeenucggtffrrg htthgvrhhnpedtleelvdfgjedvffeiueekfeeuleffhfegfffhgfffkeevueehieehhfei gffhvdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe grnhgurhgvshesrghnrghrrgiivghlrdguvgdpnhgspghrtghpthhtohepkedpmhhouggv pehsmhhtphhouhhtpdhrtghpthhtohepsghovghkvgifuhhrmhdophhoshhtghhrvghsse hgmhgrihhlrdgtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhm rghilhdrtghomhdprhgtphhtthhopehmihgthhgrvghlrdhprghquhhivghrsehgmhgrih hlrdgtohhmpdhrtghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdp rhgtphhtthhopehthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtth hopehhlhhinhhnrghkrgesihhkihdrfhhipdhrtghpthhtohepnhhorghhsehlvggruggs ohgrthdrtghomhdprhgtphhtthhopehpghhsqhhlqdhhrggtkhgvrhhssehpohhsthhgrh gvshhqlhdrohhrgh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 25 Nov 2025 15:46:50 -0500 (EST) Date: Tue, 25 Nov 2025 15:46:49 -0500 From: Andres Freund To: Melanie Plageman Cc: Matthias van de Meent , pgsql-hackers@postgresql.org, Thomas Munro , Heikki Linnakangas , Noah Misch , Robert Haas , Michael Paquier Subject: Re: Buffer locking is special (hints, checksums, AIO writes) Message-ID: References: <6kmid26do57ykqfpvq6iieniy4djsymhrypkjccazq5g4bbe6a@2y6owwv7qpex> <3je3ahgf7rrmmurxo6hnlhg5d3ffwfrtjwjxd6jm5srlv5iebp@vxqk5qtgmowr> <3w7v3w6a57jnssokap4k7thoekig72flnyhd4wp3yftzdd7lm7@f6lpcfen6hr7> <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar> <3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2025-11-25 15:02:02 -0500, Melanie Plageman wrote: > On Tue, Nov 25, 2025 at 11:54 AM Andres Freund wrote: > > > > As of this commit nothing uses the new share-exclusive lock mode. It will be > > > > documented and used in a future commit. It seemed too complicated to introduce > > > > the lock-level in a separate commit. > > > > > > I would have liked this mentioned earlier in the commit message. > > > > Hm, ok. I could split share-exclusive out again, but it was somewhat painful, > > because it doesn't just lead to adding code, but to changing code. > > I don't think you need to introduce share-exclusive in a separate > commit. I just mean it would be good to mention in the commit message > before you get into so many other details that this commit doesn't use > the share-exclusive level yet. > > > > Also, I don't know how I feel about it being "documented" in a future > > > commit...perhaps just don't say that. > > > > Hm, why? > > I don't really see you documenting share-exclusive lock semantics in a > later commit. Documentation for what the lock level is should be in > the same commit that introduces it -- which I think you mostly have > done. I feel like it is weird to say you will document that lock level > later when 1) you don't really do that and 2) this commit mostly > (besides some requests I had for elaboration) already does that. The problem I faced was that the explanation for the lock level depends on later changes - how do you document why share-exclusive is useful without explaining the hint bit thing, which isn't yet implemented as of that commit. > > > diff --git a/src/include/storage/buf_internals.h > > > b/src/include/storage/buf_internals.h > > > index 28519ad2813..0a145d95024 100644 > > > --- a/src/include/storage/buf_internals.h > > > +++ b/src/include/storage/buf_internals.h > > > @@ -32,22 +33,29 @@ > > > /* > > > * Buffer state is a single 64-bit variable where following data is combined. > > > * > > > + * State of the buffer itself: > > > * - 18 bits refcount > > > * - 4 bits usage count > > > * - 10 bits of flags > > > * > > > + * State of the content lock: > > > + * - 1 bit has_waiter > > > + * - 1 bit release_ok > > > + * - 1 bit lock state locked > > > > > > Somewhere you should clearly explain the scenarios in which you still > > > need to take the buffer header lock with LockBufHdr() vs when you can > > > just use atomic operations/CAS loops. > > > > There is an explanation of that, or at least my attempt at it ;). See the > > dcoumentation for BufferDesc. > > Hmm. I suppose. I was imagining something above the state member since > you have added to it, but okay. > > Separately, I'll admit I don't quite understand when I have to use > LockBufHdr() and when I should use a CAS loop to update the > bufferdesc->state. It's definitely subtle :(. Not sure what to do about that, other than to work on eventually just making everything doable with just a CAS. But that's a fair bit of future work away. > > > @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res) > > > if (BufferIsLocal(buffer)) > > > UnpinLocalBufferNoOwner(buffer); > > > else > > > + { > > > + PrivateRefCountEntry *ref; > > > + > > > + ref = GetPrivateRefCountEntry(buffer, false); > > > + > > > + /* > > > + * If the buffer was locked at the time of the resowner release, > > > + * release the lock now. This should only happen after errors. > > > + */ > > > + if (ref->data.lockmode != BUFFER_LOCK_UNLOCK) > > > + { > > > + BufferDesc *buf = GetBufferDescriptor(buffer - 1); > > > + > > > + HOLD_INTERRUPTS(); /* match the upcoming RESUME_INTERRUPTS */ > > > + BufferLockUnlock(buffer, buf); > > > + } > > > + > > > UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1)); > > > + } > > > } > > > > > > Bit confusing that ResOwnerReleaseBufferBin() now releases locks as well. > > > > Do you have a better suggestion? I'll add a comment that makes that explicit, > > but other than that I don't have a great idea. Renaming the whole buffer pin > > mechanism seems pretty noisy. > > ResOwnerReleaseBuffer()? > > What do you mean renaming the whole buffer pin mechanism? All the *PrivateRefCount* stuff arguably aught to be renamed... > > > I presume you didn't do this because HeapTupleSatisifiesMVCC() for > > > SNAPSHOT_MVCC calls the non-batch version (and, of course, > > > HeapTupleSatisifiesVisibility() is the much more common case). > > > > > > if (TransactionIdIsValid(xid)) > > > { > > > if (BufferIsPermanent(buffer)) > > > { > > > /* NB: xid must be known committed here! */ > > > XLogRecPtr commitLSN = TransactionIdGetCommitLSN(xid); > > > > > > if (XLogNeedsFlush(commitLSN) && > > > BufferGetLSNAtomic(buffer) < commitLSN) > > > { > > > /* not flushed and no LSN interlock, so don't > > > set hint */ > > > return; false; > > > } > > > } > > > } > > > > > > Separately, I was thinking, should we assert here about having the > > > right lock type? > > > > Not sure I get what assert of what locktype where? > > In SetHintBitsExt() that we have share-exclusive or above. We *don't* necessarily hold that though, it'll just be acquired by BufferBeginSetHintBits(). Or do you mean in the SHB_ENABLED case? > Or are there still callers with only a share lock? Almost all of them, I think? We can't just call this with an unconditional share-exclusive lock, since that'd destroy concurrency. Instead we just try to upgrade to share-exclusive to set the hint bit. If we can't get share-exclusive, we don't need to block, we just haven't set the hint bit. > > > @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, > > > Buffer buffer, > > > + if (state == SHB_ENABLED) > > > + BufferFinishSetHintBits(buffer, true, true); > > > + > > > return nvis; > > > } > > > > > > I wondered if it would be more natural for BufferBeginSetHintBits() > > > and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED > > > instead of having callers do it. But, I guess you don't do this > > > because of gist and hash indexes using this for doing their own > > > modifications. > > > > I thought about it. But yea, the different callers seemed to make that not > > really useful. It'd also mean that we'd do an external function call for every > > tuple, which I think would be prohibitively expensive. > > I'm confused why this would mean more external function calls. The visibility logic is in heapam_visibility.c, the locking for the buffer in bufmgr.c? If the knowledge about SetHintBitsState lives in BufferBeginSetHintBits(), we need to call it to do that check. > > > --- a/src/backend/storage/freespace/freespace.c > > > +++ b/src/backend/storage/freespace/freespace.c > > > @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, > > > max_avail = fsm_get_max_avail(page); > > > /* > > > - * Reset the next slot pointer. This encourages the use of low-numbered > > > - * pages, increasing the chances that a later vacuum can truncate the > > > - * relation. We don't bother with a lock here, nor with marking the page > > > - * dirty if it wasn't already, since this is just a hint. > > > + * Try to reset the next slot pointer. This encourages the use of > > > + * low-numbered pages, increasing the chances that a later vacuum can > > > + * truncate the relation. We don't bother with a lock here, nor with > > > + * marking the page dirty if it wasn't already, since this is just a hint. > > > + * > > > + * To be allowed to update the page without an exclusive lock, we have to > > > + * use the hint bit infrastructure. > > > */ > > > > > > What the heck? This didn't even take a share lock before... > > > > It is insane. I do not understand how anybody thought this was ok. I think I > > probably should split this out into a separate commit, since it's so insane. > > Aren't you going to backport having it take a lock? Then it can be > separate (e.g. have it take a share lock in the "fix" commit and then > this commit bumps it to share-exclusive). I wasn't thinking we should backport that. It's been this way for ages, without having caused known issues.... > > > diff --git a/src/backend/storage/freespace/fsmpage.c > > > b/src/backend/storage/freesp> > > > /* > > > * Update the next-target pointer. Note that we do this even if we're only > > > * holding a shared lock, on the grounds that it's better to use a shared > > > * lock and get a garbled next pointer every now and then, than take the > > > * concurrency hit of an exclusive lock. > > > > > > We appear to avoid the garbling now? > > > > I don't think so. Two backends concurrently can do fsm_search_avail() and > > one backend might set a hint to a page that is already used up by the other > > one. At least I think so? > > Maybe I don't know what it meant by garbled, but I thought it was > talking about two backends each trying to set fp_next_slot. If they > now have to have a share-exclusive lock and they can't both have a > share-exclusive lock at the same time, then it seems like that > wouldn't be a problem. It sounds like you may be talking about a > backend taking up the freespace of a page that is referred to by the > fp_next_slot? Yes, a version of the latter. The value that fp_next_slot will be set to can be outdated by the time we actually set it, unless we do all of fsm_search_avail() under some form of exclusive lock - clearly not something desirable. Greetings, Andres Freund