Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNzF0-004W1y-23 for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 20:02:22 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vNzEx-00BDfq-2c for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 20:02:20 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNzEx-00BDfg-0n for pgsql-hackers@lists.postgresql.org; Tue, 25 Nov 2025 20:02:19 +0000 Received: from mail-ed1-x534.google.com ([2a00:1450:4864:20::534]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vNzEu-001Rt6-2B for pgsql-hackers@postgresql.org; Tue, 25 Nov 2025 20:02:18 +0000 Received: by mail-ed1-x534.google.com with SMTP id 4fb4d7f45d1cf-6418b55f86dso10046487a12.1 for ; Tue, 25 Nov 2025 12:02:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764100934; x=1764705734; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=8qqEPNRG7l5ihIa9qidi5Yuhzi2O+JfSDBpAT+w6c/o=; b=I4EnRMoxZwveSuCxT8ASzefLxuiuBwLgXu5+kWnuw8b2Rzswwhcfi20vUecQnapNj/ Vd3xPCWFaia8VU/3rIDTAU4EZK0fyoGGulwmS05kPXa6RNHMRRIAbOuBhEteYHjCbhVA sYUOWk+e0BXiR+VkUJqmcFtf/FAHMJQ4XNj+bwkwBozC7oVCBczZLA3MeRPSOvX0z/DS cRntdxEvTOBb7aBeGiCmTjPUw3T4cMjBiOdkCy0GIHsxcm2wSxrtIUzXrHSWJFze5RyQ DIpUyW1cZQpwIHpt8VYXVUU2ZMS+51yFd0ftKU0FIuQv9cbRKBegekpbPJg8yNTtZSLL mKxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764100934; x=1764705734; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=8qqEPNRG7l5ihIa9qidi5Yuhzi2O+JfSDBpAT+w6c/o=; b=mWbIppX2R23zLCWYcAAfyXbduRkWngIW2n2YIp4JsJtrEzyF9j3AqiOKtDKco4sNPf dqPYCcjSz6sWGWPpAB4EKKlcCui/P97HbbO+L+duPUeVUCdDc71Ejdrot9SvR1BTgzFs DLlNpE9Ko1gA0lI51l4UOGRzvIiKAa2BYStVHZsa7E+B77Mv2qSMV2jUKjYlgz6lMo8Q o7c6TbcaPYvW/5NQKv7riM93DjBPBNy880Xq/Kv4M+M9BZgI1ByMJBSB3eyaDJ80cUCa +ZjP3wGr1QSk8dTyK/t9Gv4fOjdQ9m2czKhoLR6NHGqp1RQm0EZDBAkL+W035k5MVHfT XGSw== X-Forwarded-Encrypted: i=1; AJvYcCUwHbKWKIKbQPvQTdHynIb1kOud6ArrBUmVy/qpZQ9TZodUIQsSfyqL9e2debLkgLlZ+chOnhRQF80MPrLz@postgresql.org X-Gm-Message-State: AOJu0YwxjyhmehsKXVM7WXbj9Z7KYn8nXIg6CYRXAaDIeb6WjVAhBdwh Ln312AjN5OXNrF09IBmTfS+11XNw7lND59T0Cpor67zwsqoQpd608sHVXZhCRI+IB0OsYZoiuNn 2+6H8K6k6s8VElGV/RYSAjHypEqpN8/8= X-Gm-Gg: ASbGncuxH7S3eKFdc6XGH2P7kjPaKJw+dgPSu1RdYzQeM4RNhdZ8AeZAryK4Z9vu54N GuwssUY7eEd1zlJqy08zUxYhqGSORn9TleHURgW84MFNKfGYlvxF7xbfoOKqPSPNhLipPJ6C6ss bb+hixOhU9AQAFNgD8s6TpIZ8XbudEfcQxG41FxiS0BD4xNcCitPmJsEWjZYlbCvakadfBY5HyD El/k1mjQ0BINhDeb/C7ahwYoc6vCsBw4LIZw7rjKEOhmS92yaBA+OqtFgHetBPcMzZ5Jggr X-Google-Smtp-Source: AGHT+IEJRxVeLUlsGeNOhwFg08jFMfOIhzPYfOmERhzIHRVWgjCDv7c1BFoW3cUaswixpgr2lS7ctasey+WheviFS3M= X-Received: by 2002:a05:6402:5247:b0:643:130c:eb0 with SMTP id 4fb4d7f45d1cf-645544542d5mr14368459a12.8.1764100933763; Tue, 25 Nov 2025 12:02:13 -0800 (PST) MIME-Version: 1.0 References: <6kmid26do57ykqfpvq6iieniy4djsymhrypkjccazq5g4bbe6a@2y6owwv7qpex> <3je3ahgf7rrmmurxo6hnlhg5d3ffwfrtjwjxd6jm5srlv5iebp@vxqk5qtgmowr> <3w7v3w6a57jnssokap4k7thoekig72flnyhd4wp3yftzdd7lm7@f6lpcfen6hr7> <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar> <3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru> In-Reply-To: <3nce7i72ayzkunai6mkz24ckbxk74jodz4ua2chcdrwppxlxcd@w6x5kfkjrkru> From: Melanie Plageman Date: Tue, 25 Nov 2025 15:02:02 -0500 X-Gm-Features: AWmQ_bkORDPbkPefzxzuUblSd2-fuIL_sHnyoHnsRGKlYSmwD-F9xSGGh2T2NSY Message-ID: Subject: Re: Buffer locking is special (hints, checksums, AIO writes) To: Andres Freund Cc: Matthias van de Meent , pgsql-hackers@postgresql.org, Thomas Munro , Heikki Linnakangas , Noah Misch , Robert Haas , Michael Paquier Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Tue, Nov 25, 2025 at 11:54=E2=80=AFAM Andres Freund = wrote: > > > I skipped 0013 and 0014 after seeing "#if 1" in 0013 :) > > I left that in there to make the comparison easier. But clearly that's no= t to > be committed... Eh, I mostly just ran out of steam. > > > [PATCH v6 08/14] bufmgr: Implement buffer content locks independently > > of lwlocks > > ... > > > This commit unfortunately introduces some code that is very similar t= o the > > > code in lwlock.c, however the code is not equivalent enough to easily= merge > > > it. The future wins that this commit makes possible seem worth the co= st. > > > > It is a truly unfortunate amount of duplication. I tried some > > refactoring myself just to convince myself it wasn't a good idea. > > Without success, I guess? Nothing that seemed better...and not worse :) > > > As of this commit nothing uses the new share-exclusive lock mode. It = will be > > > documented and used in a future commit. It seemed too complicated to = introduce > > > the lock-level in a separate commit. > > > > I would have liked this mentioned earlier in the commit message. > > Hm, ok. I could split share-exclusive out again, but it was somewhat pai= nful, > because it doesn't just lead to adding code, but to changing code. I don't think you need to introduce share-exclusive in a separate commit. I just mean it would be good to mention in the commit message before you get into so many other details that this commit doesn't use the share-exclusive level yet. > > Also, I don't know how I feel about it being "documented" in a future > > commit...perhaps just don't say that. > > Hm, why? I don't really see you documenting share-exclusive lock semantics in a later commit. Documentation for what the lock level is should be in the same commit that introduces it -- which I think you mostly have done. I feel like it is weird to say you will document that lock level later when 1) you don't really do that and 2) this commit mostly (besides some requests I had for elaboration) already does that. > > diff --git a/src/include/storage/buf_internals.h > > b/src/include/storage/buf_internals.h > > index 28519ad2813..0a145d95024 100644 > > --- a/src/include/storage/buf_internals.h > > +++ b/src/include/storage/buf_internals.h > > @@ -32,22 +33,29 @@ > > /* > > * Buffer state is a single 64-bit variable where following data is co= mbined. > > * > > + * State of the buffer itself: > > * - 18 bits refcount > > * - 4 bits usage count > > * - 10 bits of flags > > * > > + * State of the content lock: > > + * - 1 bit has_waiter > > + * - 1 bit release_ok > > + * - 1 bit lock state locked > > > > Somewhere you should clearly explain the scenarios in which you still > > need to take the buffer header lock with LockBufHdr() vs when you can > > just use atomic operations/CAS loops. > > There is an explanation of that, or at least my attempt at it ;). See the > dcoumentation for BufferDesc. Hmm. I suppose. I was imagining something above the state member since you have added to it, but okay. Separately, I'll admit I don't quite understand when I have to use LockBufHdr() and when I should use a CAS loop to update the bufferdesc->state. > > > > + * Signal that the process isn't on the wait list anymore. Thi= s allows > > + * BufferLockDequeueSelf() to remove itself of the waitlist wi= th a > > + * proclist_delete(), rather than having to check if it has be= en > > > > I know this comment was ported over, but the "remove itself of the > > waitlist" -- the "of" is confusing in the original comment and it is > > confusing here. > > As in s/of/from/? Yes, I wasn't sure if it meant "from" or "off" -- but I believe "from" is more grammatically correct. > > +static inline uint64 > > +BufferLockReleaseSub(BufferLockMode mode) > > > > I don't understand why this is a separate function even with your comme= nt. > > Because there are operations where we want to unlock the buffer as well a= s do > something else. E.g. in 0013, UnlockReleaseBuffer() we want to unlock the > buffer and decrease the refcount in one atomic operation. For that we nee= d to > know what to subtract from the state variable for the lock portion - henc= e > BufferLockReleaseSub(). Hmm. Okay well maybe save it for 0013? I don't care that much, though. > > Maybe you should add an assert to the lock acquisition path that the > > prevate ref count entry mode is UNLOCK? > > Yes, we should... > > It'd be nice if we had a decent way to test things that we except to > crash. Like repeated buffer lock acquisitions... Indeed. > > @@ -6625,7 +7295,25 @@ ResOwnerReleaseBufferPin(Datum res) > > if (BufferIsLocal(buffer)) > > UnpinLocalBufferNoOwner(buffer); > > else > > + { > > + PrivateRefCountEntry *ref; > > + > > + ref =3D GetPrivateRefCountEntry(buffer, false); > > + > > + /* > > + * If the buffer was locked at the time of the resowner releas= e, > > + * release the lock now. This should only happen after errors. > > + */ > > + if (ref->data.lockmode !=3D BUFFER_LOCK_UNLOCK) > > + { > > + BufferDesc *buf =3D GetBufferDescriptor(buffer - 1); > > + > > + HOLD_INTERRUPTS(); /* match the upcoming RESUME_INTERRU= PTS */ > > + BufferLockUnlock(buffer, buf); > > + } > > + > > UnpinBufferNoOwner(GetBufferDescriptor(buffer - 1)); > > + } > > } > > > > Bit confusing that ResOwnerReleaseBufferBin() now releases locks as we= ll. > > Do you have a better suggestion? I'll add a comment that makes that expli= cit, > but other than that I don't have a great idea. Renaming the whole buffer = pin > mechanism seems pretty noisy. ResOwnerReleaseBuffer()? What do you mean renaming the whole buffer pin mechanism? > > 0011: > > -------- > > > [PATCH v6 11/14] heapam: Add batch mode mvcc check and use it in page= mode > > > > > 2) We would like to stop setting hint bits while pages are being writ= ten > > > out. The necessary locking becomes visible for page mode scans if don= e for > > > every tuple. With batching the overhead can be amortized to only happ= en > > > once per page. > > > And I presume you mean don't set hint bits on a buffer that is being fl= ushed > > by someone else -- but it sounds like you mean not to set hint bits as = part > > of flushing a buffer. > > Right, I do mean the former. I would try and state it more clearly then. > > + /* > > + * If the page is not all-visible or we need to check serializ= ability, > > + * maintain enough state to be able to refind the tuple effici= ently, > > + * without again needing to extract it from the page. > > + */ > > + if (!all_visible || check_serializable) > > + { > > > > "enough state" is pretty vague here. > > I don't really follow. Wouldn't going into more detail just restate the c= ode > in a comment? I guess maybe something like If the page is not all-visible or we need to check serializability, keep track of the tuples so we can examine them later without the overhead of extracting them from the page again. > > 0012: > > ------- > > > @@ -77,6 +73,16 @@ gistkillitems(IndexScanDesc scan) > > */ > > for (i =3D 0; i < so->numKilled; i++) > > { > > + if (!killedsomething) > > + { > > + /* > > + * Use hint bit infrastructure to be allowed to modify the= page > > + * without holding an exclusive lock. > > + */ > > + if (!BufferBeginSetHintBits(buffer)) > > + goto unlock; > > + } > > + > > > > I don't understand why this is in the loop. Clearly you want to call > > BufferBeginSetHintBits() once, but why would you do it in the loop? > > Why would we want to continue if we can't set hint bits? No, I'm suggesting that you move BufferBeginSetHintBits() outside the loop so it is more obvious that it is happening once at the beginning. Like this: if (so->numKilled > 0 && !BufferBeginSetHintBits(buffer)) goto unlock; for (i =3D 0; i < so->numKilled; i++) { offnum =3D so->killedItems[i]; iid =3D PageGetItemId(page, offnum); ItemIdMarkDead(iid); killedsomething =3D true; } if (killedsomething) { GistMarkPageHasGarbage(page); BufferFinishSetHintBits(buffer, true, true); } unlock: UnlockReleaseBuffer(buffer); > > And should the enum name itself (SetHintBitsState) include the word > > "batch"? I know that would make it > > long. At least the comment should explain that these are needed when ba= tch > > setting hint bits. > > Isn't that what the comment does, explaining that we want to amortize the= cost > of BufferBeginSetHintBits()? well, amortize is a pretty fancy word and you don't say "batch" anywhere. > > And then you inlined BufferSetHintBits16(). This was part of my wishlist for separating the batch and non-batch version= s. > > I presume you didn't do this because HeapTupleSatisifiesMVCC() for > > SNAPSHOT_MVCC calls the non-batch version (and, of course, > > HeapTupleSatisifiesVisibility() is the much more common case). > > > > if (TransactionIdIsValid(xid)) > > { > > if (BufferIsPermanent(buffer)) > > { > > /* NB: xid must be known committed here! */ > > XLogRecPtr commitLSN =3D TransactionIdGetCommitLSN(x= id); > > > > if (XLogNeedsFlush(commitLSN) && > > BufferGetLSNAtomic(buffer) < commitLSN) > > { > > /* not flushed and no LSN interlock, so don't > > set hint */ > > return; false; > > } > > } > > } > > > > Separately, I was thinking, should we assert here about having the > > right lock type? > > Not sure I get what assert of what locktype where? In SetHintBitsExt() that we have share-exclusive or above. Or are there still callers with only a share lock? > > * It is only safe to set a transaction-committed hint bit if we know t= he > > * transaction's commit record is guaranteed to be flushed to disk befo= re the > > * buffer, or if the table is temporary or unlogged and will be obliter= ated by > > * a crash anyway. We cannot change the LSN of the page here, because = we may > > * hold only a share lock on the buffer, so we can only use the LSN to > > * interlock this if the buffer's LSN already is newer than the commit = LSN; > > * otherwise we have to just refrain from setting the hint bit until so= me > > * future re-examination of the tuple. > > * > > > > Should this say "we may hold only a share exclusive lock on the > > buffer". Also what is "this" in "only use the LSN to interlock this"? > > That's a pre-existing comment, right? Right, but are there callers that will only have a share lock after your ch= ange? > > @@ -1628,6 +1701,9 @@ HeapTupleSatisfiesMVCCBatch(Snapshot snapshot, > > Buffer buffer, > > + if (state =3D=3D SHB_ENABLED) > > + BufferFinishSetHintBits(buffer, true, true); > > + > > return nvis; > > } > > > > I wondered if it would be more natural for BufferBeginSetHintBits() > > and BufferFinishSetHintBits() to set SHB_INITIAL and SHB_DISABLED > > instead of having callers do it. But, I guess you don't do this > > because of gist and hash indexes using this for doing their own > > modifications. > > I thought about it. But yea, the different callers seemed to make that no= t > really useful. It'd also mean that we'd do an external function call for = every > tuple, which I think would be prohibitively expensive. I'm confused why this would mean more external function calls. > > --- a/src/backend/storage/freespace/freespace.c > > +++ b/src/backend/storage/freespace/freespace.c > > @@ -904,14 +904,22 @@ fsm_vacuum_page(Relation rel, FSMAddress addr, > > max_avail =3D fsm_get_max_avail(page); > > /* > > - * Reset the next slot pointer. This encourages the use of low-numb= ered > > - * pages, increasing the chances that a later vacuum can truncate t= he > > - * relation. We don't bother with a lock here, nor with marking th= e page > > - * dirty if it wasn't already, since this is just a hint. > > + * Try to reset the next slot pointer. This encourages the use of > > + * low-numbered pages, increasing the chances that a later vacuum c= an > > + * truncate the relation. We don't bother with a lock here, nor wi= th > > + * marking the page dirty if it wasn't already, since this is just = a hint. > > + * > > + * To be allowed to update the page without an exclusive lock, we h= ave to > > + * use the hint bit infrastructure. > > */ > > > > What the heck? This didn't even take a share lock before... > > It is insane. I do not understand how anybody thought this was ok. I thi= nk I > probably should split this out into a separate commit, since it's so insa= ne. Aren't you going to backport having it take a lock? Then it can be separate (e.g. have it take a share lock in the "fix" commit and then this commit bumps it to share-exclusive). > > diff --git a/src/backend/storage/freespace/fsmpage.c > > b/src/backend/storage/freesp> > > /* > > * Update the next-target pointer. Note that we do this even if we're on= ly > > * holding a shared lock, on the grounds that it's better to use a share= d > > * lock and get a garbled next pointer every now and then, than take the > > * concurrency hit of an exclusive lock. > > > > We appear to avoid the garbling now? > > I don't think so. Two backends concurrently can do fsm_search_avail() and > one backend might set a hint to a page that is already used up by the oth= er > one. At least I think so? Maybe I don't know what it meant by garbled, but I thought it was talking about two backends each trying to set fp_next_slot. If they now have to have a share-exclusive lock and they can't both have a share-exclusive lock at the same time, then it seems like that wouldn't be a problem. It sounds like you may be talking about a backend taking up the freespace of a page that is referred to by the fp_next_slot? - Melanie