Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7hDe-005bXb-1O for pgsql-hackers@arkaria.postgresql.org; Tue, 31 Mar 2026 22:05:54 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w7hDc-00DQ7c-1x for pgsql-hackers@arkaria.postgresql.org; Tue, 31 Mar 2026 22:05:53 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7hDb-00DQ7T-1Q for pgsql-hackers@lists.postgresql.org; Tue, 31 Mar 2026 22:05:52 +0000 Received: from fout-a7-smtp.messagingengine.com ([103.168.172.150]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1w7hDZ-000000022LM-2PxU for pgsql-hackers@postgresql.org; Tue, 31 Mar 2026 22:05:51 +0000 Received: from phl-compute-04.internal (phl-compute-04.internal [10.202.2.44]) by mailfout.phl.internal (Postfix) with ESMTP id 06229EC0242; Tue, 31 Mar 2026 18:05:48 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-04.internal (MEProxy); Tue, 31 Mar 2026 18:05:48 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm2; t=1774994748; x=1775081148; bh=uPFvsw4kkQ WdHhJOiriMkgkBKWowU3qhTGIKa6QkGRk=; b=Let1LxopfmmEP1FmwF1BEeUd/i 830D+Q4t1KsZglG0eHA160dHuyB6Tq/UKwQ+OIUTfiowbdYdOCguDnlbvHRnhf9z Ig4M60K1zVNq43Ytf0zrETiaZD3QtF3qbYb+bPv51KQoKPcYbVsnCj6lcEfUlNWT ZUQa3AnF4efq1wzypPL2CU1mPeZX7Bd6P5VzVrWTQocYzEtK4Ahoe8uofLlUi1Sq P7tm/Y88TTnawm/i0/Z87OCheNYl7pPr5Vb1VEQQMZreQYRW7MkutlOIz7063pS3 AQvZqySynCb8B+U12T1PgYj2mcF2fzvWpegQnZj8lfiWhPL3ZLGALZC5eZOA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm2; t= 1774994748; x=1775081148; bh=uPFvsw4kkQWdHhJOiriMkgkBKWowU3qhTGI Ka6QkGRk=; b=BIYpgRrC1x7i89k0kPoc65Yu2DFE0jrJVxwPS0uqvMkj7w9+V66 vw3Frs27Y5DJ/ZMGiJ6MvPsUVYFVdB8oE77q98jq7zZyCb0J0LMEQFdgSMXDuqvU V8gjfXWxhyqvvqoJ7h0Py/j/SEHk3UZRA0KiiyDjvM47vzvo2rj5Zwz/NY5RHduw yMi2WDhOt9AkVJX9DVuAlQ/eHkjAgBi6OxaGMbQk/6eph8R08gVnShEIe/j0MfF5 XQMtIqCtYPn7K1wxTXf0HHogHSU4/niqhcMqR5KUrMl10GuE0X6VLeKIZ6EVuRG9 yrhmSaIis2a2+D3x7p1EEvw6JKdjjJmjCUg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgddufeehucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceurghi lhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujfgurh epfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghsucfh rhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrthhtvg hrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefhueeg necuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprghnug hrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopedutddpmhhouggvpehs mhhtphhouhhtpdhrtghpthhtohepsghovghkvgifuhhrmhdophhoshhtghhrvghssehgmh grihhlrdgtohhmpdhrtghpthhtohepmhgvlhgrnhhivghplhgrghgvmhgrnhesghhmrghi lhdrtghomhdprhgtphhtthhopehmihgthhgrvghlrdhprghquhhivghrsehgmhgrihhlrd gtohhmpdhrtghpthhtoheprhgvshhhkhgvkhhirhhilhhlsehgmhgrihhlrdgtohhmpdhr tghpthhtoheprhhosggvrhhtmhhhrggrshesghhmrghilhdrtghomhdprhgtphhtthhope hthhhomhgrshdrmhhunhhrohesghhmrghilhdrtghomhdprhgtphhtthhopehhlhhinhhn rghkrgesihhkihdrfhhipdhrtghpthhtohepnhhorghhsehlvggruggsohgrthdrtghomh dprhgtphhtthhopeihrdhsohhkohhlohhvsehpohhsthhgrhgvshhprhhordhruh X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 31 Mar 2026 18:05:47 -0400 (EDT) Date: Tue, 31 Mar 2026 18:05:46 -0400 From: Andres Freund To: Yura Sokolov Cc: Melanie Plageman , Noah Misch , Heikki Linnakangas , Kirill Reshke , Matthias van de Meent , pgsql-hackers@postgresql.org, Thomas Munro , Robert Haas , Michael Paquier Subject: Re: Buffer locking is special (hints, checksums, AIO writes) Message-ID: References: <5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d> <68e89de8-5f6c-4eaf-a800-e16a5e487667@iki.fi> <20260215195239.ce.noahmisch@microsoft.com> <5bf667f3-5270-4b19-a08f-0facbecdff68@postgrespro.ru> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <5bf667f3-5270-4b19-a08f-0facbecdff68@postgrespro.ru> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-03-31 19:02:33 +0300, Yura Sokolov wrote: > 27.03.2026 23:00, Andres Freund wrote: > > On 2026-03-25 18:35:55 -0400, Andres Freund wrote: > >> Running it through valgrind and then will work on reading through one more > >> time and pushing them. > > > > And done. > > > > Phew, this project took way longer than I'd though it'd take. > > In addition to bug with BM_IO_ERROR [1] , I found race condition in > PinBuffer in this lines of code: > > if (unlikely(skip_if_not_valid && !(old_buf_state & BM_VALID))) > return false; > > /* > * We're not allowed to increase the refcount while the buffer > * header spinlock is held. Wait for the lock to be released. > */ > if (old_buf_state & BM_LOCKED) > old_buf_state = WaitBufHdrUnlocked(buf); > > While we waited for buffer header for being unlocked, it may become > invalid, isn't it? > Therefore, check related to skip_if_not_valid have to happen after waiting. Yea, that does seem wrong. Not sure how it ended up that way. I think it may be better to add a continue after the WaitBufHdrUnlocked(), so that we restart the loop, rather than moving the skip_if_not_valid check. > .... > > Another question: previously we had to wait for buffer for being unlocked > because UnlockBufHdr wrote to buf->state unconditionally, therefore our pin > increment could be lost. > Now UnlockBufHdr and UnlockBufHdrExt does proper atomic operations and > preserves concurrent changes. Are we still need to wait? Yes. > Most of time PinBuffer is called protected by BufTable's partition LWLock, > therefore buffer may not be changed in dramatic way. I don't think the partition locks are sufficient protection for everything. We have a few places in the code that want to be able to modify the buffer state depending on whether the buffer is already pinned, and I don't think all of them currently hold the relevant buffer mapping partition's lock. If pinning were not to wait for an existing header lock, such checks would not easily be doable. Perhaps we could fix all the relevant places by acquiring the partition lock in a few more places. But I think that'd be going in exactly the opposite direction we should to. The partition locks are quite contended locks and we should work on getting rid of them eventually. Building them into the protection model seems quite unwise. I think many of the places that currently do rely on the buffer header spinlock can be converted to CAS loops. I'm not really sure how much that's worth though - the WaitBufHdrLocked() in PinBuffer() is pretty hard to hit in realistic workloads. What would be really nice, is to be able to replace the CAS() with an atomic add (since those are considerably faster), but that's not really possible regardless of the need for WaitBufHdrLocked(), because we can't just add BUF_USAGECOUNT_ONE, as that would allow increasing the usage count too far. I would like to eventually narrow the definition of the buffer header spinlock to just be about the "identity" of the buffer, which then would only be needed by things like DropRelationBuffers() and when changing the buffer's identity. > But call in ReadRecentBuffer is the exception. It is not protected by > partition lock and have to make additional checks. That is why you > introduced skip_if_not_valid. > > Does optimization of ReadRecentBuffer pays for WaitBufHdrUnlocked? As mentioned above, I don't think it's just ReadRecentBuffer that relies on the buffer header spinlock preventing new pins. Greetings, Andres Freund