Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vp9gJ-002DtB-31 for pgsql-hackers@arkaria.postgresql.org; Sun, 08 Feb 2026 18:38:52 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vp9gH-008QEo-17 for pgsql-hackers@arkaria.postgresql.org; Sun, 08 Feb 2026 18:38:49 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vp9gG-008QEg-1U for pgsql-hackers@lists.postgresql.org; Sun, 08 Feb 2026 18:38:48 +0000 Received: from fout-a3-smtp.messagingengine.com ([103.168.172.146]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1vp9gD-00000001hVn-1crX for pgsql-hackers@postgresql.org; Sun, 08 Feb 2026 18:38:47 +0000 Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfout.phl.internal (Postfix) with ESMTP id 3DDB8EC0557; Sun, 8 Feb 2026 13:38:43 -0500 (EST) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-06.internal (MEProxy); Sun, 08 Feb 2026 13:38:43 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=anarazel.de; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm3; t=1770575923; x=1770662323; bh=Q5hgKhyUIN aBaL+ZCbkQ9RXhEGJxHHVYh8yBpfGp3dg=; b=ZEthXrqAL36WO4WfnPh77+KgWR CITzLD6DkdPVD6hW+JFDksnavMDq447ruhxP7L93uofWQRlFr34S8Ty0K1qg2p9g SIAFCdIV1fD+q3OnPo/IrjofZY5EXzFOwX1vHQc2Z3Ik6XAQw2x3+7tcXZnKt68Q 8Pan3lzd+Auh3gA0yj9Inp6a/8hUlHeXDdSrfK1oa7ftx5Ei0h4AT6nVDUD7ga7w Y/o7y7zJbF2jQkBnhIPSCaBrkyzZrd87+ecK4kr7V9jtLP6bz8IUNoufg6msVHSy GwwuJ6s89Y70Cwq1EEAlOIftz94Cb3U76C5ea6CTN85lj8ztVEQFBGgmJViw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm3; t= 1770575923; x=1770662323; bh=Q5hgKhyUINaBaL+ZCbkQ9RXhEGJxHHVYh8y BpfGp3dg=; b=wptDzZUfNihTzImgD7VTPzF3w+Tt/eMqhHWScaem0QOPF1IHshv nVDy9+PgQiJOdRRRKewKFX45WhmbYkD8gnlDwuIko6+DpYw59LZYcFlXV74eJLCK zsDSluyytv2SNANZLGz/omYPfubKP5n+vDD3VFKQtmOhKJ3G0vztPKyBqTo1123W 7KCfOJCfsFvFWShgMBFhJvCjbXumrdX8dM9wMUOGdRDqj6V8v0jPAWjHN9XxDG5w YHQUlNl9vZLRtJM1xuGLq43guPryO24Y5KdvMLlAO3TPmZ4e9K0cJFOSAOhqLrUC wFi3dlQjmr+uplG01T7rEJT3HU0fm4btIqg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefgedrtddtgdduleegieejucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepfffhvfevuffkfhggtggujgesthdtsfdttddtvdenucfhrhhomheptehnughrvghs ucfhrhgvuhhnugcuoegrnhgurhgvshesrghnrghrrgiivghlrdguvgeqnecuggftrfgrth htvghrnhepfeffgfelvdffgedtveelgfdtgefghfdvkefggeetieevjeekteduleevjefh ueegnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomheprg hnughrvghssegrnhgrrhgriigvlhdruggvpdhnsggprhgtphhtthhopeelpdhmohguvgep shhmthhpohhuthdprhgtphhtthhopegsohgvkhgvfihurhhmodhpohhsthhgrhgvshesgh hmrghilhdrtghomhdprhgtphhtthhopehmvghlrghnihgvphhlrghgvghmrghnsehgmhgr ihhlrdgtohhmpdhrtghpthhtohepmhhitghhrggvlhdrphgrqhhuihgvrhesghhmrghilh drtghomhdprhgtphhtthhopehrvghshhhkvghkihhrihhllhesghhmrghilhdrtghomhdp rhgtphhtthhopehrohgsvghrthhmhhgrrghssehgmhgrihhlrdgtohhmpdhrtghpthhtoh epthhhohhmrghsrdhmuhhnrhhosehgmhgrihhlrdgtohhmpdhrtghpthhtohephhhlihhn nhgrkhgrsehikhhirdhfihdprhgtphhtthhopehnohgrhheslhgvrggusghorghtrdgtoh hmpdhrtghpthhtohepphhgshhqlhdqhhgrtghkvghrshesphhoshhtghhrvghsqhhlrdho rhhg X-ME-Proxy: Feedback-ID: id4a34324:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Sun, 8 Feb 2026 13:38:42 -0500 (EST) Date: Sun, 8 Feb 2026 13:38:42 -0500 From: Andres Freund To: Heikki Linnakangas Cc: Melanie Plageman , Noah Misch , Kirill Reshke , Matthias van de Meent , pgsql-hackers@postgresql.org, Thomas Munro , Robert Haas , Michael Paquier Subject: Re: Buffer locking is special (hints, checksums, AIO writes) Message-ID: References: <1108f18d-cf7c-4f17-b29c-a119fe42f7e5@iki.fi> <5dwlfu2jyzkyf3nrlzxxblxctb6xio5es73ptgsahjnmfu5miu@772rc764hfhi> <4csodkvvfbfloxxjlkgsnl2lgfv2mtzdl7phqzd4jxjadxm4o5@usw7feyb5bzf> <5ubipyssiju5twkb7zgqwdr7q2vhpkpmuelxfpanetlk6ofnop@hvxb4g2amb2d> <03041d48-1e15-4741-b365-0809f2bc75c4@iki.fi> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <03041d48-1e15-4741-b365-0809f2bc75c4@iki.fi> List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Hi, On 2026-02-07 14:38:53 +0200, Heikki Linnakangas wrote: > On 03/02/2026 00:33, Andres Freund wrote: > > - Now that we use the normal order of WAL logging, we don't need to delay > > checkpoint starts anymore. > > > > I think the explanation for why that is ok is correct [1], but it needs to > > be looked at by somebody with experience around this. Maybe Heikki? > > So that's patch 0004 "bufmgr: Switch to standard order in > MarkBufferDirtyHint()". Yes, looks correct to me. Thanks for checking! Somehow I went back and forth about it being right multiple times... > > /* > > * Update RedoRecPtr so that we can make the right decision. It's possible > > * that a new checkpoint will start just after GetRedoRecPtr(), but that > > * is ok, as the buffer is already dirty, ensuring that any BufferSync() > > * started after the buffer was marked dirty cannot complete without > > * flushing this buffer. If a checkpoint started between marking the > > * buffer dirty and this check, we will emit an unnecessary WAL record (as > > * the buffer will be written out as part of the checkpoint), but the > > * window for that is small. > > */ > > RedoRecPtr = GetRedoRecPtr(); > > That "small window" is actually pretty big if you think of it a little more > loosely. Our rule is that we write the full page image if a checkpoint has > started since the page LSN, but that's very conservative already. It would > be sufficient to write the full page image only if the checkpoint has > already flushed the page. This small window is just a special case of that > conservatism. I mainly want to mention that window because I have to think about it when analyzing the correctness of the approach. If the window is not mentioned, at least I have to think about whether the window is dangerous in some form. > It would be sufficient to write the full page image only if the checkpoint > has already flushed the page. Today that would probably not quite be sufficient, due to issues around re-dirtying the page during checkpointer's flush (and thus needing to be written out again, with the chance of a torn write that has no FPI to repair it). But that will soon be impossible. I think the actual rule would need to be more complicated, I think we would need to generate an FPI for the first modification after the checkpoint flush, even though the LSN is newer than the redo LSN, because we didn't generate one earlier? Otherwise we could get into a situation where there is no non-torn on-disk page version after a later crash, I think? Consider: 1) modify page w/ FPI 2) redo pointer determined at X 3) modify page w/o FPI, as the page hasn't yet been flushed at X+1 4) checkpointer flushes page 5) checkpoint completes, at X+2 6) page is dirtied, w/o FPI X+3, as X+1 > X 7) in the middle of writing out the page, we crash, the page is torn For recovery we will replay starting from position X. Then will replay the record from 3), which will be skipped due to the LSN. Then we will replay X+3, which either will be skipped due to the LSN condition (if the page header survived the torn page), leading to the changes to the "old portion" of the torn page not being replayed, or we will replay the WAL record, applying it to a torn page (or failing to read in the page due to checksum errors). If we only needed to think about buffers that stay in memory, we could "just" tackle this by remember that the page will need to be FPId during the next modification in the BufferDesc, but that doesn't help us if the page is evicted and reread... > I've been thinking of trying track that more accurately for a long time, > because it would smoothen the WAL spike when a checkpoint begins. It'd indeed be nice to improve that. Another thing it'd be helpful is widening when we can write out hint bits on standbys. If the rule were just that we can skip an FPI if the page still needs to be written out by the checkpoint, it'd be fairly simple - we could utilize BM_CHECKPOINT_NEEDED. But as hinted at above, I think it's a it more complicated. Greetings, Andres Freund