Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNgdR-00Aqpz-26 for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 00:10:21 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vNgdP-005X3n-2x for pgsql-hackers@arkaria.postgresql.org; Tue, 25 Nov 2025 00:10:20 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vNgdP-005X3f-20 for pgsql-hackers@lists.postgresql.org; Tue, 25 Nov 2025 00:10:19 +0000 Received: from mail-pf1-x430.google.com ([2607:f8b0:4864:20::430]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.96) (envelope-from ) id 1vNgdN-001K13-21 for pgsql-hackers@postgresql.org; Tue, 25 Nov 2025 00:10:19 +0000 Received: by mail-pf1-x430.google.com with SMTP id d2e1a72fcca58-7bac5b906bcso559702b3a.3 for ; Mon, 24 Nov 2025 16:10:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1764029415; x=1764634215; darn=postgresql.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=H777xzCVeBzCpaFbqgvpxxqHw7a9NHnzXTWqemKICwI=; b=SGMRJ2Lpx74NvEkresDsA3qkfgYr8fTjh7uLR2nKTNOaorAgQXjx7cpIRaQzAp9HEu PMNK9h+KVfatKfp8YAQa7h9vRHGOhg5BfrC8rNz5s6RymUJbHUBYR1LQdq60o3a8cZ0+ rrqXD3zpbagebnART3vj+o/0eP+5Lv4Xv+sPGa/r5zAW9XUMzJjYQnQ0S709q3+vnbwF x3fBGJ6raiccO1VVlxAb63/iDScT79CCYMDUPBpcxNpVfUpv1wa7C+/KctiKHEK3Kvkb uJx90nNjNBs3RmJPUYurCmdShWqwuF4sOrdMfqSWdmq+E1r9VAFbLLAjK3k3t6kvFrbD GOAQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1764029415; x=1764634215; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=H777xzCVeBzCpaFbqgvpxxqHw7a9NHnzXTWqemKICwI=; b=r71rRvHlu5S5L/B+0Rv4yHbudla+3YOTiu4Fu2+b4aW6tQ9YzfzRKaTiVCC9cwYA05 wfndw2jFy8CqxSfYNyjKZDf/yxNJDmjDNJZRDYsWUo8A8dFIUN0VNdEsTkrpQ9FjhKt0 mEKlTlSaQHN/qWdenT2dXkrnNnXoml74YjEPYyvQzlP3LBo3tZ+wIEyae3QRHHduGaB3 s7uC7sfQKlUMbh37wwjNeOfGgkrXq1H1m/4L4JnnXVJw7Jk4mwVsIXkMdNkKyKiyBZ7R hbvKhtyA1OUbnF///VyL+ELyrsW22hCfDMQgOSwB3UPD4kDeWM4pQesmobIlbeSReS8L hzSQ== X-Forwarded-Encrypted: i=1; AJvYcCVVrOfJSzRlegbofT4S29uQm6XizVdgwsCd6WOweFFC7Jx6rxLoJWDCV/lOY2NjUumpzUAQ/l09SEZqdL2S@postgresql.org X-Gm-Message-State: AOJu0YzF0jl8WprVN5w59zhTuOfsxcYH2dDxysWb0RSKfp9zogdyZyOV aTr3iVL6h/Anujb2XpthAzPQbh8idlHaiGY5d2/J3xmHvDIosmWrRnj9M3qcByn12+cpPNBzft+ ZVangY3/IO5+UYcU1lzVLbgLaC2n0pLA= X-Gm-Gg: ASbGnctWaoq+SvrmFvSFwCkr8Xq95MW1S0GYIuxuejkB7xEwN0mFbxltSjUfGSsRO4G nJhzstzVkCHtC2pZx1q9x2I7WlSaziVO8Bm7gYHmpvUN95oekULuYTysHigUbUSHCDTt6m0S8TZ Ihsjlb4LG7oU3A5nHD1+HYbGC6Fs5qihqNZaHor6DSWhtXN94WOcaQw6eGphTH81JPWlyTzkicf WmVS1BxoxePqovKyg+C/zQ/58KOfvsZgI/A8gpuTJmNYc05YtUTf/BiiN8Z+52UrrnUeUUotJcF +mMuYV3pjW647JPHffyEfR2aMvqJjx1s3RQfg3E1oOCZTN7qkLv46JXYKHLrgns= X-Google-Smtp-Source: AGHT+IFogudbamEwXvwH5G9VohB64+V040zXWchITjx1NC6qjKn1IC/ywD7WhMaIHFlSJEWJeioVDAidupm7sXH58NE= X-Received: by 2002:a05:7300:7b01:b0:2a4:3593:2c08 with SMTP id 5a478bee46e88-2a7249fec39mr4934879eec.1.1764029415187; Mon, 24 Nov 2025 16:10:15 -0800 (PST) MIME-Version: 1.0 References: <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar> <1FAF2680-51A6-4ACF-8446-C6FA3234D72A@greg.burd.me> In-Reply-To: From: Thomas Munro Date: Tue, 25 Nov 2025 13:09:38 +1300 X-Gm-Features: AWmQ_bkqe-Yo4BIeLn1ZQMXzoZLA-OY_3xjkRYMvW1XPbkhZ6MNx_VNex2NzPIE Message-ID: Subject: Re: Buffer locking is special (hints, checksums, AIO writes) To: Andres Freund Cc: Greg Burd , Matthias van de Meent , "pgsql-hackers@postgresql.org" , Melanie Plageman , Heikki Linnakangas , Noah Misch , Robert Haas , Michael Paquier Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On Fri, Nov 21, 2025 at 9:51=E2=80=AFAM Andres Freund = wrote: > It's worth pointing out that the new way of setting hint bits is inherent= ly > more expensive than what we did before - upgrading a lock to a different = lock > level isn't free, compared to doing, well, nothing. > > For paths that set the hint bits of a whole page, like a seqscan, that co= st is > more than amortized by the batched approach introduced in 0011. Those get > faster with the patch, both when already hinted and when not. Nice work! > However, there are paths that aren't easily amenable to that approach, li= ke > e.g. an ordered index scan referencing unhinted tuples. There we only eve= r > access a single tuple and release the upgraded lock after every tuple. If= the > index scan is perfectly correlated with the table and every tuple is unhi= nted, > that's a decent amount of additional work. Yeah, but it was only faster because it was cheating. It presumably doesn't happen when you bulk load and then create index. It presumably does happen when you insert a lot of data in order, on first correlated index scan. Seems like an inherent limitation of the current tuple-at-a-time architecture when combined with the *required* interlocking, and not a blocker for this work. + Some filesystems, raid implementations, ... do not tolerate the data bein= g I was aware of BTRFS (EIO on read) and ZFS 2.4 (EIO on read or write depending on configuration option), but hadn't thought about RAID. Ugh, right, non-matching RAID1 mirrors (and I guess also b0rked RAID5 parity bits?). Fun. https://bugzilla.kernel.org/show_bug.cgi?id=3D99171 > I've spent a lot of time micro-optimizing that workload, to avoid any > significiant regressions. An extreme stress-test started out being about = 20% > slower than today, as of my current local version, it's a bit faster (~1%= ) on > one of my machines and a bit slower (~2%) on another. Partially that was > achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a= fast > path for updating a single hint bit, avoiding redundant reads of the lock > state by having MarkSharedBufferDirtyHint(), ...), partially by optimizin= g the > locking code. The latter is a bit of a cheat though - things would be ev= en > faster if we went with the old way of setting hint bits, but with the > independent optimizations applied. > > I think that's ok though: > > 1) the old way of setting hint bits is a pretty dirty hack that causes is= sues > in quite a few places. > > 2) by definition, having to set hint bits is an ephemeral state, once the= hint > bits are set, the difference vanishes > > 3) no normal workload shows the difference - my stress test does > SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000; > on a perfectly correlated table with very narrow rows, i.e. an index s= can > of the whole table, where none of the scan results are ever used. Once= one > actually uses the resulting rows, the performance difference completel= y > vanishes. > > 4) as part of the index prefetching work, we might get the infrastructure= to > actually batch the hint-bit setting in this case too. Yeah. Was just thinking the same. Both the streaming and batching projects have opportunities to figure out an amortisation scheme. I have a few vague ideas about stream-based approaches already, hmm... +1, I think this is OK for now.