public inbox for [email protected]  
help / color / mirror / Atom feed
From: Thomas Munro <[email protected]>
To: Andres Freund <[email protected]>
Cc: Greg Burd <[email protected]>
Cc: Matthias van de Meent <[email protected]>
Cc: [email protected] <[email protected]>
Cc: Melanie Plageman <[email protected]>
Cc: Heikki Linnakangas <[email protected]>
Cc: Noah Misch <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Michael Paquier <[email protected]>
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Date: Tue, 25 Nov 2025 13:09:38 +1300
Message-ID: <CA+hUKGLmpStLUW3LVzPiR_-zJ8=QrMoBT82z7HnLzk9nMU=KGg@mail.gmail.com> (raw)
In-Reply-To: <z5gkpjfxa4rg3zylnnrpga77ezts5rrgi7xr34yydcmqeathop@hbeyhwlm2i5f>
References: <6rgb2nvhyvnszz4ul3wfzlf5rheb2kkwrglthnna7qhe24onwr@vw27225tkyar>
	<[email protected]>
	<z5gkpjfxa4rg3zylnnrpga77ezts5rrgi7xr34yydcmqeathop@hbeyhwlm2i5f>

On Fri, Nov 21, 2025 at 9:51 AM Andres Freund <[email protected]> wrote:
> It's worth pointing out that the new way of setting hint bits is inherently
> more expensive than what we did before - upgrading a lock to a different lock
> level isn't free, compared to doing, well, nothing.
>
> For paths that set the hint bits of a whole page, like a seqscan, that cost is
> more than amortized by the batched approach introduced in 0011. Those get
> faster with the patch, both when already hinted and when not.

Nice work!

> However, there are paths that aren't easily amenable to that approach, like
> e.g. an ordered index scan referencing unhinted tuples. There we only ever
> access a single tuple and release the upgraded lock after every tuple. If the
> index scan is perfectly correlated with the table and every tuple is unhinted,
> that's a decent amount of additional work.

Yeah, but it was only faster because it was cheating.  It presumably
doesn't happen when you bulk load and then create index.  It
presumably does happen when you insert a lot of data in order, on
first correlated index scan.  Seems like an inherent limitation of the
current tuple-at-a-time architecture when combined with the *required*
interlocking, and not a blocker for this work.

+ Some filesystems, raid implementations, ... do not tolerate the data being

I was aware of BTRFS (EIO on read) and ZFS 2.4 (EIO on read or write
depending on configuration option), but hadn't thought about RAID.
Ugh, right, non-matching RAID1 mirrors (and I guess also b0rked RAID5
parity bits?).  Fun.

https://bugzilla.kernel.org/show_bug.cgi?id=99171

> I've spent a lot of time micro-optimizing that workload, to avoid any
> significiant regressions. An extreme stress-test started out being about 20%
> slower than today, as of my current local version, it's a bit faster (~1%) on
> one of my machines and a bit slower (~2%) on another. Partially that was
> achieved by optimizing the hint-bit-lock-upgrade code more (e.g. having a fast
> path for updating a single hint bit, avoiding redundant reads of the lock
> state by having MarkSharedBufferDirtyHint(), ...), partially by optimizing the
> locking code.  The latter is a bit of a cheat though - things would be even
> faster if we went with the old way of setting hint bits, but with the
> independent optimizations applied.
>
> I think that's ok though:
>
> 1) the old way of setting hint bits is a pretty dirty hack that causes issues
>    in quite a few places.
>
> 2) by definition, having to set hint bits is an ephemeral state, once the hint
>    bits are set, the difference vanishes
>
> 3) no normal workload shows the difference - my stress test does
>    SELECT * FROM manyrows_idx ORDER BY i OFFSET 10000000;
>    on a perfectly correlated table with very narrow rows, i.e. an index scan
>    of the whole table, where none of the scan results are ever used. Once one
>    actually uses the resulting rows, the performance difference completely
>    vanishes.
>
> 4) as part of the index prefetching work, we might get the infrastructure to
>    actually batch the hint-bit setting in this case too.

Yeah.  Was just thinking the same.  Both the streaming and batching
projects have opportunities to figure out an amortisation scheme.  I
have a few vague ideas about stream-based approaches already, hmm...

+1, I think this is OK for now.





view thread (57+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
  In-Reply-To: <CA+hUKGLmpStLUW3LVzPiR_-zJ8=QrMoBT82z7HnLzk9nMU=KGg@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox