Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0SFR-001v70-0P for pgsql-hackers@arkaria.postgresql.org; Wed, 11 Mar 2026 22:41:49 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w0SFP-00CP9Y-1b for pgsql-hackers@arkaria.postgresql.org; Wed, 11 Mar 2026 22:41:47 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0SFO-00CP96-39 for pgsql-hackers@lists.postgresql.org; Wed, 11 Mar 2026 22:41:47 +0000 Received: from mail-ed1-x534.google.com ([2a00:1450:4864:20::534]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w0SFH-00000001gGS-28Td for pgsql-hackers@lists.postgresql.org; Wed, 11 Mar 2026 22:41:46 +0000 Received: by mail-ed1-x534.google.com with SMTP id 4fb4d7f45d1cf-660a58841d4so401973a12.0 for ; Wed, 11 Mar 2026 15:41:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773268897; cv=none; d=google.com; s=arc-20240605; b=h/POH4B+sBfu4atBIRFkpxskMLoPcbVc2Oj0CFARW1w37Rd4RKS18kxyWcJCQQcvlp KT89qntVbPa9NY/qSDXd6TeE+3E3VHG9crRc6tE62bM427t635Yk05/j4zDT7oJlBk9E 7uzJnUlz5XPLr3vagsqJq42bLqRcOGJZWUhclwjbtktzGgpsSENLaySVp/r8YiciXoS8 Sc7qnmJ1r9S4ByxLDe4ySY/tlA5RtwDCKVaBb84nctZEQ3i68watFQjhjK7nJPgrHIwa AQ3sssTVJ0XJDs0p5O+R1WBWi5TGkwncWZZ4Og9lDzm/d86piK1eG+MwC3UiewIEFUfa 13DA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=zGqjuULfuQQKBryHwoGgnebPysQQt4nYQOqfh8Teffk=; fh=5lisVc/SdZrXESFhj6i3cPQivKRq2wrqq/p4/OVduqw=; b=idfAAIGiahm+uEnduChfzbwjQ6gPG1vyKUMwhLG6yqOb0MW2QkG5L/tR7VHdtZEua3 IEnZtLC2r5jLUPrYy3kU18AifnHRUVU99moWGXYfaskbTZKSwPewjH78iNOeyx2X+Mhm 1zK3KmfKhIbouWrfhomcJIPY+6IYPOMNV9aqU5lPAaKA6meXExIWAnLU4ilZZ39AkQ+r UqPR5r/wYa+c/4A+/1YKB09TvZbuP7ySW1tu5gt2nT1/5lC7v8FaAZFVVhKm6Dlj12Yq Y9Kf/061FzBvHkhmDI/DqwY86K8pOtMfgvD5+LjvplGrZSb+eiVcXFvg/okypzYXAxGs rZSg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773268897; x=1773873697; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=zGqjuULfuQQKBryHwoGgnebPysQQt4nYQOqfh8Teffk=; b=HcnIXyPd0NpdlAwt9+UJHmwHvICxcOhnOTNBNIvrR+5JzycEmvkxNJSl3AJMSXkiSU 6AI9RQKTAb8ZOQJMOZDn5aEvht8cIoXtS6q72imUKK2VqJsp5kQ7TaM2KwnbMf8sVw/a 6nDoOi581zyEbr4LlfBHvDaZMT5yHwFd7xrg+Ne1rwxF6cFzKnTLTd5huocGZVvQDTut 8FNUzFx32go0aVIV4pmM6VLrp3r98t6+hXbuGmpsF8Kak6mOyLYh37iMYg0etug5GRpm TVkUr82xF3a8p4yqPEIkaVZvQBUbWCI7TxHqHU+ZB/XhfOw0lPZKrzYyUiPjbQJD2FAI s4Iw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773268897; x=1773873697; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=zGqjuULfuQQKBryHwoGgnebPysQQt4nYQOqfh8Teffk=; b=uUL+52YSFxrikFzpldMAceN6nq2+E8fJA/MXL5uP/Wa6KVWeSPtZSRJRuEx+/n1/d6 OtRA5vVv8+BHJtBu6FeYhMSNiI0bg0jGEpg270x1u3BzIGM3Vzc5Pmwt7++G0DSVdf/i RajBz9/R9xGkC8MLJyBU0xPq4Npl6J3hr0ncx+//SL+Ljg+qo9oZj+4hSh5/miSPX2E4 AbnCa9dvJcLSZeiEjN8y34OG3tehCb34MqJL46I3VbiTwyd32PpjQY/BNfogZCuulNaA ibZVnWAj6bDIkxLjWYyPK1npApU5heXoAiM1QF13d/kIJj3VhdI8rVuD/ZvVAaP/Sv8R eqgw== X-Gm-Message-State: AOJu0Yxtwq/9qA23OGLYXje+ymrbIpO0861Izx2GdZ30XNCpxpCeme/4 lMrvtMXmbNZdGjZBdiS6eB0NQmuP/fxckP5svrKP0gnB+ZgmX8LciA9P3K55iYXKK7oPjOhBYPx BhfCpKsiH+uIiydm8bvnIFXXq/XCibfkJ6t2l X-Gm-Gg: ATEYQzw3VVZW5R/gtPmnAK3L8a5wm+R3liT8gwOWQLhf54u56xaeV17nrCNQ6WkaHkW 0JXUzU2+Yx6aSIUR5Y+YJhZVtoALy1YLF3jnJWUuCnX8+k4NbhxdmXF5KXSYUNR/DHnqgf13Qiy Otm6eCZfltYEpqksuTJRsV6W+SBFfyEh90dBqE+OB7wLTfwCCil3Eb68YG0IJmvkkAWUN7UV6Mf ArvRPuX5EHjYWX+5l/OeD08H3ddbyhwYbF8OcBA/SOWWuPzCdQOVdJ7ys0ptngujNN0cQoZmGQr JoVOLu5W22l6jRYL1eatDyuPqlTBbXGTsa0sTI9XWzJc8R2c X-Received: by 2002:a17:907:1b15:b0:b8f:e438:75b9 with SMTP id a640c23a62f3a-b972e1cd679mr254198566b.22.1773268896693; Wed, 11 Mar 2026 15:41:36 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Alexandre Felipe Date: Wed, 11 Mar 2026 22:41:25 +0000 X-Gm-Features: AaiRm53BOesflnc8atYUju8vT2WTcveLO0p2a9XndhTo_pxjOtNnaXFjq9z5Jac Message-ID: Subject: Re: Addressing buffer private reference count scalability issue To: Andres Freund Cc: PostgreSQL Hackers Content-Type: multipart/alternative; boundary="000000000000adc169064cc75652" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000adc169064cc75652 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Wed, Mar 11, 2026 at 6:55=E2=80=AFPM Andres Freund = wrote: > Nice numbers! > > It'd be good to also evaluate in the context of queries, as such a focuse= d > microbenchmark will have much higher L1/L2 cache hit ratios than workload= s > that actually look at data > I ran with the query you first used to illustrate the issue, and it was slower. Went back and tried a strict copy of the code and it was 3% slower. passing CFLAGS +=3D -falign-functions=3D64 this shows no degradation. Maybe LTO is what we need here? It's not surprising it's worse, I just don't see how we could get away > without > some mixing. I will keep that for the future. > It's not at all crazy to have accesss patterns that just differ > in the higher bits of the buffer number (e.g. a nestloop join where the > inner > side always needs a certain number of buffers). We need to be somewhat > resistant to that causing a lot of collisions (which will trigger a > hashtable > growth cycle, without that necessarily fixing the issue!). > Yes, I understand that possibility. > > I brought back the array, but I eliminated the linear search. > > Why? In my benchmarks allowing vectorization helped a decent amount in re= al > queries, because it does away with all the branch misses. I basically treated the array as a hash table with a weak hash function and delegated collisions to simplehash. In the worse case would do a simple array[buffer & mask], never fails with one branch, and would fail in ~3% of the cases with two branches. And would eliminate the branches necessary to check the 1-entry cache. But notice that this was the last patch, if you like everything except that it is just a matter of picking 02, 03, 04. > 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache. > > > > 2a. the dynamic array case > > REFCOUNT_ARRAY_MAX_SIZE!=3DREFCOUNT_ARRAY_INITIAL_SIZE > > will grow the array when it reaches a certain level of occupation. > > I have set the default occupation level to 86% so that, if enabled, for= a > > random input it will grow when we have about 2*size pins in total. > > If we find a sequential pattern then it will grow without growing the > hash > > table. > > For the array lookup I don't use a hash, so for small number of pins > > it will be very fast. > > I doubt it makes sense to basically have two levels of hash tables. > > > > 2b. the static case > > REFCOUNT_ARRAY_MAX_SIZE=3D=3DREFCOUNT_ARRAY_INITIAL_SIZE > > will use a static array, just as we had before and will not perform the > > linear > > search. It still has to read the size and do mask input. > > > > I tested the 4 variations and the winner is with the static array witho= ut > > the cache for the last entry. > > I increased the array size from 8 to 32, since you suggested before tha= t > > that this could help. At that point it would have the tradeoff of a > longer > > linear search, so it may help even more now. > > Does your benchmark actually test repeated accesses to the same refcount? > As > mentioned, those are very very common (access sequences like Pin, (Lock, > Unlock)+, Unpin are extremely common). Yes that is the case a FIFO with one buffer, I didn't include the lock unlock tests here. I did it a some point and it I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as > that does not *at all* communicate that this is backend private state. > I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be more accurate right? I probably will stick with the original names. > I'd strongly advise separating moving code from a large scale rename. I > certainly won't waste time trying to see the difference between what was > just > moved and what was changed at the same time. > Would you prefer not moving the code at all? One of the main reasons for this was the changes in data structure on the patch 04, that I will not include in the next version. > > diff --git a/.gitignore b/.gitignore > > index 4e911395fe..fddb7f861d 100644 > > --- a/.gitignore > > +++ b/.gitignore > > @@ -43,3 +43,5 @@ lib*.pc > > /Release/ > > /tmp_install/ > > /portlock/ > > + > > +.* > > \ No newline at end of file > > What? Certainly not. > Do you mean, we should certainly not exclude hidden files from git? I usually build with prefix postgresql/.build/patch-*/ Then whenever I checkout something I have to keep adding this again. I don't think it's a good idea to introduce new simplehash infrastructure a= s > part of this larger change. Do you think it is worth doing that as a separate patch? Then we get it out of the way on this that probably will go a few more versions? You also haven't documented the new stuff. Do you mean as source comments, or is there a separate documentation for this? > > The previous implementation used an 8-bytes (64-bit) entry to store > > a uint32 count and an uint32 lock mode. That is fine when we store > > the data separate from the key (buffer). But in the simplehash > > {key, value} are stored together, so each entry is 12-bytes. > > This is somewhat awkward as we have to either pad the entry to 16-bytes= , > > or the access will be an alternating aligned/misaligned addreses. > > > > Lock can assume only 4 values, and 2^30 is a decent limit for the > > number of pins on a single buffer. So this change is packing the > > {count[31:2], lock[1:0]} into a single uint32. > > > > Incrementing/decrementing the count continue the same, just using > > 4 instead of 1, lock mode access will require one or two additional > > bitwise operations. The exact count requires one shift, and is used > > only for debugging. A special function is provided to check whether > > count =3D=3D 1. > > Have you actually evaluated the benefit from this? Pretty sceptical it's > worth > it. > I tested, and I agree, not worth it from a speed perspective. At this point the only part left is the introduction of the simplehash. However what I will try is to store just the buffer number in the hash and keep another array for the entries, who knows that works better. --000000000000adc169064cc75652 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


On Wed, Mar 11,= 2026 at 6:55=E2=80=AFPM Andres Freund <andres@anarazel.de> wrote:
Nice numbers!

It'd be good to also evaluate in the context of queries, as such a focu= sed
microbenchmark will have much higher L1/L2 cache hit ratios than workloads<= br> that actually look at data=C2=A0

I ran = with the query you first used to illustrate the issue, and it was slower.Went back and tried a strict copy of the code and it was 3% slower.
=
passing CFLAGS=C2=A0+=3D -falign-functions=3D64 this shows no degradat= ion.
Maybe LTO is what we need here?

It's not surprising it's worse, I just don't see how we could g= et away without
some mixing.

I will keep that for the futur= e.

=C2=A0
It's not at all crazy to have accesss patterns that just= differ
in the higher bits of the buffer number (e.g. a nestloop join where the inn= er
side always needs a certain number of buffers). We need to be somewhat
resistant to that causing a lot of collisions (which will trigger a hashtab= le
growth cycle, without that necessarily fixing the issue!).
=

Yes, I understand=C2=A0that possibility.=C2=A0
=C2=A0
> I brought back the array, but I eliminated the linear search.

Why? In my benchmarks allowing vectorization helped a decent amount in real=
queries, because it does away with all the branch misses.
=
I basically treated the array as a hash table with a weak ha= sh function and
delegated collisions to simplehash.
In = the worse case would do a simple array[buffer & mask], never fails with=
one branch, and would fail in ~3% of the cases with two branches= .
And would eliminate the branches necessary to check the 1-entry= cache.

But notice that this was the last patch, i= f you like everything except that
it is just a matter of picking = 02, 03, 04.

> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
>
> 2a. the dynamic array case
> REFCOUNT_ARRAY_MAX_SIZE!=3DREFCOUNT_ARRAY_INITIAL_SIZE
> will grow the array when it reaches a certain level of occupation.
> I have set the default occupation level to 86% so that, if enabled, fo= r a
> random input it will grow when we have about 2*size pins in total.
> If we find a sequential pattern then it will grow without growing the = hash
> table.
> For the array lookup I don't use a hash, so for small number of pi= ns
> it will be very fast.

I doubt it makes sense to basically have two levels of hash tables.


> 2b. the static case
> REFCOUNT_ARRAY_MAX_SIZE=3D=3DREFCOUNT_ARRAY_INITIAL_SIZE
> will use a static array, just as we had before and will not perform th= e
> linear
> search. It still has to read the size and do mask input.
>
> I tested the 4 variations and the winner is with the static array with= out
> the cache for the last entry.
> I increased the array size from 8 to 32, since you suggested before th= at
> that this could help. At that point it would have the tradeoff of a lo= nger
> linear search, so it may help even more now.

Does your benchmark actually test repeated accesses to the same refcount? A= s
mentioned, those are very very common (access sequences like Pin, (Lock, Unlock)+, Unpin are extremely common).

Yes = that is the case a FIFO with one buffer, I didn't include the lock unlo= ck tests here.
I did it a some=C2=A0point and it=C2=A0

I don't think the new naming scheme (GetSharedBufferEntry etc) is good,= as
that does not *at all* communicate that this is backend private state.
<= /blockquote>

I suspected that, GetEntryForPrivateReferen= ceCountOfSharedBuffer would be
more accurate right?
I p= robably will stick with the original names.
=C2=A0
I'd strongly advise separating moving code from a large scale rename.= =C2=A0 I
certainly won't waste time trying to see the difference between what wa= s just
moved and what was changed at the same time.
=C2=A0
Would you prefer not moving the code at all? One of the main reason= s
for this was the changes in data structure on the patch 04, tha= t I will not
include in the next version.

=C2=A0
> diff --git a/.gitignore b/.gitignore
> index 4e911395fe..fddb7f861d 100644
> --- a/.gitignore
> +++ b/.gitignore
> @@ -43,3 +43,5 @@ lib*.pc
>=C2=A0 /Release/
>=C2=A0 /tmp_install/
>=C2=A0 /portlock/
> +
> +.*
> \ No newline at end of file

What? Certainly not.

Do you mean, we sh= ould certainly not exclude hidden files from git?
I usually build= with prefix postgresql/.build/patch-*/
Then whenever I checkout = something=C2=A0I have to keep adding this again.

I don't think it's a good idea to introduce new simplehash infrastr= ucture as
part of this larger change.
=C2=A0
Do you think = it is worth doing that as a separate patch? Then we get it
out of= the way on this that probably will go a few more versions?

<= /div>
You also haven't= documented the new stuff.
=C2=A0
Do you mean as= source comments, or is there a separate documentation
for this?<= /div>
=C2=A0
=C2=A0=C2=A0
> The previous implementation used an 8-bytes (64-bit) entry to store > a uint32 count and an uint32 lock mode. That is fine when we store
> the data separate from the key (buffer). But in the simplehash
> {key, value} are stored together, so each entry is 12-bytes.
> This is somewhat awkward as we have to either pad the entry to 16-byte= s,
> or the access will be an alternating aligned/misaligned addreses.
>
> Lock can assume only 4 values, and 2^30 is a decent limit for the
> number of pins on a single buffer. So this change is packing the
> {count[31:2], lock[1:0]} into a single uint32.
>
> Incrementing/decrementing the count continue the same, just using
> 4 instead of 1, lock mode access will require one or two additional > bitwise operations. The exact count requires one shift, and is used > only for debugging. A special function is provided to check whether > count =3D=3D 1.

Have you actually evaluated the benefit from this? Pretty sceptical it'= s worth
it.

I tested, and I agree, not worth it= from=C2=A0a speed perspective.
At this point the only part left = is the introduction of the simplehash.

However wha= t I will try is to store just the buffer number in the hash
and k= eep another array for the entries, who knows that works better.
<= br>


--000000000000adc169064cc75652--