MIME-Version: 1.0
References: 
 <CAE8JnxNTETEUiAOF31=_yo=pvyAi9npOeJfcTvEJJbi4vomtYA@mail.gmail.com>
 <krknshnvus4qhehtoqtwnroemgxqwlfmykark6umd6hf64xnku@ibxx734ds3ga>
 <CAE8JnxNqf=sYB-hfeHBEtXi+aC8jezqqukdgRuR2=t8nbetL=w@mail.gmail.com>
 <rfjyce5hmfkp2pbgjaxvmc76zy33kpokigbkwnounxfmz6uyd5@vt7yxibmfy6n>
In-Reply-To: <rfjyce5hmfkp2pbgjaxvmc76zy33kpokigbkwnounxfmz6uyd5@vt7yxibmfy6n>
From: Alexandre Felipe <o.alexandre.felipe@gmail.com>
Date: Wed, 11 Mar 2026 22:41:25 +0000
Message-ID: 
 <CAE8JnxP=oPCmZs70VQ6U=35uLceqJVBQz3qakPR79cSkt7HU-g@mail.gmail.com>
Subject: Re: Addressing buffer private reference count scalability issue
To: Andres Freund <andres@anarazel.de>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000adc169064cc75652"
Archived-At: 
 <https://www.postgresql.org/message-id/CAE8JnxP%3DoPCmZs70VQ6U%3D35uLceqJVBQz3qakPR79cSkt7HU-g%40mail.gmail.com>
Precedence: bulk

--000000000000adc169064cc75652
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Wed, Mar 11, 2026 at 6:55=E2=80=AFPM Andres Freund <andres@anarazel.de> =
wrote:

> Nice numbers!
>
> It'd be good to also evaluate in the context of queries, as such a focuse=
d
> microbenchmark will have much higher L1/L2 cache hit ratios than workload=
s
> that actually look at data
>

I ran with the query you first used to illustrate the issue, and it was
slower.
Went back and tried a strict copy of the code and it was 3% slower.
passing CFLAGS +=3D -falign-functions=3D64 this shows no degradation.
Maybe LTO is what we need here?


It's not surprising it's worse, I just don't see how we could get away
> without
> some mixing.


I will keep that for the future.


> It's not at all crazy to have accesss patterns that just differ
> in the higher bits of the buffer number (e.g. a nestloop join where the
> inner
> side always needs a certain number of buffers). We need to be somewhat
> resistant to that causing a lot of collisions (which will trigger a
> hashtable
> growth cycle, without that necessarily fixing the issue!).
>

Yes, I understand that possibility.


> > I brought back the array, but I eliminated the linear search.
>
> Why? In my benchmarks allowing vectorization helped a decent amount in re=
al
> queries, because it does away with all the branch misses.


I basically treated the array as a hash table with a weak hash function and
delegated collisions to simplehash.
In the worse case would do a simple array[buffer & mask], never fails with
one branch, and would fail in ~3% of the cases with two branches.
And would eliminate the branches necessary to check the 1-entry cache.

But notice that this was the last patch, if you like everything except that
it is just a matter of picking 02, 03, 04.

> 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.
> >
> > 2a. the dynamic array case
> > REFCOUNT_ARRAY_MAX_SIZE!=3DREFCOUNT_ARRAY_INITIAL_SIZE
> > will grow the array when it reaches a certain level of occupation.
> > I have set the default occupation level to 86% so that, if enabled, for=
 a
> > random input it will grow when we have about 2*size pins in total.
> > If we find a sequential pattern then it will grow without growing the
> hash
> > table.
> > For the array lookup I don't use a hash, so for small number of pins
> > it will be very fast.
>
> I doubt it makes sense to basically have two levels of hash tables.
>
>
> > 2b. the static case
> > REFCOUNT_ARRAY_MAX_SIZE=3D=3DREFCOUNT_ARRAY_INITIAL_SIZE
> > will use a static array, just as we had before and will not perform the
> > linear
> > search. It still has to read the size and do mask input.
> >
> > I tested the 4 variations and the winner is with the static array witho=
ut
> > the cache for the last entry.
> > I increased the array size from 8 to 32, since you suggested before tha=
t
> > that this could help. At that point it would have the tradeoff of a
> longer
> > linear search, so it may help even more now.
>
> Does your benchmark actually test repeated accesses to the same refcount?
> As
> mentioned, those are very very common (access sequences like Pin, (Lock,
> Unlock)+, Unpin are extremely common).


Yes that is the case a FIFO with one buffer, I didn't include the lock
unlock tests here.
I did it a some point and it

I don't think the new naming scheme (GetSharedBufferEntry etc) is good, as
> that does not *at all* communicate that this is backend private state.
>

I suspected that, GetEntryForPrivateReferenceCountOfSharedBuffer would be
more accurate right?
I probably will stick with the original names.


> I'd strongly advise separating moving code from a large scale rename.  I
> certainly won't waste time trying to see the difference between what was
> just
> moved and what was changed at the same time.
>

Would you prefer not moving the code at all? One of the main reasons
for this was the changes in data structure on the patch 04, that I will not
include in the next version.


> > diff --git a/.gitignore b/.gitignore
> > index 4e911395fe..fddb7f861d 100644
> > --- a/.gitignore
> > +++ b/.gitignore
> > @@ -43,3 +43,5 @@ lib*.pc
> >  /Release/
> >  /tmp_install/
> >  /portlock/
> > +
> > +.*
> > \ No newline at end of file
>
> What? Certainly not.
>

Do you mean, we should certainly not exclude hidden files from git?
I usually build with prefix postgresql/.build/patch-*/
Then whenever I checkout something I have to keep adding this again.

I don't think it's a good idea to introduce new simplehash infrastructure a=
s
> part of this larger change.


Do you think it is worth doing that as a separate patch? Then we get it
out of the way on this that probably will go a few more versions?

You also haven't documented the new stuff.


Do you mean as source comments, or is there a separate documentation
for this?


> > The previous implementation used an 8-bytes (64-bit) entry to store
> > a uint32 count and an uint32 lock mode. That is fine when we store
> > the data separate from the key (buffer). But in the simplehash
> > {key, value} are stored together, so each entry is 12-bytes.
> > This is somewhat awkward as we have to either pad the entry to 16-bytes=
,
> > or the access will be an alternating aligned/misaligned addreses.
> >
> > Lock can assume only 4 values, and 2^30 is a decent limit for the
> > number of pins on a single buffer. So this change is packing the
> > {count[31:2], lock[1:0]} into a single uint32.
> >
> > Incrementing/decrementing the count continue the same, just using
> > 4 instead of 1, lock mode access will require one or two additional
> > bitwise operations. The exact count requires one shift, and is used
> > only for debugging. A special function is provided to check whether
> > count =3D=3D 1.
>
> Have you actually evaluated the benefit from this? Pretty sceptical it's
> worth
> it.
>

I tested, and I agree, not worth it from a speed perspective.
At this point the only part left is the introduction of the simplehash.

However what I will try is to store just the buffer number in the hash
and keep another array for the entries, who knows that works better.

--000000000000adc169064cc75652
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote g=
mail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">On Wed, Mar 11,=
 2026 at 6:55=E2=80=AFPM Andres Freund &lt;<a href=3D"mailto:andres@anaraze=
l.de">andres@anarazel.de</a>&gt; wrote:</div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">
Nice numbers!<br>
<br>
It&#39;d be good to also evaluate in the context of queries, as such a focu=
sed<br>
microbenchmark will have much higher L1/L2 cache hit ratios than workloads<=
br>
that actually look at data=C2=A0<br></blockquote><div><br></div><div>I ran =
with the query you first used to illustrate the issue, and it was slower.<b=
r>Went back and tried a strict copy of the code and it was 3% slower.</div>=
<div>passing CFLAGS=C2=A0+=3D -falign-functions=3D64 this shows no degradat=
ion.</div><div>Maybe LTO is what we need here?</div><div><br></div><div><br=
></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;=
border-left:1px solid rgb(204,204,204);padding-left:1ex">
It&#39;s not surprising it&#39;s worse, I just don&#39;t see how we could g=
et away without<br>
some mixing.</blockquote><div><br></div><div>I will keep that for the futur=
e.</div><div><br></div><div>=C2=A0</div><blockquote class=3D"gmail_quote" s=
tyle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);pad=
ding-left:1ex">It&#39;s not at all crazy to have accesss patterns that just=
 differ<br>
in the higher bits of the buffer number (e.g. a nestloop join where the inn=
er<br>
side always needs a certain number of buffers). We need to be somewhat<br>
resistant to that causing a lot of collisions (which will trigger a hashtab=
le<br>
growth cycle, without that necessarily fixing the issue!).<br></blockquote>=
<div><br></div><div>Yes, I understand=C2=A0that possibility.=C2=A0</div><di=
v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt; I brought back the array, but I eliminated the linear search.<br>
<br>
Why? In my benchmarks allowing vectorization helped a decent amount in real=
<br>
queries, because it does away with all the branch misses.</blockquote><div>=
<br></div><div>I basically treated the array as a hash table with a weak ha=
sh function and</div><div>delegated collisions to simplehash.</div><div>In =
the worse case would do a simple array[buffer &amp; mask], never fails with=
</div><div>one branch, and would fail in ~3% of the cases with two branches=
.</div><div>And would eliminate the branches necessary to check the 1-entry=
 cache.</div><div><br></div><div>But notice that this was the last patch, i=
f you like everything except that</div><div>it is just a matter of picking =
02, 03, 04.</div><div><br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex">
&gt; 1. USE_REFCOUNT_CACHE_ENTRY will enable the last entry cache.<br>
&gt; <br>
&gt; 2a. the dynamic array case<br>
&gt; REFCOUNT_ARRAY_MAX_SIZE!=3DREFCOUNT_ARRAY_INITIAL_SIZE<br>
&gt; will grow the array when it reaches a certain level of occupation.<br>
&gt; I have set the default occupation level to 86% so that, if enabled, fo=
r a<br>
&gt; random input it will grow when we have about 2*size pins in total.<br>
&gt; If we find a sequential pattern then it will grow without growing the =
hash<br>
&gt; table.<br>
&gt; For the array lookup I don&#39;t use a hash, so for small number of pi=
ns<br>
&gt; it will be very fast.<br>
<br>
I doubt it makes sense to basically have two levels of hash tables.<br>
<br>
<br>
&gt; 2b. the static case<br>
&gt; REFCOUNT_ARRAY_MAX_SIZE=3D=3DREFCOUNT_ARRAY_INITIAL_SIZE<br>
&gt; will use a static array, just as we had before and will not perform th=
e<br>
&gt; linear<br>
&gt; search. It still has to read the size and do mask input.<br>
&gt;<br>
&gt; I tested the 4 variations and the winner is with the static array with=
out<br>
&gt; the cache for the last entry.<br>
&gt; I increased the array size from 8 to 32, since you suggested before th=
at<br>
&gt; that this could help. At that point it would have the tradeoff of a lo=
nger<br>
&gt; linear search, so it may help even more now.<br>
<br>
Does your benchmark actually test repeated accesses to the same refcount? A=
s<br>
mentioned, those are very very common (access sequences like Pin, (Lock,<br=
>
Unlock)+, Unpin are extremely common).</blockquote><div><br></div><div>Yes =
that is the case a FIFO with one buffer, I didn&#39;t include the lock unlo=
ck tests here.<br>I did it a some=C2=A0point and it=C2=A0</div><div><br></d=
iv><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bord=
er-left:1px solid rgb(204,204,204);padding-left:1ex">
I don&#39;t think the new naming scheme (GetSharedBufferEntry etc) is good,=
 as<br>
that does not *at all* communicate that this is backend private state.<br><=
/blockquote><div><br></div><div>I suspected that, GetEntryForPrivateReferen=
ceCountOfSharedBuffer would be</div><div>more accurate right?</div><div>I p=
robably will stick with the original names.</div><div>=C2=A0</div><blockquo=
te class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px =
solid rgb(204,204,204);padding-left:1ex">
I&#39;d strongly advise separating moving code from a large scale rename.=
=C2=A0 I<br>
certainly won&#39;t waste time trying to see the difference between what wa=
s just<br>
moved and what was changed at the same time.<br></blockquote><div>=C2=A0</d=
iv><div>Would you prefer not moving the code at all? One of the main reason=
s</div><div>for this was the changes in data structure on the patch 04, tha=
t I will not</div><div>include in the next version.</div><div><br></div><di=
v>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px=
 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt; diff --git a/.gitignore b/.gitignore<br>
&gt; index 4e911395fe..fddb7f861d 100644<br>
&gt; --- a/.gitignore<br>
&gt; +++ b/.gitignore<br>
&gt; @@ -43,3 +43,5 @@ lib*.pc<br>
&gt;=C2=A0 /Release/<br>
&gt;=C2=A0 /tmp_install/<br>
&gt;=C2=A0 /portlock/<br>
&gt; +<br>
&gt; +.*<br>
&gt; \ No newline at end of file<br>
<br>
What? Certainly not.<br></blockquote><div><br></div><div>Do you mean, we sh=
ould certainly not exclude hidden files from git?</div><div>I usually build=
 with prefix postgresql/.build/patch-*/</div><div>Then whenever I checkout =
something=C2=A0I have to keep adding this again.</div><div><br></div><block=
quote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1=
px solid rgb(204,204,204);padding-left:1ex">
I don&#39;t think it&#39;s a good idea to introduce new simplehash infrastr=
ucture as<br>
part of this larger change.</blockquote><div>=C2=A0</div><div>Do you think =
it is worth doing that as a separate patch? Then we get it</div><div>out of=
 the way on this that probably will go a few more versions?</div><div><br><=
/div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bo=
rder-left:1px solid rgb(204,204,204);padding-left:1ex">You also haven&#39;t=
 documented the new stuff.</blockquote><div>=C2=A0</div><div>Do you mean as=
 source comments, or is there a separate documentation</div><div>for this?<=
/div><div>=C2=A0</div><div>=C2=A0=C2=A0</div><blockquote class=3D"gmail_quo=
te" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204=
);padding-left:1ex">
&gt; The previous implementation used an 8-bytes (64-bit) entry to store<br=
>
&gt; a uint32 count and an uint32 lock mode. That is fine when we store<br>
&gt; the data separate from the key (buffer). But in the simplehash<br>
&gt; {key, value} are stored together, so each entry is 12-bytes.<br>
&gt; This is somewhat awkward as we have to either pad the entry to 16-byte=
s,<br>
&gt; or the access will be an alternating aligned/misaligned addreses.<br>
&gt; <br>
&gt; Lock can assume only 4 values, and 2^30 is a decent limit for the<br>
&gt; number of pins on a single buffer. So this change is packing the<br>
&gt; {count[31:2], lock[1:0]} into a single uint32.<br>
&gt; <br>
&gt; Incrementing/decrementing the count continue the same, just using<br>
&gt; 4 instead of 1, lock mode access will require one or two additional<br=
>
&gt; bitwise operations. The exact count requires one shift, and is used<br=
>
&gt; only for debugging. A special function is provided to check whether<br=
>
&gt; count =3D=3D 1.<br>
<br>
Have you actually evaluated the benefit from this? Pretty sceptical it&#39;=
s worth<br>
it.<br></blockquote><div><br></div><div>I tested, and I agree, not worth it=
 from=C2=A0a speed perspective.</div><div>At this point the only part left =
is the introduction of the simplehash.</div><div><br></div><div>However wha=
t I will try is to store just the buffer number in the hash</div><div>and k=
eep another array for the entries, who knows that works better.</div><div><=
br></div><div><br></div><div><br></div></div></div>

--000000000000adc169064cc75652--