MIME-Version: 1.0
References: 
 <CAEudQApbWon+3Eb9x4WW_D-JkSt2mvfx99dXu9VZ4AeCuTh=fw@mail.gmail.com>
 <CAEudQApEfvhNT1fEPURzVcQH7G0A1ukh_ugoCGaErV6_dbndCQ@mail.gmail.com>
 <CAF+pBj_RS2KErTqQ6ORXjhVzmukG7Ve0wHU1Kq56xjJfFKwVqA@mail.gmail.com>
 <CAEudQAptRymgvmd5hQb2mk-Ft89XcSo_xvC74kv4JBA9v=D4Sg@mail.gmail.com>
 <CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com>
In-Reply-To: 
 <CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com>
From: Bryan Green <dbryan.green@gmail.com>
Date: Mon, 9 Mar 2026 11:06:28 -0600
Message-ID: 
 <CAF+pBj9M8rCoS-EEBwLiA6hdxm_UNjxMXG9Vc4RU8xifqnQB-g@mail.gmail.com>
Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
To: Ranier Vilela <ranier.vf@gmail.com>
Cc: Pg Hackers <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000264c91064c9a6d8c"
Archived-At: 
 <https://www.postgresql.org/message-id/CAF%2BpBj9M8rCoS-EEBwLiA6hdxm_UNjxMXG9Vc4RU8xifqnQB-g%40mail.gmail.com>
Precedence: bulk

--000000000000264c91064c9a6d8c
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I meant to add that the micro-benchmark went through each loop 100,000,000
iterations for each run...
for 20 runs to come up with those numbers.

--bg

On Mon, Mar 9, 2026 at 11:02=E2=80=AFAM Bryan Green <dbryan.green@gmail.com=
> wrote:

> I performed a micro-benchmark on my dual epyc (zen 2) server and version =
1
> wins for small values of n.
>
> 20 runs:
>
> n       version       min  median    mean     max  stddev  noise%
> -----------------------------------------------------------------------
> n=3D1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
> n=3D1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>
> n=3D2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
> n=3D2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>
> n=3D4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
> n=3D4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>
> But, micro-benchmarks always make me nervous, so I looked at the actual
> instruction cost for my
> platform given the version 1 and version 2 code.
>
> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
> tables:  version 1 (loop body)
> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
> has ~3-4 cycles per iteration.
>
> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
> to the stub + function call + return
> and branch predictor pressure on first call.  This probably results in
> ~2.5 ns per iteration cost for version 2.
>
> So, no I wouldn't call it an optimization.  But, it will be interesting t=
o
> hear other opinions on this.
>
> --bg
>
>
> On Mon, Mar 9, 2026 at 10:25=E2=80=AFAM Ranier Vilela <ranier.vf@gmail.co=
m> wrote:
>
>>
>>
>> Em seg., 9 de mar. de 2026 =C3=A0s 11:47, Bryan Green <dbryan.green@gmai=
l.com>
>> escreveu:
>>
>>> I created an example that is a little bit closer to the actual code and
>>> changed the compiler from C++ to C.
>>>
>>> It is interesting the optimization that the compiler has chosen for
>>> version 1 versus version 2.  One calls
>>> memcpy and one doesn't.  There is a good chance the inlining of memcpy
>>> as SSE+scalar per iteration
>>> will be faster for syscache scans-- which I believe are usually small
>>> (1-4 keys?).
>>>
>> I doubt the inline version is better.
>> Clang is supported too and the code generated is much better with memcpy
>> one call outside of the loop.
>>
>>
>>>
>>> Probably the only reason to do this patch would be if N is normally
>>> large or if this is considered an
>>> improvement in code clarity without a detrimental impact on small N
>>> syscache scans.
>>> I realize you only said "possible small optimization".  It might be
>>> worthwhile to benchmark the code for
>>> different values of n to determine if there is a tipping point either
>>> way?
>>>
>>  In your opinion, shouldn't this be considered an optimization, even a
>> small one?
>>
>> best regards,
>> Ranier Vilela
>>
>

--000000000000264c91064c9a6d8c
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I meant to add that the micro-benchmark went through each =
loop 100,000,000 iterations for each run...<div>for 20 runs to come up with=
 those numbers.=C2=A0=C2=A0</div><div><br></div><div>--bg</div></div><br><d=
iv class=3D"gmail_quote gmail_quote_container"><div dir=3D"ltr" class=3D"gm=
ail_attr">On Mon, Mar 9, 2026 at 11:02=E2=80=AFAM Bryan Green &lt;<a href=
=3D"mailto:dbryan.green@gmail.com">dbryan.green@gmail.com</a>&gt; wrote:<br=
></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;=
border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">I=
 performed a micro-benchmark on my dual epyc (zen 2) server and version 1 w=
ins for small values of n.<div><br></div><div>20 runs:=C2=A0</div><div><br>=
<div>n =C2=A0 =C2=A0 =C2=A0 version =C2=A0 =C2=A0 =C2=A0 min =C2=A0median =
=C2=A0 =C2=A0mean =C2=A0 =C2=A0 max =C2=A0stddev =C2=A0noise%<br>----------=
-------------------------------------------------------------<br>n=3D1 =C2=
=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.440 =C2=A0 2.440 =C2=A0 2.450 =C2=A0 2.=
550 =C2=A0 0.024 =C2=A0 =C2=A04.5%<br>n=3D1 =C2=A0 =C2=A0 version2 =C2=A0 =
=C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.277 =C2=A0 4.290 =C2=A0 0.007 =C2=A0 =C2=
=A00.7%<br><br>n=3D2 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.740 =C2=A0 2.75=
0 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 0.029 =C2=A0 =C2=A05.1%<br>n=3D2 =C2=A0 =
=C2=A0 version2 =C2=A0 =C2=A0 3.970 =C2=A0 3.980 =C2=A0 3.980 =C2=A0 4.020 =
=C2=A0 0.010 =C2=A0 =C2=A01.3%<br><br>n=3D4 =C2=A0 =C2=A0 version1 =C2=A0 =
=C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.649 =C2=A0 4.910 =C2=A0 0.094 =C2=A0 =C2=
=A07.2%<br>n=3D4 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 5.780 =C2=A0 5.815 =
=C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013 =C2=A0 =C2=A00.7%</div><div><br></di=
v><div>But, micro-benchmarks always make me nervous, so I looked at the act=
ual instruction cost for my=C2=A0</div><div>platform given the version 1 an=
d version 2 code.</div><div><br></div><div>If we count cpu cycles using the=
 AMD Zen 2 instruction latency/throughput tables:=C2=A0 version 1 (loop bod=
y)=C2=A0</div><div>has a critical path of ~5-6 cycles per iteration.=C2=A0 =
version 2 (loop body) has ~3-4 cycles per iteration.=C2=A0</div><div><br></=
div><div>The problem for version 2 is that the call to memcpy is ~24-30 cyc=
les due to the stub=C2=A0+ function call=C2=A0+ return</div><div>and branch=
 predictor pressure on first call.=C2=A0 This probably results in ~2.5 ns p=
er iteration cost for version 2.</div><div><br></div><div>So, no I wouldn&#=
39;t call it an optimization.=C2=A0 But, it will be interesting to hear oth=
er opinions on this.=C2=A0</div><div><br></div><div>--bg</div><div><br></di=
v></div></div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmai=
l_attr">On Mon, Mar 9, 2026 at 10:25=E2=80=AFAM Ranier Vilela &lt;<a href=
=3D"mailto:ranier.vf@gmail.com" target=3D"_blank">ranier.vf@gmail.com</a>&g=
t; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0p=
x 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div d=
ir=3D"ltr"><div dir=3D"ltr"><br></div><br><div class=3D"gmail_quote"><div d=
ir=3D"ltr" class=3D"gmail_attr">Em seg., 9 de mar. de 2026 =C3=A0s 11:47, B=
ryan Green &lt;<a href=3D"mailto:dbryan.green@gmail.com" target=3D"_blank">=
dbryan.green@gmail.com</a>&gt; escreveu:<br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex"><div dir=3D"ltr">I created an example that is a li=
ttle bit closer to the actual code and changed the compiler from C++ to C.=
=C2=A0<div><br></div><div>It is interesting the optimization that the compi=
ler has chosen for version 1 versus version 2.=C2=A0 One calls</div><div>me=
mcpy and one doesn&#39;t.=C2=A0 There is a good chance the inlining of memc=
py as SSE+scalar per iteration</div><div>will be faster for syscache scans-=
- which I believe are usually small (1-4 keys?).=C2=A0=C2=A0</div></div></b=
lockquote><div>I doubt the inline version is better.</div><div>Clang is sup=
ported too and the code generated is much better with memcpy one call outsi=
de of the loop.</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" sty=
le=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);paddi=
ng-left:1ex"><div dir=3D"ltr"><div><br></div><div>Probably the only reason =
to do this patch would be if N is normally large or if this is considered a=
n</div><div>improvement in code clarity without a detrimental=C2=A0impact o=
n small N syscache scans.=C2=A0=C2=A0</div><div>I realize you only said &qu=
ot;possible small optimization&quot;.=C2=A0 It might be worthwhile to bench=
mark the code for=C2=A0</div><div>different values of n to determine if the=
re is a tipping point either way?</div></div></blockquote><div>=C2=A0In you=
r opinion, shouldn&#39;t this be considered an optimization, even a small o=
ne?</div><div><br></div><div>best regards,</div><div>Ranier Vilela</div></d=
iv></div>
</blockquote></div>
</blockquote></div>

--000000000000264c91064c9a6d8c--