MIME-Version: 1.0
References: 
 <CAEudQApbWon+3Eb9x4WW_D-JkSt2mvfx99dXu9VZ4AeCuTh=fw@mail.gmail.com>
 <CAEudQApEfvhNT1fEPURzVcQH7G0A1ukh_ugoCGaErV6_dbndCQ@mail.gmail.com>
 <CAF+pBj_RS2KErTqQ6ORXjhVzmukG7Ve0wHU1Kq56xjJfFKwVqA@mail.gmail.com>
 <CAEudQAptRymgvmd5hQb2mk-Ft89XcSo_xvC74kv4JBA9v=D4Sg@mail.gmail.com>
 <CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com>
 <CAEudQApqk6DXWgqSBdHyH7+wSxJuk7D-DwkGODUcGkUWpYu0UA@mail.gmail.com>
In-Reply-To: 
 <CAEudQApqk6DXWgqSBdHyH7+wSxJuk7D-DwkGODUcGkUWpYu0UA@mail.gmail.com>
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 12 Mar 2026 12:48:38 -0500
Message-ID: 
 <CAF+pBj-pAGnTh2un8RGcDqSYuMnwGhXv5_MteB77FNjf-Af=tg@mail.gmail.com>
Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
To: Ranier Vilela <ranier.vf@gmail.com>
Cc: Pg Hackers <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000c99f08064cd75df5"
Archived-At: 
 <https://www.postgresql.org/message-id/CAF%2BpBj-pAGnTh2un8RGcDqSYuMnwGhXv5_MteB77FNjf-Af%3Dtg%40mail.gmail.com>
Precedence: bulk

--000000000000c99f08064cd75df5
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I don't think your version 1 memcpy is doing what you think it is doing.

On Thu, Mar 12, 2026 at 12:35=E2=80=AFPM Ranier Vilela <ranier.vf@gmail.com=
> wrote:

> Hi.
>
> Em seg., 9 de mar. de 2026 =C3=A0s 14:02, Bryan Green <dbryan.green@gmail=
.com>
> escreveu:
>
>> I performed a micro-benchmark on my dual epyc (zen 2) server and version
>> 1 wins for small values of n.
>>
>> 20 runs:
>>
>> n       version       min  median    mean     max  stddev  noise%
>> -----------------------------------------------------------------------
>> n=3D1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
>> n=3D1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>>
>> n=3D2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
>> n=3D2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>>
>> n=3D4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
>> n=3D4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>>
>> But, micro-benchmarks always make me nervous, so I looked at the actual
>> instruction cost for my
>> platform given the version 1 and version 2 code.
>>
>> If we count cpu cycles using the AMD Zen 2 instruction latency/throughpu=
t
>> tables:  version 1 (loop body)
>> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
>> has ~3-4 cycles per iteration.
>>
>> The problem for version 2 is that the call to memcpy is ~24-30 cycles du=
e
>> to the stub + function call + return
>> and branch predictor pressure on first call.  This probably results in
>> ~2.5 ns per iteration cost for version 2.
>>
>> So, no I wouldn't call it an optimization.  But, it will be interesting
>> to hear other opinions on this.
>>
> I made dirty and quick tests with two versions:
> gcc 15.2.0
> gcc -O2 memcpy1.c -o memcpy1
>
> The first test was with keys 10000000 and 10000000 loops:
> version1: on memcpy call
> done in 1873 nanoseconds
>
> version2: inlined memcpy
> not finish
>
> The second test was with keys 4 and 10000000 loops:
> version1: one memcpy call
> version2: inlined memcpy call
>
> version1: done in 1519 nanoseconds
> version2: done in 104981851 nanoseconds
> (1.44692e-05 times faster)
>
> version1: done in 1979 nanoseconds
> version2: done in 110568901 nanoseconds
> (1.78983e-05 times faster)
>
> version1: done in 1814 nanoseconds
> version2: done in 108555484 nanoseconds
> (1.67103e-05 times faster)
>
> version1: done in 1631 nanoseconds
> version2: done in 109867919 nanoseconds
> (1.48451e-05 times faster)
>
> version1: done in 1269 nanoseconds
> version2: done in 111639106 nanoseconds
> (1.1367e-05 times faster)
>
> Unless I'm doing something wrong, one call memcpy wins!
> memcpy1.c attached.
>
> best regards,
> Ranier Vilela
>

--000000000000c99f08064cd75df5
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I don&#39;t think your version 1 memcpy is doing what you =
think it is doing.</div><br><div class=3D"gmail_quote gmail_quote_container=
"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Mar 12, 2026 at 12:35=E2=80=
=AFPM Ranier Vilela &lt;<a href=3D"mailto:ranier.vf@gmail.com">ranier.vf@gm=
ail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=3D"=
margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-lef=
t:1ex"><div dir=3D"ltr"><div>Hi.</div><br><div class=3D"gmail_quote"><div d=
ir=3D"ltr" class=3D"gmail_attr">Em seg., 9 de mar. de 2026 =C3=A0s 14:02, B=
ryan Green &lt;<a href=3D"mailto:dbryan.green@gmail.com" target=3D"_blank">=
dbryan.green@gmail.com</a>&gt; escreveu:<br></div><blockquote class=3D"gmai=
l_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,20=
4,204);padding-left:1ex"><div dir=3D"ltr">I performed a micro-benchmark on =
my dual epyc (zen 2) server and version 1 wins for small values of n.<div><=
br></div><div>20 runs:=C2=A0</div><div><br><div>n =C2=A0 =C2=A0 =C2=A0 vers=
ion =C2=A0 =C2=A0 =C2=A0 min =C2=A0median =C2=A0 =C2=A0mean =C2=A0 =C2=A0 m=
ax =C2=A0stddev =C2=A0noise%<br>-------------------------------------------=
----------------------------<br>n=3D1 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 =
2.440 =C2=A0 2.440 =C2=A0 2.450 =C2=A0 2.550 =C2=A0 0.024 =C2=A0 =C2=A04.5%=
<br>n=3D1 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.=
277 =C2=A0 4.290 =C2=A0 0.007 =C2=A0 =C2=A00.7%<br><br>n=3D2 =C2=A0 =C2=A0 =
version1 =C2=A0 =C2=A0 2.740 =C2=A0 2.750 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 =
0.029 =C2=A0 =C2=A05.1%<br>n=3D2 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 3.970=
 =C2=A0 3.980 =C2=A0 3.980 =C2=A0 4.020 =C2=A0 0.010 =C2=A0 =C2=A01.3%<br><=
br>n=3D4 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.6=
49 =C2=A0 4.910 =C2=A0 0.094 =C2=A0 =C2=A07.2%<br>n=3D4 =C2=A0 =C2=A0 versi=
on2 =C2=A0 =C2=A0 5.780 =C2=A0 5.815 =C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013=
 =C2=A0 =C2=A00.7%</div><div><br></div><div>But, micro-benchmarks always ma=
ke me nervous, so I looked at the actual instruction cost for my=C2=A0</div=
><div>platform given the version 1 and version 2 code.</div><div><br></div>=
<div>If we count cpu cycles using the AMD Zen 2 instruction latency/through=
put tables:=C2=A0 version 1 (loop body)=C2=A0</div><div>has a critical path=
 of ~5-6 cycles per iteration.=C2=A0 version 2 (loop body) has ~3-4 cycles =
per iteration.=C2=A0</div><div><br></div><div>The problem for version 2 is =
that the call to memcpy is ~24-30 cycles due to the stub=C2=A0+ function ca=
ll=C2=A0+ return</div><div>and branch predictor pressure on first call.=C2=
=A0 This probably results in ~2.5 ns per iteration cost for version 2.</div=
><div><br></div><div>So, no I wouldn&#39;t call it an optimization.=C2=A0 B=
ut, it will be interesting to hear other opinions on this.=C2=A0</div></div=
></div></blockquote><div>I made dirty and quick tests with two versions:</d=
iv><div>gcc 15.2.0</div><div>gcc -O2 memcpy1.c -o memcpy1</div><div><br></d=
iv><div>The first test was with keys=C2=A010000000 and=C2=A010000000 loops:=
</div><div>version1: on memcpy call</div><div>done in 1873 nanoseconds<br><=
br></div><div>version2: inlined memcpy</div><div>not finish</div><div><br><=
/div><div>The second test was with keys 4 and=C2=A010000000 loops:</div><di=
v>version1: one memcpy call</div><div>version2: inlined memcpy call</div><d=
iv><br></div><div>version1: done in 1519 nanoseconds<br>version2: done in 1=
04981851 nanoseconds<br>(1.44692e-05 times faster)</div><div><br>version1: =
done in 1979 nanoseconds<br>version2: done in 110568901 nanoseconds<br>(1.7=
8983e-05 times faster)<br><br></div><div>version1: done in 1814 nanoseconds=
<br>version2: done in 108555484 nanoseconds<br>(1.67103e-05 times faster)<b=
r><br></div><div>version1: done in 1631 nanoseconds<br>version2: done in 10=
9867919 nanoseconds<br>(1.48451e-05 times faster)<br><br>version1: done in =
1269 nanoseconds<br>version2: done in 111639106 nanoseconds<br>(1.1367e-05 =
times faster)<br><br></div><div>Unless I&#39;m doing something wrong, one c=
all memcpy wins!</div><div>memcpy1.c attached.</div><div><br></div><div>bes=
t regards,</div><div>Ranier Vilela</div></div></div>
</blockquote></div>

--000000000000c99f08064cd75df5--