MIME-Version: 1.0
References: 
 <CAEudQApbWon+3Eb9x4WW_D-JkSt2mvfx99dXu9VZ4AeCuTh=fw@mail.gmail.com>
 <CAEudQApEfvhNT1fEPURzVcQH7G0A1ukh_ugoCGaErV6_dbndCQ@mail.gmail.com>
 <CAF+pBj_RS2KErTqQ6ORXjhVzmukG7Ve0wHU1Kq56xjJfFKwVqA@mail.gmail.com>
 <CAEudQAptRymgvmd5hQb2mk-Ft89XcSo_xvC74kv4JBA9v=D4Sg@mail.gmail.com>
 <CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com>
 <CAEudQApqk6DXWgqSBdHyH7+wSxJuk7D-DwkGODUcGkUWpYu0UA@mail.gmail.com>
 <CAF+pBj-pAGnTh2un8RGcDqSYuMnwGhXv5_MteB77FNjf-Af=tg@mail.gmail.com>
In-Reply-To: 
 <CAF+pBj-pAGnTh2un8RGcDqSYuMnwGhXv5_MteB77FNjf-Af=tg@mail.gmail.com>
From: Bryan Green <dbryan.green@gmail.com>
Date: Thu, 12 Mar 2026 14:21:12 -0500
Message-ID: 
 <CAF+pBj87nrr1jGadSRqZrhik+Y9T7V5e7177MOJX6=8KjUgOkA@mail.gmail.com>
Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
To: Ranier Vilela <ranier.vf@gmail.com>
Cc: Pg Hackers <pgsql-hackers@postgresql.org>
Content-Type: multipart/alternative; boundary="000000000000e50a4b064cd8a837"
Archived-At: 
 <https://www.postgresql.org/message-id/CAF%2BpBj87nrr1jGadSRqZrhik%2BY9T7V5e7177MOJX6%3D8KjUgOkA%40mail.gmail.com>
Precedence: bulk

--000000000000e50a4b064cd8a837
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

I modified your memcpy1.c program to not inline the version functions.  I
changed the memcpy function
call in version 1, added volatile to keep some DCE opportunities from
happening and added a range
of N values to keep the compiler from specializing the code for N =3D 4.
Before it did DCE and the test1
function was just a ret.

The interesting issue is the use of malloc versus the stack.  The use of
malloc will probably track closer
with PG's use of palloc so I would say in that case this is an
optimization.  It might be fun to compile PG
with and without the patch (in debug mode) and actually see what gets
generated for this function.

Here are the results I got using your modified benchmark:
--- stack allocated ---
stack  n=3D1  v1(patch): 49721599 ns  v2(original): 21477302 ns  ratio: 2.3=
15
 original wins
stack  n=3D2  v1(patch): 52065462 ns  v2(original): 28765199 ns  ratio: 1.8=
10
 original wins
stack  n=3D3  v1(patch): 58914958 ns  v2(original): 39726110 ns  ratio: 1.4=
83
 original wins
stack  n=3D4  v1(patch): 64585275 ns  v2(original): 47046397 ns  ratio: 1.3=
73
 original wins
stack  n=3D5  v1(patch): 73929844 ns  v2(original): 58588698 ns  ratio: 1.2=
62
 original wins
stack  n=3D6  v1(patch): 95465376 ns  v2(original): 67807817 ns  ratio: 1.4=
08
 original wins
stack  n=3D7  v1(patch): 86910226 ns  v2(original): 76999488 ns  ratio: 1.1=
29
 original wins
stack  n=3D8  v1(patch): 107765417 ns  v2(original): 86046016 ns  ratio:
1.252  original wins

--- malloc allocated ---
malloc n=3D1  v1(patch): 133283824 ns  v2(original): 141361091 ns  ratio:
0.943  patch wins
malloc n=3D2  v1(patch): 145625895 ns  v2(original): 180912711 ns  ratio:
0.805  patch wins
malloc n=3D3  v1(patch): 153975594 ns  v2(original): 228459879 ns  ratio:
0.674  patch wins
malloc n=3D4  v1(patch): 154483094 ns  v2(original): 248157408 ns  ratio:
0.623  patch wins
malloc n=3D5  v1(patch): 157710598 ns  v2(original): 298795018 ns  ratio:
0.528  patch wins
malloc n=3D6  v1(patch): 165196636 ns  v2(original): 332940132 ns  ratio:
0.496  patch wins
malloc n=3D7  v1(patch): 169576370 ns  v2(original): 358438778 ns  ratio:
0.473  patch wins
malloc n=3D8  v1(patch): 184463815 ns  v2(original): 403721513 ns  ratio:
0.457  patch wins


The modified program:

#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <time.h>

typedef void (*RegProcedure)(void);
typedef uintptr_t Datum;

typedef struct ScanKeyData
{
    int         sk_flags;
    int         sk_attno;
    RegProcedure sk_func;
    Datum       sk_argument;
} ScanKeyData;

/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
void version1_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
    memcpy(idxkey, key, n * sizeof(ScanKeyData));
    for (int i =3D 0; i < n; i++)
        idxkey[i].sk_attno =3D i + 1;
}

/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
void version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{
    for (int i =3D 0; i < n; i++)
    {
        memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
        idxkey[i].sk_attno =3D i + 1;
    }
}

/* version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinline))
ScanKeyData *version1_malloc(int n, const ScanKeyData *key)
{
    ScanKeyData *idxkey =3D (ScanKeyData *) malloc(n * sizeof(ScanKeyData))=
;

    memcpy(idxkey, key, n * sizeof(ScanKeyData));
    for (int i =3D 0; i < n; i++)
        idxkey[i].sk_attno =3D i + 1;

    return idxkey;
}

/* version2: per-element memcpy + fixup (the original) */
static __attribute__((noinline))
ScanKeyData *version2_malloc(int n, const ScanKeyData *key)
{
    ScanKeyData *idxkey =3D (ScanKeyData *) malloc(n * sizeof(ScanKeyData))=
;

    for (int i =3D 0; i < n; i++)
    {
        memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData));
        idxkey[i].sk_attno =3D i + 1;
    }

    return idxkey;
}

#define NANOSEC_PER_SEC 1000000000

int64_t
get_clock_diff(struct timespec *t1, struct timespec *t2)
{
    int64_t nanosec =3D (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
    nanosec +=3D (t1->tv_nsec - t2->tv_nsec);
    return nanosec;
}

#define MAX_KEYS 8
#define LOOPS 10000000

void test_stack(int n)
{
    ScanKeyData  keys[MAX_KEYS];
    ScanKeyData  idxkey[MAX_KEYS];
    struct timespec start, end;
    int64_t      version1_time, version2_time;

    memset(keys, 0, sizeof(keys));

    /* warmup */
    for (int i =3D 0; i < 1000; i++)
    {
        version1_stack(n, keys, idxkey);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i =3D 0; i < LOOPS; i++)
    {
        version1_stack(n, keys, idxkey);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version1_time =3D get_clock_diff(&end, &start);

    /* warmup */
    for (int i =3D 0; i < 1000; i++)
    {
        version2_stack(n, keys, idxkey);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i =3D 0; i < LOOPS; i++)
    {
        version2_stack(n, keys, idxkey);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version2_time =3D get_clock_diff(&end, &start);

    printf("stack  n=3D%d  v1(patch): %ld ns  v2(original): %ld ns  ratio:
%.3f  %s\n",
           n, version1_time, version2_time,
           (double) version1_time / version2_time,
           version1_time < version2_time ? "patch wins" : "original wins");
}

void test_malloc(int n)
{
    ScanKeyData  keys[MAX_KEYS];
    ScanKeyData  *idxkey;
    struct timespec start, end;
    int64_t      version1_time, version2_time;

    memset(keys, 0, sizeof(keys));

    /* warmup */
    for (int i =3D 0; i < 1000; i++)
    {
        idxkey =3D version1_malloc(n, keys);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i =3D 0; i < LOOPS; i++)
    {
        idxkey =3D version1_malloc(n, keys);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version1_time =3D get_clock_diff(&end, &start);

    /* warmup */
    for (int i =3D 0; i < 1000; i++)
    {
        idxkey =3D version2_malloc(n, keys);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
    for (int i =3D 0; i < LOOPS; i++)
    {
        idxkey =3D version2_malloc(n, keys);
        volatile int sink =3D idxkey[n-1].sk_attno;
        (void) sink;
        free(idxkey);
    }
    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
    version2_time =3D get_clock_diff(&end, &start);

    printf("malloc n=3D%d  v1(patch): %ld ns  v2(original): %ld ns  ratio:
%.3f  %s\n",
           n, version1_time, version2_time,
           (double) version1_time / version2_time,
           version1_time < version2_time ? "patch wins" : "original wins");
}

int main(void)
{
    printf("--- stack allocated ---\n");
    for (int n =3D 1; n <=3D MAX_KEYS; n++)
        test_stack(n);

    printf("\n--- malloc allocated ---\n");
    for (int n =3D 1; n <=3D MAX_KEYS; n++)
        test_malloc(n);

    return 0;
}


-- bg


On Thu, Mar 12, 2026 at 12:48=E2=80=AFPM Bryan Green <dbryan.green@gmail.co=
m> wrote:

> I don't think your version 1 memcpy is doing what you think it is doing.
>
> On Thu, Mar 12, 2026 at 12:35=E2=80=AFPM Ranier Vilela <ranier.vf@gmail.c=
om>
> wrote:
>
>> Hi.
>>
>> Em seg., 9 de mar. de 2026 =C3=A0s 14:02, Bryan Green <dbryan.green@gmai=
l.com>
>> escreveu:
>>
>>> I performed a micro-benchmark on my dual epyc (zen 2) server and versio=
n
>>> 1 wins for small values of n.
>>>
>>> 20 runs:
>>>
>>> n       version       min  median    mean     max  stddev  noise%
>>> -----------------------------------------------------------------------
>>> n=3D1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
>>> n=3D1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>>>
>>> n=3D2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
>>> n=3D2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>>>
>>> n=3D4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
>>> n=3D4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>>>
>>> But, micro-benchmarks always make me nervous, so I looked at the actual
>>> instruction cost for my
>>> platform given the version 1 and version 2 code.
>>>
>>> If we count cpu cycles using the AMD Zen 2 instruction
>>> latency/throughput tables:  version 1 (loop body)
>>> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body=
)
>>> has ~3-4 cycles per iteration.
>>>
>>> The problem for version 2 is that the call to memcpy is ~24-30 cycles
>>> due to the stub + function call + return
>>> and branch predictor pressure on first call.  This probably results in
>>> ~2.5 ns per iteration cost for version 2.
>>>
>>> So, no I wouldn't call it an optimization.  But, it will be interesting
>>> to hear other opinions on this.
>>>
>> I made dirty and quick tests with two versions:
>> gcc 15.2.0
>> gcc -O2 memcpy1.c -o memcpy1
>>
>> The first test was with keys 10000000 and 10000000 loops:
>> version1: on memcpy call
>> done in 1873 nanoseconds
>>
>> version2: inlined memcpy
>> not finish
>>
>> The second test was with keys 4 and 10000000 loops:
>> version1: one memcpy call
>> version2: inlined memcpy call
>>
>> version1: done in 1519 nanoseconds
>> version2: done in 104981851 nanoseconds
>> (1.44692e-05 times faster)
>>
>> version1: done in 1979 nanoseconds
>> version2: done in 110568901 nanoseconds
>> (1.78983e-05 times faster)
>>
>> version1: done in 1814 nanoseconds
>> version2: done in 108555484 nanoseconds
>> (1.67103e-05 times faster)
>>
>> version1: done in 1631 nanoseconds
>> version2: done in 109867919 nanoseconds
>> (1.48451e-05 times faster)
>>
>> version1: done in 1269 nanoseconds
>> version2: done in 111639106 nanoseconds
>> (1.1367e-05 times faster)
>>
>> Unless I'm doing something wrong, one call memcpy wins!
>> memcpy1.c attached.
>>
>> best regards,
>> Ranier Vilela
>>
>

--000000000000e50a4b064cd8a837
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I modified your memcpy1.c program to not inline the versio=
n functions.=C2=A0 I changed the memcpy function<div>call in version 1, add=
ed volatile to keep some DCE opportunities from happening and added a range=
</div><div>of N values to keep the compiler from specializing the code for =
N =3D 4.=C2=A0 Before it did DCE and the test1=C2=A0</div><div>function was=
 just a ret.</div><div><br></div><div>The interesting issue is the use of m=
alloc versus the stack.=C2=A0 The use of malloc will probably track closer<=
/div><div>with PG&#39;s use of palloc so I would say in that case this is a=
n optimization.=C2=A0 It might be fun to compile PG</div><div>with and with=
out the patch (in debug mode) and actually see what gets generated for this=
 function.</div><div><br></div><div>Here are the results I got using your m=
odified benchmark:</div><div>--- stack allocated ---<br>stack =C2=A0n=3D1 =
=C2=A0v1(patch): 49721599 ns =C2=A0v2(original): 21477302 ns =C2=A0ratio: 2=
.315 =C2=A0original wins<br>stack =C2=A0n=3D2 =C2=A0v1(patch): 52065462 ns =
=C2=A0v2(original): 28765199 ns =C2=A0ratio: 1.810 =C2=A0original wins<br>s=
tack =C2=A0n=3D3 =C2=A0v1(patch): 58914958 ns =C2=A0v2(original): 39726110 =
ns =C2=A0ratio: 1.483 =C2=A0original wins<br>stack =C2=A0n=3D4 =C2=A0v1(pat=
ch): 64585275 ns =C2=A0v2(original): 47046397 ns =C2=A0ratio: 1.373 =C2=A0o=
riginal wins<br>stack =C2=A0n=3D5 =C2=A0v1(patch): 73929844 ns =C2=A0v2(ori=
ginal): 58588698 ns =C2=A0ratio: 1.262 =C2=A0original wins<br>stack =C2=A0n=
=3D6 =C2=A0v1(patch): 95465376 ns =C2=A0v2(original): 67807817 ns =C2=A0rat=
io: 1.408 =C2=A0original wins<br>stack =C2=A0n=3D7 =C2=A0v1(patch): 8691022=
6 ns =C2=A0v2(original): 76999488 ns =C2=A0ratio: 1.129 =C2=A0original wins=
<br>stack =C2=A0n=3D8 =C2=A0v1(patch): 107765417 ns =C2=A0v2(original): 860=
46016 ns =C2=A0ratio: 1.252 =C2=A0original wins<br><br>--- malloc allocated=
 ---<br>malloc n=3D1 =C2=A0v1(patch): 133283824 ns =C2=A0v2(original): 1413=
61091 ns =C2=A0ratio: 0.943 =C2=A0patch wins<br>malloc n=3D2 =C2=A0v1(patch=
): 145625895 ns =C2=A0v2(original): 180912711 ns =C2=A0ratio: 0.805 =C2=A0p=
atch wins<br>malloc n=3D3 =C2=A0v1(patch): 153975594 ns =C2=A0v2(original):=
 228459879 ns =C2=A0ratio: 0.674 =C2=A0patch wins<br>malloc n=3D4 =C2=A0v1(=
patch): 154483094 ns =C2=A0v2(original): 248157408 ns =C2=A0ratio: 0.623 =
=C2=A0patch wins<br>malloc n=3D5 =C2=A0v1(patch): 157710598 ns =C2=A0v2(ori=
ginal): 298795018 ns =C2=A0ratio: 0.528 =C2=A0patch wins<br>malloc n=3D6 =
=C2=A0v1(patch): 165196636 ns =C2=A0v2(original): 332940132 ns =C2=A0ratio:=
 0.496 =C2=A0patch wins<br>malloc n=3D7 =C2=A0v1(patch): 169576370 ns =C2=
=A0v2(original): 358438778 ns =C2=A0ratio: 0.473 =C2=A0patch wins<br>malloc=
 n=3D8 =C2=A0v1(patch): 184463815 ns =C2=A0v2(original): 403721513 ns =C2=
=A0ratio: 0.457 =C2=A0patch wins</div><div><br></div><div><br></div><div>Th=
e modified program:</div><div><br></div><div>#include &lt;stdlib.h&gt;<br>#=
include &lt;string.h&gt;<br>#include &lt;stdint.h&gt;<br>#include &lt;stdbo=
ol.h&gt;<br>#include &lt;stddef.h&gt;<br>#include &lt;stdio.h&gt;<br>#inclu=
de &lt;time.h&gt;<br><br>typedef void (*RegProcedure)(void);<br>typedef uin=
tptr_t Datum;<br><br>typedef struct ScanKeyData<br>{<br>=C2=A0 =C2=A0 int =
=C2=A0 =C2=A0 =C2=A0 =C2=A0 sk_flags;<br>=C2=A0 =C2=A0 int =C2=A0 =C2=A0 =
=C2=A0 =C2=A0 sk_attno;<br>=C2=A0 =C2=A0 RegProcedure sk_func;<br>=C2=A0 =
=C2=A0 Datum =C2=A0 =C2=A0 =C2=A0 sk_argument;<br>} ScanKeyData;<br><br>/* =
version1: bulk memcpy + fixup (the patch) */<br>static __attribute__((noinl=
ine))<br>void version1_stack(int n, const ScanKeyData *key, ScanKeyData *id=
xkey)<br>{<br>=C2=A0 =C2=A0 memcpy(idxkey, key, n * sizeof(ScanKeyData));<b=
r>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; n; i++)<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 idxkey[i].sk_attno =3D i + 1;<br>}<br><br>/* version2: per-element m=
emcpy + fixup (the original) */<br>static __attribute__((noinline))<br>void=
 version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)<br>{<br=
>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; n; i++)<br>=C2=A0 =C2=A0 {<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 memcpy(&amp;idxkey[i], &amp;key[i], sizeof(ScanKey=
Data));<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey[i].sk_attno =3D i + 1;<br>=C2=
=A0 =C2=A0 }<br>}<br><br>/* version1: bulk memcpy + fixup (the patch) */<br=
>static __attribute__((noinline))<br>ScanKeyData *version1_malloc(int n, co=
nst ScanKeyData *key)<br>{<br>=C2=A0 =C2=A0 ScanKeyData *idxkey =3D (ScanKe=
yData *) malloc(n * sizeof(ScanKeyData));<br><br>=C2=A0 =C2=A0 memcpy(idxke=
y, key, n * sizeof(ScanKeyData));<br>=C2=A0 =C2=A0 for (int i =3D 0; i &lt;=
 n; i++)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey[i].sk_attno =3D i + 1;<br><b=
r>=C2=A0 =C2=A0 return idxkey;<br>}<br><br>/* version2: per-element memcpy =
+ fixup (the original) */<br>static __attribute__((noinline))<br>ScanKeyDat=
a *version2_malloc(int n, const ScanKeyData *key)<br>{<br>=C2=A0 =C2=A0 Sca=
nKeyData *idxkey =3D (ScanKeyData *) malloc(n * sizeof(ScanKeyData));<br><b=
r>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; n; i++)<br>=C2=A0 =C2=A0 {<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 memcpy(&amp;idxkey[i], &amp;key[i], sizeof(ScanKey=
Data));<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey[i].sk_attno =3D i + 1;<br>=C2=
=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 return idxkey;<br>}<br><br>#define NANOSE=
C_PER_SEC 1000000000<br><br>int64_t<br>get_clock_diff(struct timespec *t1, =
struct timespec *t2)<br>{<br>=C2=A0 =C2=A0 int64_t nanosec =3D (t1-&gt;tv_s=
ec - t2-&gt;tv_sec) * NANOSEC_PER_SEC;<br>=C2=A0 =C2=A0 nanosec +=3D (t1-&g=
t;tv_nsec - t2-&gt;tv_nsec);<br>=C2=A0 =C2=A0 return nanosec;<br>}<br><br>#=
define MAX_KEYS 8<br>#define LOOPS 10000000<br><br>void test_stack(int n)<b=
r>{<br>=C2=A0 =C2=A0 ScanKeyData =C2=A0keys[MAX_KEYS];<br>=C2=A0 =C2=A0 Sca=
nKeyData =C2=A0idxkey[MAX_KEYS];<br>=C2=A0 =C2=A0 struct timespec start, en=
d;<br>=C2=A0 =C2=A0 int64_t =C2=A0 =C2=A0 =C2=A0version1_time, version2_tim=
e;<br><br>=C2=A0 =C2=A0 memset(keys, 0, sizeof(keys));<br><br>=C2=A0 =C2=A0=
 /* warmup */<br>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; 1000; i++)<br>=C2=
=A0 =C2=A0 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 version1_stack(n, keys, idxkey)=
;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int sink =3D idxkey[n-1].sk_attno=
;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;<br>=C2=A0 =C2=A0 }<br><br>=C2=
=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;start);<br>=C2=A0 =
=C2=A0 for (int i =3D 0; i &lt; LOOPS; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 version1_stack(n, keys, idxkey);<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 volatile int sink =3D idxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 (void) sink;<br>=C2=A0 =C2=A0 }<br>=C2=A0 =C2=A0 clock_gettime(C=
LOCK_PROCESS_CPUTIME_ID, &amp;end);<br>=C2=A0 =C2=A0 version1_time =3D get_=
clock_diff(&amp;end, &amp;start);<br><br>=C2=A0 =C2=A0 /* warmup */<br>=C2=
=A0 =C2=A0 for (int i =3D 0; i &lt; 1000; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0=
 =C2=A0 =C2=A0 =C2=A0 version2_stack(n, keys, idxkey);<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 volatile int sink =3D idxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 (void) sink;<br>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 clock_getti=
me(CLOCK_PROCESS_CPUTIME_ID, &amp;start);<br>=C2=A0 =C2=A0 for (int i =3D 0=
; i &lt; LOOPS; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 vers=
ion2_stack(n, keys, idxkey);<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int si=
nk =3D idxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;<br=
>=C2=A0 =C2=A0 }<br>=C2=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &=
amp;end);<br>=C2=A0 =C2=A0 version2_time =3D get_clock_diff(&amp;end, &amp;=
start);<br><br>=C2=A0 =C2=A0 printf(&quot;stack =C2=A0n=3D%d =C2=A0v1(patch=
): %ld ns =C2=A0v2(original): %ld ns =C2=A0ratio: %.3f =C2=A0%s\n&quot;,<br=
>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0n, version1_time, version2_time,<=
br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(double) version1_time / versio=
n2_time,<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0version1_time &lt; ver=
sion2_time ? &quot;patch wins&quot; : &quot;original wins&quot;);<br>}<br><=
br>void test_malloc(int n)<br>{<br>=C2=A0 =C2=A0 ScanKeyData =C2=A0keys[MAX=
_KEYS];<br>=C2=A0 =C2=A0 ScanKeyData =C2=A0*idxkey;<br>=C2=A0 =C2=A0 struct=
 timespec start, end;<br>=C2=A0 =C2=A0 int64_t =C2=A0 =C2=A0 =C2=A0version1=
_time, version2_time;<br><br>=C2=A0 =C2=A0 memset(keys, 0, sizeof(keys));<b=
r><br>=C2=A0 =C2=A0 /* warmup */<br>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; =
1000; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey =3D ver=
sion1_malloc(n, keys);<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int sink =3D=
 idxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 free(idxkey);<br>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=
=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;start);<br>=C2=A0 =C2=A0 f=
or (int i =3D 0; i &lt; LOOPS; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 idxkey =3D version1_malloc(n, keys);<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 volatile int sink =3D idxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=A0 =C2=
=A0 (void) sink;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 free(idxkey);<br>=C2=A0 =C2=
=A0 }<br>=C2=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;end);<b=
r>=C2=A0 =C2=A0 version1_time =3D get_clock_diff(&amp;end, &amp;start);<br>=
<br>=C2=A0 =C2=A0 /* warmup */<br>=C2=A0 =C2=A0 for (int i =3D 0; i &lt; 10=
00; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey =3D versi=
on2_malloc(n, keys);<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int sink =3D i=
dxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;<br>=C2=A0 =
=C2=A0 =C2=A0 =C2=A0 free(idxkey);<br>=C2=A0 =C2=A0 }<br><br>=C2=A0 =C2=A0 =
clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;start);<br>=C2=A0 =C2=A0 for (=
int i =3D 0; i &lt; LOOPS; i++)<br>=C2=A0 =C2=A0 {<br>=C2=A0 =C2=A0 =C2=A0 =
=C2=A0 idxkey =3D version2_malloc(n, keys);<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
volatile int sink =3D idxkey[n-1].sk_attno;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =
(void) sink;<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 free(idxkey);<br>=C2=A0 =C2=A0 =
}<br>=C2=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &amp;end);<br>=
=C2=A0 =C2=A0 version2_time =3D get_clock_diff(&amp;end, &amp;start);<br><b=
r>=C2=A0 =C2=A0 printf(&quot;malloc n=3D%d =C2=A0v1(patch): %ld ns =C2=A0v2=
(original): %ld ns =C2=A0ratio: %.3f =C2=A0%s\n&quot;,<br>=C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0n, version1_time, version2_time,<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 =C2=A0(double) version1_time / version2_time,<br>=C2=
=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0version1_time &lt; version2_time ? &q=
uot;patch wins&quot; : &quot;original wins&quot;);<br>}<br><br>int main(voi=
d)<br>{<br>=C2=A0 =C2=A0 printf(&quot;--- stack allocated ---\n&quot;);<br>=
=C2=A0 =C2=A0 for (int n =3D 1; n &lt;=3D MAX_KEYS; n++)<br>=C2=A0 =C2=A0 =
=C2=A0 =C2=A0 test_stack(n);<br><br>=C2=A0 =C2=A0 printf(&quot;\n--- malloc=
 allocated ---\n&quot;);<br>=C2=A0 =C2=A0 for (int n =3D 1; n &lt;=3D MAX_K=
EYS; n++)<br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 test_malloc(n);<br><br>=C2=A0 =C2=
=A0 return 0;<br>}</div><div>=C2=A0=C2=A0</div><div><br></div><div>-- bg<br=
><div><br></div><div><br></div><div><br></div></div></div><br><div class=3D=
"gmail_quote gmail_quote_container"><div dir=3D"ltr" class=3D"gmail_attr">O=
n Thu, Mar 12, 2026 at 12:48=E2=80=AFPM Bryan Green &lt;<a href=3D"mailto:d=
bryan.green@gmail.com">dbryan.green@gmail.com</a>&gt; wrote:<br></div><bloc=
kquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-left:=
1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr">I don&#39;t t=
hink your version 1 memcpy is doing what you think it is doing.</div><br><d=
iv class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">On Thu, Mar =
12, 2026 at 12:35=E2=80=AFPM Ranier Vilela &lt;<a href=3D"mailto:ranier.vf@=
gmail.com" target=3D"_blank">ranier.vf@gmail.com</a>&gt; wrote:<br></div><b=
lockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;border-le=
ft:1px solid rgb(204,204,204);padding-left:1ex"><div dir=3D"ltr"><div>Hi.</=
div><br><div class=3D"gmail_quote"><div dir=3D"ltr" class=3D"gmail_attr">Em=
 seg., 9 de mar. de 2026 =C3=A0s 14:02, Bryan Green &lt;<a href=3D"mailto:d=
bryan.green@gmail.com" target=3D"_blank">dbryan.green@gmail.com</a>&gt; esc=
reveu:<br></div><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0=
px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir=
=3D"ltr">I performed a micro-benchmark on my dual epyc (zen 2) server and v=
ersion 1 wins for small values of n.<div><br></div><div>20 runs:=C2=A0</div=
><div><br><div>n =C2=A0 =C2=A0 =C2=A0 version =C2=A0 =C2=A0 =C2=A0 min =C2=
=A0median =C2=A0 =C2=A0mean =C2=A0 =C2=A0 max =C2=A0stddev =C2=A0noise%<br>=
-----------------------------------------------------------------------<br>=
n=3D1 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.440 =C2=A0 2.440 =C2=A0 2.450 =
=C2=A0 2.550 =C2=A0 0.024 =C2=A0 =C2=A04.5%<br>n=3D1 =C2=A0 =C2=A0 version2=
 =C2=A0 =C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.277 =C2=A0 4.290 =C2=A0 0.007 =
=C2=A0 =C2=A00.7%<br><br>n=3D2 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.740 =
=C2=A0 2.750 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 0.029 =C2=A0 =C2=A05.1%<br>n=
=3D2 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 3.970 =C2=A0 3.980 =C2=A0 3.980 =
=C2=A0 4.020 =C2=A0 0.010 =C2=A0 =C2=A01.3%<br><br>n=3D4 =C2=A0 =C2=A0 vers=
ion1 =C2=A0 =C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.649 =C2=A0 4.910 =C2=A0 0.09=
4 =C2=A0 =C2=A07.2%<br>n=3D4 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 5.780 =C2=
=A0 5.815 =C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013 =C2=A0 =C2=A00.7%</div><di=
v><br></div><div>But, micro-benchmarks always make me nervous, so I looked =
at the actual instruction cost for my=C2=A0</div><div>platform given the ve=
rsion 1 and version 2 code.</div><div><br></div><div>If we count cpu cycles=
 using the AMD Zen 2 instruction latency/throughput tables:=C2=A0 version 1=
 (loop body)=C2=A0</div><div>has a critical path of ~5-6 cycles per iterati=
on.=C2=A0 version 2 (loop body) has ~3-4 cycles per iteration.=C2=A0</div><=
div><br></div><div>The problem for version 2 is that the call to memcpy is =
~24-30 cycles due to the stub=C2=A0+ function call=C2=A0+ return</div><div>=
and branch predictor pressure on first call.=C2=A0 This probably results in=
 ~2.5 ns per iteration cost for version 2.</div><div><br></div><div>So, no =
I wouldn&#39;t call it an optimization.=C2=A0 But, it will be interesting t=
o hear other opinions on this.=C2=A0</div></div></div></blockquote><div>I m=
ade dirty and quick tests with two versions:</div><div>gcc 15.2.0</div><div=
>gcc -O2 memcpy1.c -o memcpy1</div><div><br></div><div>The first test was w=
ith keys=C2=A010000000 and=C2=A010000000 loops:</div><div>version1: on memc=
py call</div><div>done in 1873 nanoseconds<br><br></div><div>version2: inli=
ned memcpy</div><div>not finish</div><div><br></div><div>The second test wa=
s with keys 4 and=C2=A010000000 loops:</div><div>version1: one memcpy call<=
/div><div>version2: inlined memcpy call</div><div><br></div><div>version1: =
done in 1519 nanoseconds<br>version2: done in 104981851 nanoseconds<br>(1.4=
4692e-05 times faster)</div><div><br>version1: done in 1979 nanoseconds<br>=
version2: done in 110568901 nanoseconds<br>(1.78983e-05 times faster)<br><b=
r></div><div>version1: done in 1814 nanoseconds<br>version2: done in 108555=
484 nanoseconds<br>(1.67103e-05 times faster)<br><br></div><div>version1: d=
one in 1631 nanoseconds<br>version2: done in 109867919 nanoseconds<br>(1.48=
451e-05 times faster)<br><br>version1: done in 1269 nanoseconds<br>version2=
: done in 111639106 nanoseconds<br>(1.1367e-05 times faster)<br><br></div><=
div>Unless I&#39;m doing something wrong, one call memcpy wins!</div><div>m=
emcpy1.c attached.</div><div><br></div><div>best regards,</div><div>Ranier =
Vilela</div></div></div>
</blockquote></div>
</blockquote></div>

--000000000000e50a4b064cd8a837--