Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0lbD-002BtV-1C for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 19:21:35 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w0lbB-00GwrP-2g for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 19:21:34 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0lbB-00Gwpl-1P for pgsql-hackers@lists.postgresql.org; Thu, 12 Mar 2026 19:21:33 +0000 Received: from mail-ed1-x534.google.com ([2a00:1450:4864:20::534]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w0lb9-00000002LXA-2VLK for pgsql-hackers@postgresql.org; Thu, 12 Mar 2026 19:21:33 +0000 Received: by mail-ed1-x534.google.com with SMTP id 4fb4d7f45d1cf-662fc12ac5bso2699431a12.0 for ; Thu, 12 Mar 2026 12:21:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773343291; cv=none; d=google.com; s=arc-20240605; b=iRrVdZwVaxAdcmhSdcdCJpyIJrsC+b3WgcL9N8hkWqN9lmuqQEfADcWKbIeaZoeVXb tklzimF959os3Gf0iYNGclGbY1aLwFuCrVYp85RqgSl9/eMzVImSNerZtczfqiexJcVE MXhdu0GS2MlMprINoWzcBom0aeV8gJjgtIwAfkNH5zpSVIXMvaaj8ER5FoYkMDBVpYs0 GTMuCAH6nw1FPVAgCfGpzdEm1XR2utWz53Juc293HjF8x5I9t1nClSDaGhtweKWEzHQG AniQz1OAzUOQNfZW2Vmori7tUiLXtBGokxf3jgy4zWZSvLTcWGMbfYlRRi/ey3HrNocE 6PwQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=VlJt4CCRhoIIajm58alw0wgHBeCFD0DTrdO4rGzBcr4=; fh=TVfGSc1ROJnfT2DI0+LhstR3EAj4703kqaeMLTgraqk=; b=ZGos0RVRMb9MH+vWnkwgQtJQfHEfpI6XOdoWsLYJrK34jYvX+dsw96HU5UVueDAJZo kUS1BLBBf9ygdvn/HBdco3p0V669z3CtNAccWp0QVCAKdXDD58C5SoTqwVpl5laZPCGJ HZBeddhdWCsRXlHy5Zi4vOB64WSYvZFZFTQLOXatNAg28rJEnrZE6Kk3D/FgCFsmfPOn zhCosl4DVwcUpbPqOEt/h74SP3DUalu5sbisrQetjIlTu/nHeIWzrzD+4fANFZMQSQKv c4OGo2F+tySXke88UmWet9enf56Io9KOIYF6v9/x4rWvJPX2JypXZcitZELQ40ptfCrw 5ZcA==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773343291; x=1773948091; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VlJt4CCRhoIIajm58alw0wgHBeCFD0DTrdO4rGzBcr4=; b=UiIFF0cd6WDKaiG4SQ/tNhkssFoHPlPsgEjKk0stLnMsKRd+L5YMLvR3Rnp+8d4GGJ p0vmT7x5FVhvg+vFUHFQcjmA/8hxLBFPqVZupLdU4Jnl2L2npFoFHN2Zr41HPQ4/q9vq F67mSTz4dnAj4TwMByR11sXBgE2XidfhbskqhY76Q/RqCmQCbfzAkAMx6dGxFyJ9MiAv 44hlzsA4jTE4du0NiS6F0bFym+JbMnFxUM8qwjPKBhKzXSy23gHQnhcgsmNxXgF5hlL/ JJyp+ObHHnUQo8Tnk2rd6J7c8JzIHickUgr6+LYIlkAehI7W5CK4MrMo449OLhsgS9oo X0zQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773343291; x=1773948091; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=VlJt4CCRhoIIajm58alw0wgHBeCFD0DTrdO4rGzBcr4=; b=cV8whMnAHVuKRRfI5ZUeGH8N8Duv/RWWBuDdEpWyF19K6BMmCmBEnWMgMw5TqN0bhv vC4Gg9GKsa0Dp5tzTPDrX3RLY1DlRHUKq+x6huq2FBVy+GOEadNTFxb6O3ChB9wEpYPm dfIUS0x2Z37dsNhE7kgwTZGR2LQ7YZKkea0Ax69kvMaG9Sa2bIpDn25aVlYfRLV2Qe/p pg3RGRmVS9a2nZweqWFxRUAW9OWn9OuTlabNj8v/AQAllnFFEtcDvhcHQoyNZ5P0xq9r p59ncnRl2l3nW1QFw8IfJUJJtlLvd4R+1PuP0KUuQa20zL0+SEzt1eq1Wfgok0xDdYFo CrfQ== X-Gm-Message-State: AOJu0Yyt9CAl5LDTrnX/9OKNXow33k6Tze6nnkWXg9MsyJuOH7LLTmR6 57tEd+yNKnGNFHtn5qQUD07rD2qSUhg8PXuLJvATNgmyANbetCm3gAN2M1wcMsHxxXHedx8L+2c igDLOYDnihvNZH8w1ZSHFTRa46cb4a94= X-Gm-Gg: ATEYQzxiqaMnmuGtctdW0Hqb1GF5PIAXIBGP1l+lEyAp/G8wOvZOnDXhMs1sHpMmXFh WqZ2FKZxF7R+2OMxKMmzYbNwIFxuexhbIrOPC/1vwRUEaaQw1mjnHv+lmQ97PaR0AXZnbSzueOh rwVhVsJGbS6E8Vw5tB9CIOwMSNvqUhHZNCtuaAV5seom1Gge6Llvy7PmPEvA8pEQlJOPr6dD/+V cOFjPZaiIX8D+Vc22+9FxXhoWvYERn7NPjkIWCFqXn0XWp8HTsBryxjzuHN1JK5e+ahqHVh057Z WKRfMunFKYULiLHgfv6n44MBJk0zmPQtkGzb1TmPmQ2ZXn1Q/k3L6iz895iKPKfkMOU= X-Received: by 2002:a05:6402:5214:b0:662:ce20:f229 with SMTP id 4fb4d7f45d1cf-663bac0e2e0mr267583a12.22.1773343290492; Thu, 12 Mar 2026 12:21:30 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bryan Green Date: Thu, 12 Mar 2026 14:21:12 -0500 X-Gm-Features: AaiRm53_3-tW2IoaEWjX5QqHXM_3loEHuriGaZRL2OigTOBQbidNupwqPluYv7s Message-ID: Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) To: Ranier Vilela Cc: Pg Hackers Content-Type: multipart/alternative; boundary="000000000000e50a4b064cd8a837" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000e50a4b064cd8a837 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I modified your memcpy1.c program to not inline the version functions. I changed the memcpy function call in version 1, added volatile to keep some DCE opportunities from happening and added a range of N values to keep the compiler from specializing the code for N =3D 4. Before it did DCE and the test1 function was just a ret. The interesting issue is the use of malloc versus the stack. The use of malloc will probably track closer with PG's use of palloc so I would say in that case this is an optimization. It might be fun to compile PG with and without the patch (in debug mode) and actually see what gets generated for this function. Here are the results I got using your modified benchmark: --- stack allocated --- stack n=3D1 v1(patch): 49721599 ns v2(original): 21477302 ns ratio: 2.3= 15 original wins stack n=3D2 v1(patch): 52065462 ns v2(original): 28765199 ns ratio: 1.8= 10 original wins stack n=3D3 v1(patch): 58914958 ns v2(original): 39726110 ns ratio: 1.4= 83 original wins stack n=3D4 v1(patch): 64585275 ns v2(original): 47046397 ns ratio: 1.3= 73 original wins stack n=3D5 v1(patch): 73929844 ns v2(original): 58588698 ns ratio: 1.2= 62 original wins stack n=3D6 v1(patch): 95465376 ns v2(original): 67807817 ns ratio: 1.4= 08 original wins stack n=3D7 v1(patch): 86910226 ns v2(original): 76999488 ns ratio: 1.1= 29 original wins stack n=3D8 v1(patch): 107765417 ns v2(original): 86046016 ns ratio: 1.252 original wins --- malloc allocated --- malloc n=3D1 v1(patch): 133283824 ns v2(original): 141361091 ns ratio: 0.943 patch wins malloc n=3D2 v1(patch): 145625895 ns v2(original): 180912711 ns ratio: 0.805 patch wins malloc n=3D3 v1(patch): 153975594 ns v2(original): 228459879 ns ratio: 0.674 patch wins malloc n=3D4 v1(patch): 154483094 ns v2(original): 248157408 ns ratio: 0.623 patch wins malloc n=3D5 v1(patch): 157710598 ns v2(original): 298795018 ns ratio: 0.528 patch wins malloc n=3D6 v1(patch): 165196636 ns v2(original): 332940132 ns ratio: 0.496 patch wins malloc n=3D7 v1(patch): 169576370 ns v2(original): 358438778 ns ratio: 0.473 patch wins malloc n=3D8 v1(patch): 184463815 ns v2(original): 403721513 ns ratio: 0.457 patch wins The modified program: #include #include #include #include #include #include #include typedef void (*RegProcedure)(void); typedef uintptr_t Datum; typedef struct ScanKeyData { int sk_flags; int sk_attno; RegProcedure sk_func; Datum sk_argument; } ScanKeyData; /* version1: bulk memcpy + fixup (the patch) */ static __attribute__((noinline)) void version1_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey) { memcpy(idxkey, key, n * sizeof(ScanKeyData)); for (int i =3D 0; i < n; i++) idxkey[i].sk_attno =3D i + 1; } /* version2: per-element memcpy + fixup (the original) */ static __attribute__((noinline)) void version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey) { for (int i =3D 0; i < n; i++) { memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData)); idxkey[i].sk_attno =3D i + 1; } } /* version1: bulk memcpy + fixup (the patch) */ static __attribute__((noinline)) ScanKeyData *version1_malloc(int n, const ScanKeyData *key) { ScanKeyData *idxkey =3D (ScanKeyData *) malloc(n * sizeof(ScanKeyData))= ; memcpy(idxkey, key, n * sizeof(ScanKeyData)); for (int i =3D 0; i < n; i++) idxkey[i].sk_attno =3D i + 1; return idxkey; } /* version2: per-element memcpy + fixup (the original) */ static __attribute__((noinline)) ScanKeyData *version2_malloc(int n, const ScanKeyData *key) { ScanKeyData *idxkey =3D (ScanKeyData *) malloc(n * sizeof(ScanKeyData))= ; for (int i =3D 0; i < n; i++) { memcpy(&idxkey[i], &key[i], sizeof(ScanKeyData)); idxkey[i].sk_attno =3D i + 1; } return idxkey; } #define NANOSEC_PER_SEC 1000000000 int64_t get_clock_diff(struct timespec *t1, struct timespec *t2) { int64_t nanosec =3D (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC; nanosec +=3D (t1->tv_nsec - t2->tv_nsec); return nanosec; } #define MAX_KEYS 8 #define LOOPS 10000000 void test_stack(int n) { ScanKeyData keys[MAX_KEYS]; ScanKeyData idxkey[MAX_KEYS]; struct timespec start, end; int64_t version1_time, version2_time; memset(keys, 0, sizeof(keys)); /* warmup */ for (int i =3D 0; i < 1000; i++) { version1_stack(n, keys, idxkey); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i =3D 0; i < LOOPS; i++) { version1_stack(n, keys, idxkey); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); version1_time =3D get_clock_diff(&end, &start); /* warmup */ for (int i =3D 0; i < 1000; i++) { version2_stack(n, keys, idxkey); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i =3D 0; i < LOOPS; i++) { version2_stack(n, keys, idxkey); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); version2_time =3D get_clock_diff(&end, &start); printf("stack n=3D%d v1(patch): %ld ns v2(original): %ld ns ratio: %.3f %s\n", n, version1_time, version2_time, (double) version1_time / version2_time, version1_time < version2_time ? "patch wins" : "original wins"); } void test_malloc(int n) { ScanKeyData keys[MAX_KEYS]; ScanKeyData *idxkey; struct timespec start, end; int64_t version1_time, version2_time; memset(keys, 0, sizeof(keys)); /* warmup */ for (int i =3D 0; i < 1000; i++) { idxkey =3D version1_malloc(n, keys); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; free(idxkey); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i =3D 0; i < LOOPS; i++) { idxkey =3D version1_malloc(n, keys); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; free(idxkey); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); version1_time =3D get_clock_diff(&end, &start); /* warmup */ for (int i =3D 0; i < 1000; i++) { idxkey =3D version2_malloc(n, keys); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; free(idxkey); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); for (int i =3D 0; i < LOOPS; i++) { idxkey =3D version2_malloc(n, keys); volatile int sink =3D idxkey[n-1].sk_attno; (void) sink; free(idxkey); } clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); version2_time =3D get_clock_diff(&end, &start); printf("malloc n=3D%d v1(patch): %ld ns v2(original): %ld ns ratio: %.3f %s\n", n, version1_time, version2_time, (double) version1_time / version2_time, version1_time < version2_time ? "patch wins" : "original wins"); } int main(void) { printf("--- stack allocated ---\n"); for (int n =3D 1; n <=3D MAX_KEYS; n++) test_stack(n); printf("\n--- malloc allocated ---\n"); for (int n =3D 1; n <=3D MAX_KEYS; n++) test_malloc(n); return 0; } -- bg On Thu, Mar 12, 2026 at 12:48=E2=80=AFPM Bryan Green wrote: > I don't think your version 1 memcpy is doing what you think it is doing. > > On Thu, Mar 12, 2026 at 12:35=E2=80=AFPM Ranier Vilela > wrote: > >> Hi. >> >> Em seg., 9 de mar. de 2026 =C3=A0s 14:02, Bryan Green >> escreveu: >> >>> I performed a micro-benchmark on my dual epyc (zen 2) server and versio= n >>> 1 wins for small values of n. >>> >>> 20 runs: >>> >>> n version min median mean max stddev noise% >>> ----------------------------------------------------------------------- >>> n=3D1 version1 2.440 2.440 2.450 2.550 0.024 4.5% >>> n=3D1 version2 4.260 4.280 4.277 4.290 0.007 0.7% >>> >>> n=3D2 version1 2.740 2.750 2.757 2.880 0.029 5.1% >>> n=3D2 version2 3.970 3.980 3.980 4.020 0.010 1.3% >>> >>> n=3D4 version1 4.580 4.595 4.649 4.910 0.094 7.2% >>> n=3D4 version2 5.780 5.815 5.809 5.820 0.013 0.7% >>> >>> But, micro-benchmarks always make me nervous, so I looked at the actual >>> instruction cost for my >>> platform given the version 1 and version 2 code. >>> >>> If we count cpu cycles using the AMD Zen 2 instruction >>> latency/throughput tables: version 1 (loop body) >>> has a critical path of ~5-6 cycles per iteration. version 2 (loop body= ) >>> has ~3-4 cycles per iteration. >>> >>> The problem for version 2 is that the call to memcpy is ~24-30 cycles >>> due to the stub + function call + return >>> and branch predictor pressure on first call. This probably results in >>> ~2.5 ns per iteration cost for version 2. >>> >>> So, no I wouldn't call it an optimization. But, it will be interesting >>> to hear other opinions on this. >>> >> I made dirty and quick tests with two versions: >> gcc 15.2.0 >> gcc -O2 memcpy1.c -o memcpy1 >> >> The first test was with keys 10000000 and 10000000 loops: >> version1: on memcpy call >> done in 1873 nanoseconds >> >> version2: inlined memcpy >> not finish >> >> The second test was with keys 4 and 10000000 loops: >> version1: one memcpy call >> version2: inlined memcpy call >> >> version1: done in 1519 nanoseconds >> version2: done in 104981851 nanoseconds >> (1.44692e-05 times faster) >> >> version1: done in 1979 nanoseconds >> version2: done in 110568901 nanoseconds >> (1.78983e-05 times faster) >> >> version1: done in 1814 nanoseconds >> version2: done in 108555484 nanoseconds >> (1.67103e-05 times faster) >> >> version1: done in 1631 nanoseconds >> version2: done in 109867919 nanoseconds >> (1.48451e-05 times faster) >> >> version1: done in 1269 nanoseconds >> version2: done in 111639106 nanoseconds >> (1.1367e-05 times faster) >> >> Unless I'm doing something wrong, one call memcpy wins! >> memcpy1.c attached. >> >> best regards, >> Ranier Vilela >> > --000000000000e50a4b064cd8a837 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I modified your memcpy1.c program to not inline the versio= n functions.=C2=A0 I changed the memcpy function
call in version 1, add= ed volatile to keep some DCE opportunities from happening and added a range=
of N values to keep the compiler from specializing the code for = N =3D 4.=C2=A0 Before it did DCE and the test1=C2=A0
function was= just a ret.

The interesting issue is the use of m= alloc versus the stack.=C2=A0 The use of malloc will probably track closer<= /div>
with PG's use of palloc so I would say in that case this is a= n optimization.=C2=A0 It might be fun to compile PG
with and with= out the patch (in debug mode) and actually see what gets generated for this= function.

Here are the results I got using your m= odified benchmark:
--- stack allocated ---
stack =C2=A0n=3D1 = =C2=A0v1(patch): 49721599 ns =C2=A0v2(original): 21477302 ns =C2=A0ratio: 2= .315 =C2=A0original wins
stack =C2=A0n=3D2 =C2=A0v1(patch): 52065462 ns = =C2=A0v2(original): 28765199 ns =C2=A0ratio: 1.810 =C2=A0original wins
s= tack =C2=A0n=3D3 =C2=A0v1(patch): 58914958 ns =C2=A0v2(original): 39726110 = ns =C2=A0ratio: 1.483 =C2=A0original wins
stack =C2=A0n=3D4 =C2=A0v1(pat= ch): 64585275 ns =C2=A0v2(original): 47046397 ns =C2=A0ratio: 1.373 =C2=A0o= riginal wins
stack =C2=A0n=3D5 =C2=A0v1(patch): 73929844 ns =C2=A0v2(ori= ginal): 58588698 ns =C2=A0ratio: 1.262 =C2=A0original wins
stack =C2=A0n= =3D6 =C2=A0v1(patch): 95465376 ns =C2=A0v2(original): 67807817 ns =C2=A0rat= io: 1.408 =C2=A0original wins
stack =C2=A0n=3D7 =C2=A0v1(patch): 8691022= 6 ns =C2=A0v2(original): 76999488 ns =C2=A0ratio: 1.129 =C2=A0original wins=
stack =C2=A0n=3D8 =C2=A0v1(patch): 107765417 ns =C2=A0v2(original): 860= 46016 ns =C2=A0ratio: 1.252 =C2=A0original wins

--- malloc allocated= ---
malloc n=3D1 =C2=A0v1(patch): 133283824 ns =C2=A0v2(original): 1413= 61091 ns =C2=A0ratio: 0.943 =C2=A0patch wins
malloc n=3D2 =C2=A0v1(patch= ): 145625895 ns =C2=A0v2(original): 180912711 ns =C2=A0ratio: 0.805 =C2=A0p= atch wins
malloc n=3D3 =C2=A0v1(patch): 153975594 ns =C2=A0v2(original):= 228459879 ns =C2=A0ratio: 0.674 =C2=A0patch wins
malloc n=3D4 =C2=A0v1(= patch): 154483094 ns =C2=A0v2(original): 248157408 ns =C2=A0ratio: 0.623 = =C2=A0patch wins
malloc n=3D5 =C2=A0v1(patch): 157710598 ns =C2=A0v2(ori= ginal): 298795018 ns =C2=A0ratio: 0.528 =C2=A0patch wins
malloc n=3D6 = =C2=A0v1(patch): 165196636 ns =C2=A0v2(original): 332940132 ns =C2=A0ratio:= 0.496 =C2=A0patch wins
malloc n=3D7 =C2=A0v1(patch): 169576370 ns =C2= =A0v2(original): 358438778 ns =C2=A0ratio: 0.473 =C2=A0patch wins
malloc= n=3D8 =C2=A0v1(patch): 184463815 ns =C2=A0v2(original): 403721513 ns =C2= =A0ratio: 0.457 =C2=A0patch wins


Th= e modified program:

#include <stdlib.h>
#= include <string.h>
#include <stdint.h>
#include <stdbo= ol.h>
#include <stddef.h>
#include <stdio.h>
#inclu= de <time.h>

typedef void (*RegProcedure)(void);
typedef uin= tptr_t Datum;

typedef struct ScanKeyData
{
=C2=A0 =C2=A0 int = =C2=A0 =C2=A0 =C2=A0 =C2=A0 sk_flags;
=C2=A0 =C2=A0 int =C2=A0 =C2=A0 = =C2=A0 =C2=A0 sk_attno;
=C2=A0 =C2=A0 RegProcedure sk_func;
=C2=A0 = =C2=A0 Datum =C2=A0 =C2=A0 =C2=A0 sk_argument;
} ScanKeyData;

/* = version1: bulk memcpy + fixup (the patch) */
static __attribute__((noinl= ine))
void version1_stack(int n, const ScanKeyData *key, ScanKeyData *id= xkey)
{
=C2=A0 =C2=A0 memcpy(idxkey, key, n * sizeof(ScanKeyData));=C2=A0 =C2=A0 for (int i =3D 0; i < n; i++)
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 idxkey[i].sk_attno =3D i + 1;
}

/* version2: per-element m= emcpy + fixup (the original) */
static __attribute__((noinline))
void= version2_stack(int n, const ScanKeyData *key, ScanKeyData *idxkey)
{=C2=A0 =C2=A0 for (int i =3D 0; i < n; i++)
=C2=A0 =C2=A0 {
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 memcpy(&idxkey[i], &key[i], sizeof(ScanKey= Data));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey[i].sk_attno =3D i + 1;
=C2= =A0 =C2=A0 }
}

/* version1: bulk memcpy + fixup (the patch) */static __attribute__((noinline))
ScanKeyData *version1_malloc(int n, co= nst ScanKeyData *key)
{
=C2=A0 =C2=A0 ScanKeyData *idxkey =3D (ScanKe= yData *) malloc(n * sizeof(ScanKeyData));

=C2=A0 =C2=A0 memcpy(idxke= y, key, n * sizeof(ScanKeyData));
=C2=A0 =C2=A0 for (int i =3D 0; i <= n; i++)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey[i].sk_attno =3D i + 1;
=C2=A0 =C2=A0 return idxkey;
}

/* version2: per-element memcpy = + fixup (the original) */
static __attribute__((noinline))
ScanKeyDat= a *version2_malloc(int n, const ScanKeyData *key)
{
=C2=A0 =C2=A0 Sca= nKeyData *idxkey =3D (ScanKeyData *) malloc(n * sizeof(ScanKeyData));
=C2=A0 =C2=A0 for (int i =3D 0; i < n; i++)
=C2=A0 =C2=A0 {
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 memcpy(&idxkey[i], &key[i], sizeof(ScanKey= Data));
=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey[i].sk_attno =3D i + 1;
=C2= =A0 =C2=A0 }

=C2=A0 =C2=A0 return idxkey;
}

#define NANOSE= C_PER_SEC 1000000000

int64_t
get_clock_diff(struct timespec *t1, = struct timespec *t2)
{
=C2=A0 =C2=A0 int64_t nanosec =3D (t1->tv_s= ec - t2->tv_sec) * NANOSEC_PER_SEC;
=C2=A0 =C2=A0 nanosec +=3D (t1-&g= t;tv_nsec - t2->tv_nsec);
=C2=A0 =C2=A0 return nanosec;
}

#= define MAX_KEYS 8
#define LOOPS 10000000

void test_stack(int n){
=C2=A0 =C2=A0 ScanKeyData =C2=A0keys[MAX_KEYS];
=C2=A0 =C2=A0 Sca= nKeyData =C2=A0idxkey[MAX_KEYS];
=C2=A0 =C2=A0 struct timespec start, en= d;
=C2=A0 =C2=A0 int64_t =C2=A0 =C2=A0 =C2=A0version1_time, version2_tim= e;

=C2=A0 =C2=A0 memset(keys, 0, sizeof(keys));

=C2=A0 =C2=A0= /* warmup */
=C2=A0 =C2=A0 for (int i =3D 0; i < 1000; i++)
=C2= =A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 version1_stack(n, keys, idxkey)= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int sink =3D idxkey[n-1].sk_attno= ;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;
=C2=A0 =C2=A0 }

=C2= =A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
=C2=A0 = =C2=A0 for (int i =3D 0; i < LOOPS; i++)
=C2=A0 =C2=A0 {
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 version1_stack(n, keys, idxkey);
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 volatile int sink =3D idxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 (void) sink;
=C2=A0 =C2=A0 }
=C2=A0 =C2=A0 clock_gettime(C= LOCK_PROCESS_CPUTIME_ID, &end);
=C2=A0 =C2=A0 version1_time =3D get_= clock_diff(&end, &start);

=C2=A0 =C2=A0 /* warmup */
=C2= =A0 =C2=A0 for (int i =3D 0; i < 1000; i++)
=C2=A0 =C2=A0 {
=C2=A0= =C2=A0 =C2=A0 =C2=A0 version2_stack(n, keys, idxkey);
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 volatile int sink =3D idxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 (void) sink;
=C2=A0 =C2=A0 }

=C2=A0 =C2=A0 clock_getti= me(CLOCK_PROCESS_CPUTIME_ID, &start);
=C2=A0 =C2=A0 for (int i =3D 0= ; i < LOOPS; i++)
=C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 vers= ion2_stack(n, keys, idxkey);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int si= nk =3D idxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;=C2=A0 =C2=A0 }
=C2=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &= amp;end);
=C2=A0 =C2=A0 version2_time =3D get_clock_diff(&end, &= start);

=C2=A0 =C2=A0 printf("stack =C2=A0n=3D%d =C2=A0v1(patch= ): %ld ns =C2=A0v2(original): %ld ns =C2=A0ratio: %.3f =C2=A0%s\n",=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0n, version1_time, version2_time,<= br>=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0(double) version1_time / versio= n2_time,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0version1_time < ver= sion2_time ? "patch wins" : "original wins");
}
<= br>void test_malloc(int n)
{
=C2=A0 =C2=A0 ScanKeyData =C2=A0keys[MAX= _KEYS];
=C2=A0 =C2=A0 ScanKeyData =C2=A0*idxkey;
=C2=A0 =C2=A0 struct= timespec start, end;
=C2=A0 =C2=A0 int64_t =C2=A0 =C2=A0 =C2=A0version1= _time, version2_time;

=C2=A0 =C2=A0 memset(keys, 0, sizeof(keys));
=C2=A0 =C2=A0 /* warmup */
=C2=A0 =C2=A0 for (int i =3D 0; i < = 1000; i++)
=C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey =3D ver= sion1_malloc(n, keys);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int sink =3D= idxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 free(idxkey);
=C2=A0 =C2=A0 }

=C2=A0 =C2= =A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
=C2=A0 =C2=A0 f= or (int i =3D 0; i < LOOPS; i++)
=C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 idxkey =3D version1_malloc(n, keys);
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 volatile int sink =3D idxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2=A0 =C2= =A0 (void) sink;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 free(idxkey);
=C2=A0 =C2= =A0 }
=C2=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);=C2=A0 =C2=A0 version1_time =3D get_clock_diff(&end, &start);
=
=C2=A0 =C2=A0 /* warmup */
=C2=A0 =C2=A0 for (int i =3D 0; i < 10= 00; i++)
=C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 =C2=A0 idxkey =3D versi= on2_malloc(n, keys);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 volatile int sink =3D i= dxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 (void) sink;
=C2=A0 = =C2=A0 =C2=A0 =C2=A0 free(idxkey);
=C2=A0 =C2=A0 }

=C2=A0 =C2=A0 = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
=C2=A0 =C2=A0 for (= int i =3D 0; i < LOOPS; i++)
=C2=A0 =C2=A0 {
=C2=A0 =C2=A0 =C2=A0 = =C2=A0 idxkey =3D version2_malloc(n, keys);
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = volatile int sink =3D idxkey[n-1].sk_attno;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = (void) sink;
=C2=A0 =C2=A0 =C2=A0 =C2=A0 free(idxkey);
=C2=A0 =C2=A0 = }
=C2=A0 =C2=A0 clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
= =C2=A0 =C2=A0 version2_time =3D get_clock_diff(&end, &start);
=C2=A0 =C2=A0 printf("malloc n=3D%d =C2=A0v1(patch): %ld ns =C2=A0v2= (original): %ld ns =C2=A0ratio: %.3f =C2=A0%s\n",
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0n, version1_time, version2_time,
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0(double) version1_time / version2_time,
=C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0version1_time < version2_time ? &q= uot;patch wins" : "original wins");
}

int main(voi= d)
{
=C2=A0 =C2=A0 printf("--- stack allocated ---\n");
= =C2=A0 =C2=A0 for (int n =3D 1; n <=3D MAX_KEYS; n++)
=C2=A0 =C2=A0 = =C2=A0 =C2=A0 test_stack(n);

=C2=A0 =C2=A0 printf("\n--- malloc= allocated ---\n");
=C2=A0 =C2=A0 for (int n =3D 1; n <=3D MAX_K= EYS; n++)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 test_malloc(n);

=C2=A0 =C2= =A0 return 0;
}
=C2=A0=C2=A0

-- bg




O= n Thu, Mar 12, 2026 at 12:48=E2=80=AFPM Bryan Green <dbryan.green@gmail.com> wrote:
I don't t= hink your version 1 memcpy is doing what you think it is doing.

On Thu, Mar = 12, 2026 at 12:35=E2=80=AFPM Ranier Vilela <ranier.vf@gmail.com> wrote:
Hi.
Em= seg., 9 de mar. de 2026 =C3=A0s 14:02, Bryan Green <dbryan.green@gmail.com> esc= reveu:
I performed a micro-benchmark on my dual epyc (zen 2) server and v= ersion 1 wins for small values of n.

20 runs:=C2=A0

n =C2=A0 =C2=A0 =C2=A0 version =C2=A0 =C2=A0 =C2=A0 min =C2= =A0median =C2=A0 =C2=A0mean =C2=A0 =C2=A0 max =C2=A0stddev =C2=A0noise%
= -----------------------------------------------------------------------
= n=3D1 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.440 =C2=A0 2.440 =C2=A0 2.450 = =C2=A0 2.550 =C2=A0 0.024 =C2=A0 =C2=A04.5%
n=3D1 =C2=A0 =C2=A0 version2= =C2=A0 =C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.277 =C2=A0 4.290 =C2=A0 0.007 = =C2=A0 =C2=A00.7%

n=3D2 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.740 = =C2=A0 2.750 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 0.029 =C2=A0 =C2=A05.1%
n= =3D2 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 3.970 =C2=A0 3.980 =C2=A0 3.980 = =C2=A0 4.020 =C2=A0 0.010 =C2=A0 =C2=A01.3%

n=3D4 =C2=A0 =C2=A0 vers= ion1 =C2=A0 =C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.649 =C2=A0 4.910 =C2=A0 0.09= 4 =C2=A0 =C2=A07.2%
n=3D4 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 5.780 =C2= =A0 5.815 =C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013 =C2=A0 =C2=A00.7%

But, micro-benchmarks always make me nervous, so I looked = at the actual instruction cost for my=C2=A0
platform given the ve= rsion 1 and version 2 code.

If we count cpu cycles= using the AMD Zen 2 instruction latency/throughput tables:=C2=A0 version 1= (loop body)=C2=A0
has a critical path of ~5-6 cycles per iterati= on.=C2=A0 version 2 (loop body) has ~3-4 cycles per iteration.=C2=A0
<= div>
The problem for version 2 is that the call to memcpy is = ~24-30 cycles due to the stub=C2=A0+ function call=C2=A0+ return
= and branch predictor pressure on first call.=C2=A0 This probably results in= ~2.5 ns per iteration cost for version 2.

So, no = I wouldn't call it an optimization.=C2=A0 But, it will be interesting t= o hear other opinions on this.=C2=A0
I m= ade dirty and quick tests with two versions:
gcc 15.2.0
gcc -O2 memcpy1.c -o memcpy1

The first test was w= ith keys=C2=A010000000 and=C2=A010000000 loops:
version1: on memc= py call
done in 1873 nanoseconds

version2: inli= ned memcpy
not finish

The second test wa= s with keys 4 and=C2=A010000000 loops:
version1: one memcpy call<= /div>
version2: inlined memcpy call

version1: = done in 1519 nanoseconds
version2: done in 104981851 nanoseconds
(1.4= 4692e-05 times faster)

version1: done in 1979 nanoseconds
= version2: done in 110568901 nanoseconds
(1.78983e-05 times faster)
version1: done in 1814 nanoseconds
version2: done in 108555= 484 nanoseconds
(1.67103e-05 times faster)

version1: d= one in 1631 nanoseconds
version2: done in 109867919 nanoseconds
(1.48= 451e-05 times faster)

version1: done in 1269 nanoseconds
version2= : done in 111639106 nanoseconds
(1.1367e-05 times faster)

<= div>Unless I'm doing something wrong, one call memcpy wins!
m= emcpy1.c attached.

best regards,
Ranier = Vilela
--000000000000e50a4b064cd8a837--