Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0k9f-002AdR-0K for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 17:49:03 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w0k9d-00GJQ8-0G for pgsql-hackers@arkaria.postgresql.org; Thu, 12 Mar 2026 17:49:01 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w0k9c-00GJQ0-1x for pgsql-hackers@lists.postgresql.org; Thu, 12 Mar 2026 17:49:01 +0000 Received: from mail-ej1-x635.google.com ([2a00:1450:4864:20::635]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w0k9a-00000001nwq-30Rj for pgsql-hackers@postgresql.org; Thu, 12 Mar 2026 17:48:59 +0000 Received: by mail-ej1-x635.google.com with SMTP id a640c23a62f3a-b942424d231so161915766b.0 for ; Thu, 12 Mar 2026 10:48:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773337736; cv=none; d=google.com; s=arc-20240605; b=C41aeKIuzCFfRc9jJUz/UmWneVgle9ssTu8TdEJlM87sX0G1Io5p2XxrO16QxTCMtk Jlj24oJOrhbPnOPjjbeFlfnhrIndln8v1t6V18J9UKaVjdMIgFA4e3ALMTsLa4xxJAdX SmF/Tu4TQZIVhBHG8n309nn3+Jnh6WFUR6q2N0qf3wMcZPZh+G6Ur18MSJpu8Fgzlbau 70RKbJaek55XyO2tJGm4Gx6iIoosyVZXq3/oE0dH0q4q6gZ9rQ8HnaLn5i6dEPS668Wg 81BMVioPo4q5lxMovdx7NOHLwVAAw0ulmDO0kb4dAfGZdj/5uEQKi4ormyfDesNZH7nL OsbQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=VWrTtzVs8Zo1g3Rm7siPsqyNtFehhQncgbj6r67y7Qg=; fh=TVfGSc1ROJnfT2DI0+LhstR3EAj4703kqaeMLTgraqk=; b=Cee32kbHL2tKRzedFzkj2cf2qKuD//ob+z98y6+elsaTyyslNvSLnEfYSx6sqIKcva aFpwcCPo+wsg/+3RxYwzajk33vDy12IPMcEyjySKQqaWBAbIlK0bwSURi4EFkA/LJubo yVH1Y2VBXr9BsBzJoagEdwVrmtfqRtGGbCpYX1PzWQFQ2WcyAVKOK+XYIgDYvqLQJ9P4 //X/NFsdMlhTI9I7QUodL8GiCP91/M/Lm+oJCXtnoJdevV3m8vs7A4zwh3c7kuk5c6YB LmzD++JVuvqXxpnXN6uHpsJ1T80K/aotjJssSonG9WZu74Ju7FkfAvcp/cxA5HnelTxX jQGg==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773337736; x=1773942536; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=VWrTtzVs8Zo1g3Rm7siPsqyNtFehhQncgbj6r67y7Qg=; b=Fdt9FSSLRVNP2HUfXib0rQeyF9J69EXtcJ9wa4H9E8OHpqs/ubpTeBeiYOWT/2NgVE s/YZtNmzcvXBv6W9UfF6A+qZJ3U07AZJEXiCSdzjcpf81cFvOx8tGLYcr6OZ/VP90Lqo /1lIEeJnU46psAn6+uCuven1ccHloi8+iKBdlhy0vEh11OiEi3cjbRlX417Mg867N+4y WZHVXuPiCfgmbM+zWdKE/BD3VfW7f9OanjK+BsLAbtZQCgTRW8KPl8T6sO+q0O+H9AwC JNRz9LrBX23Cab6JYCzbb3cB9GcGupofc6Jnw2ygxHaR3MvD9lW+xes97lWjH3B2V9Lx ky2Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773337736; x=1773942536; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=VWrTtzVs8Zo1g3Rm7siPsqyNtFehhQncgbj6r67y7Qg=; b=vz7SPFB4NWX5jzxrVTeAvyrxOG05gKuKlvC/eQ2N1PYSu5Wzv5W91ajmUsPLoezejw AdljX6EPj1aKv55pWAHpTNJna97HmediQFffDkY9LAre1xc2YWFDSh4norrUe5a3b/8A kMQ/ogGytl1SBkPuMv9vOY8lqxQ8RtKBO13pliMPrxT7Aaro1B9n569fZCR0P1FcmzGC QblHRqaTYRQRaa1YXvn2PJX4gzAPfzQqeBzLaRt5JRXaTbcnxp5tfSdVVG+wrxpOQ8VP AAFB2sv9+nm7LaZvfevIXnu+GE0IpwF4y551wiOd3k0nb95KWBwvo9TYEZzDhDD937eI bExw== X-Gm-Message-State: AOJu0Yz1l1ehwnZ9Vd5GjqcPc6lmIVvojE0la4NBxGrflvVGW89QEDzP PgYzedCfM33TYMNwq+13krxOMfWXiGc9RyNP5m6x/tgjnZ2OiXDQAmnOhM5IPMxkdhx7AHiJ2Mn am9eB9npr89TSlMLYZa9LkIUe7Ltx4wI= X-Gm-Gg: ATEYQzyEjhgXGyS4J9MiVKPJi0TvxgUA3hKOIA9NGoBdjcNcKwFKxDL993x1LaWIvG0 O0rZybZ3RhyGS10lF89NEyflMdny/7EWjVuQFtipYbSx/sjOW2/GsLgSmiXjOZ9GZLn/i34dNtz 6VH7t9agtFS8TXLsZ06MfoFwZoJqFqxcv2b9vfrWyAhvLEpaMk6rJqO6XVfm39Tf6AY5BKYrzdG UrMemWyyUuHK3/MdKqCGgI30kKMY0DbjzgyE3Bw5Q5XVxiW8fwp1mOHln5EgegxpKgwPyn7vXeK RqUYel8t/nj1LmGIQk4vVE5YWwJ/AAA+8xIC86RY7H4vVclEST++NYMVymPvn5Nyekk= X-Received: by 2002:a17:907:b001:b0:b96:f161:be with SMTP id a640c23a62f3a-b9765211e8amr10538966b.42.1773337735436; Thu, 12 Mar 2026 10:48:55 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bryan Green Date: Thu, 12 Mar 2026 12:48:38 -0500 X-Gm-Features: AaiRm51lTQnAJ9-7W9Q4ZYsPpFaIczCvllHLBTamC9VJ4aprsYMTLv8UgczR2aA Message-ID: Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) To: Ranier Vilela Cc: Pg Hackers Content-Type: multipart/alternative; boundary="000000000000c99f08064cd75df5" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000c99f08064cd75df5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I don't think your version 1 memcpy is doing what you think it is doing. On Thu, Mar 12, 2026 at 12:35=E2=80=AFPM Ranier Vilela wrote: > Hi. > > Em seg., 9 de mar. de 2026 =C3=A0s 14:02, Bryan Green > escreveu: > >> I performed a micro-benchmark on my dual epyc (zen 2) server and version >> 1 wins for small values of n. >> >> 20 runs: >> >> n version min median mean max stddev noise% >> ----------------------------------------------------------------------- >> n=3D1 version1 2.440 2.440 2.450 2.550 0.024 4.5% >> n=3D1 version2 4.260 4.280 4.277 4.290 0.007 0.7% >> >> n=3D2 version1 2.740 2.750 2.757 2.880 0.029 5.1% >> n=3D2 version2 3.970 3.980 3.980 4.020 0.010 1.3% >> >> n=3D4 version1 4.580 4.595 4.649 4.910 0.094 7.2% >> n=3D4 version2 5.780 5.815 5.809 5.820 0.013 0.7% >> >> But, micro-benchmarks always make me nervous, so I looked at the actual >> instruction cost for my >> platform given the version 1 and version 2 code. >> >> If we count cpu cycles using the AMD Zen 2 instruction latency/throughpu= t >> tables: version 1 (loop body) >> has a critical path of ~5-6 cycles per iteration. version 2 (loop body) >> has ~3-4 cycles per iteration. >> >> The problem for version 2 is that the call to memcpy is ~24-30 cycles du= e >> to the stub + function call + return >> and branch predictor pressure on first call. This probably results in >> ~2.5 ns per iteration cost for version 2. >> >> So, no I wouldn't call it an optimization. But, it will be interesting >> to hear other opinions on this. >> > I made dirty and quick tests with two versions: > gcc 15.2.0 > gcc -O2 memcpy1.c -o memcpy1 > > The first test was with keys 10000000 and 10000000 loops: > version1: on memcpy call > done in 1873 nanoseconds > > version2: inlined memcpy > not finish > > The second test was with keys 4 and 10000000 loops: > version1: one memcpy call > version2: inlined memcpy call > > version1: done in 1519 nanoseconds > version2: done in 104981851 nanoseconds > (1.44692e-05 times faster) > > version1: done in 1979 nanoseconds > version2: done in 110568901 nanoseconds > (1.78983e-05 times faster) > > version1: done in 1814 nanoseconds > version2: done in 108555484 nanoseconds > (1.67103e-05 times faster) > > version1: done in 1631 nanoseconds > version2: done in 109867919 nanoseconds > (1.48451e-05 times faster) > > version1: done in 1269 nanoseconds > version2: done in 111639106 nanoseconds > (1.1367e-05 times faster) > > Unless I'm doing something wrong, one call memcpy wins! > memcpy1.c attached. > > best regards, > Ranier Vilela > --000000000000c99f08064cd75df5 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I don't think your version 1 memcpy is doing what you = think it is doing.

On Thu, Mar 12, 2026 at 12:35=E2=80= =AFPM Ranier Vilela <ranier.vf@gm= ail.com> wrote:
Hi.

Em seg., 9 de mar. de 2026 =C3=A0s 14:02, B= ryan Green <= dbryan.green@gmail.com> escreveu:
I performed a micro-benchmark on = my dual epyc (zen 2) server and version 1 wins for small values of n.
<= br>
20 runs:=C2=A0

n =C2=A0 =C2=A0 =C2=A0 vers= ion =C2=A0 =C2=A0 =C2=A0 min =C2=A0median =C2=A0 =C2=A0mean =C2=A0 =C2=A0 m= ax =C2=A0stddev =C2=A0noise%
-------------------------------------------= ----------------------------
n=3D1 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 = 2.440 =C2=A0 2.440 =C2=A0 2.450 =C2=A0 2.550 =C2=A0 0.024 =C2=A0 =C2=A04.5%=
n=3D1 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.= 277 =C2=A0 4.290 =C2=A0 0.007 =C2=A0 =C2=A00.7%

n=3D2 =C2=A0 =C2=A0 = version1 =C2=A0 =C2=A0 2.740 =C2=A0 2.750 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 = 0.029 =C2=A0 =C2=A05.1%
n=3D2 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 3.970= =C2=A0 3.980 =C2=A0 3.980 =C2=A0 4.020 =C2=A0 0.010 =C2=A0 =C2=A01.3%
<= br>n=3D4 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.6= 49 =C2=A0 4.910 =C2=A0 0.094 =C2=A0 =C2=A07.2%
n=3D4 =C2=A0 =C2=A0 versi= on2 =C2=A0 =C2=A0 5.780 =C2=A0 5.815 =C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013= =C2=A0 =C2=A00.7%

But, micro-benchmarks always ma= ke me nervous, so I looked at the actual instruction cost for my=C2=A0
platform given the version 1 and version 2 code.

=
If we count cpu cycles using the AMD Zen 2 instruction latency/through= put tables:=C2=A0 version 1 (loop body)=C2=A0
has a critical path= of ~5-6 cycles per iteration.=C2=A0 version 2 (loop body) has ~3-4 cycles = per iteration.=C2=A0

The problem for version 2 is = that the call to memcpy is ~24-30 cycles due to the stub=C2=A0+ function ca= ll=C2=A0+ return
and branch predictor pressure on first call.=C2= =A0 This probably results in ~2.5 ns per iteration cost for version 2.

So, no I wouldn't call it an optimization.=C2=A0 B= ut, it will be interesting to hear other opinions on this.=C2=A0
I made dirty and quick tests with two versions:
gcc 15.2.0
gcc -O2 memcpy1.c -o memcpy1

The first test was with keys=C2=A010000000 and=C2=A010000000 loops:=
version1: on memcpy call
done in 1873 nanoseconds
<= br>
version2: inlined memcpy
not finish

<= /div>
The second test was with keys 4 and=C2=A010000000 loops:
version1: one memcpy call
version2: inlined memcpy call

version1: done in 1519 nanoseconds
version2: done in 1= 04981851 nanoseconds
(1.44692e-05 times faster)

version1: = done in 1979 nanoseconds
version2: done in 110568901 nanoseconds
(1.7= 8983e-05 times faster)

version1: done in 1814 nanoseconds=
version2: done in 108555484 nanoseconds
(1.67103e-05 times faster)
version1: done in 1631 nanoseconds
version2: done in 10= 9867919 nanoseconds
(1.48451e-05 times faster)

version1: done in = 1269 nanoseconds
version2: done in 111639106 nanoseconds
(1.1367e-05 = times faster)

Unless I'm doing something wrong, one c= all memcpy wins!
memcpy1.c attached.

bes= t regards,
Ranier Vilela
--000000000000c99f08064cd75df5--