Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vzdzs-001Bhx-0M for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Mar 2026 17:02:24 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1vzdzp-0005uc-1Z for pgsql-hackers@arkaria.postgresql.org; Mon, 09 Mar 2026 17:02:21 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1vzdzo-0005uT-39 for pgsql-hackers@lists.postgresql.org; Mon, 09 Mar 2026 17:02:21 +0000 Received: from mail-ed1-x530.google.com ([2a00:1450:4864:20::530]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1vzdzn-00000001JZF-28PU for pgsql-hackers@postgresql.org; Mon, 09 Mar 2026 17:02:20 +0000 Received: by mail-ed1-x530.google.com with SMTP id 4fb4d7f45d1cf-660a293515fso11945930a12.1 for ; Mon, 09 Mar 2026 10:02:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1773075738; cv=none; d=google.com; s=arc-20240605; b=Aht47xG7qvWUNNAQyFnqpyd1YJnFQ9f2Aq6xXuwlGPCkZTQU2xBtWShEHhgniWBI9M CBCRYjoCR2FU0MylJurqRFx8ClAsEWR2EvQWexUPO0jQ6nhmG+vvkDEGCcUIFg42tUYZ G2DHJ4pMhgg71HHHxvTsoaiqVkFpsgiXdKbMzSNH97+PTSyWRR8l7KLAcSQxmSs8en1r k0bay1KULUTewNtn/BI+BGfm46KZ3m5TwqUTCmi3hCPAEPkDF5+kTvWzMOzx6EjxeufO 3rwYPAVBREGWu1u1CcWfqmeySl8LDWHO8QJlMbXdjqmjPvLs8/Ka+jmSF7hsJDqa7LQ/ PqvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=gyiwpmRAtondfrl9OurUwjhbE63PwU7zEXZJ1M6O65w=; fh=TVfGSc1ROJnfT2DI0+LhstR3EAj4703kqaeMLTgraqk=; b=CE6CDkaOJ/nAz6EIXA+/NvGaqCG2s/COPZrURDbgm0OU7q+yz5MBPjk194j0pb+o++ QR3AWHciTdm2SdTYk8JagDvWmjlbQuhDjHrvsAv7vk2eHucBH7cJNOD+o3HMzaWr1B7P c6238AYasnwI4+57S3/a4xovW7Yqrwdad9tpJgNvMCQqwY5GqhIPYs41euYb9RlmhSUL SCtLz6XfpOWjbkaMBV2e6hWvdxXSMMbksYpIj2yUeb64Po/Zp9uqZ8ELmyNPW0jzQeGT 2YqQ9AvCET2G0/DUd4HQN39BMTYXtE6M7GZan1pemAdjlCUDdVM+hqFKDGGpqPW5fO6C 732w==; darn=postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1773075738; x=1773680538; darn=postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=gyiwpmRAtondfrl9OurUwjhbE63PwU7zEXZJ1M6O65w=; b=CmaPJQB8HEBhN0x+9BOY23J+RVDi4PsvP795RL8YzTC582HgKZvMHTIuCOQRZqZJnT aC1cUaONFQa3dfAogwPLpsLUTbdPc8o87bbgmd+/mSGvhVHmS8Kk/W3pKtDnv86YnuvB nysAjqJJzq19X1ytUSIDmfjuwOcc5vZOSzyA2mAUhWJwiU2E+Xr77j75da40T0ORPGdZ d8DSeCVnRg31mEkoxIuFXarDcUKpAzYSs8NJbVJoiVHEtF1IFwQ3qsuyn8mxfBp8JLZc uP+tTNXXBkMojmP6DKMjjTvbAv/jroK7LmaeXvHwvFFTLX3wa49LUHCLvNPFssCNx0MJ 2A1Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1773075738; x=1773680538; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=gyiwpmRAtondfrl9OurUwjhbE63PwU7zEXZJ1M6O65w=; b=E2uVy5/xsCw09a4V4fEJCr0vLGbZXE3IwQziyEAPUpe/lnHsZZn3CkcQZ2IDOGEUZc 5ob6fxdBVCQPVq7PObOlJ9evxSoa11taThMaOH78e2255uLZfExrGhWeuAtTa9gCts7a 2a9GlJUbGYshVSt35IFn+9X2mbH2riZJOicE+e0af54ytbmEDiDyo6+AX6tGtSWGEj9w VfCMPBmVfbcEWuZvTod/Kg/r/bxwlTEv160JeNq+Hv2vRvvJ7iG5snUQzf+3p+tMPTbX MiEogAbetiIPDx1fcwfiEKMZ7eQdF0G6bZ8g9rWQxuCBUFEHhISEKKMmjcmLewl3+/yA enQg== X-Gm-Message-State: AOJu0Ywym69wPUhIHrEeRxGehJ4TsANH4s3UrISQlxdukssnAdfcDGJk qbhHS1KjV+aupb8e34hdVfvfUWSy3lG10gKbcjgon1Yl6G7mA64484uhGCdQzLcsrAFo12KbhaX rubYjmlWYCtkmdn7HI26Yhvl66C8Z0ahY1M4xMqw= X-Gm-Gg: ATEYQzynjc9uo+RLUAVN1Zzc4ToLqFDENnkjdMwG5IvT/o432f6QsEXLs8MAZZlFPdt wGL/8N+EDfGXnzbyf2UnZpBk9TRbLWxq65Fdy8Vtk4Fck0+LXnkyepsogvVMJfLrCGOLdQGPYjI Ku+Gih9Vbiw2+U9MSQheMr2Zly84CZSUeoMSZtYDGpPhbPwbzy4M5xutfYN2FmcqzqS8T2aaLFa ipBMw/tS3f2bIRJEOertI23rLm3ubhOhp7uvCDmmSAK63vHf+whuG3DOQi1Ww/iAIa1XO2PuJhH ydMZgHakTi5weh0qSM3qlyOnhRKP1OhEBTW+KtUpQ1SQBkokdalEb61BZIQYPPp+4LJ2Ois= X-Received: by 2002:a17:907:94d3:b0:b94:c55:81be with SMTP id a640c23a62f3a-b942da79e0dmr684415466b.12.1773075737626; Mon, 09 Mar 2026 10:02:17 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Bryan Green Date: Mon, 9 Mar 2026 11:02:05 -0600 X-Gm-Features: AaiRm52JF7OVFSiUA_GKhuXida_34zVseNaubraTkovSoYaBQJ0ExFssogeHcbI Message-ID: Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c) To: Ranier Vilela Cc: Pg Hackers Content-Type: multipart/alternative; boundary="000000000000804b97064c9a5df7" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000804b97064c9a5df7 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I performed a micro-benchmark on my dual epyc (zen 2) server and version 1 wins for small values of n. 20 runs: n version min median mean max stddev noise% ----------------------------------------------------------------------- n=3D1 version1 2.440 2.440 2.450 2.550 0.024 4.5% n=3D1 version2 4.260 4.280 4.277 4.290 0.007 0.7% n=3D2 version1 2.740 2.750 2.757 2.880 0.029 5.1% n=3D2 version2 3.970 3.980 3.980 4.020 0.010 1.3% n=3D4 version1 4.580 4.595 4.649 4.910 0.094 7.2% n=3D4 version2 5.780 5.815 5.809 5.820 0.013 0.7% But, micro-benchmarks always make me nervous, so I looked at the actual instruction cost for my platform given the version 1 and version 2 code. If we count cpu cycles using the AMD Zen 2 instruction latency/throughput tables: version 1 (loop body) has a critical path of ~5-6 cycles per iteration. version 2 (loop body) has ~3-4 cycles per iteration. The problem for version 2 is that the call to memcpy is ~24-30 cycles due to the stub + function call + return and branch predictor pressure on first call. This probably results in ~2.5 ns per iteration cost for version 2. So, no I wouldn't call it an optimization. But, it will be interesting to hear other opinions on this. --bg On Mon, Mar 9, 2026 at 10:25=E2=80=AFAM Ranier Vilela = wrote: > > > Em seg., 9 de mar. de 2026 =C3=A0s 11:47, Bryan Green > escreveu: > >> I created an example that is a little bit closer to the actual code and >> changed the compiler from C++ to C. >> >> It is interesting the optimization that the compiler has chosen for >> version 1 versus version 2. One calls >> memcpy and one doesn't. There is a good chance the inlining of memcpy a= s >> SSE+scalar per iteration >> will be faster for syscache scans-- which I believe are usually small >> (1-4 keys?). >> > I doubt the inline version is better. > Clang is supported too and the code generated is much better with memcpy > one call outside of the loop. > > >> >> Probably the only reason to do this patch would be if N is normally larg= e >> or if this is considered an >> improvement in code clarity without a detrimental impact on small N >> syscache scans. >> I realize you only said "possible small optimization". It might be >> worthwhile to benchmark the code for >> different values of n to determine if there is a tipping point either wa= y? >> > In your opinion, shouldn't this be considered an optimization, even a > small one? > > best regards, > Ranier Vilela > --000000000000804b97064c9a5df7 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I performed a micro-benchmark on my dual epyc (zen 2) serv= er and version 1 wins for small values of n.

20 runs:=C2= =A0

n =C2=A0 =C2=A0 =C2=A0 version =C2=A0 =C2=A0 =C2=A0 = min =C2=A0median =C2=A0 =C2=A0mean =C2=A0 =C2=A0 max =C2=A0stddev =C2=A0noi= se%
--------------------------------------------------------------------= ---
n=3D1 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.440 =C2=A0 2.440 =C2=A0= 2.450 =C2=A0 2.550 =C2=A0 0.024 =C2=A0 =C2=A04.5%
n=3D1 =C2=A0 =C2=A0 v= ersion2 =C2=A0 =C2=A0 4.260 =C2=A0 4.280 =C2=A0 4.277 =C2=A0 4.290 =C2=A0 0= .007 =C2=A0 =C2=A00.7%

n=3D2 =C2=A0 =C2=A0 version1 =C2=A0 =C2=A0 2.= 740 =C2=A0 2.750 =C2=A0 2.757 =C2=A0 2.880 =C2=A0 0.029 =C2=A0 =C2=A05.1%n=3D2 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 3.970 =C2=A0 3.980 =C2=A0 3.98= 0 =C2=A0 4.020 =C2=A0 0.010 =C2=A0 =C2=A01.3%

n=3D4 =C2=A0 =C2=A0 ve= rsion1 =C2=A0 =C2=A0 4.580 =C2=A0 4.595 =C2=A0 4.649 =C2=A0 4.910 =C2=A0 0.= 094 =C2=A0 =C2=A07.2%
n=3D4 =C2=A0 =C2=A0 version2 =C2=A0 =C2=A0 5.780 = =C2=A0 5.815 =C2=A0 5.809 =C2=A0 5.820 =C2=A0 0.013 =C2=A0 =C2=A00.7%
=

But, micro-benchmarks always make me nervous, so I look= ed at the actual instruction cost for my=C2=A0
platform given the= version 1 and version 2 code.

If we count cpu cyc= les using the AMD Zen 2 instruction latency/throughput tables:=C2=A0 versio= n 1 (loop body)=C2=A0
has a critical path of ~5-6 cycles per iter= ation.=C2=A0 version 2 (loop body) has ~3-4 cycles per iteration.=C2=A0

The problem for version 2 is that the call to memcpy = is ~24-30 cycles due to the stub=C2=A0+ function call=C2=A0+ return
and branch predictor pressure on first call.=C2=A0 This probably results= in ~2.5 ns per iteration cost for version 2.

So, = no I wouldn't call it an optimization.=C2=A0 But, it will be interestin= g to hear other opinions on this.=C2=A0

--bg
=


On Mon, Mar 9, 2026 at 10:25=E2= =80=AFAM Ranier Vilela <ranier.vf= @gmail.com> wrote:


Em seg., 9 de mar. de 2026= =C3=A0s 11:47, Bryan Green <dbryan.green@gmail.com> escreveu:
I created an exa= mple that is a little bit closer to the actual code and changed the compile= r from C++ to C.=C2=A0

It is interesting the optimizatio= n that the compiler has chosen for version 1 versus version 2.=C2=A0 One ca= lls
memcpy and one doesn't.=C2=A0 There is a good chance the = inlining of memcpy as SSE+scalar per iteration
will be faster for= syscache scans-- which I believe are usually small (1-4 keys?).=C2=A0=C2= =A0
I doubt the inline version is better.
Clang is supported too and the code generated is much better with mem= cpy one call outside of the loop.
=C2=A0

Proba= bly the only reason to do this patch would be if N is normally large or if = this is considered an
improvement in code clarity without a detri= mental=C2=A0impact on small N syscache scans.=C2=A0=C2=A0
I reali= ze you only said "possible small optimization".=C2=A0 It might be= worthwhile to benchmark the code for=C2=A0
different values of n= to determine if there is a tipping point either way?
=C2=A0In your opinion, shouldn't this be considered an optimiza= tion, even a small one?

best regards,
Ra= nier Vilela
--000000000000804b97064c9a5df7--