public inbox for [email protected]  
help / color / mirror / Atom feed
From: Bryan Green <[email protected]>
To: Ranier Vilela <[email protected]>
Cc: Pg Hackers <[email protected]>
Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
Date: Mon, 9 Mar 2026 11:06:28 -0600
Message-ID: <CAF+pBj9M8rCoS-EEBwLiA6hdxm_UNjxMXG9Vc4RU8xifqnQB-g@mail.gmail.com> (raw)
In-Reply-To: <CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com>
References: <CAEudQApbWon+3Eb9x4WW_D-JkSt2mvfx99dXu9VZ4AeCuTh=fw@mail.gmail.com>
	<CAEudQApEfvhNT1fEPURzVcQH7G0A1ukh_ugoCGaErV6_dbndCQ@mail.gmail.com>
	<CAF+pBj_RS2KErTqQ6ORXjhVzmukG7Ve0wHU1Kq56xjJfFKwVqA@mail.gmail.com>
	<CAEudQAptRymgvmd5hQb2mk-Ft89XcSo_xvC74kv4JBA9v=D4Sg@mail.gmail.com>
	<CAF+pBj-K2bgNQRc9ih01WFmAWUaQtVbS37jLtYdYh5LOwOkF6A@mail.gmail.com>

I meant to add that the micro-benchmark went through each loop 100,000,000
iterations for each run...
for 20 runs to come up with those numbers.

--bg

On Mon, Mar 9, 2026 at 11:02 AM Bryan Green <[email protected]> wrote:

> I performed a micro-benchmark on my dual epyc (zen 2) server and version 1
> wins for small values of n.
>
> 20 runs:
>
> n       version       min  median    mean     max  stddev  noise%
> -----------------------------------------------------------------------
> n=1     version1     2.440   2.440   2.450   2.550   0.024    4.5%
> n=1     version2     4.260   4.280   4.277   4.290   0.007    0.7%
>
> n=2     version1     2.740   2.750   2.757   2.880   0.029    5.1%
> n=2     version2     3.970   3.980   3.980   4.020   0.010    1.3%
>
> n=4     version1     4.580   4.595   4.649   4.910   0.094    7.2%
> n=4     version2     5.780   5.815   5.809   5.820   0.013    0.7%
>
> But, micro-benchmarks always make me nervous, so I looked at the actual
> instruction cost for my
> platform given the version 1 and version 2 code.
>
> If we count cpu cycles using the AMD Zen 2 instruction latency/throughput
> tables:  version 1 (loop body)
> has a critical path of ~5-6 cycles per iteration.  version 2 (loop body)
> has ~3-4 cycles per iteration.
>
> The problem for version 2 is that the call to memcpy is ~24-30 cycles due
> to the stub + function call + return
> and branch predictor pressure on first call.  This probably results in
> ~2.5 ns per iteration cost for version 2.
>
> So, no I wouldn't call it an optimization.  But, it will be interesting to
> hear other opinions on this.
>
> --bg
>
>
> On Mon, Mar 9, 2026 at 10:25 AM Ranier Vilela <[email protected]> wrote:
>
>>
>>
>> Em seg., 9 de mar. de 2026 às 11:47, Bryan Green <[email protected]>
>> escreveu:
>>
>>> I created an example that is a little bit closer to the actual code and
>>> changed the compiler from C++ to C.
>>>
>>> It is interesting the optimization that the compiler has chosen for
>>> version 1 versus version 2.  One calls
>>> memcpy and one doesn't.  There is a good chance the inlining of memcpy
>>> as SSE+scalar per iteration
>>> will be faster for syscache scans-- which I believe are usually small
>>> (1-4 keys?).
>>>
>> I doubt the inline version is better.
>> Clang is supported too and the code generated is much better with memcpy
>> one call outside of the loop.
>>
>>
>>>
>>> Probably the only reason to do this patch would be if N is normally
>>> large or if this is considered an
>>> improvement in code clarity without a detrimental impact on small N
>>> syscache scans.
>>> I realize you only said "possible small optimization".  It might be
>>> worthwhile to benchmark the code for
>>> different values of n to determine if there is a tipping point either
>>> way?
>>>
>>  In your opinion, shouldn't this be considered an optimization, even a
>> small one?
>>
>> best regards,
>> Ranier Vilela
>>
>


view thread (17+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected]
  Subject: Re: Avoid multiple calls to memcpy (src/backend/access/index/genam.c)
  In-Reply-To: <CAF+pBj9M8rCoS-EEBwLiA6hdxm_UNjxMXG9Vc4RU8xifqnQB-g@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox