On Mon, 30 Mar 2026 at 15:01, John Naylor <johncnaylorls@gmail.com> wrote:
> I don't remember the last time anyone did measurements, so I went
> ahead and did that:
>
> master: 945ms
> 32 AVX2: 335ms
> 64 AVX2: 220ms

I'm guessing this is on a recent Intel. Any extra width is helpful on Intel as they doubled vpmulld latency from under us after we had settled on this algorithm. uops.info shows that the most recent Arrow Lake-P cores bring the latency down to 5. B Intels product lineup is so confusing that it's hard to tell which products this core ships in. As far as I can tell not in any Xeons yet. AMD has had 3 cycle vpmulld since Zen 3.

Out of curiosity I tried some approximate numbers on Zen 5 for differing N_SUMS values. Numbers are ns per iteration for 10M iterations.

GCC 15.2 -O3:

              n16     n32     n64    n128    n256
    x86-64  620.1   482.4   493.9   543.1   584.0
 x86-64-v2  188.6   125.5   121.3   183.9   196.6
 x86-64-v3  185.2   101.3    63.2    60.9   101.6
 x86-64-v4  182.9    86.0    53.9    35.4    30.5
    native  178.2    84.7    54.0    34.5    30.9

clang 20.1 -O3:

              n16     n32     n64    n128    n256
    x86-64  611.7   264.0   254.7   283.9   304.0
 x86-64-v2  603.7   134.0   137.9   236.1   165.8
 x86-64-v3  252.1   103.2    61.9   124.0    96.9
 x86-64-v4  223.9   102.1    61.4   101.7    68.9
    native  203.3    91.0    54.5    35.0    40.4

FWIW I think AVX2 (x86-64-v3) is fine. On AMD the speed is close to core to fabric bandwidth and Intel has significantly less bandwidth on server chips.

Regards,
Ants Aasma