public inbox for [email protected]
help / color / mirror / Atom feedFrom: Ants Aasma <[email protected]>
To: John Naylor <[email protected]>
Cc: Andrew Kim <[email protected]>
Cc: Oleg Tselebrovskiy <[email protected]>
Cc: [email protected]
Subject: Re: Proposal for enabling auto-vectorization for checksum calculations
Date: Mon, 30 Mar 2026 18:00:59 +0300
Message-ID: <CANwKhkMN31RoNab8ovJjZaW=o6CNHCu-rznk85wKO=L5z5-PSA@mail.gmail.com> (raw)
In-Reply-To: <CANWCAZZ49dJ7XR1dY==7cHs93H7huo9f6RA_2qevFLp9eaOk4g@mail.gmail.com>
References: <[email protected]>
<CANWCAZYZQw-nzTXbx3Bk332VtY9_D7ksDsuMZ0A-iDZ53yG7Ng@mail.gmail.com>
<CAK64mnfeWLBRbMfnOsag0vGTDnT84KJzpuei40nG0OHyw4SESw@mail.gmail.com>
<CANWCAZa1b2rcvoK657SmcKwh2P2cgASQ1D-0JPj5d3LbfaAVgA@mail.gmail.com>
<CAK64mneN20+sW5WhV+r7hMVo4Rd0z11B6=3L039rWMt1wK3nPg@mail.gmail.com>
<CANWCAZZuS3sNgLRo8Z4AM=uY4zTmz=dH5D4Z9xV6K0CEuJ8Hdw@mail.gmail.com>
<CAK64mnejn9AZMYz03e7HX8Uui35PihUuOy=b+iBG=YtRKx0Log@mail.gmail.com>
<CANWCAZZ_0AQMk1HgHXHX+JaeBfy_4kzwHgTdqMptDA7zM+nm+Q@mail.gmail.com>
<CAK64mnc6jbehHv5AHc84tVFRJg4zeMiFuvPX9xZkRpq0210MFA@mail.gmail.com>
<CANWCAZY940P3wGOQAZWMLQL4MQGGyOu7WBjBEcn_gqcrr+NvAw@mail.gmail.com>
<CAK64mne_oWN9d4mf+0c_5-4Emb9kRXA-OC05OJ4F_1fVqpjzDA@mail.gmail.com>
<CANWCAZZcKYp+01u1QmkShfXVkUCCdxtJAgHT-61Vw0ALoWj47A@mail.gmail.com>
<CAK64mne=Q_4VSpJ8f4RQB-yAThd4+i-BRYMvfdGOhvwJQdYoKQ@mail.gmail.com>
<CANWCAZYg2MVbYTaczNYNC2kaPodtfB8toUfE2Mhp9kut=2wzEA@mail.gmail.com>
<CAK64mnd9NE+xE18shrf-SSx-iwMVof=2DJ2y9_fOkQ5E2Abc5g@mail.gmail.com>
<CANWCAZbjdFnBiUmrBQC5vFFy0Fnn4SJG4AkkzGpTFhovodJdYQ@mail.gmail.com>
<[email protected]>
<CANWCAZZJ1tQcwWZe4BTgv1E-+bvhe4d0LzJvXeZCFMjRtWpk-w@mail.gmail.com>
<CAK64mnfwyr-6GMRFFW_3a+xXpJxpYymCOygfbr-HUA_7+tQk2Q@mail.gmail.com>
<CAK64mndS2Oy1i9ehEALwyv5EBpjMozTxSkt8HYc17a+MKDvmdQ@mail.gmail.com>
<CANWCAZbigzGc_Kzsqf3NB+FgfnvJCas_KovCXg3GROJTVjuS9Q@mail.gmail.com>
<CANWCAZZ49dJ7XR1dY==7cHs93H7huo9f6RA_2qevFLp9eaOk4g@mail.gmail.com>
On Mon, 30 Mar 2026 at 15:01, John Naylor <[email protected]> wrote:
> I don't remember the last time anyone did measurements, so I went
> ahead and did that:
>
> master: 945ms
> 32 AVX2: 335ms
> 64 AVX2: 220ms
I'm guessing this is on a recent Intel. Any extra width is helpful on Intel
as they doubled vpmulld latency from under us after we had settled on this
algorithm. uops.info shows that the most recent Arrow Lake-P cores bring
the latency down to 5. B Intels product lineup is so confusing that it's
hard to tell which products this core ships in. As far as I can tell not in
any Xeons yet. AMD has had 3 cycle vpmulld since Zen 3.
Out of curiosity I tried some approximate numbers on Zen 5 for differing
N_SUMS values. Numbers are ns per iteration for 10M iterations.
GCC 15.2 -O3:
n16 n32 n64 n128 n256
x86-64 620.1 482.4 493.9 543.1 584.0
x86-64-v2 188.6 125.5 121.3 183.9 196.6
x86-64-v3 185.2 101.3 63.2 60.9 101.6
x86-64-v4 182.9 86.0 53.9 35.4 30.5
native 178.2 84.7 54.0 34.5 30.9
clang 20.1 -O3:
n16 n32 n64 n128 n256
x86-64 611.7 264.0 254.7 283.9 304.0
x86-64-v2 603.7 134.0 137.9 236.1 165.8
x86-64-v3 252.1 103.2 61.9 124.0 96.9
x86-64-v4 223.9 102.1 61.4 101.7 68.9
native 203.3 91.0 54.5 35.0 40.4
FWIW I think AVX2 (x86-64-v3) is fine. On AMD the speed is close to core to
fabric bandwidth and Intel has significantly less bandwidth on server chips.
Regards,
Ants Aasma
Attachments:
[text/x-csrc] bench-checksums.c (1023B, 3-bench-checksums.c)
download | inline:
#include "postgres.h"
#include "storage/checksum_impl.h"
#include <time.h>
#undef printf
int __attribute__ ((noinline)) checksum_block(char *page, uint32 blockno)
{
return pg_checksum_page(page, blockno);
}
int main(int argc, char *argv[]) {
char *page;
uint64 i;
uint64 sum = 0;
struct timespec start;
struct timespec end;
double delta;
if (argc<3) {
printf("Usage: %s niterations nblocks\n", argv[0]);
return 1;
}
uint64 n = strtoull(argv[1], 0, 10);
uint64 b = strtoull(argv[2], 0, 10);
page = malloc(BLCKSZ*b);
for (i = 0; i < BLCKSZ*b; i++)
page[i] = (i*997) & 0xFF;
clock_gettime(CLOCK_MONOTONIC_RAW, &start);
for (i = 0; i < n; i++)
sum += checksum_block(page + BLCKSZ*(i % b), (uint32) i);
clock_gettime(CLOCK_MONOTONIC_RAW, &end);
delta = (double)(end.tv_sec - start.tv_sec) + (1e-9*(double) (end.tv_nsec - start.tv_nsec));
printf("%0.5fms @ %0.3f GB/s\n", delta*1000, (((double) BLCKSZ) * n)/delta/1e9);
printf(" %0.1fns per iteration\n", delta*1e9/n);
return 0;
}
view thread (42+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
Subject: Re: Proposal for enabling auto-vectorization for checksum calculations
In-Reply-To: <CANwKhkMN31RoNab8ovJjZaW=o6CNHCu-rznk85wKO=L5z5-PSA@mail.gmail.com>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox