Re: Proposal for enabling auto-vectorization for checksum calculations

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Ants Aasma <[email protected]>
To: John Naylor <[email protected]>
Cc: Andrew Kim <[email protected]>
Cc: Oleg Tselebrovskiy <[email protected]>
Cc: [email protected]
Subject: Re: Proposal for enabling auto-vectorization for checksum calculations
Date: Mon, 30 Mar 2026 18:00:59 +0300
Message-ID: <CANwKhkMN31RoNab8ovJjZaW=o6CNHCu-rznk85wKO=L5z5-PSA@mail.gmail.com> (raw)
In-Reply-To: <CANWCAZZ49dJ7XR1dY==7cHs93H7huo9f6RA_2qevFLp9eaOk4g@mail.gmail.com>
References: <[email protected]>
	<CANWCAZYZQw-nzTXbx3Bk332VtY9_D7ksDsuMZ0A-iDZ53yG7Ng@mail.gmail.com>
	<CAK64mnfeWLBRbMfnOsag0vGTDnT84KJzpuei40nG0OHyw4SESw@mail.gmail.com>
	<CANWCAZa1b2rcvoK657SmcKwh2P2cgASQ1D-0JPj5d3LbfaAVgA@mail.gmail.com>
	<CAK64mneN20+sW5WhV+r7hMVo4Rd0z11B6=3L039rWMt1wK3nPg@mail.gmail.com>
	<CANWCAZZuS3sNgLRo8Z4AM=uY4zTmz=dH5D4Z9xV6K0CEuJ8Hdw@mail.gmail.com>
	<CAK64mnejn9AZMYz03e7HX8Uui35PihUuOy=b+iBG=YtRKx0Log@mail.gmail.com>
	<CANWCAZZ_0AQMk1HgHXHX+JaeBfy_4kzwHgTdqMptDA7zM+nm+Q@mail.gmail.com>
	<CAK64mnc6jbehHv5AHc84tVFRJg4zeMiFuvPX9xZkRpq0210MFA@mail.gmail.com>
	<CANWCAZY940P3wGOQAZWMLQL4MQGGyOu7WBjBEcn_gqcrr+NvAw@mail.gmail.com>
	<CAK64mne_oWN9d4mf+0c_5-4Emb9kRXA-OC05OJ4F_1fVqpjzDA@mail.gmail.com>
	<CANWCAZZcKYp+01u1QmkShfXVkUCCdxtJAgHT-61Vw0ALoWj47A@mail.gmail.com>
	<CAK64mne=Q_4VSpJ8f4RQB-yAThd4+i-BRYMvfdGOhvwJQdYoKQ@mail.gmail.com>
	<CANWCAZYg2MVbYTaczNYNC2kaPodtfB8toUfE2Mhp9kut=2wzEA@mail.gmail.com>
	<CAK64mnd9NE+xE18shrf-SSx-iwMVof=2DJ2y9_fOkQ5E2Abc5g@mail.gmail.com>
	<CANWCAZbjdFnBiUmrBQC5vFFy0Fnn4SJG4AkkzGpTFhovodJdYQ@mail.gmail.com>
	<[email protected]>
	<CANWCAZZJ1tQcwWZe4BTgv1E-+bvhe4d0LzJvXeZCFMjRtWpk-w@mail.gmail.com>
	<CAK64mnfwyr-6GMRFFW_3a+xXpJxpYymCOygfbr-HUA_7+tQk2Q@mail.gmail.com>
	<CAK64mndS2Oy1i9ehEALwyv5EBpjMozTxSkt8HYc17a+MKDvmdQ@mail.gmail.com>
	<CANWCAZbigzGc_Kzsqf3NB+FgfnvJCas_KovCXg3GROJTVjuS9Q@mail.gmail.com>
	<CANWCAZZ49dJ7XR1dY==7cHs93H7huo9f6RA_2qevFLp9eaOk4g@mail.gmail.com>

On Mon, 30 Mar 2026 at 15:01, John Naylor <[email protected]> wrote:
> I don't remember the last time anyone did measurements, so I went
> ahead and did that:
>
> master: 945ms
> 32 AVX2: 335ms
> 64 AVX2: 220ms

I'm guessing this is on a recent Intel. Any extra width is helpful on Intel
as they doubled vpmulld latency from under us after we had settled on this
algorithm. uops.info shows that the most recent Arrow Lake-P cores bring
the latency down to 5. B Intels product lineup is so confusing that it's
hard to tell which products this core ships in. As far as I can tell not in
any Xeons yet. AMD has had 3 cycle vpmulld since Zen 3.

Out of curiosity I tried some approximate numbers on Zen 5 for differing
N_SUMS values. Numbers are ns per iteration for 10M iterations.

GCC 15.2 -O3:

              n16     n32     n64    n128    n256
    x86-64  620.1   482.4   493.9   543.1   584.0
 x86-64-v2  188.6   125.5   121.3   183.9   196.6
 x86-64-v3  185.2   101.3    63.2    60.9   101.6
 x86-64-v4  182.9    86.0    53.9    35.4    30.5
    native  178.2    84.7    54.0    34.5    30.9

clang 20.1 -O3:

              n16     n32     n64    n128    n256
    x86-64  611.7   264.0   254.7   283.9   304.0
 x86-64-v2  603.7   134.0   137.9   236.1   165.8
 x86-64-v3  252.1   103.2    61.9   124.0    96.9
 x86-64-v4  223.9   102.1    61.4   101.7    68.9
    native  203.3    91.0    54.5    35.0    40.4

FWIW I think AVX2 (x86-64-v3) is fine. On AMD the speed is close to core to
fabric bandwidth and Intel has significantly less bandwidth on server chips.

Regards,
Ants Aasma


Attachments:

  [text/x-csrc] bench-checksums.c (1023B, 3-bench-checksums.c)
  download | inline:
#include "postgres.h"
#include "storage/checksum_impl.h"
#include <time.h>

#undef printf

int __attribute__ ((noinline)) checksum_block(char *page, uint32 blockno)
{
	return pg_checksum_page(page, blockno);
}

int main(int argc, char *argv[]) {
	char *page;
	uint64 i;
	uint64 sum = 0;

	struct timespec start;
	struct timespec end;
	double delta;

	if (argc<3) {
		printf("Usage: %s niterations nblocks\n", argv[0]);
		return 1;
	}

	uint64 n = strtoull(argv[1], 0, 10);
	uint64 b = strtoull(argv[2], 0, 10);

	page = malloc(BLCKSZ*b);

	for (i = 0; i < BLCKSZ*b; i++)
		page[i] = (i*997) & 0xFF;


	clock_gettime(CLOCK_MONOTONIC_RAW, &start);
	for (i = 0; i < n; i++)
		sum += checksum_block(page + BLCKSZ*(i % b), (uint32) i);
	clock_gettime(CLOCK_MONOTONIC_RAW, &end);

	delta = (double)(end.tv_sec - start.tv_sec) + (1e-9*(double) (end.tv_nsec - start.tv_nsec));

        printf("%0.5fms @ %0.3f GB/s\n", delta*1000, (((double) BLCKSZ) * n)/delta/1e9);
	printf("  %0.1fns per iteration\n", delta*1e9/n);
	return 0;
}

view thread (42+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Proposal for enabling auto-vectorization for checksum calculations
  In-Reply-To: <CANwKhkMN31RoNab8ovJjZaW=o6CNHCu-rznk85wKO=L5z5-PSA@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox