Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7E7C-0055iM-0v for pgsql-hackers@arkaria.postgresql.org; Mon, 30 Mar 2026 15:01:18 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w7E7A-0049zQ-2W for pgsql-hackers@arkaria.postgresql.org; Mon, 30 Mar 2026 15:01:17 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w7E7A-0049zG-17 for pgsql-hackers@lists.postgresql.org; Mon, 30 Mar 2026 15:01:16 +0000 Received: from mail-wr1-x431.google.com ([2a00:1450:4864:20::431]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w7E78-00000001oxP-1FuL for pgsql-hackers@lists.postgresql.org; Mon, 30 Mar 2026 15:01:15 +0000 Received: by mail-wr1-x431.google.com with SMTP id ffacd0b85a97d-43cf8d550bdso1304479f8f.0 for ; Mon, 30 Mar 2026 08:01:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1774882873; cv=none; d=google.com; s=arc-20240605; b=H/0TIhTAI16OT89s0Jg+Vre0yWjyfA0fJIoZw0bkvJ9QeJUYOsufkOSMd+MXS+oMsi uLu2MrTilvwPZXBeH3wuhWlrMm50+zWbJMiuagd4I0ZXp1YK/alo0WaTO2y5rO0DdlMO V0Whh/nPgx46/RLZkSkgKX658pFJM12X07QmEuao4gjTkOeE65wepVI8evTPMQAnOO3V Wwve5deYQq5qHqG1JZNnkNGtX5DJWGNhcLFNag3FrxIFhAVzg4pjzNfx8RAN4Oby7FtQ xHkAFzYNib7iVvj9jnGmo2aOngDZm4tfFvzXZiDdrLb76sYR7XS0Cm/JDEXQBY3emVxw u27Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=9YAnCeoypv0LHPlHkq9U/4N9BZiCpwZDfzegKqhUflI=; fh=non1ZmwQw/1YetbmjgEEufBIzRHRKBlyO/OIkNQzEhk=; b=ByrYcNBsFiXkmn2vkVdLSsfw2Zuq+XiN1pcPTYZ4vxi508BIyGnqRnhTQXijqGMah7 gxzIbOjvE6LPejiLcIwMDrVUebZiQOGO2uBAR79Rs3PI2eK5OaBzuySDubSpdx/xxHoO g0cRnD28nIDU8oCKTSFLCTvnfMY719q1+O6QDWtEZDj5jAnOvlqrKDsoEO3f+owST4g+ JeM7KRTvevZzL4BN9wGwrZLKI9ECFZqdN9HbQ8a+sXMWTOwo36HNk3hhalPdIPsgJWz9 aHTlEMpEX8BzBZx8yHDDjqyWxmR2PgZV4emEh5hqQuZUDJagBtAHnwnxetYjQFapGhtN qEJw==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cybertec.at; s=google; t=1774882873; x=1775487673; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=9YAnCeoypv0LHPlHkq9U/4N9BZiCpwZDfzegKqhUflI=; b=D7R8CgQgMxChWP7T30vo9JR3funn/kBrktelPDtpwfJxcPOD18CheiG1nRKcCghski jOdrb/QzQ8Z3PCZhmx+0rLJ2yumxjiXNtOjzIV3rUn3G3YsFjXFzGbPvLzLY2A1fANGm D7Uk+XVuVNcHwNNc1tVzANzQWMd1IGh8Q+r5m9DhKVJGm+zz8OIhV0BHrqn6edzzIvsu YI7oRl9yjKoKiWLH5+OPZOxOLPVqiCCeKk+50l054jLt/qh9raI0DyiF+BSSJxsueczQ NVuDbPWPMFbNOTUTUfNJXuIQ6ePEgHbeAhRu93uxnDAL+FBz3MlkGSJmX/hKIFwukDyu St5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1774882873; x=1775487673; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=9YAnCeoypv0LHPlHkq9U/4N9BZiCpwZDfzegKqhUflI=; b=DKlQppBWq/AMDnOqFwNkxLZM7hCtXFMPmlPFJIiNP0gOacJtZXor72Ueg9wLodSQgN 0VVnaMXSvwrV9PTiDWzJ187NzX2K29vjZTX7Ir3bzoMJ0orrfUlb90A7jO8D1lCYmkFD SnBAfbyzZyNOvoYxg83CRqRirm+fb9/sowXDphluU3SDUFXtGrogQD6/poUGydbRTIm9 /ifJ4k/v2/c3ZM7jelvLg3ch7ywNbUvBNGhpB+Jua1/XG0T8U4P7lN9a1ZvmJ4oQifij hX1BWB/5SHQlnESlk/QgHpONTMGn1KpTsxzv/jl5lS3w/g5vHGyyNM6ai0q01E1rkn0z dpkQ== X-Forwarded-Encrypted: i=1; AJvYcCU18gwccEhdy6DWFzB9MQ0byT14tXHalzhtBQLcZVG+MNFgO5n5YGv7GUl2PccHUDKLeP9/GJwNB0+ZpMbU@lists.postgresql.org X-Gm-Message-State: AOJu0YzC0zJsC8OWHUSnPIVK7GNT0yjxJO2fyB1wGIpI8LzGS9N+Y85v nxSFO6eNm5+eLo15segApuE1XNyb2WYGrGA/b3aCKn5IWlcQhTYqNIZrFa5irWY6ScFSKX391Ix J7kIul9q9M03EdVbgzR0cgxR85bVBS4ntLVQ/5Bys/w== X-Gm-Gg: ATEYQzwdUYmyYmEwLIlbHv1fEzAiwQwXT1wF/TUsFRvSH9PSkr5nD+4THZxlaByGg8X shP8yqxdg6QaP7ne7vqD5uBxqxFZiHo4k1G3LkdmWLiGfyj4xwCNCe5ffV+6ype8QsVuMAnV98w Opncs1OeYFkFMU1QS8xYZp2MGOKAU+bcGtsyemfmkxVA/HNh6X6+5pIUmJruaiM+NZVdwWUHOHz BHSRYA0n5g/1jkvlKbIoueJ6vkPhN8eNsQCauBJDaUqjEb8Msc2SkY5UbE3ZOSTUXQwDwEyFOX0 zgyrOMYLuv4KyeyTJH4= X-Received: by 2002:a05:6000:24c4:b0:43d:4c:22a4 with SMTP id ffacd0b85a97d-43d004c248amr5382532f8f.40.1774882871327; Mon, 30 Mar 2026 08:01:11 -0700 (PDT) MIME-Version: 1.0 References: <20250911054220.3784-1-root@ip-172-31-36-228.ec2.internal> <0be1b7b05726652ea0d83e8f72fd4cfe@postgrespro.ru> In-Reply-To: From: Ants Aasma Date: Mon, 30 Mar 2026 18:00:59 +0300 X-Gm-Features: AQROBzDP4QCIEVq0GxgQn_qVyBN865nuXOtLUgYrRhmOWiJ2RhIprXtP-Gp5OMI Message-ID: Subject: Re: Proposal for enabling auto-vectorization for checksum calculations To: John Naylor Cc: Andrew Kim , Oleg Tselebrovskiy , pgsql-hackers@lists.postgresql.org Content-Type: multipart/mixed; boundary="00000000000010dff5064e3f1f0d" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000010dff5064e3f1f0d Content-Type: multipart/alternative; boundary="00000000000010dff3064e3f1f0b" --00000000000010dff3064e3f1f0b Content-Type: text/plain; charset="UTF-8" On Mon, 30 Mar 2026 at 15:01, John Naylor wrote: > I don't remember the last time anyone did measurements, so I went > ahead and did that: > > master: 945ms > 32 AVX2: 335ms > 64 AVX2: 220ms I'm guessing this is on a recent Intel. Any extra width is helpful on Intel as they doubled vpmulld latency from under us after we had settled on this algorithm. uops.info shows that the most recent Arrow Lake-P cores bring the latency down to 5. B Intels product lineup is so confusing that it's hard to tell which products this core ships in. As far as I can tell not in any Xeons yet. AMD has had 3 cycle vpmulld since Zen 3. Out of curiosity I tried some approximate numbers on Zen 5 for differing N_SUMS values. Numbers are ns per iteration for 10M iterations. GCC 15.2 -O3: n16 n32 n64 n128 n256 x86-64 620.1 482.4 493.9 543.1 584.0 x86-64-v2 188.6 125.5 121.3 183.9 196.6 x86-64-v3 185.2 101.3 63.2 60.9 101.6 x86-64-v4 182.9 86.0 53.9 35.4 30.5 native 178.2 84.7 54.0 34.5 30.9 clang 20.1 -O3: n16 n32 n64 n128 n256 x86-64 611.7 264.0 254.7 283.9 304.0 x86-64-v2 603.7 134.0 137.9 236.1 165.8 x86-64-v3 252.1 103.2 61.9 124.0 96.9 x86-64-v4 223.9 102.1 61.4 101.7 68.9 native 203.3 91.0 54.5 35.0 40.4 FWIW I think AVX2 (x86-64-v3) is fine. On AMD the speed is close to core to fabric bandwidth and Intel has significantly less bandwidth on server chips. Regards, Ants Aasma --00000000000010dff3064e3f1f0b Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Mon, 30 Mar 2026 at 15:01, John Naylor <johncnaylorls@gmail.com> wrote:
&g= t; I don't remember the last time anyone did measurements, so I went> ahead and did that:
>
> master: 945ms
> 32 AVX2: 33= 5ms
> 64 AVX2: 220ms

I'm guessing this is on a recent Inte= l. Any extra width is helpful on Intel as they doubled vpmulld latency from= under us after we had settled on this algorithm. uops.info shows that the most recent Arrow Lake-P cores bring the l= atency down to 5. B Intels product lineup is so confusing that it's har= d to tell which products this core ships in. As far as I can tell not in an= y Xeons yet. AMD has had 3 cycle vpmulld since Zen 3.

Out of curiosi= ty I tried some approximate numbers on Zen 5 for differing N_SUMS values. N= umbers are ns per iteration for 10M iterations.

GCC 15.2 -O3:
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 n16 =C2=A0 =C2=A0 n32 = =C2=A0 =C2=A0 n64 =C2=A0 =C2=A0n128 =C2=A0 =C2=A0n256
=C2=A0 =C2=A0 x86-= 64 =C2=A0620.1 =C2=A0 482.4 =C2=A0 493.9 =C2=A0 543.1 =C2=A0 584.0
=C2= =A0x86-64-v2 =C2=A0188.6 =C2=A0 125.5 =C2=A0 121.3 =C2=A0 183.9 =C2=A0 196.= 6
=C2=A0x86-64-v3 =C2=A0185.2 =C2=A0 101.3 =C2=A0 =C2=A063.2 =C2=A0 =C2= =A060.9 =C2=A0 101.6
=C2=A0x86-64-v4 =C2=A0182.9 =C2=A0 =C2=A086.0 =C2= =A0 =C2=A053.9 =C2=A0 =C2=A035.4 =C2=A0 =C2=A030.5
=C2=A0 =C2=A0 native = =C2=A0178.2 =C2=A0 =C2=A084.7 =C2=A0 =C2=A054.0 =C2=A0 =C2=A034.5 =C2=A0 = =C2=A030.9

clang 20.1 -O3:

=C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 n16 =C2=A0 =C2=A0 n32 =C2=A0 =C2=A0 n64 =C2=A0 =C2=A0= n128 =C2=A0 =C2=A0n256
=C2=A0 =C2=A0 x86-64 =C2=A0611.7 =C2=A0 264.0 =C2= =A0 254.7 =C2=A0 283.9 =C2=A0 304.0
=C2=A0x86-64-v2 =C2=A0603.7 =C2=A0 1= 34.0 =C2=A0 137.9 =C2=A0 236.1 =C2=A0 165.8
=C2=A0x86-64-v3 =C2=A0252.1 = =C2=A0 103.2 =C2=A0 =C2=A061.9 =C2=A0 124.0 =C2=A0 =C2=A096.9
=C2=A0x86-= 64-v4 =C2=A0223.9 =C2=A0 102.1 =C2=A0 =C2=A061.4 =C2=A0 101.7 =C2=A0 =C2=A0= 68.9
=C2=A0 =C2=A0 native =C2=A0203.3 =C2=A0 =C2=A091.0 =C2=A0 =C2=A054.= 5 =C2=A0 =C2=A035.0 =C2=A0 =C2=A040.4

FWIW I think AVX2 (x86-64-v3) = is fine. On AMD the speed is close to core to fabric bandwidth and Intel ha= s significantly less bandwidth on server chips.

Re= gards,
Ants Aasma
--00000000000010dff3064e3f1f0b-- --00000000000010dff5064e3f1f0d Content-Type: text/x-csrc; charset="US-ASCII"; name="bench-checksums.c" Content-Disposition: attachment; filename="bench-checksums.c" Content-Transfer-Encoding: base64 Content-ID: X-Attachment-Id: f_mndb32sf0 I2luY2x1ZGUgInBvc3RncmVzLmgiCiNpbmNsdWRlICJzdG9yYWdlL2NoZWNrc3VtX2ltcGwuaCIK I2luY2x1ZGUgPHRpbWUuaD4KCiN1bmRlZiBwcmludGYKCmludCBfX2F0dHJpYnV0ZV9fICgobm9p bmxpbmUpKSBjaGVja3N1bV9ibG9jayhjaGFyICpwYWdlLCB1aW50MzIgYmxvY2tubykKewoJcmV0 dXJuIHBnX2NoZWNrc3VtX3BhZ2UocGFnZSwgYmxvY2tubyk7Cn0KCmludCBtYWluKGludCBhcmdj LCBjaGFyICphcmd2W10pIHsKCWNoYXIgKnBhZ2U7Cgl1aW50NjQgaTsKCXVpbnQ2NCBzdW0gPSAw OwoKCXN0cnVjdCB0aW1lc3BlYyBzdGFydDsKCXN0cnVjdCB0aW1lc3BlYyBlbmQ7Cglkb3VibGUg ZGVsdGE7CgoJaWYgKGFyZ2M8MykgewoJCXByaW50ZigiVXNhZ2U6ICVzIG5pdGVyYXRpb25zIG5i bG9ja3NcbiIsIGFyZ3ZbMF0pOwoJCXJldHVybiAxOwoJfQoKCXVpbnQ2NCBuID0gc3RydG91bGwo YXJndlsxXSwgMCwgMTApOwoJdWludDY0IGIgPSBzdHJ0b3VsbChhcmd2WzJdLCAwLCAxMCk7CgoJ cGFnZSA9IG1hbGxvYyhCTENLU1oqYik7CgoJZm9yIChpID0gMDsgaSA8IEJMQ0tTWipiOyBpKysp CgkJcGFnZVtpXSA9IChpKjk5NykgJiAweEZGOwoKCgljbG9ja19nZXR0aW1lKENMT0NLX01PTk9U T05JQ19SQVcsICZzdGFydCk7Cglmb3IgKGkgPSAwOyBpIDwgbjsgaSsrKQoJCXN1bSArPSBjaGVj a3N1bV9ibG9jayhwYWdlICsgQkxDS1NaKihpICUgYiksICh1aW50MzIpIGkpOwoJY2xvY2tfZ2V0 dGltZShDTE9DS19NT05PVE9OSUNfUkFXLCAmZW5kKTsKCglkZWx0YSA9IChkb3VibGUpKGVuZC50 dl9zZWMgLSBzdGFydC50dl9zZWMpICsgKDFlLTkqKGRvdWJsZSkgKGVuZC50dl9uc2VjIC0gc3Rh cnQudHZfbnNlYykpOwoKICAgICAgICBwcmludGYoIiUwLjVmbXMgQCAlMC4zZiBHQi9zXG4iLCBk ZWx0YSoxMDAwLCAoKChkb3VibGUpIEJMQ0tTWikgKiBuKS9kZWx0YS8xZTkpOwoJcHJpbnRmKCIg ICUwLjFmbnMgcGVyIGl0ZXJhdGlvblxuIiwgZGVsdGEqMWU5L24pOwoJcmV0dXJuIDA7Cn0K --00000000000010dff5064e3f1f0d--