public inbox for [email protected]help / color / mirror / Atom feed
Re: Proposal for enabling auto-vectorization for checksum calculations 15+ messages / 4 participants [nested] [flat]
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-01-11 23:19 John Naylor <[email protected]> 0 siblings, 1 reply; 15+ messages in thread From: John Naylor @ 2026-01-11 23:19 UTC (permalink / raw) To: Andrew Kim <[email protected]>; +Cc: [email protected]; Oleg Tselebrovskiy <[email protected]> On Thu, Nov 6, 2025 at 6:50 AM Andrew Kim <[email protected]> wrote: > The v9 patch series is attached. Sorry for the delay. I found some issues last month and needed to consider the tradeoffs. First, apparently it has gone unnoticed by everyone, myself included, that no version has passed Meson CI since v6: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5726 That's because `ninja -C build -t missingdeps` gives: Missing dep: src/port/libpgport_shlib_checksum.a.p/checksum.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) Missing dep: src/port/libpgport_checksum.a.p/checksum.c.o uses src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) Processed 2561 nodes. Error: There are 2 missing dependency paths. 2 targets had depfile dependencies on 1 distinct generated inputs (from 1 rules) without a non-depfile dep path to the generator. There might be build flakiness if any of the targets listed above are built alone, or not late enough, in a clean output directory. In the back of my mind I was worried of consequences of something in src/port depending on backend types, but hadn't seen any in my local builds. It seems the proximate cause is the removal of this stanza with no equivalent replacement: --- a/src/backend/storage/page/meson.build +++ b/src/backend/storage/page/meson.build @@ -1,14 +1,5 @@ # Copyright (c) 2022-2025, PostgreSQL Global Development Group -checksum_backend_lib = static_library('checksum_backend_lib', - 'checksum.c', - dependencies: backend_build_deps, - kwargs: internal_lib_args, - c_args: vectorize_cflags + unroll_loops_cflags, -) - -backend_link_with += checksum_backend_lib The low-level algorithm doesn't care about database pages, only integers, so first I tried to surgically isolate the concepts, but that was too messy. In the attached v10-0003, I went back to something more similar to v6, but incorporated Andrew's idea of using PG_CHECKSUM_INTERNAL to allow for flexibility. Now pg_filedump compiles without any changes, so that's a plus. > - Provides public interfaces wrapping the basic implementation > - No code duplication (checksum.c includes checksum_impl.h) Upthread I mentioned "thin wrappers", but so far I haven't seen it in any patch versions, so I don't think this term means the same thing to you as it does to me (I saw pretty clear duplication in v9). It then occurred to me that with function attribute targets, doing the naive thing throws a compiler error IIRC -- namely just have a notional function call that then gets inlined and re-targeted. So in v10 I separated the body of checksum_block to a semi-private header to provide hardware-specific definitions for core code, while also maintaining the same one that external code expects. For this to be commitable, I think (and I think Oleg agrees) that the feature detection should go in src/port. Some of us have been thinking of refactoring and centralizing the feature detection, and now may be a good time to do it. Before going that far, I wanted to see what people think of v10. -- John Naylor Amazon Web Services Attachments: [application/x-patch] v10-0003-Enable-autovectorizing-pg_checksum_block-with-AV.patch (13.8K, 2-v10-0003-Enable-autovectorizing-pg_checksum_block-with-AV.patch) download | inline diff: From 1783b4efc237364a5ecf3fa5cb17ebb45b73a9ef Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Thu, 8 Jan 2026 18:30:20 +0700 Subject: [PATCH v10 3/3] Enable autovectorizing pg_checksum_block with AVX2 runtime detection [todo more here] Co-authored-by: Matthew Sterrett <[email protected]> Co-authored-by: Andrew Kim <[email protected]> Reviewed-by: Oleg Tselebrovskiy <[email protected]> Discussion: https://postgr.es/m/CA%2BvA85_5GTu%2BHHniSbvvP%2B8k3%3DxZO%3DWE84NPwiKyxztqvpfZ3Q%40mail.gmail.com Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal --- config/c-compiler.m4 | 26 ++++ configure | 52 ++++++++ configure.ac | 9 ++ meson.build | 30 +++++ src/backend/storage/page/checksum.c | 112 +++++++++++++++++- src/include/pg_config.h.in | 3 + src/include/storage/checksum_block_internal.h | 42 +++++++ src/include/storage/checksum_impl.h | 48 +++----- 8 files changed, 288 insertions(+), 34 deletions(-) create mode 100644 src/include/storage/checksum_block_internal.h diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 1509dbfa2ab..1f3e31fc2d3 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -613,6 +613,32 @@ fi undefine([Ac_cachevar])dnl ])# PGAC_SSE42_CRC32_INTRINSICS +# PGAC_AVX2_SUPPORT +# --------------------------- +# Check if the compiler supports AVX2 target attribute. +# This is used for optimized checksum calculations with runtime detection. +# +# If AVX2 target attribute is supported, sets pgac_avx2_support. +AC_DEFUN([PGAC_AVX2_SUPPORT], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_support])])dnl +AC_CACHE_CHECK([for AVX2 target attribute support], [Ac_cachevar], +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([#include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + } + #endif], + [return avx2_test();])], + [Ac_cachevar=yes], + [Ac_cachevar=no])]) +if test x"$Ac_cachevar" = x"yes"; then + pgac_avx2_support=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX2_SUPPORT + # PGAC_AVX512_PCLMUL_INTRINSICS # --------------------------- # Check if the compiler supports AVX-512 carryless multiplication diff --git a/configure b/configure index 045c913865d..b89c44f81c0 100755 --- a/configure +++ b/configure @@ -17662,6 +17662,58 @@ $as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for AVX2 support" >&5 +$as_echo_n "checking for AVX2 support... " >&6; } +if ${pgac_cv_avx2_support+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <immintrin.h> + #include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + #endif + static int avx2_test(void) + { + const char buf[sizeof(__m256i)]; + __m256i accum = _mm256_loadu_si256((const __m256i *) buf); + accum = _mm256_add_epi32(accum, accum); + int result = _mm256_extract_epi32(accum, 0); + return (int) result; + } +int +main () +{ +return avx2_test(); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_avx2_support=yes +else + pgac_cv_avx2_support=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_support" >&5 +$as_echo "$pgac_cv_avx2_support" >&6; } +if test x"$pgac_cv_avx2_support" = x"yes"; then + pgac_avx2_support=yes +fi + + if test x"$pgac_avx2_support" = x"yes"; then + +$as_echo "#define USE_AVX2_WITH_RUNTIME_CHECK 1" >>confdefs.h + + fi +fi + # Check for AVX-512 popcount intrinsics # if test x"$host_cpu" = x"x86_64"; then diff --git a/configure.ac b/configure.ac index 145197e6bd6..bb7456e4478 100644 --- a/configure.ac +++ b/configure.ac @@ -2074,6 +2074,15 @@ else fi fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + PGAC_AVX2_SUPPORT() + if test x"$pgac_avx2_support" = x"yes"; then + AC_DEFINE(USE_AVX2_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX2 instructions with a runtime check.]) + fi +fi + # Check for XSAVE intrinsics # PGAC_XSAVE_INTRINSICS() diff --git a/meson.build b/meson.build index 2064d1b0a8d..776abf1249d 100644 --- a/meson.build +++ b/meson.build @@ -2322,6 +2322,36 @@ int main(void) endif +############################################################### +# Check for the availability of AVX2 support +############################################################### + +if host_cpu == 'x86_64' + + prog = ''' +#include <immintrin.h> +#include <stdint.h> +#if defined(__has_attribute) && __has_attribute (target) +__attribute__((target("avx2"))) +#endif +static int avx2_test(void) +{ + return 0; +} + +int main(void) +{ + return avx2_test(); +} +''' + + if cc.links(prog, name: 'AVX2 support', args: test_c_args) + cdata.set('USE_AVX2_WITH_RUNTIME_CHECK', 1) + endif + +endif + + ############################################################### # Check for the availability of AVX-512 popcount intrinsics. ############################################################### diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index 8716651c8b5..55ebe988411 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -13,10 +13,120 @@ */ #include "postgres.h" +#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) +#include <cpuid.h> +#endif + +#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX) +#include <intrin.h> +#endif + +#ifdef HAVE_XSAVE_INTRINSICS +#include <immintrin.h> +#endif + #include "storage/checksum.h" + /* * The actual code is in storage/checksum_impl.h. This is done so that * external programs can incorporate the checksum code by #include'ing - * that file from the exported Postgres headers. (Compare our CRC code.) + * that file from the exported Postgres headers. (Compare our legacy + * CRC code in pg_crc.h.) + * The PG_CHECKSUM_INTERNAL symbol allows core to use hardware-specific + * coding without affecting external programs. */ +#define PG_CHECKSUM_INTERNAL #include "storage/checksum_impl.h" /* IWYU pragma: keep */ + + +/* WIP: the feature detection should go in src/port */ + +/* + * Does CPUID say there's support for XSAVE instructions? + */ +static inline bool +xsave_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + +#if defined(HAVE__GET_CPUID) + __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUID) + __cpuid(exx, 1); +#endif + return (exx[2] & (1 << 27)) != 0; /* osxsave */ +} + +/* + * Does XGETBV say the YMM registers are enabled? + * + * NB: Caller is responsible for verifying that xsave_available() returns true + * before calling this. + */ +#ifdef HAVE_XSAVE_INTRINSICS +pg_attribute_target("xsave") +#endif +static inline bool +ymm_regs_available(void) +{ +#ifdef HAVE_XSAVE_INTRINSICS + return (_xgetbv(0) & 0x06) == 0x06; +#else + return false; +#endif +} + +/* + * Check for AVX2 support using CPUID detection + */ +static inline bool +avx2_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, 7, 0); +#endif + + return (exx[1] & (1 << 5)) != 0; /* avx2 */ +} + +static uint32 +pg_checksum_block_fallback(const PGChecksummablePage *page) +{ +#include "storage/checksum_block_internal.h" +} + +/* + * AVX2-optimized block checksum algorithm. + */ +#ifdef USE_AVX2_WITH_RUNTIME_CHECK +pg_attribute_target("avx2") +static uint32 +pg_checksum_block_avx2(const PGChecksummablePage *page) +{ +#include "storage/checksum_block_internal.h" +} +#endif /* USE_AVX2_WITH_RUNTIME_CHECK */ + +/* + * Choose the best available checksum implementation. + */ +static uint32 +pg_checksum_choose(const PGChecksummablePage *page) +{ +#ifdef USE_AVX2_WITH_RUNTIME_CHECK + if (xsave_available() && + ymm_regs_available() && + avx2_available()) + pg_checksum_block = pg_checksum_block_avx2; + else +#endif + pg_checksum_block = pg_checksum_block_fallback; + + return pg_checksum_block(page); +} + +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page) = pg_checksum_choose; diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 10fa85e78c2..444d2fc1afe 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -665,6 +665,9 @@ /* Define to 1 to use AVX-512 CRC algorithms with a runtime check. */ #undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK +/* Define to 1 to use AVX2 instructions with a runtime check. */ +#undef USE_AVX2_WITH_RUNTIME_CHECK + /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */ #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK diff --git a/src/include/storage/checksum_block_internal.h b/src/include/storage/checksum_block_internal.h new file mode 100644 index 00000000000..b4e6987d6b5 --- /dev/null +++ b/src/include/storage/checksum_block_internal.h @@ -0,0 +1,42 @@ +/*------------------------------------------------------------------------- + * + * checksum_block_internal.h + * Core algorithm for page checksums , semi private to checksum_impl.h + * and checksum.c. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/checksum_block_internal.h + * + *------------------------------------------------------------------------- + */ + +/* there is deliberately not an #ifndef CHECKSUM_BLOCK_INTERNAL_H here */ + +uint32 sums[N_SUMS]; +uint32 result = 0; +uint32 i, + j; + +/* ensure that the size is compatible with the algorithm */ +Assert(sizeof(PGChecksummablePage) == BLCKSZ); + +/* initialize partial checksums to their corresponding offsets */ +memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); + +/* main checksum calculation */ +for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], page->data[i][j]); + +/* finally add in two rounds of zeroes for additional mixing */ +for (i = 0; i < 2; i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], 0); + +/* xor fold partial checksums together */ +for (i = 0; i < N_SUMS; i++) + result ^= sums[i]; + +return result; diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h index 5c2dcbc63e7..8a308e423c3 100644 --- a/src/include/storage/checksum_impl.h +++ b/src/include/storage/checksum_impl.h @@ -73,11 +73,10 @@ * 2e-16 false positive rate within margin of error. * * Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer - * multiplication instruction. As of 2013 the corresponding instruction is - * available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32). - * Vectorization requires a compiler to do the vectorization for us. For recent - * GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough - * to achieve vectorization. + * multiplication instruction. Examples include x86 AVX2 extensions (vpmulld) + * and ARM NEON (vmul.i32). For simplicity we rely on the compiler to do the + * vectorization for us. For GCC and clang the flags -funroll-loops + * -ftree-vectorize are enough to achieve vectorization. * * The optimal amount of parallelism to use depends on CPU specific instruction * latency, SIMD instruction width, throughput and the amount of registers @@ -89,8 +88,9 @@ * * The parallelism number 32 was chosen based on the fact that it is the * largest state that fits into architecturally visible x86 SSE registers while - * leaving some free registers for intermediate values. For future processors - * with 256bit vector registers this will leave some performance on the table. + * leaving some free registers for intermediate values. For processors + * with 256bit vector registers this leaves some performance on the table. + * * When vectorization is not available it might be beneficial to restructure * the computation to calculate a subset of the columns at a time and perform * multiple passes to avoid register spilling. This optimization opportunity @@ -138,6 +138,9 @@ do { \ (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \ } while (0) +/* Provide a static definition for external programs */ +#ifndef PG_CHECKSUM_INTERNAL + /* * Block checksum algorithm. The page must be adequately aligned * (at least on 4-byte boundary). @@ -145,34 +148,13 @@ do { \ static uint32 pg_checksum_block(const PGChecksummablePage *page) { - uint32 sums[N_SUMS]; - uint32 result = 0; - uint32 i, - j; - - /* ensure that the size is compatible with the algorithm */ - Assert(sizeof(PGChecksummablePage) == BLCKSZ); - - /* initialize partial checksums to their corresponding offsets */ - memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); - - /* main checksum calculation */ - for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], page->data[i][j]); - - /* finally add in two rounds of zeroes for additional mixing */ - for (i = 0; i < 2; i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], 0); - - /* xor fold partial checksums together */ - for (i = 0; i < N_SUMS; i++) - result ^= sums[i]; - - return result; +#include "storage/checksum_block_internal.h" } +#else +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page); +#endif + /* * Compute the checksum for a Postgres page. * -- 2.52.0 [application/x-patch] v10-0002-Adjust-benchmark-to-use-core-checksum.patch (1.6K, 3-v10-0002-Adjust-benchmark-to-use-core-checksum.patch) download | inline diff: From 1e11687dd9778caaeeb3c73e3ab7b526c9e8e77a Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Fri, 9 Jan 2026 17:07:37 +0700 Subject: [PATCH v10 2/3] Adjust benchmark to use core checksum --- contrib/pg_checksum_bench/pg_checksum_bench.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.c b/contrib/pg_checksum_bench/pg_checksum_bench.c index dc20395a590..61da664e723 100644 --- a/contrib/pg_checksum_bench/pg_checksum_bench.c +++ b/contrib/pg_checksum_bench/pg_checksum_bench.c @@ -1,7 +1,6 @@ #include "postgres.h" #include "fmgr.h" -#include "port/checksum.h" -#include "port/checksum_impl.h" +#include "storage/checksum.h" #include <stdio.h> #include <assert.h> @@ -15,23 +14,23 @@ Datum drive_pg_checksum(PG_FUNCTION_ARGS) { int page_count = PG_GETARG_INT32(0); - PGChecksummablePage *pages; + char *pages; int i; size_t j; - pages = palloc(page_count * sizeof(PGChecksummablePage)); + pages = palloc(page_count * BLCKSZ); srand(0); - for (j = 0; j < page_count * sizeof(PGChecksummablePage); j++) + for (j = 0; j < page_count * BLCKSZ; j++) { - char *byte_ptr = (char *) pages; + char *byte_ptr = pages; byte_ptr[j] = rand() % 256; } for (i = 0; i < REPEATS; i++) { - const PGChecksummablePage *test_page = pages + (i % page_count); - volatile uint32 result = pg_checksum_block_choose((const char *) test_page); + char *test_page = pages + (i % page_count); + volatile uint32 result = pg_checksum_page((char *) test_page, 0); (void) result; } -- 2.52.0 [application/x-patch] v10-0001-Benchmark-code-for-postgres-checksums.patch (5.0K, 4-v10-0001-Benchmark-code-for-postgres-checksums.patch) download | inline diff: From 7a3afea56b253982d28b8123a35466fc93ee51d2 Mon Sep 17 00:00:00 2001 From: Andrew Kim <[email protected]> Date: Wed, 5 Nov 2025 14:37:29 -0800 Subject: [PATCH v10 1/3] Benchmark code for postgres checksums Add pg_checksum_bench extension for performance testing of checksum implementations with AVX2 optimization. Key features: - PostgreSQL extension for benchmarking checksum performance - Tests pg_checksum_block_choose() with runtime AVX2 dispatch --- contrib/meson.build | 1 + contrib/pg_checksum_bench/meson.build | 23 ++++++++++ .../pg_checksum_bench--1.0.sql | 8 ++++ contrib/pg_checksum_bench/pg_checksum_bench.c | 42 +++++++++++++++++++ .../pg_checksum_bench.control | 4 ++ .../sql/pg_checksum_bench.sql | 17 ++++++++ 6 files changed, 95 insertions(+) create mode 100644 contrib/pg_checksum_bench/meson.build create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench.c create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench.control create mode 100644 contrib/pg_checksum_bench/sql/pg_checksum_bench.sql diff --git a/contrib/meson.build b/contrib/meson.build index def13257cbe..98fe47b5b9b 100644 --- a/contrib/meson.build +++ b/contrib/meson.build @@ -12,6 +12,7 @@ contrib_doc_args = { 'install_dir': contrib_doc_dir, } +subdir('pg_checksum_bench') subdir('amcheck') subdir('auth_delay') subdir('auto_explain') diff --git a/contrib/pg_checksum_bench/meson.build b/contrib/pg_checksum_bench/meson.build new file mode 100644 index 00000000000..32ccd9efa0f --- /dev/null +++ b/contrib/pg_checksum_bench/meson.build @@ -0,0 +1,23 @@ +# Copyright (c) 2022-2025, PostgreSQL Global Development Group + +pg_checksum_bench_sources = files( + 'pg_checksum_bench.c', +) + +if host_system == 'windows' + pg_checksum_bench_sources += rc_lib_gen.process(win32ver_rc, extra_args: [ + '--NAME', 'pg_checksum_bench', + '--FILEDESC', 'pg_checksum_bench',]) +endif + +pg_checksum_bench = shared_module('pg_checksum_bench', + pg_checksum_bench_sources, + kwargs: contrib_mod_args, +) +contrib_targets += pg_checksum_bench + +install_data( + 'pg_checksum_bench--1.0.sql', + 'pg_checksum_bench.control', + kwargs: contrib_data_args, +) diff --git a/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql b/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql new file mode 100644 index 00000000000..5f13cbe3c5e --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql @@ -0,0 +1,8 @@ +/* contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +-- \echo Use "CREATE EXTENSION pg_checksum_bench" to load this file. \quit + +CREATE FUNCTION drive_pg_checksum(page_count int) + RETURNS pg_catalog.void + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.c b/contrib/pg_checksum_bench/pg_checksum_bench.c new file mode 100644 index 00000000000..dc20395a590 --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench.c @@ -0,0 +1,42 @@ +#include "postgres.h" +#include "fmgr.h" +#include "port/checksum.h" +#include "port/checksum_impl.h" + +#include <stdio.h> +#include <assert.h> + +PG_MODULE_MAGIC; + +#define REPEATS 1000000 + +PG_FUNCTION_INFO_V1(drive_pg_checksum); +Datum +drive_pg_checksum(PG_FUNCTION_ARGS) +{ + int page_count = PG_GETARG_INT32(0); + PGChecksummablePage *pages; + int i; + size_t j; + + pages = palloc(page_count * sizeof(PGChecksummablePage)); + srand(0); + for (j = 0; j < page_count * sizeof(PGChecksummablePage); j++) + { + char *byte_ptr = (char *) pages; + + byte_ptr[j] = rand() % 256; + } + + for (i = 0; i < REPEATS; i++) + { + const PGChecksummablePage *test_page = pages + (i % page_count); + volatile uint32 result = pg_checksum_block_choose((const char *) test_page); + + (void) result; + } + + pfree((void *) pages); + + PG_RETURN_VOID(); +} diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.control b/contrib/pg_checksum_bench/pg_checksum_bench.control new file mode 100644 index 00000000000..4a4e2c9363c --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench.control @@ -0,0 +1,4 @@ +comment = 'pg_checksum benchmark' +default_version = '1.0' +module_pathname = '$libdir/pg_checksum_bench' +relocatable = true diff --git a/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql b/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql new file mode 100644 index 00000000000..4b347699953 --- /dev/null +++ b/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql @@ -0,0 +1,17 @@ +CREATE EXTENSION pg_checksum_bench; + +SELECT drive_pg_checksum(-1); + +\timing on + +SELECT drive_pg_checksum(1); +SELECT drive_pg_checksum(2); +SELECT drive_pg_checksum(4); +SELECT drive_pg_checksum(8); +SELECT drive_pg_checksum(16); +SELECT drive_pg_checksum(32); +SELECT drive_pg_checksum(64); +SELECT drive_pg_checksum(128); +SELECT drive_pg_checksum(256); +SELECT drive_pg_checksum(512); +SELECT drive_pg_checksum(1024); -- 2.52.0 ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-01-13 01:58 Andrew Kim <[email protected]> parent: John Naylor <[email protected]> 0 siblings, 1 reply; 15+ messages in thread From: Andrew Kim @ 2026-01-13 01:58 UTC (permalink / raw) To: John Naylor <[email protected]>; +Cc: [email protected]; Oleg Tselebrovskiy <[email protected]> Hi John, Thanks for taking the time to dig into this, I really appreciate the detailed analysis, especially catching the Meson CI failure, which I had unfortunately missed after v6. On Sun, Jan 11, 2026 at 3:19 PM John Naylor <[email protected]> wrote: > > On Thu, Nov 6, 2025 at 6:50 AM Andrew Kim <[email protected]> wrote: > > The v9 patch series is attached. > > Sorry for the delay. I found some issues last month and needed to > consider the tradeoffs. > > First, apparently it has gone unnoticed by everyone, myself included, > that no version has passed Meson CI since v6: > > https://cirrus-ci.com/github/postgresql-cfbot/postgresql/cf%2F5726 > > That's because `ninja -C build -t missingdeps` gives: > > Missing dep: src/port/libpgport_shlib_checksum.a.p/checksum.c.o uses > src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > Missing dep: src/port/libpgport_checksum.a.p/checksum.c.o uses > src/include/utils/errcodes.h (generated by CUSTOM_COMMAND) > Processed 2561 nodes. > Error: There are 2 missing dependency paths. > 2 targets had depfile dependencies on 1 distinct generated inputs > (from 1 rules) without a non-depfile dep path to the generator. > There might be build flakiness if any of the targets listed above are > built alone, or not late enough, in a clean output directory. > > In the back of my mind I was worried of consequences of something in > src/port depending on backend types, but hadn't seen any in my local > builds. It seems the proximate cause is the removal of this stanza > with no equivalent replacement: > > --- a/src/backend/storage/page/meson.build > +++ b/src/backend/storage/page/meson.build > @@ -1,14 +1,5 @@ > # Copyright (c) 2022-2025, PostgreSQL Global Development Group > > -checksum_backend_lib = static_library('checksum_backend_lib', > - 'checksum.c', > - dependencies: backend_build_deps, > - kwargs: internal_lib_args, > - c_args: vectorize_cflags + unroll_loops_cflags, > -) > - > -backend_link_with += checksum_backend_lib > > The low-level algorithm doesn't care about database pages, only > integers, so first I tried to surgically isolate the concepts, but > that was too messy. > > In the attached v10-0003, I went back to something more similar to v6, > but incorporated Andrew's idea of using PG_CHECKSUM_INTERNAL to allow > for flexibility. Now pg_filedump compiles without any changes, so > that's a plus. > > > - Provides public interfaces wrapping the basic implementation > > > - No code duplication (checksum.c includes checksum_impl.h) > > Upthread I mentioned "thin wrappers", but so far I haven't seen it in > any patch versions, so I don't think this term means the same thing to > you as it does to me (I saw pretty clear duplication in v9). It then > occurred to me that with function attribute targets, doing the naive > thing throws a compiler error IIRC -- namely just have a notional > function call that then gets inlined and re-targeted. So in v10 I > separated the body of checksum_block to a semi-private header to > provide hardware-specific definitions for core code, while also > maintaining the same one that external code expects. I agree that the missing dependency reported by Meson is a real issue, not just a theoretical one. The removal of the backend-side checksum_backend_lib stanza without an equivalent dependency path explains the CI breakage clearly, and your diagnosis makes sense, v10-0003 approach, splitting the body of checksum_block into a semi-private implementation header while preserving the externally visible interface, that makes sense to me > > For this to be commitable, I think (and I think Oleg agrees) that the > feature detection should go in src/port. Some of us have been thinking > of refactoring and centralizing the feature detection, and now may be > a good time to do it. Before going that far, I wanted to see what > people think of v10. I also agree with you (and Oleg) that feature detection really belongs in src/port, even if that means doing a bit more refactoring up front. As you said, this may actually be a good forcing function to finally consolidate feature detection in a cleaner way. I’m supportive of using v10 as the basis for further discussion and iteration, cleaning up Meson dependency declarations so generated headers are properly ordered, refining the PG_CHECKSUM_INTERNAL usage if needed, and assisting with any additional refactoring required to keep src/port fully backend-agnostic. Thanks again for the careful review and for pushing this in a direction that’s more robust and committable. > > -- > John Naylor > Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-01-13 05:07 John Naylor <[email protected]> parent: Andrew Kim <[email protected]> 0 siblings, 1 reply; 15+ messages in thread From: John Naylor @ 2026-01-13 05:07 UTC (permalink / raw) To: Andrew Kim <[email protected]>; +Cc: [email protected]; Oleg Tselebrovskiy <[email protected]> On Tue, Jan 13, 2026 at 8:58 AM Andrew Kim <[email protected]> wrote: > The removal of the backend-side checksum_backend_lib stanza without an > equivalent dependency path explains the CI breakage clearly, > and your diagnosis makes sense, v10-0003 approach, > splitting the body of checksum_block into a semi-private > implementation header while preserving the externally visible > interface, > that makes sense to me Glad to hear it. > I’m supportive of using v10 as the basis for further discussion and iteration, > cleaning up Meson dependency declarations so generated headers are > properly ordered, > refining the PG_CHECKSUM_INTERNAL usage if needed, Great! It sounds like you've found some issues to address? It's not clear. -- John Naylor Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-01-15 08:03 Oleg Tselebrovskiy <[email protected]> parent: John Naylor <[email protected]> 0 siblings, 2 replies; 15+ messages in thread From: Oleg Tselebrovskiy @ 2026-01-15 08:03 UTC (permalink / raw) To: John Naylor <[email protected]>; +Cc: Andrew Kim <[email protected]>; [email protected] Can't respond to the original message about v10 of the patch for some reason, so I'll respond here. > So in v10 I separated the body of checksum_block to > a semi-private header to provide hardware-specific definitions > for core code, while also maintaining the same one that > external code expects I like the usage of a semi-internal header, less code duplication is always good > In the attached v10-0003, I went back to something more similar to v6, > but incorporated Andrew's idea of using PG_CHECKSUM_INTERNAL to allow > for flexibility. Now pg_filedump compiles without any changes, so > that's a plus. If I understand correctly, with how code is currently, external programms can define PG_CHECKSUM_INTERNAL manually, but then they won't have access to static functions inside of checksum.c, so all you get is a pointer that leads nowhere, correct? I'd like to think that speeding up checksum calculation is something that some external programms could appreciate. Is it possible to move pg_checksum_block_fallback, pg_checksum_block_avx2, and pg_checksum_choose to checksum_impl.h? It would mean moving all the hardware check to the same header as well. Doesn't look or sound pretty, but this would allow external programms to chose implementation the same way the core does it, but also just change nothing and still have fallback code. > For this to be commitable, I think (and I think Oleg agrees) that the > feature detection should go in src/port. Some of us have been thinking > of refactoring and centralizing the feature detection, and now may be > a good time to do it. I do agree with that. I thought that we can continue with v10 as-is, commit it with hardware checks still in checksum.c and refactor it with everything else, but with my proposition above (and even without it) it seems that refactoring of hardware checks should come first. Also, not moving all those checksum files to src/port saves us from thinking about problems with meson and current external programs, but, I think, that after hardware checks are refactored, we could revisit the question of moving checksum[_impl].h/.c to src/port. All in all, very happy to see progress on this! Regards, Oleg ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-01-15 10:35 John Naylor <[email protected]> parent: Oleg Tselebrovskiy <[email protected]> 1 sibling, 0 replies; 15+ messages in thread From: John Naylor @ 2026-01-15 10:35 UTC (permalink / raw) To: Oleg Tselebrovskiy <[email protected]>; +Cc: Andrew Kim <[email protected]>; [email protected] On Thu, Jan 15, 2026 at 3:04 PM Oleg Tselebrovskiy <[email protected]> wrote: > > So in v10 I separated the body of checksum_block to > > a semi-private header to provide hardware-specific definitions > > for core code, while also maintaining the same one that > > external code expects > > I like the usage of a semi-internal header, less code duplication > is always good Glad to hear it. > If I understand correctly, with how code is currently, > external programms can define PG_CHECKSUM_INTERNAL manually, > but then they won't have access to static functions inside of > checksum.c, so all you get is a pointer that leads nowhere, correct? Sounds right, but I'm not sure why an external program would define it, because it's named...drumroll.."internal". > I'd like to think that speeding up checksum calculation is something > that some external programms could appreciate. External programs are probably doing some one-off task, so I don't see a reason to work harder. > Also, not moving all those checksum files to src/port saves us from > thinking about problems with meson and current external programs, > but, I think, that after hardware checks are refactored, we could > revisit the question of moving checksum[_impl].h/.c to src/port. Refactoring the hardware checks is not going to make those two problems go away, and I don't understand why you want to move anything to begin with. -- John Naylor Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-01-21 11:13 John Naylor <[email protected]> parent: Oleg Tselebrovskiy <[email protected]> 1 sibling, 1 reply; 15+ messages in thread From: John Naylor @ 2026-01-21 11:13 UTC (permalink / raw) To: Oleg Tselebrovskiy <[email protected]>; +Cc: Andrew Kim <[email protected]>; [email protected] Attached is v11 to fix headerscheck, per CI. -- John Naylor Amazon Web Services Attachments: [application/x-patch] v11-0003-Enable-autovectorizing-pg_checksum_block-with-AV.patch (14.7K, 2-v11-0003-Enable-autovectorizing-pg_checksum_block-with-AV.patch) download | inline diff: From 5f60e4457fa6e67b2d186895a4f3e10ac87989ec Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Thu, 8 Jan 2026 18:30:20 +0700 Subject: [PATCH v11 3/3] Enable autovectorizing pg_checksum_block with AVX2 runtime detection We already rely on autovectorization for computing page checksums, but on x86 we can get about twice the performance by annotating pg_checksum_block() with function target attributes for AVX2, which uses 256-bit registers. WIP: Runtime detection is okay checksum.c for now, but it'd be better to refactor feature detection at some point so it's more centralized. Co-authored-by: Matthew Sterrett <[email protected]> Co-authored-by: Andrew Kim <[email protected]> Reviewed-by: Oleg Tselebrovskiy <[email protected]> Discussion: https://postgr.es/m/CA%2BvA85_5GTu%2BHHniSbvvP%2B8k3%3DxZO%3DWE84NPwiKyxztqvpfZ3Q%40mail.gmail.com Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal --- config/c-compiler.m4 | 26 ++++ configure | 52 ++++++++ configure.ac | 9 ++ meson.build | 30 +++++ src/backend/storage/page/checksum.c | 112 +++++++++++++++++- src/include/pg_config.h.in | 3 + src/include/storage/checksum_block_internal.h | 42 +++++++ src/include/storage/checksum_impl.h | 48 +++----- src/tools/pginclude/headerscheck | 2 + 9 files changed, 290 insertions(+), 34 deletions(-) create mode 100644 src/include/storage/checksum_block_internal.h diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 1509dbfa2ab..1f3e31fc2d3 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -613,6 +613,32 @@ fi undefine([Ac_cachevar])dnl ])# PGAC_SSE42_CRC32_INTRINSICS +# PGAC_AVX2_SUPPORT +# --------------------------- +# Check if the compiler supports AVX2 target attribute. +# This is used for optimized checksum calculations with runtime detection. +# +# If AVX2 target attribute is supported, sets pgac_avx2_support. +AC_DEFUN([PGAC_AVX2_SUPPORT], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_support])])dnl +AC_CACHE_CHECK([for AVX2 target attribute support], [Ac_cachevar], +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([#include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + } + #endif], + [return avx2_test();])], + [Ac_cachevar=yes], + [Ac_cachevar=no])]) +if test x"$Ac_cachevar" = x"yes"; then + pgac_avx2_support=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX2_SUPPORT + # PGAC_AVX512_PCLMUL_INTRINSICS # --------------------------- # Check if the compiler supports AVX-512 carryless multiplication diff --git a/configure b/configure index 04eeb1a741c..72c935c5d83 100755 --- a/configure +++ b/configure @@ -17680,6 +17680,58 @@ $as_echo "#define HAVE_XSAVE_INTRINSICS 1" >>confdefs.h fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for AVX2 support" >&5 +$as_echo_n "checking for AVX2 support... " >&6; } +if ${pgac_cv_avx2_support+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <immintrin.h> + #include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + #endif + static int avx2_test(void) + { + const char buf[sizeof(__m256i)]; + __m256i accum = _mm256_loadu_si256((const __m256i *) buf); + accum = _mm256_add_epi32(accum, accum); + int result = _mm256_extract_epi32(accum, 0); + return (int) result; + } +int +main () +{ +return avx2_test(); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_link "$LINENO"; then : + pgac_cv_avx2_support=yes +else + pgac_cv_avx2_support=no +fi +rm -f core conftest.err conftest.$ac_objext \ + conftest$ac_exeext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_support" >&5 +$as_echo "$pgac_cv_avx2_support" >&6; } +if test x"$pgac_cv_avx2_support" = x"yes"; then + pgac_avx2_support=yes +fi + + if test x"$pgac_avx2_support" = x"yes"; then + +$as_echo "#define USE_AVX2_WITH_RUNTIME_CHECK 1" >>confdefs.h + + fi +fi + # Check for AVX-512 popcount intrinsics # if test x"$host_cpu" = x"x86_64"; then diff --git a/configure.ac b/configure.ac index 13c75170f7a..c2180111044 100644 --- a/configure.ac +++ b/configure.ac @@ -2089,6 +2089,15 @@ else fi fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + PGAC_AVX2_SUPPORT() + if test x"$pgac_avx2_support" = x"yes"; then + AC_DEFINE(USE_AVX2_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX2 instructions with a runtime check.]) + fi +fi + # Check for XSAVE intrinsics # PGAC_XSAVE_INTRINSICS() diff --git a/meson.build b/meson.build index 6d304f32fb0..61620cbe37a 100644 --- a/meson.build +++ b/meson.build @@ -2348,6 +2348,36 @@ int main(void) endif +############################################################### +# Check for the availability of AVX2 support +############################################################### + +if host_cpu == 'x86_64' + + prog = ''' +#include <immintrin.h> +#include <stdint.h> +#if defined(__has_attribute) && __has_attribute (target) +__attribute__((target("avx2"))) +#endif +static int avx2_test(void) +{ + return 0; +} + +int main(void) +{ + return avx2_test(); +} +''' + + if cc.links(prog, name: 'AVX2 support', args: test_c_args) + cdata.set('USE_AVX2_WITH_RUNTIME_CHECK', 1) + endif + +endif + + ############################################################### # Check for the availability of AVX-512 popcount intrinsics. ############################################################### diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index 8716651c8b5..55ebe988411 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -13,10 +13,120 @@ */ #include "postgres.h" +#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) +#include <cpuid.h> +#endif + +#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX) +#include <intrin.h> +#endif + +#ifdef HAVE_XSAVE_INTRINSICS +#include <immintrin.h> +#endif + #include "storage/checksum.h" + /* * The actual code is in storage/checksum_impl.h. This is done so that * external programs can incorporate the checksum code by #include'ing - * that file from the exported Postgres headers. (Compare our CRC code.) + * that file from the exported Postgres headers. (Compare our legacy + * CRC code in pg_crc.h.) + * The PG_CHECKSUM_INTERNAL symbol allows core to use hardware-specific + * coding without affecting external programs. */ +#define PG_CHECKSUM_INTERNAL #include "storage/checksum_impl.h" /* IWYU pragma: keep */ + + +/* WIP: the feature detection should go in src/port */ + +/* + * Does CPUID say there's support for XSAVE instructions? + */ +static inline bool +xsave_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + +#if defined(HAVE__GET_CPUID) + __get_cpuid(1, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUID) + __cpuid(exx, 1); +#endif + return (exx[2] & (1 << 27)) != 0; /* osxsave */ +} + +/* + * Does XGETBV say the YMM registers are enabled? + * + * NB: Caller is responsible for verifying that xsave_available() returns true + * before calling this. + */ +#ifdef HAVE_XSAVE_INTRINSICS +pg_attribute_target("xsave") +#endif +static inline bool +ymm_regs_available(void) +{ +#ifdef HAVE_XSAVE_INTRINSICS + return (_xgetbv(0) & 0x06) == 0x06; +#else + return false; +#endif +} + +/* + * Check for AVX2 support using CPUID detection + */ +static inline bool +avx2_available(void) +{ + unsigned int exx[4] = {0, 0, 0, 0}; + +#if defined(HAVE__GET_CPUID_COUNT) + __get_cpuid_count(7, 0, &exx[0], &exx[1], &exx[2], &exx[3]); +#elif defined(HAVE__CPUIDEX) + __cpuidex(exx, 7, 0); +#endif + + return (exx[1] & (1 << 5)) != 0; /* avx2 */ +} + +static uint32 +pg_checksum_block_fallback(const PGChecksummablePage *page) +{ +#include "storage/checksum_block_internal.h" +} + +/* + * AVX2-optimized block checksum algorithm. + */ +#ifdef USE_AVX2_WITH_RUNTIME_CHECK +pg_attribute_target("avx2") +static uint32 +pg_checksum_block_avx2(const PGChecksummablePage *page) +{ +#include "storage/checksum_block_internal.h" +} +#endif /* USE_AVX2_WITH_RUNTIME_CHECK */ + +/* + * Choose the best available checksum implementation. + */ +static uint32 +pg_checksum_choose(const PGChecksummablePage *page) +{ +#ifdef USE_AVX2_WITH_RUNTIME_CHECK + if (xsave_available() && + ymm_regs_available() && + avx2_available()) + pg_checksum_block = pg_checksum_block_avx2; + else +#endif + pg_checksum_block = pg_checksum_block_fallback; + + return pg_checksum_block(page); +} + +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page) = pg_checksum_choose; diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 339268dc8ef..1e43e9b2bc4 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -665,6 +665,9 @@ /* Define to 1 to use AVX-512 CRC algorithms with a runtime check. */ #undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK +/* Define to 1 to use AVX2 instructions with a runtime check. */ +#undef USE_AVX2_WITH_RUNTIME_CHECK + /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */ #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK diff --git a/src/include/storage/checksum_block_internal.h b/src/include/storage/checksum_block_internal.h new file mode 100644 index 00000000000..b4e6987d6b5 --- /dev/null +++ b/src/include/storage/checksum_block_internal.h @@ -0,0 +1,42 @@ +/*------------------------------------------------------------------------- + * + * checksum_block_internal.h + * Core algorithm for page checksums , semi private to checksum_impl.h + * and checksum.c. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/checksum_block_internal.h + * + *------------------------------------------------------------------------- + */ + +/* there is deliberately not an #ifndef CHECKSUM_BLOCK_INTERNAL_H here */ + +uint32 sums[N_SUMS]; +uint32 result = 0; +uint32 i, + j; + +/* ensure that the size is compatible with the algorithm */ +Assert(sizeof(PGChecksummablePage) == BLCKSZ); + +/* initialize partial checksums to their corresponding offsets */ +memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); + +/* main checksum calculation */ +for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], page->data[i][j]); + +/* finally add in two rounds of zeroes for additional mixing */ +for (i = 0; i < 2; i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], 0); + +/* xor fold partial checksums together */ +for (i = 0; i < N_SUMS; i++) + result ^= sums[i]; + +return result; diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h index 5c2dcbc63e7..8a308e423c3 100644 --- a/src/include/storage/checksum_impl.h +++ b/src/include/storage/checksum_impl.h @@ -73,11 +73,10 @@ * 2e-16 false positive rate within margin of error. * * Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer - * multiplication instruction. As of 2013 the corresponding instruction is - * available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32). - * Vectorization requires a compiler to do the vectorization for us. For recent - * GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough - * to achieve vectorization. + * multiplication instruction. Examples include x86 AVX2 extensions (vpmulld) + * and ARM NEON (vmul.i32). For simplicity we rely on the compiler to do the + * vectorization for us. For GCC and clang the flags -funroll-loops + * -ftree-vectorize are enough to achieve vectorization. * * The optimal amount of parallelism to use depends on CPU specific instruction * latency, SIMD instruction width, throughput and the amount of registers @@ -89,8 +88,9 @@ * * The parallelism number 32 was chosen based on the fact that it is the * largest state that fits into architecturally visible x86 SSE registers while - * leaving some free registers for intermediate values. For future processors - * with 256bit vector registers this will leave some performance on the table. + * leaving some free registers for intermediate values. For processors + * with 256bit vector registers this leaves some performance on the table. + * * When vectorization is not available it might be beneficial to restructure * the computation to calculate a subset of the columns at a time and perform * multiple passes to avoid register spilling. This optimization opportunity @@ -138,6 +138,9 @@ do { \ (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \ } while (0) +/* Provide a static definition for external programs */ +#ifndef PG_CHECKSUM_INTERNAL + /* * Block checksum algorithm. The page must be adequately aligned * (at least on 4-byte boundary). @@ -145,34 +148,13 @@ do { \ static uint32 pg_checksum_block(const PGChecksummablePage *page) { - uint32 sums[N_SUMS]; - uint32 result = 0; - uint32 i, - j; - - /* ensure that the size is compatible with the algorithm */ - Assert(sizeof(PGChecksummablePage) == BLCKSZ); - - /* initialize partial checksums to their corresponding offsets */ - memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); - - /* main checksum calculation */ - for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], page->data[i][j]); - - /* finally add in two rounds of zeroes for additional mixing */ - for (i = 0; i < 2; i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], 0); - - /* xor fold partial checksums together */ - for (i = 0; i < N_SUMS; i++) - result ^= sums[i]; - - return result; +#include "storage/checksum_block_internal.h" } +#else +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page); +#endif + /* * Compute the checksum for a Postgres page. * diff --git a/src/tools/pginclude/headerscheck b/src/tools/pginclude/headerscheck index 7a6755991bb..569e749b25a 100755 --- a/src/tools/pginclude/headerscheck +++ b/src/tools/pginclude/headerscheck @@ -154,6 +154,8 @@ do test "$f" = src/include/catalog/syscache_ids.h && continue test "$f" = src/include/catalog/syscache_info.h && continue + test "$f" = src/include/storage/checksum_block_internal.h && continue + # We can't make these Bison output files compilable standalone # without using "%code require", which old Bison versions lack. # parser/gram.h will be included by parser/gramparse.h anyway. -- 2.52.0 [application/x-patch] v11-0002-Adjust-benchmark-to-use-core-checksum.patch (1.6K, 3-v11-0002-Adjust-benchmark-to-use-core-checksum.patch) download | inline diff: From b6a3aba8f9568684cf22af42584bec65b6170668 Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Fri, 9 Jan 2026 17:07:37 +0700 Subject: [PATCH v11 2/3] Adjust benchmark to use core checksum XXX not for commit --- contrib/pg_checksum_bench/pg_checksum_bench.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.c b/contrib/pg_checksum_bench/pg_checksum_bench.c index dc20395a590..61da664e723 100644 --- a/contrib/pg_checksum_bench/pg_checksum_bench.c +++ b/contrib/pg_checksum_bench/pg_checksum_bench.c @@ -1,7 +1,6 @@ #include "postgres.h" #include "fmgr.h" -#include "port/checksum.h" -#include "port/checksum_impl.h" +#include "storage/checksum.h" #include <stdio.h> #include <assert.h> @@ -15,23 +14,23 @@ Datum drive_pg_checksum(PG_FUNCTION_ARGS) { int page_count = PG_GETARG_INT32(0); - PGChecksummablePage *pages; + char *pages; int i; size_t j; - pages = palloc(page_count * sizeof(PGChecksummablePage)); + pages = palloc(page_count * BLCKSZ); srand(0); - for (j = 0; j < page_count * sizeof(PGChecksummablePage); j++) + for (j = 0; j < page_count * BLCKSZ; j++) { - char *byte_ptr = (char *) pages; + char *byte_ptr = pages; byte_ptr[j] = rand() % 256; } for (i = 0; i < REPEATS; i++) { - const PGChecksummablePage *test_page = pages + (i % page_count); - volatile uint32 result = pg_checksum_block_choose((const char *) test_page); + char *test_page = pages + (i % page_count); + volatile uint32 result = pg_checksum_page((char *) test_page, 0); (void) result; } -- 2.52.0 [application/x-patch] v11-0001-Benchmark-code-for-postgres-checksums.patch (4.9K, 4-v11-0001-Benchmark-code-for-postgres-checksums.patch) download | inline diff: From 97a24b6da8fddaaafc2ed434dabf14a53bd6eecb Mon Sep 17 00:00:00 2001 From: Andrew Kim <[email protected]> Date: Wed, 5 Nov 2025 14:37:29 -0800 Subject: [PATCH v11 1/3] Benchmark code for postgres checksums Add pg_checksum_bench extension for performance testing of checksum implementations with AVX2 optimization. XXX not for commit --- contrib/meson.build | 1 + contrib/pg_checksum_bench/meson.build | 23 ++++++++++ .../pg_checksum_bench--1.0.sql | 8 ++++ contrib/pg_checksum_bench/pg_checksum_bench.c | 42 +++++++++++++++++++ .../pg_checksum_bench.control | 4 ++ .../sql/pg_checksum_bench.sql | 17 ++++++++ 6 files changed, 95 insertions(+) create mode 100644 contrib/pg_checksum_bench/meson.build create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench.c create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench.control create mode 100644 contrib/pg_checksum_bench/sql/pg_checksum_bench.sql diff --git a/contrib/meson.build b/contrib/meson.build index def13257cbe..98fe47b5b9b 100644 --- a/contrib/meson.build +++ b/contrib/meson.build @@ -12,6 +12,7 @@ contrib_doc_args = { 'install_dir': contrib_doc_dir, } +subdir('pg_checksum_bench') subdir('amcheck') subdir('auth_delay') subdir('auto_explain') diff --git a/contrib/pg_checksum_bench/meson.build b/contrib/pg_checksum_bench/meson.build new file mode 100644 index 00000000000..32ccd9efa0f --- /dev/null +++ b/contrib/pg_checksum_bench/meson.build @@ -0,0 +1,23 @@ +# Copyright (c) 2022-2025, PostgreSQL Global Development Group + +pg_checksum_bench_sources = files( + 'pg_checksum_bench.c', +) + +if host_system == 'windows' + pg_checksum_bench_sources += rc_lib_gen.process(win32ver_rc, extra_args: [ + '--NAME', 'pg_checksum_bench', + '--FILEDESC', 'pg_checksum_bench',]) +endif + +pg_checksum_bench = shared_module('pg_checksum_bench', + pg_checksum_bench_sources, + kwargs: contrib_mod_args, +) +contrib_targets += pg_checksum_bench + +install_data( + 'pg_checksum_bench--1.0.sql', + 'pg_checksum_bench.control', + kwargs: contrib_data_args, +) diff --git a/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql b/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql new file mode 100644 index 00000000000..5f13cbe3c5e --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql @@ -0,0 +1,8 @@ +/* contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +-- \echo Use "CREATE EXTENSION pg_checksum_bench" to load this file. \quit + +CREATE FUNCTION drive_pg_checksum(page_count int) + RETURNS pg_catalog.void + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.c b/contrib/pg_checksum_bench/pg_checksum_bench.c new file mode 100644 index 00000000000..dc20395a590 --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench.c @@ -0,0 +1,42 @@ +#include "postgres.h" +#include "fmgr.h" +#include "port/checksum.h" +#include "port/checksum_impl.h" + +#include <stdio.h> +#include <assert.h> + +PG_MODULE_MAGIC; + +#define REPEATS 1000000 + +PG_FUNCTION_INFO_V1(drive_pg_checksum); +Datum +drive_pg_checksum(PG_FUNCTION_ARGS) +{ + int page_count = PG_GETARG_INT32(0); + PGChecksummablePage *pages; + int i; + size_t j; + + pages = palloc(page_count * sizeof(PGChecksummablePage)); + srand(0); + for (j = 0; j < page_count * sizeof(PGChecksummablePage); j++) + { + char *byte_ptr = (char *) pages; + + byte_ptr[j] = rand() % 256; + } + + for (i = 0; i < REPEATS; i++) + { + const PGChecksummablePage *test_page = pages + (i % page_count); + volatile uint32 result = pg_checksum_block_choose((const char *) test_page); + + (void) result; + } + + pfree((void *) pages); + + PG_RETURN_VOID(); +} diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.control b/contrib/pg_checksum_bench/pg_checksum_bench.control new file mode 100644 index 00000000000..4a4e2c9363c --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench.control @@ -0,0 +1,4 @@ +comment = 'pg_checksum benchmark' +default_version = '1.0' +module_pathname = '$libdir/pg_checksum_bench' +relocatable = true diff --git a/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql b/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql new file mode 100644 index 00000000000..4b347699953 --- /dev/null +++ b/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql @@ -0,0 +1,17 @@ +CREATE EXTENSION pg_checksum_bench; + +SELECT drive_pg_checksum(-1); + +\timing on + +SELECT drive_pg_checksum(1); +SELECT drive_pg_checksum(2); +SELECT drive_pg_checksum(4); +SELECT drive_pg_checksum(8); +SELECT drive_pg_checksum(16); +SELECT drive_pg_checksum(32); +SELECT drive_pg_checksum(64); +SELECT drive_pg_checksum(128); +SELECT drive_pg_checksum(256); +SELECT drive_pg_checksum(512); +SELECT drive_pg_checksum(1024); -- 2.52.0 ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-02-09 08:42 Andrew Kim <[email protected]> parent: John Naylor <[email protected]> 0 siblings, 1 reply; 15+ messages in thread From: Andrew Kim @ 2026-02-09 08:42 UTC (permalink / raw) To: John Naylor <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] Hi John, Thanks for v11. Adding the headercheck exception for the internal header looks good to me. It would be the right call to fix the CI issues. The current structure utilizing the semi-private header looks solid to me. -Andrew On Wed, Jan 21, 2026 at 3:13 AM John Naylor <[email protected]> wrote: > > Attached is v11 to fix headerscheck, per CI. > > -- > John Naylor > Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-03-16 08:00 Andrew Kim <[email protected]> parent: Andrew Kim <[email protected]> 0 siblings, 1 reply; 15+ messages in thread From: Andrew Kim @ 2026-03-16 08:00 UTC (permalink / raw) To: John Naylor <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] It looks like your PostgreSQL build on Cirrus CI is failing during the Meson configuration phase because it cannot find the libedit libraries. Should we add these to the pacman installation command in our CI scripts, or is there a preferred way to handle terminal library dependencies for the Windows Meson builds? https://cirrus-ci.com/task/4928462311391232 [01:09:47.468] meson.build:1480:20: ERROR: C shared or static library 'libedit' not found On Mon, Feb 9, 2026 at 12:42 AM Andrew Kim <[email protected]> wrote: > > Hi John, Thanks for v11. > Adding the headercheck exception for the internal header looks good to me. > It would be the right call to fix the CI issues. > The current structure utilizing the semi-private header looks solid to me. > -Andrew > > On Wed, Jan 21, 2026 at 3:13 AM John Naylor <[email protected]> wrote: > > > > Attached is v11 to fix headerscheck, per CI. > > > > -- > > John Naylor > > Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-03-17 02:23 John Naylor <[email protected]> parent: Andrew Kim <[email protected]> 0 siblings, 2 replies; 15+ messages in thread From: John Naylor @ 2026-03-17 02:23 UTC (permalink / raw) To: Andrew Kim <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] On Mon, Mar 16, 2026 at 3:00 PM Andrew Kim <[email protected]> wrote: > > It looks like your PostgreSQL build on Cirrus CI is failing during the > Meson configuration phase because it cannot find the libedit > libraries. > Should we add these to the pacman installation command in our CI > scripts, or is there a preferred way to handle terminal library > dependencies for the Windows Meson builds? I'll leave that to the people who maintain it. Sometimes intermittent glitches happen. And please don't top-post. I've attached v12 which is just a rebase over the new centralized feature detection. I also have some review: +# Check if the compiler supports AVX2 target attribute. +# This is used for optimized checksum calculations with runtime detection. It could possibly be used for other things, in which case this will get out of date. It's most reliable to grep for the symbol to see where something is used. Also, the first statement is not true: +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([#include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + } + #endif], With these guards, I think any compiler will pass the test, and CI does show it passes on MSVC: [01:09:52.888] Checking if "AVX2 support" links: YES The consequence is that two functions get built with identical non-AVX2 contents. Then at runtime we pick one of them, but it doesn't matter which. This needs to test what it says it's testing. -- John Naylor Amazon Web Services Attachments: [text/x-patch] v12-0001-Benchmark-code-for-postgres-checksums.patch (4.9K, 2-v12-0001-Benchmark-code-for-postgres-checksums.patch) download | inline diff: From 561542289a4696de7e8e0f5635019f092714e6c8 Mon Sep 17 00:00:00 2001 From: Andrew Kim <[email protected]> Date: Wed, 5 Nov 2025 14:37:29 -0800 Subject: [PATCH v12 1/3] Benchmark code for postgres checksums Add pg_checksum_bench extension for performance testing of checksum implementations with AVX2 optimization. XXX not for commit --- contrib/meson.build | 1 + contrib/pg_checksum_bench/meson.build | 23 ++++++++++ .../pg_checksum_bench--1.0.sql | 8 ++++ contrib/pg_checksum_bench/pg_checksum_bench.c | 42 +++++++++++++++++++ .../pg_checksum_bench.control | 4 ++ .../sql/pg_checksum_bench.sql | 17 ++++++++ 6 files changed, 95 insertions(+) create mode 100644 contrib/pg_checksum_bench/meson.build create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench.c create mode 100644 contrib/pg_checksum_bench/pg_checksum_bench.control create mode 100644 contrib/pg_checksum_bench/sql/pg_checksum_bench.sql diff --git a/contrib/meson.build b/contrib/meson.build index 5a752eac347..9529f0b1aee 100644 --- a/contrib/meson.build +++ b/contrib/meson.build @@ -12,6 +12,7 @@ contrib_doc_args = { 'install_dir': contrib_doc_dir, } +subdir('pg_checksum_bench') subdir('amcheck') subdir('auth_delay') subdir('auto_explain') diff --git a/contrib/pg_checksum_bench/meson.build b/contrib/pg_checksum_bench/meson.build new file mode 100644 index 00000000000..32ccd9efa0f --- /dev/null +++ b/contrib/pg_checksum_bench/meson.build @@ -0,0 +1,23 @@ +# Copyright (c) 2022-2025, PostgreSQL Global Development Group + +pg_checksum_bench_sources = files( + 'pg_checksum_bench.c', +) + +if host_system == 'windows' + pg_checksum_bench_sources += rc_lib_gen.process(win32ver_rc, extra_args: [ + '--NAME', 'pg_checksum_bench', + '--FILEDESC', 'pg_checksum_bench',]) +endif + +pg_checksum_bench = shared_module('pg_checksum_bench', + pg_checksum_bench_sources, + kwargs: contrib_mod_args, +) +contrib_targets += pg_checksum_bench + +install_data( + 'pg_checksum_bench--1.0.sql', + 'pg_checksum_bench.control', + kwargs: contrib_data_args, +) diff --git a/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql b/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql new file mode 100644 index 00000000000..5f13cbe3c5e --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql @@ -0,0 +1,8 @@ +/* contrib/pg_checksum_bench/pg_checksum_bench--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +-- \echo Use "CREATE EXTENSION pg_checksum_bench" to load this file. \quit + +CREATE FUNCTION drive_pg_checksum(page_count int) + RETURNS pg_catalog.void + AS 'MODULE_PATHNAME' LANGUAGE C; diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.c b/contrib/pg_checksum_bench/pg_checksum_bench.c new file mode 100644 index 00000000000..dc20395a590 --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench.c @@ -0,0 +1,42 @@ +#include "postgres.h" +#include "fmgr.h" +#include "port/checksum.h" +#include "port/checksum_impl.h" + +#include <stdio.h> +#include <assert.h> + +PG_MODULE_MAGIC; + +#define REPEATS 1000000 + +PG_FUNCTION_INFO_V1(drive_pg_checksum); +Datum +drive_pg_checksum(PG_FUNCTION_ARGS) +{ + int page_count = PG_GETARG_INT32(0); + PGChecksummablePage *pages; + int i; + size_t j; + + pages = palloc(page_count * sizeof(PGChecksummablePage)); + srand(0); + for (j = 0; j < page_count * sizeof(PGChecksummablePage); j++) + { + char *byte_ptr = (char *) pages; + + byte_ptr[j] = rand() % 256; + } + + for (i = 0; i < REPEATS; i++) + { + const PGChecksummablePage *test_page = pages + (i % page_count); + volatile uint32 result = pg_checksum_block_choose((const char *) test_page); + + (void) result; + } + + pfree((void *) pages); + + PG_RETURN_VOID(); +} diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.control b/contrib/pg_checksum_bench/pg_checksum_bench.control new file mode 100644 index 00000000000..4a4e2c9363c --- /dev/null +++ b/contrib/pg_checksum_bench/pg_checksum_bench.control @@ -0,0 +1,4 @@ +comment = 'pg_checksum benchmark' +default_version = '1.0' +module_pathname = '$libdir/pg_checksum_bench' +relocatable = true diff --git a/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql b/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql new file mode 100644 index 00000000000..4b347699953 --- /dev/null +++ b/contrib/pg_checksum_bench/sql/pg_checksum_bench.sql @@ -0,0 +1,17 @@ +CREATE EXTENSION pg_checksum_bench; + +SELECT drive_pg_checksum(-1); + +\timing on + +SELECT drive_pg_checksum(1); +SELECT drive_pg_checksum(2); +SELECT drive_pg_checksum(4); +SELECT drive_pg_checksum(8); +SELECT drive_pg_checksum(16); +SELECT drive_pg_checksum(32); +SELECT drive_pg_checksum(64); +SELECT drive_pg_checksum(128); +SELECT drive_pg_checksum(256); +SELECT drive_pg_checksum(512); +SELECT drive_pg_checksum(1024); -- 2.53.0 [text/x-patch] v12-0002-Adjust-benchmark-to-use-core-checksum.patch (1.6K, 3-v12-0002-Adjust-benchmark-to-use-core-checksum.patch) download | inline diff: From 556ddbb4c67b0f9b742c40ccb4ca39500de45805 Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Fri, 9 Jan 2026 17:07:37 +0700 Subject: [PATCH v12 2/3] Adjust benchmark to use core checksum XXX not for commit --- contrib/pg_checksum_bench/pg_checksum_bench.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/contrib/pg_checksum_bench/pg_checksum_bench.c b/contrib/pg_checksum_bench/pg_checksum_bench.c index dc20395a590..61da664e723 100644 --- a/contrib/pg_checksum_bench/pg_checksum_bench.c +++ b/contrib/pg_checksum_bench/pg_checksum_bench.c @@ -1,7 +1,6 @@ #include "postgres.h" #include "fmgr.h" -#include "port/checksum.h" -#include "port/checksum_impl.h" +#include "storage/checksum.h" #include <stdio.h> #include <assert.h> @@ -15,23 +14,23 @@ Datum drive_pg_checksum(PG_FUNCTION_ARGS) { int page_count = PG_GETARG_INT32(0); - PGChecksummablePage *pages; + char *pages; int i; size_t j; - pages = palloc(page_count * sizeof(PGChecksummablePage)); + pages = palloc(page_count * BLCKSZ); srand(0); - for (j = 0; j < page_count * sizeof(PGChecksummablePage); j++) + for (j = 0; j < page_count * BLCKSZ; j++) { - char *byte_ptr = (char *) pages; + char *byte_ptr = pages; byte_ptr[j] = rand() % 256; } for (i = 0; i < REPEATS; i++) { - const PGChecksummablePage *test_page = pages + (i % page_count); - volatile uint32 result = pg_checksum_block_choose((const char *) test_page); + char *test_page = pages + (i % page_count); + volatile uint32 result = pg_checksum_page((char *) test_page, 0); (void) result; } -- 2.53.0 [text/x-patch] v12-0003-Enable-autovectorizing-page-checksums-with-AVX2-.patch (13.4K, 4-v12-0003-Enable-autovectorizing-page-checksums-with-AVX2-.patch) download | inline diff: From a2e58990cf79dc217755431a86084e5d3ccd1e5b Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Mon, 2 Mar 2026 17:28:58 +0700 Subject: [PATCH v12 3/3] Enable autovectorizing page checksums with AVX2 where available We already rely on autovectorization for computing page checksums, but on x86 we can get about twice the performance by annotating pg_checksum_block() with function target attributes for AVX2, which uses 256-bit registers. Co-authored-by: Matthew Sterrett <[email protected]> Co-authored-by: Andrew Kim <[email protected]> Reviewed-by: Oleg Tselebrovskiy <[email protected]> Discussion: https://postgr.es/m/CA%2BvA85_5GTu%2BHHniSbvvP%2B8k3%3DxZO%3DWE84NPwiKyxztqvpfZ3Q%40mail.gmail.com Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal --- config/c-compiler.m4 | 26 +++++++++++++ configure | 46 +++++++++++++++++++++++ configure.ac | 9 +++++ meson.build | 30 +++++++++++++++ src/backend/storage/page/checksum.c | 44 +++++++++++++++++++++- src/include/pg_config.h.in | 3 ++ src/include/port/pg_cpu.h | 3 ++ src/include/storage/checksum_block.inc.c | 42 +++++++++++++++++++++ src/include/storage/checksum_impl.h | 48 ++++++++---------------- src/port/pg_cpu_x86.c | 6 ++- 10 files changed, 222 insertions(+), 35 deletions(-) create mode 100644 src/include/storage/checksum_block.inc.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 88333ef301d..566c62dabab 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -691,6 +691,32 @@ fi undefine([Ac_cachevar])dnl ])# PGAC_SSE42_CRC32_INTRINSICS +# PGAC_AVX2_SUPPORT +# --------------------------- +# Check if the compiler supports AVX2 target attribute. +# This is used for optimized checksum calculations with runtime detection. +# +# If AVX2 target attribute is supported, sets pgac_avx2_support. +AC_DEFUN([PGAC_AVX2_SUPPORT], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_support])])dnl +AC_CACHE_CHECK([for AVX2 target attribute support], [Ac_cachevar], +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([#include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + } + #endif], + [return avx2_test();])], + [Ac_cachevar=yes], + [Ac_cachevar=no])]) +if test x"$Ac_cachevar" = x"yes"; then + pgac_avx2_support=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX2_SUPPORT + # PGAC_AVX512_PCLMUL_INTRINSICS # --------------------------- # Check if the compiler supports AVX-512 carryless multiplication diff --git a/configure b/configure index 4c789bd9289..87ebeb0cd0e 100755 --- a/configure +++ b/configure @@ -17835,6 +17835,52 @@ $as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h fi fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for AVX2 target attribute support" >&5 +$as_echo_n "checking for AVX2 target attribute support... " >&6; } +if ${pgac_cv_avx2_support+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + } + #endif +int +main () +{ +return avx2_test(); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv_avx2_support=yes +else + pgac_cv_avx2_support=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_support" >&5 +$as_echo "$pgac_cv_avx2_support" >&6; } +if test x"$pgac_cv_avx2_support" = x"yes"; then + pgac_avx2_support=yes +fi + + if test x"$pgac_avx2_support" = x"yes"; then + +$as_echo "#define USE_AVX2_WITH_RUNTIME_CHECK 1" >>confdefs.h + + fi +fi + # Check for XSAVE intrinsics # { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5 diff --git a/configure.ac b/configure.ac index 9edffe481a6..0445e1cbcff 100644 --- a/configure.ac +++ b/configure.ac @@ -2130,6 +2130,15 @@ else fi fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + PGAC_AVX2_SUPPORT() + if test x"$pgac_avx2_support" = x"yes"; then + AC_DEFINE(USE_AVX2_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX2 instructions with a runtime check.]) + fi +fi + # Check for XSAVE intrinsics # PGAC_XSAVE_INTRINSICS() diff --git a/meson.build b/meson.build index f7a87edcc94..d57e9f89487 100644 --- a/meson.build +++ b/meson.build @@ -2436,6 +2436,36 @@ int main(void) endif +############################################################### +# Check for the availability of AVX2 support +############################################################### + +if host_cpu == 'x86_64' + + prog = ''' +#include <immintrin.h> +#include <stdint.h> +#if defined(__has_attribute) && __has_attribute (target) +__attribute__((target("avx2"))) +#endif +static int avx2_test(void) +{ + return 0; +} + +int main(void) +{ + return avx2_test(); +} +''' + + if cc.links(prog, name: 'AVX2 support', args: test_c_args) + cdata.set('USE_AVX2_WITH_RUNTIME_CHECK', 1) + endif + +endif + + ############################################################### # Check for the availability of AVX-512 popcount intrinsics. ############################################################### diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index 8716651c8b5..7ce51fe9d2e 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -13,10 +13,52 @@ */ #include "postgres.h" +#include "port/pg_cpu.h" #include "storage/checksum.h" /* * The actual code is in storage/checksum_impl.h. This is done so that * external programs can incorporate the checksum code by #include'ing - * that file from the exported Postgres headers. (Compare our CRC code.) + * that file from the exported Postgres headers. (Compare our legacy + * CRC code in pg_crc.h.) + * The PG_CHECKSUM_INTERNAL symbol allows core to use hardware-specific + * coding without affecting external programs. */ +#define PG_CHECKSUM_INTERNAL #include "storage/checksum_impl.h" /* IWYU pragma: keep */ + + +static uint32 +pg_checksum_block_fallback(const PGChecksummablePage *page) +{ +#include "storage/checksum_block.inc.c" +} + +/* + * AVX2-optimized block checksum algorithm. + */ +#ifdef USE_AVX2_WITH_RUNTIME_CHECK +pg_attribute_target("avx2") +static uint32 +pg_checksum_block_avx2(const PGChecksummablePage *page) +{ +#include "storage/checksum_block.inc.c" +} +#endif /* USE_AVX2_WITH_RUNTIME_CHECK */ + +/* + * Choose the best available checksum implementation. + */ +static uint32 +pg_checksum_choose(const PGChecksummablePage *page) +{ + pg_checksum_block = pg_checksum_block_fallback; + +#ifdef USE_AVX2_WITH_RUNTIME_CHECK + if (x86_feature_available(PG_AVX2)) + pg_checksum_block = pg_checksum_block_avx2; +#endif + + return pg_checksum_block(page); +} + +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page) = pg_checksum_choose; diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index 79379a4d125..71f6646cd0b 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -674,6 +674,9 @@ /* Define to 1 to use AVX-512 CRC algorithms with a runtime check. */ #undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK +/* Define to 1 to use AVX2 instructions with a runtime check. */ +#undef USE_AVX2_WITH_RUNTIME_CHECK + /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */ #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK diff --git a/src/include/port/pg_cpu.h b/src/include/port/pg_cpu.h index b93b828d3ac..c5d96bb4f47 100644 --- a/src/include/port/pg_cpu.h +++ b/src/include/port/pg_cpu.h @@ -24,6 +24,9 @@ typedef enum X86FeatureId PG_SSE4_2, PG_POPCNT, + /* 256-bit YMM registers */ + PG_AVX2, + /* 512-bit ZMM registers */ PG_AVX512_BW, PG_AVX512_VL, diff --git a/src/include/storage/checksum_block.inc.c b/src/include/storage/checksum_block.inc.c new file mode 100644 index 00000000000..743434644a4 --- /dev/null +++ b/src/include/storage/checksum_block.inc.c @@ -0,0 +1,42 @@ +/*------------------------------------------------------------------------- + * + * checksum_block.inc.c + * Core algorithm for page checksums, semi private to checksum_impl.h + * and checksum.c. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/checksum_block.inc.c + * + *------------------------------------------------------------------------- + */ + +/* there is deliberately not an #ifndef CHECKSUM_BLOCK_INC_C here */ + +uint32 sums[N_SUMS]; +uint32 result = 0; +uint32 i, + j; + +/* ensure that the size is compatible with the algorithm */ +Assert(sizeof(PGChecksummablePage) == BLCKSZ); + +/* initialize partial checksums to their corresponding offsets */ +memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); + +/* main checksum calculation */ +for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], page->data[i][j]); + +/* finally add in two rounds of zeroes for additional mixing */ +for (i = 0; i < 2; i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], 0); + +/* xor fold partial checksums together */ +for (i = 0; i < N_SUMS; i++) + result ^= sums[i]; + +return result; diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h index 5c2dcbc63e7..49974043dc2 100644 --- a/src/include/storage/checksum_impl.h +++ b/src/include/storage/checksum_impl.h @@ -73,11 +73,10 @@ * 2e-16 false positive rate within margin of error. * * Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer - * multiplication instruction. As of 2013 the corresponding instruction is - * available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32). - * Vectorization requires a compiler to do the vectorization for us. For recent - * GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough - * to achieve vectorization. + * multiplication instruction. Examples include x86 AVX2 extensions (vpmulld) + * and ARM NEON (vmul.i32). For simplicity we rely on the compiler to do the + * vectorization for us. For GCC and clang the flags -funroll-loops + * -ftree-vectorize are enough to achieve vectorization. * * The optimal amount of parallelism to use depends on CPU specific instruction * latency, SIMD instruction width, throughput and the amount of registers @@ -89,8 +88,9 @@ * * The parallelism number 32 was chosen based on the fact that it is the * largest state that fits into architecturally visible x86 SSE registers while - * leaving some free registers for intermediate values. For future processors - * with 256bit vector registers this will leave some performance on the table. + * leaving some free registers for intermediate values. For processors + * with 256bit vector registers this leaves some performance on the table. + * * When vectorization is not available it might be beneficial to restructure * the computation to calculate a subset of the columns at a time and perform * multiple passes to avoid register spilling. This optimization opportunity @@ -138,6 +138,9 @@ do { \ (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \ } while (0) +/* Provide a static definition for external programs */ +#ifndef PG_CHECKSUM_INTERNAL + /* * Block checksum algorithm. The page must be adequately aligned * (at least on 4-byte boundary). @@ -145,34 +148,13 @@ do { \ static uint32 pg_checksum_block(const PGChecksummablePage *page) { - uint32 sums[N_SUMS]; - uint32 result = 0; - uint32 i, - j; - - /* ensure that the size is compatible with the algorithm */ - Assert(sizeof(PGChecksummablePage) == BLCKSZ); - - /* initialize partial checksums to their corresponding offsets */ - memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); - - /* main checksum calculation */ - for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], page->data[i][j]); - - /* finally add in two rounds of zeroes for additional mixing */ - for (i = 0; i < 2; i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], 0); - - /* xor fold partial checksums together */ - for (i = 0; i < N_SUMS; i++) - result ^= sums[i]; - - return result; +#include "storage/checksum_block.inc.c" } +#else +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page); +#endif + /* * Compute the checksum for a Postgres page. * diff --git a/src/port/pg_cpu_x86.c b/src/port/pg_cpu_x86.c index 7575838245c..7ac4c4b3fd5 100644 --- a/src/port/pg_cpu_x86.c +++ b/src/port/pg_cpu_x86.c @@ -80,7 +80,7 @@ set_x86_features(void) { uint32 xcr0_val = 0; - /* second cpuid call on leaf 7 to check extended AVX-512 support */ + /* cpuid call on leaf 7 */ memset(exx, 0, 4 * sizeof(exx[0])); @@ -95,6 +95,10 @@ set_x86_features(void) xcr0_val = _xgetbv(0); #endif + /* Are YMM registers enabled? */ + if (mask_available(xcr0_val, XMM | YMM)) + X86Features[PG_AVX2] = exx[1] >> 5 & 1; + /* Are ZMM registers enabled? */ if (mask_available(xcr0_val, XMM | YMM | OPMASK | ZMM0_15 | ZMM16_31)) -- 2.53.0 ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-03-17 02:25 John Naylor <[email protected]> parent: John Naylor <[email protected]> 1 sibling, 0 replies; 15+ messages in thread From: John Naylor @ 2026-03-17 02:25 UTC (permalink / raw) To: Andrew Kim <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] On Tue, Mar 17, 2026 at 9:23 AM John Naylor <[email protected]> wrote: > I've attached v12 which is just a rebase over the new centralized > feature detection. I also have some review: I forgot to mention elsewhere we generated #include-able snippets of code with the suffix ".inc.c" so I went with that rather than add exception to headerscheck. -- John Naylor Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-03-30 12:01 John Naylor <[email protected]> parent: John Naylor <[email protected]> 1 sibling, 2 replies; 15+ messages in thread From: John Naylor @ 2026-03-30 12:01 UTC (permalink / raw) To: Andrew Kim <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] On Tue, Mar 17, 2026 at 9:23 AM John Naylor <[email protected]> wrote: > I've attached v12 which is just a rebase over the new centralized > feature detection. I also have some review: Andrew Kim let me know he is not available at this time, so since I found only minor issues and we're close to feature freeze I took care of them myself in the attached v13. I also further updated an outdated comment to reflect that some compilers (for the archives: at least gcc 8.5 and up) can turn a multiplication by a constant into shifts and adds (the FNV prime is pretty sparse in one-bits), so it's no longer true that vector multiplication is required. That works on SSE2 and powerpc, at least. I don't remember the last time anyone did measurements, so I went ahead and did that: master: 945ms 32 AVX2: 335ms 64 AVX2: 220ms The last one is just to verify an old code comment, and assertion in this thread, that the choice of 32 accumulators left some performance on the table. (Even if it weren't in diminishing returns territory, we wouldn't consider raising this because that changes the computed value, but if I'm updating comments anyway, I wanted to check as much as convenient.) I'll repeat building pg_filedump with this and if that goes well I plan to push this week unless there are objections. -- John Naylor Amazon Web Services Attachments: [text/x-patch] v13-0001-Enable-autovectorizing-page-checksums-with-AVX2-.patch (13.6K, 2-v13-0001-Enable-autovectorizing-page-checksums-with-AVX2-.patch) download | inline diff: From da15aefd7269b8a342f751e2b5c2c4b9c1b0627c Mon Sep 17 00:00:00 2001 From: John Naylor <[email protected]> Date: Mon, 2 Mar 2026 17:28:58 +0700 Subject: [PATCH v13] Enable autovectorizing page checksums with AVX2 where available We already rely on autovectorization for computing page checksums, but on x86 we can get nearly three times the performance by annotating pg_checksum_block() with a function target attribute for AVX2. That feature set not only uses 256-bit registers, but can also use vector multiplication rather than the vector shifts and adds available in SSE2. This matters most when using io_uring since in that case the checksum computation is not done in parallel via workers. Co-authored-by: Matthew Sterrett <[email protected]> Co-authored-by: Andrew Kim <[email protected]> Reviewed-by: Oleg Tselebrovskiy <[email protected]> Discussion: https://postgr.es/m/CA%2BvA85_5GTu%2BHHniSbvvP%2B8k3%3DxZO%3DWE84NPwiKyxztqvpfZ3Q%40mail.gmail.com Discussion: https://postgr.es/m/20250911054220.3784-1-root%40ip-172-31-36-228.ec2.internal --- config/c-compiler.m4 | 23 +++++++++++ configure | 46 +++++++++++++++++++++ configure.ac | 9 ++++ meson.build | 28 +++++++++++++ src/backend/storage/page/checksum.c | 44 +++++++++++++++++++- src/include/pg_config.h.in | 3 ++ src/include/port/pg_cpu.h | 3 ++ src/include/storage/checksum_block.inc.c | 42 +++++++++++++++++++ src/include/storage/checksum_impl.h | 52 ++++++++---------------- src/port/pg_cpu_x86.c | 4 ++ 10 files changed, 219 insertions(+), 35 deletions(-) create mode 100644 src/include/storage/checksum_block.inc.c diff --git a/config/c-compiler.m4 b/config/c-compiler.m4 index 629572ee350..4d5acf8be6e 100644 --- a/config/c-compiler.m4 +++ b/config/c-compiler.m4 @@ -687,6 +687,29 @@ fi undefine([Ac_cachevar])dnl ])# PGAC_SSE42_CRC32_INTRINSICS +# PGAC_AVX2_SUPPORT +# --------------------------- +# Check if the compiler supports AVX2 as a target +# +# If AVX2 target attribute is supported, sets pgac_avx2_support. +AC_DEFUN([PGAC_AVX2_SUPPORT], +[define([Ac_cachevar], [AS_TR_SH([pgac_cv_avx2_support])])dnl +AC_CACHE_CHECK([for AVX2 target attribute support], [Ac_cachevar], +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([#include <stdint.h> + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + }], + [return avx2_test();])], + [Ac_cachevar=yes], + [Ac_cachevar=no])]) +if test x"$Ac_cachevar" = x"yes"; then + pgac_avx2_support=yes +fi +undefine([Ac_cachevar])dnl +])# PGAC_AVX2_SUPPORT + # PGAC_AVX512_PCLMUL_INTRINSICS # --------------------------- # Check if the compiler supports AVX-512 carryless multiplication diff --git a/configure b/configure index 8e0e7483c1d..1ae527215c7 100755 --- a/configure +++ b/configure @@ -17725,6 +17725,52 @@ $as_echo "#define HAVE__CPUIDEX 1" >>confdefs.h fi fi +# Check for AVX2 target and intrinsic support +# +if test x"$host_cpu" = x"x86_64"; then + { $as_echo "$as_me:${as_lineno-$LINENO}: checking for AVX2 target attribute support" >&5 +$as_echo_n "checking for AVX2 target attribute support... " >&6; } +if ${pgac_cv_avx2_support+:} false; then : + $as_echo_n "(cached) " >&6 +else + cat confdefs.h - <<_ACEOF >conftest.$ac_ext +/* end confdefs.h. */ +#include <stdint.h> + #if defined(__has_attribute) && __has_attribute (target) + __attribute__((target("avx2"))) + static int avx2_test(void) + { + return 0; + } + #endif +int +main () +{ +return avx2_test(); + ; + return 0; +} +_ACEOF +if ac_fn_c_try_compile "$LINENO"; then : + pgac_cv_avx2_support=yes +else + pgac_cv_avx2_support=no +fi +rm -f core conftest.err conftest.$ac_objext conftest.$ac_ext +fi +{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $pgac_cv_avx2_support" >&5 +$as_echo "$pgac_cv_avx2_support" >&6; } +if test x"$pgac_cv_avx2_support" = x"yes"; then + pgac_avx2_support=yes +fi + + if test x"$pgac_avx2_support" = x"yes"; then + +$as_echo "#define USE_AVX2_WITH_RUNTIME_CHECK 1" >>confdefs.h + + fi +fi + # Check for XSAVE intrinsics # { $as_echo "$as_me:${as_lineno-$LINENO}: checking for _xgetbv" >&5 diff --git a/configure.ac b/configure.ac index 2baac5e9da7..a43add51ca6 100644 --- a/configure.ac +++ b/configure.ac @@ -2129,6 +2129,15 @@ else fi fi +# Check for AVX2 target support +# +if test x"$host_cpu" = x"x86_64"; then + PGAC_AVX2_SUPPORT() + if test x"$pgac_avx2_support" = x"yes"; then + AC_DEFINE(USE_AVX2_WITH_RUNTIME_CHECK, 1, [Define to 1 to use AVX2 instructions with a runtime check.]) + fi +fi + # Check for XSAVE intrinsics # PGAC_XSAVE_INTRINSICS() diff --git a/meson.build b/meson.build index ea31cbce9c0..5cf23195a6a 100644 --- a/meson.build +++ b/meson.build @@ -2451,6 +2451,34 @@ int main(void) endif +############################################################### +# Check if the compiler supports AVX2 as a target +############################################################### + +if host_cpu == 'x86_64' + + prog = ''' +#include <immintrin.h> +#include <stdint.h> +__attribute__((target("avx2"))) +static int avx2_test(void) +{ + return 0; +} + +int main(void) +{ + return avx2_test(); +} +''' + + if cc.links(prog, name: 'AVX2 support', args: test_c_args) + cdata.set('USE_AVX2_WITH_RUNTIME_CHECK', 1) + endif + +endif + + ############################################################### # Check for the availability of AVX-512 popcount intrinsics. ############################################################### diff --git a/src/backend/storage/page/checksum.c b/src/backend/storage/page/checksum.c index 8716651c8b5..7ce51fe9d2e 100644 --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -13,10 +13,52 @@ */ #include "postgres.h" +#include "port/pg_cpu.h" #include "storage/checksum.h" /* * The actual code is in storage/checksum_impl.h. This is done so that * external programs can incorporate the checksum code by #include'ing - * that file from the exported Postgres headers. (Compare our CRC code.) + * that file from the exported Postgres headers. (Compare our legacy + * CRC code in pg_crc.h.) + * The PG_CHECKSUM_INTERNAL symbol allows core to use hardware-specific + * coding without affecting external programs. */ +#define PG_CHECKSUM_INTERNAL #include "storage/checksum_impl.h" /* IWYU pragma: keep */ + + +static uint32 +pg_checksum_block_fallback(const PGChecksummablePage *page) +{ +#include "storage/checksum_block.inc.c" +} + +/* + * AVX2-optimized block checksum algorithm. + */ +#ifdef USE_AVX2_WITH_RUNTIME_CHECK +pg_attribute_target("avx2") +static uint32 +pg_checksum_block_avx2(const PGChecksummablePage *page) +{ +#include "storage/checksum_block.inc.c" +} +#endif /* USE_AVX2_WITH_RUNTIME_CHECK */ + +/* + * Choose the best available checksum implementation. + */ +static uint32 +pg_checksum_choose(const PGChecksummablePage *page) +{ + pg_checksum_block = pg_checksum_block_fallback; + +#ifdef USE_AVX2_WITH_RUNTIME_CHECK + if (x86_feature_available(PG_AVX2)) + pg_checksum_block = pg_checksum_block_avx2; +#endif + + return pg_checksum_block(page); +} + +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page) = pg_checksum_choose; diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in index d8d61918aff..5394a614f87 100644 --- a/src/include/pg_config.h.in +++ b/src/include/pg_config.h.in @@ -677,6 +677,9 @@ /* Define to 1 to use AVX-512 CRC algorithms with a runtime check. */ #undef USE_AVX512_CRC32C_WITH_RUNTIME_CHECK +/* Define to 1 to use AVX2 instructions with a runtime check. */ +#undef USE_AVX2_WITH_RUNTIME_CHECK + /* Define to 1 to use AVX-512 popcount instructions with a runtime check. */ #undef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK diff --git a/src/include/port/pg_cpu.h b/src/include/port/pg_cpu.h index b93b828d3ac..c5d96bb4f47 100644 --- a/src/include/port/pg_cpu.h +++ b/src/include/port/pg_cpu.h @@ -24,6 +24,9 @@ typedef enum X86FeatureId PG_SSE4_2, PG_POPCNT, + /* 256-bit YMM registers */ + PG_AVX2, + /* 512-bit ZMM registers */ PG_AVX512_BW, PG_AVX512_VL, diff --git a/src/include/storage/checksum_block.inc.c b/src/include/storage/checksum_block.inc.c new file mode 100644 index 00000000000..6ef8a911145 --- /dev/null +++ b/src/include/storage/checksum_block.inc.c @@ -0,0 +1,42 @@ +/*------------------------------------------------------------------------- + * + * checksum_block.inc.c + * Core algorithm for page checksums, semi-private to checksum_impl.h + * and checksum.c. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/checksum_block.inc.c + * + *------------------------------------------------------------------------- + */ + +/* there is deliberately not an #ifndef CHECKSUM_BLOCK_INC_C here */ + +uint32 sums[N_SUMS]; +uint32 result = 0; +uint32 i, + j; + +/* ensure that the size is compatible with the algorithm */ +Assert(sizeof(PGChecksummablePage) == BLCKSZ); + +/* initialize partial checksums to their corresponding offsets */ +memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); + +/* main checksum calculation */ +for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], page->data[i][j]); + +/* finally add in two rounds of zeroes for additional mixing */ +for (i = 0; i < 2; i++) + for (j = 0; j < N_SUMS; j++) + CHECKSUM_COMP(sums[j], 0); + +/* xor fold partial checksums together */ +for (i = 0; i < N_SUMS; i++) + result ^= sums[i]; + +return result; diff --git a/src/include/storage/checksum_impl.h b/src/include/storage/checksum_impl.h index 5c2dcbc63e7..28570abdda0 100644 --- a/src/include/storage/checksum_impl.h +++ b/src/include/storage/checksum_impl.h @@ -72,12 +72,13 @@ * random segments of page with 0x00, 0xFF and random data all show optimal * 2e-16 false positive rate within margin of error. * - * Vectorization of the algorithm requires 32bit x 32bit -> 32bit integer - * multiplication instruction. As of 2013 the corresponding instruction is - * available on x86 SSE4.1 extensions (pmulld) and ARM NEON (vmul.i32). - * Vectorization requires a compiler to do the vectorization for us. For recent - * GCC versions the flags -msse4.1 -funroll-loops -ftree-vectorize are enough - * to achieve vectorization. + * Vectorization of the algorithm works best with a 32bit x 32bit -> 32bit + * vector integer multiplication instruction, Examples include x86 AVX2 + * extensions (vpmulld) and ARM NEON (vmul.i32). Without that, vectorization + * is still possible if the compiler can turn multiplication by FNV_PRIME + * into a sequence of vectorized shifts and adds. For simplicity we rely + * on the compiler to do the vectorization for us. For GCC and clang the + * flags -funroll-loops -ftree-vectorize are enough to achieve vectorization. * * The optimal amount of parallelism to use depends on CPU specific instruction * latency, SIMD instruction width, throughput and the amount of registers @@ -89,8 +90,9 @@ * * The parallelism number 32 was chosen based on the fact that it is the * largest state that fits into architecturally visible x86 SSE registers while - * leaving some free registers for intermediate values. For future processors - * with 256bit vector registers this will leave some performance on the table. + * leaving some free registers for intermediate values. For processors + * with 256-bit vector registers this leaves some performance on the table. + * * When vectorization is not available it might be beneficial to restructure * the computation to calculate a subset of the columns at a time and perform * multiple passes to avoid register spilling. This optimization opportunity @@ -138,6 +140,9 @@ do { \ (checksum) = __tmp * FNV_PRIME ^ (__tmp >> 17); \ } while (0) +/* Provide a static definition for external programs */ +#ifndef PG_CHECKSUM_INTERNAL + /* * Block checksum algorithm. The page must be adequately aligned * (at least on 4-byte boundary). @@ -145,34 +150,13 @@ do { \ static uint32 pg_checksum_block(const PGChecksummablePage *page) { - uint32 sums[N_SUMS]; - uint32 result = 0; - uint32 i, - j; - - /* ensure that the size is compatible with the algorithm */ - Assert(sizeof(PGChecksummablePage) == BLCKSZ); - - /* initialize partial checksums to their corresponding offsets */ - memcpy(sums, checksumBaseOffsets, sizeof(checksumBaseOffsets)); - - /* main checksum calculation */ - for (i = 0; i < (uint32) (BLCKSZ / (sizeof(uint32) * N_SUMS)); i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], page->data[i][j]); - - /* finally add in two rounds of zeroes for additional mixing */ - for (i = 0; i < 2; i++) - for (j = 0; j < N_SUMS; j++) - CHECKSUM_COMP(sums[j], 0); - - /* xor fold partial checksums together */ - for (i = 0; i < N_SUMS; i++) - result ^= sums[i]; - - return result; +#include "storage/checksum_block.inc.c" } +#else +static uint32 (*pg_checksum_block) (const PGChecksummablePage *page); +#endif + /* * Compute the checksum for a Postgres page. * diff --git a/src/port/pg_cpu_x86.c b/src/port/pg_cpu_x86.c index e2ab92b09ac..f069afd1c53 100644 --- a/src/port/pg_cpu_x86.c +++ b/src/port/pg_cpu_x86.c @@ -121,6 +121,10 @@ set_x86_features(void) xcr0_val = _xgetbv(0); #endif + /* Are YMM registers enabled? */ + if (mask_available(xcr0_val, XMM | YMM)) + X86Features[PG_AVX2] = reg[EBX] >> 5 & 1; + /* Are ZMM registers enabled? */ if (mask_available(xcr0_val, XMM | YMM | OPMASK | ZMM0_15 | ZMM16_31)) -- 2.53.0 ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-03-30 15:00 Ants Aasma <[email protected]> parent: John Naylor <[email protected]> 1 sibling, 1 reply; 15+ messages in thread From: Ants Aasma @ 2026-03-30 15:00 UTC (permalink / raw) To: John Naylor <[email protected]>; +Cc: Andrew Kim <[email protected]>; Oleg Tselebrovskiy <[email protected]>; [email protected] On Mon, 30 Mar 2026 at 15:01, John Naylor <[email protected]> wrote: > I don't remember the last time anyone did measurements, so I went > ahead and did that: > > master: 945ms > 32 AVX2: 335ms > 64 AVX2: 220ms I'm guessing this is on a recent Intel. Any extra width is helpful on Intel as they doubled vpmulld latency from under us after we had settled on this algorithm. uops.info shows that the most recent Arrow Lake-P cores bring the latency down to 5. B Intels product lineup is so confusing that it's hard to tell which products this core ships in. As far as I can tell not in any Xeons yet. AMD has had 3 cycle vpmulld since Zen 3. Out of curiosity I tried some approximate numbers on Zen 5 for differing N_SUMS values. Numbers are ns per iteration for 10M iterations. GCC 15.2 -O3: n16 n32 n64 n128 n256 x86-64 620.1 482.4 493.9 543.1 584.0 x86-64-v2 188.6 125.5 121.3 183.9 196.6 x86-64-v3 185.2 101.3 63.2 60.9 101.6 x86-64-v4 182.9 86.0 53.9 35.4 30.5 native 178.2 84.7 54.0 34.5 30.9 clang 20.1 -O3: n16 n32 n64 n128 n256 x86-64 611.7 264.0 254.7 283.9 304.0 x86-64-v2 603.7 134.0 137.9 236.1 165.8 x86-64-v3 252.1 103.2 61.9 124.0 96.9 x86-64-v4 223.9 102.1 61.4 101.7 68.9 native 203.3 91.0 54.5 35.0 40.4 FWIW I think AVX2 (x86-64-v3) is fine. On AMD the speed is close to core to fabric bandwidth and Intel has significantly less bandwidth on server chips. Regards, Ants Aasma Attachments: [text/x-csrc] bench-checksums.c (1023B, 3-bench-checksums.c) download | inline: #include "postgres.h" #include "storage/checksum_impl.h" #include <time.h> #undef printf int __attribute__ ((noinline)) checksum_block(char *page, uint32 blockno) { return pg_checksum_page(page, blockno); } int main(int argc, char *argv[]) { char *page; uint64 i; uint64 sum = 0; struct timespec start; struct timespec end; double delta; if (argc<3) { printf("Usage: %s niterations nblocks\n", argv[0]); return 1; } uint64 n = strtoull(argv[1], 0, 10); uint64 b = strtoull(argv[2], 0, 10); page = malloc(BLCKSZ*b); for (i = 0; i < BLCKSZ*b; i++) page[i] = (i*997) & 0xFF; clock_gettime(CLOCK_MONOTONIC_RAW, &start); for (i = 0; i < n; i++) sum += checksum_block(page + BLCKSZ*(i % b), (uint32) i); clock_gettime(CLOCK_MONOTONIC_RAW, &end); delta = (double)(end.tv_sec - start.tv_sec) + (1e-9*(double) (end.tv_nsec - start.tv_nsec)); printf("%0.5fms @ %0.3f GB/s\n", delta*1000, (((double) BLCKSZ) * n)/delta/1e9); printf(" %0.1fns per iteration\n", delta*1e9/n); return 0; } ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-03-31 04:09 John Naylor <[email protected]> parent: Ants Aasma <[email protected]> 0 siblings, 0 replies; 15+ messages in thread From: John Naylor @ 2026-03-31 04:09 UTC (permalink / raw) To: Ants Aasma <[email protected]>; +Cc: Andrew Kim <[email protected]>; Oleg Tselebrovskiy <[email protected]>; [email protected] On Mon, Mar 30, 2026 at 10:01 PM Ants Aasma <[email protected]> wrote: > > On Mon, 30 Mar 2026 at 15:01, John Naylor <[email protected]> wrote: > > I don't remember the last time anyone did measurements, so I went > > ahead and did that: > > > > master: 945ms > > 32 AVX2: 335ms > > 64 AVX2: 220ms > > I'm guessing this is on a recent Intel. Any extra width is helpful on Intel as they doubled vpmulld latency from under us after we had settled on this algorithm. It's actually ancient and due to be replaced soon, but still several years after the adoption of this algorithm. > FWIW I think AVX2 (x86-64-v3) is fine. Glad to hear it, although the patch doesn't use that build flag, so it's not impossible there is some additional difference in the compiler's model. Still, given the variation you found, I'll make sure the commit message says "several time faster" so it's not specific to my hardware. -- John Naylor Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-04-04 13:25 John Naylor <[email protected]> parent: John Naylor <[email protected]> 1 sibling, 1 reply; 15+ messages in thread From: John Naylor @ 2026-04-04 13:25 UTC (permalink / raw) To: Andrew Kim <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] On Mon, Mar 30, 2026 at 7:01 PM John Naylor <[email protected]> wrote: > I'll repeat building pg_filedump with this and if that goes well I > plan to push this week unless there are objections. Something change in my environment, or something, because I can't build pg_filedump anymore, although it hasn't had any recent new commits: pg_config /bin/sh: line 1: mkdir: command not found Looks like something messed with PATH, but I don't think it was me. In any case, very little has changed in the patch since I last built pg_filedump successfully, so I won't worry yet. I pushed with a couple cosmetic adjustments: - Removed no-longer-needed #includes from configure checks - Added a comment that we deliberately don't guard on __has_attribute - switch things around to use #ifdef instead of #ifndef for clarity Thanks Andrew, for picking this up again! -- John Naylor Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
* Re: Proposal for enabling auto-vectorization for checksum calculations @ 2026-04-13 08:32 Andrew Kim <[email protected]> parent: John Naylor <[email protected]> 0 siblings, 0 replies; 15+ messages in thread From: Andrew Kim @ 2026-04-13 08:32 UTC (permalink / raw) To: John Naylor <[email protected]>; +Cc: Oleg Tselebrovskiy <[email protected]>; [email protected] On Sat, Apr 4, 2026 at 6:25 AM John Naylor <[email protected]> wrote: > > On Mon, Mar 30, 2026 at 7:01 PM John Naylor <[email protected]> wrote: > > I'll repeat building pg_filedump with this and if that goes well I > > plan to push this week unless there are objections. > > Something change in my environment, or something, because I can't > build pg_filedump anymore, although it hasn't had any recent new > commits: > > pg_config > /bin/sh: line 1: mkdir: command not found > > Looks like something messed with PATH, but I don't think it was me. In > any case, very little has changed in the patch since I last built > pg_filedump successfully, so I won't worry yet. > > I pushed with a couple cosmetic adjustments: > > - Removed no-longer-needed #includes from configure checks > - Added a comment that we deliberately don't guard on __has_attribute > - switch things around to use #ifdef instead of #ifndef for clarity > > Thanks Andrew, for picking this up again! Thank you for taking care of the final adjustments and pushing the patch to master. Thank you again for your guidance and for steering this through to the finish line. It was a pleasure collaborating with you on this optimization. - Andrew > > -- > John Naylor > Amazon Web Services ^ permalink raw reply [nested|flat] 15+ messages in thread
end of thread, other threads:[~2026-04-13 08:32 UTC | newest] Thread overview: 15+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-01-11 23:19 Re: Proposal for enabling auto-vectorization for checksum calculations John Naylor <[email protected]> 2026-01-13 01:58 ` Andrew Kim <[email protected]> 2026-01-13 05:07 ` John Naylor <[email protected]> 2026-01-15 08:03 ` Oleg Tselebrovskiy <[email protected]> 2026-01-15 10:35 ` John Naylor <[email protected]> 2026-01-21 11:13 ` John Naylor <[email protected]> 2026-02-09 08:42 ` Andrew Kim <[email protected]> 2026-03-16 08:00 ` Andrew Kim <[email protected]> 2026-03-17 02:23 ` John Naylor <[email protected]> 2026-03-17 02:25 ` John Naylor <[email protected]> 2026-03-30 12:01 ` John Naylor <[email protected]> 2026-03-30 15:00 ` Ants Aasma <[email protected]> 2026-03-31 04:09 ` John Naylor <[email protected]> 2026-04-04 13:25 ` John Naylor <[email protected]> 2026-04-13 08:32 ` Andrew Kim <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox