Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8V9T-000bBz-1A for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 03:24:55 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8V9S-009d1q-0s for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 03:24:54 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8V9R-009d1g-33 for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 03:24:54 +0000 Received: from mail-pj1-x102a.google.com ([2607:f8b0:4864:20::102a]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8V9P-00000000JEb-1lP3 for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 03:24:53 +0000 Received: by mail-pj1-x102a.google.com with SMTP id 98e67ed59e1d1-358e3cc5e7eso800288a91.0 for ; Thu, 02 Apr 2026 20:24:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775186689; x=1775791489; darn=lists.postgresql.org; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:from:to:cc:subject:date:message-id:reply-to; bh=Q34ut7RgXOpWp5104Wb/RfRMljmVJXpsCJZaK6vX824=; b=POKWDGp96wccDwPQs16hYH+eyYlRBK3r5Vm8R+IQsjqlppdJ3uS5T13R10Va9XXTk7 9fPkkJXY3hG6kEXgOIpkAlq0ElFPuah3lZxNgtfxHyzh/PilfV8eK5jGOYydVV90AsO1 Mo/zxAarX/XJjgldQ0ZHVWiVoVhAsAiOHeDUYRib3Rtg+xCfS6/RY8M7tVkGPyY/z1MQ sQb/WRy0zwOwLRNakxO601fzaTEVoveRO7Dm7aWl1Jh5EQGBiiSW5evKmmy4C7t27W83 dMH+5XoUKhONwtya3dx4YLkk2/1rsv5dDsLEYOxWOGa6Uh+5JYGhPjJOAuvfmOwCps4v lZDw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775186689; x=1775791489; h=references:to:cc:in-reply-to:date:subject:mime-version:message-id :from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=Q34ut7RgXOpWp5104Wb/RfRMljmVJXpsCJZaK6vX824=; b=kOKOwfjoICEk2JvMeEBLh4b4HGEE60Eh2FRHS2BsxfKGQfXMJA0+skyzEcdC9SjNu6 VSK+2LJiN1KkicgKXGq8QGNikacoZhkaCMCBSF/9+8UGPK2L0pMPUQy7T9wZIPCN4tZW vQIVrsESRNnZvE1r4804GAe4tlOXPnz93YHufQRaEMEBN+nnw19Ov5rO0baH/5y0Xm2y VHnXlw6IGzT/EPltMDlKXPWyFlNlkgMaJapTHXtzASotRYRUfkild+avWtsjlDu7fMbf UtuLOYVR0O8TTCNZugTMkSfUzsOBJtV5U+IYfnT4XpTOfePo/GoY2oAAiMk35CC7tliU Wmyw== X-Forwarded-Encrypted: i=1; AJvYcCXqN6VqAzDVqfIsXGSuhgO0Q0b9srzqKyf2GBg3OST0bJSXb8Pl7twP76iKFExsIjrl1T1XkNzzc+oJhALL@lists.postgresql.org X-Gm-Message-State: AOJu0YwuX27Bf1yBy0GW2PZ7pu2eXIMcGzYBnxvQeud3wZCKl0S5Rcwt TsfL3NwrpiF53p8RHJyvRyRZhP8IKuDZSTnIDJIY+3enFx5C3RERUUp/LYhf9ZPZIzc= X-Gm-Gg: AeBDiesDEd3Nvr0nfI8qtnRMSawpAw/3WBznBThxHDZRoPBWo6T7Kfb/RUa1npmrYKP L5j178BshyCY5i9f2uDkxMi509udFOsKQMLV2zKGlEj+GTx9odpGrjMm5ZCJYqVXPvZxz3kMkIv EpVmwFbQKOqQx4TTif0E0De9XgJvukJXhidihcXDCuPK/U9cP6RZcJdVazr7a2pY0S9M78VZsqF kg+Wr2c7c6PQ/Egb1oz3KMGfT0PUabSSi1pioRZ1QM6TnbNtk2GLgnycyIsdD7MFUp8iqqidHs0 UZplVIlRbQolCwJkhVo+2NJsgR/mGsKrSXeRXfYkgSpUgKKN7XELN9THzBh2ZPChRyeH/1EJVwF syDEDAg2MwiObcZXsmUNedvpPTbffB1kJdUKaanyVPnVfr+zgP7nQGxty43C7cMnUl9BueXfuKQ HbaZOa6u1x3O43W40pF5HDvx2WtDq4c7k= X-Received: by 2002:a17:903:37ce:b0:2b2:481b:de5f with SMTP id d9443c01a7336-2b28163b27dmr17209695ad.5.1775186688965; Thu, 02 Apr 2026 20:24:48 -0700 (PDT) Received: from smtpclient.apple ([45.32.121.103]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b27472d646sm44554215ad.5.2026.04.02.20.24.46 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Thu, 02 Apr 2026 20:24:48 -0700 (PDT) From: Chao Li Message-Id: <59B9EFAF-84DF-40A9-847F-9CF457A798BB@gmail.com> Content-Type: multipart/mixed; boundary="Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\)) Subject: Re: Small and unlikely overflow hazard in bms_next_member() Date: Fri, 3 Apr 2026 11:24:08 +0800 In-Reply-To: Cc: Tom Lane , PostgreSQL Developers To: David Rowley References: <3190647.1775103768@sss.pgh.pa.us> X-Mailer: Apple Mail (2.3864.400.21) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Apr 3, 2026, at 10:08, David Rowley wrote: >=20 > On Fri, 3 Apr 2026 at 11:12, David Rowley = wrote: >>=20 >> On Thu, 2 Apr 2026 at 17:22, Tom Lane wrote: >>> I don't think we should add cycles here for this purpose. >>=20 >> I'm not keen on slowing things down for this either. I did do some >> experiments in [1] that sees fewer instructions from using 64-bit >> maths. I might go off and see if there are any wins there that also >> give us the INT_MAX fix. It's not great effort to reward ratio >> though... >=20 > The reduction in instructions with the patched version got me curious > to see if it would translate into a performance increase. I tested on > an AMD Zen2 machine, and it's a decent amount faster than master. I > tested with gcc and clang. >=20 > I also scanned over the remaining parts of bitmapset.c and didn't find > anywhere else that has overflow risk aside from what you pointed out > in bms_prev_member(). >=20 > The attached patch contains the benchmark function I added to the > test_bitmapset module. It should apply to master with a bit of noise. >=20 > CREATE EXTENSION test_bitmapset; > SELECT > generate_series(1,3) AS run, > bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS > bms_next_member_us, > bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS > bms_prev_member_us; >=20 > master (gcc) >=20 > run | bms_next_member_us | bms_prev_member_us > -----+--------------------+-------------------- > 1 | 26473 | 40404 > 2 | 26218 | 40413 > 3 | 26209 | 40387 >=20 > patched (gcc) >=20 > run | bms_next_member_us | bms_prev_member_us > -----+--------------------+-------------------- > 1 | 25409 | 29705 > 2 | 24905 | 29693 > 3 | 24870 | 29707 >=20 > Times are in microseconds to do 1 million bms_*_member() loops over > the entire set. >=20 > I've also attached the full results I got. I've also included the > results from Chao's version, which does slow things down decently on > clang. >=20 > IMO, if we can make bitmapset.c work with INT_MAX members and get a > performance increase, then we should do it. >=20 > David >=20 >> [1] https://godbolt.org/z/Eh1vzssq7 > I also did a load test with a standalone c program with 4 versions: * The original bms_next_member (Original) * The fast version from [1], that uses 64bit maths (Fast) * The original version + INT32_MAX check + 64bit maths (Original2) * I tried the other approach that pulls up the first iteration, so that = removes "mask =3D (~(bitmapword) 0);=E2=80=9D from the loop. (PullUp) Note: all tests used -O2 to build the executable. On my MacBook M4, the Fast version constantly won, and PullUp version = performed badly. ``` % gcc --version Apple clang version 17.0.0 (clang-1700.6.4.2) Target: arm64-apple-darwin25.3.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin ``` A typical test run: ``` Benchmarking 100000 iterations... Original: 0.48893 seconds Fast: 0.46979 seconds Original2: 0.47740 seconds PullUp: 0.48029 seconds ```=20 On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, = Orignal2 won in the most runs, and the PullUp version was faster than = Fast version. ``` chaol@lichao-highgo:~$ gcc --version = =20 gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 = =20 Copyright (C) 2023 Free Software Foundation, Inc. = =20 This is free software; see the source for copying conditions. There is = NO =20 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR = PURPOSE. ``` A typical test run: ``` Original: 0.99849 seconds = =20 Fast: 0.74722 seconds = =20 Original2: 0.59407 seconds = =20 PullUp: 0.62746 seconds =20 ``` Then I also tried to run on Windows directly. Here, PullUp version = performed the best. ``` $ gcc --version gcc.exe (Rev13, Built by MSYS2 project) 15.2.0 Copyright (C) 2025 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is = NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR = PURPOSE. ``` A typical test run: ``` Original: 0.32931 seconds Fast: 0.32740 seconds Original2: 0.32378 seconds PullUp: 0.30795 seconds ``` I=E2=80=99m curious that, when something performs differently across = platforms, which platform should take priority? Please see the attached test program. It=E2=80=99s possible I did = something wrong. [1] https://godbolt.org/z/Eh1vzssq7 -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ --Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653 Content-Disposition: attachment; filename=test_bms_next.c Content-Type: application/octet-stream; x-unix-mode=0644; name="test_bms_next.c" Content-Transfer-Encoding: 7bit #include #include #include #include #include #include //#define NULL ((void *) 0) typedef uint64_t uint64; typedef int64_t int64; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ typedef int64 signedbitmapword; /* must be the matching signed type */ #define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) #define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) typedef struct Bitmapset { int nwords; /* number of words in array */ bitmapword words[]; /* really [nwords] */ } Bitmapset; static inline int bmw_rightmost_one_pos(uint64 word) { return __builtin_ctzll(word); } // 1. Original version int bms_next_member(const Bitmapset *a, int prevbit) { int nwords; bitmapword mask; //Assert(bms_is_valid_set(a)); if (a == NULL) return -2; nwords = a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 2. Fast version (size_t usage) int bms_next_member_fast(const Bitmapset *a, int prevbit) { uint64 currbit; size_t nwords; bitmapword mask; if (a == NULL) return -2; nwords = (size_t) a->nwords; currbit = (uint64) prevbit + 1; mask = (~(bitmapword) 0) << BITNUM(currbit); for (size_t wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before currbit */ w &= mask; if (w != 0) { int result; result = (int) wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 3. Original version + INT32_MAX check + 64bit int bms_next_member_2(const Bitmapset *a, int prevbit) { size_t nwords; bitmapword mask; if (a == NULL || prevbit == INT32_MAX) return -2; nwords = (size_t) a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (size_t wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = (int)wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 4. Pull up first iteration int bms_next_member_pullup(const Bitmapset *a, int prevbit) { if (a == NULL || prevbit == INT_MAX) return -2; uint64 currbit = (uint64) prevbit + 1; int wordnum = WORDNUM(currbit); int nwords = a->nwords; if (wordnum >= nwords) return -2; /* Handle first word with mask */ const bitmapword *p = &a->words[wordnum]; bitmapword w = (*p) & ((~(bitmapword) 0) << BITNUM(currbit)); if (w != 0) return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(w); /* The "Tight" Pointer Scan */ const bitmapword *end = &a->words[nwords]; for (p++; p < end; p++) { if (*p != 0) { wordnum = p - a->words; // Pointer arithmetic to get index return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(*p); } } return -2; } double get_time() { struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9; } int main() { int words_to_alloc = 20000; // Large set to bypass CPU cache slightly Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword)); bms->nwords = words_to_alloc; memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); /* Set a bit far into the set to force a long scan */ int target_bit = (words_to_alloc - 1) * 64 + 10; bms->words[words_to_alloc - 1] |= (1ULL << 10); int iterations = 100000; volatile int sink; printf("Benchmarking %d iterations...\n\n", iterations); // Test Original double start = get_time(); for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0); printf("Original: %.5f seconds\n", get_time() - start); // Test Fast start = get_time(); for (int i = 0; i < iterations; i++) sink = bms_next_member_fast(bms, 0); printf("Fast: %.5f seconds\n", get_time() - start); // Test Original2 start = get_time(); for (int i = 0; i < iterations; i++) sink = bms_next_member_2(bms, 0); printf("Original2: %.5f seconds\n", get_time() - start); // Pull up first iteration start = get_time(); for (int i = 0; i < iterations; i++) sink = bms_next_member_pullup(bms, 0); printf("PullUp: %.5f seconds\n", get_time() - start); free(bms); return 0; } --Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653--