From: Chao Li <li.evan.chao@gmail.com>
Message-Id: <59B9EFAF-84DF-40A9-847F-9CF457A798BB@gmail.com>
Content-Type: multipart/mixed;
	boundary="Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653"
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\))
Subject: Re: Small and unlikely overflow hazard in bms_next_member()
Date: Fri, 3 Apr 2026 11:24:08 +0800
In-Reply-To: 
 <CAApHDvqTUm3Cbgz3ZLV+ad8s_HJHZYrVbrBvGyPQdxCRR-6dvA@mail.gmail.com>
Cc: Tom Lane <tgl@sss.pgh.pa.us>,
 PostgreSQL Developers <pgsql-hackers@lists.postgresql.org>
To: David Rowley <dgrowleyml@gmail.com>
References: 
 <CAApHDvq0T=iJ0Sf5TNE9yyWwfOeVjmrBt0wSywDnGD9Y4YJQBA@mail.gmail.com>
 <3190647.1775103768@sss.pgh.pa.us>
 <CAApHDvrvvq_m+nRwjsOpCsFa4EtVtmvJX7zAD=Siria-x6DpbQ@mail.gmail.com>
 <CAApHDvqTUm3Cbgz3ZLV+ad8s_HJHZYrVbrBvGyPQdxCRR-6dvA@mail.gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/59B9EFAF-84DF-40A9-847F-9CF457A798BB%40gmail.com>
Precedence: bulk


--Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8


> On Apr 3, 2026, at 10:08, David Rowley <dgrowleyml@gmail.com> wrote:
>=20
> On Fri, 3 Apr 2026 at 11:12, David Rowley <dgrowleyml@gmail.com> =
wrote:
>>=20
>> On Thu, 2 Apr 2026 at 17:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I don't think we should add cycles here for this purpose.
>>=20
>> I'm not keen on slowing things down for this either. I did do some
>> experiments in [1] that sees fewer instructions from using 64-bit
>> maths. I might go off and see if there are any wins there that also
>> give us the INT_MAX fix. It's not great effort to reward ratio
>> though...
>=20
> The reduction in instructions with the patched version got me curious
> to see if it would translate into a performance increase.  I tested on
> an AMD Zen2 machine, and it's a decent amount faster than master. I
> tested with gcc and clang.
>=20
> I also scanned over the remaining parts of bitmapset.c and didn't find
> anywhere else that has overflow risk aside from what you pointed out
> in bms_prev_member().
>=20
> The attached patch contains the benchmark function I added to the
> test_bitmapset module. It should apply to master with a bit of noise.
>=20
> CREATE EXTENSION test_bitmapset;
> SELECT
>    generate_series(1,3) AS run,
>    bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
> bms_next_member_us,
>    bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
> bms_prev_member_us;
>=20
> master (gcc)
>=20
> run | bms_next_member_us | bms_prev_member_us
> -----+--------------------+--------------------
>   1 |              26473 |              40404
>   2 |              26218 |              40413
>   3 |              26209 |              40387
>=20
> patched (gcc)
>=20
> run | bms_next_member_us | bms_prev_member_us
> -----+--------------------+--------------------
>   1 |              25409 |              29705
>   2 |              24905 |              29693
>   3 |              24870 |              29707
>=20
> Times are in microseconds to do 1 million bms_*_member() loops over
> the entire set.
>=20
> I've also attached the full results I got. I've also included the
> results from Chao's version, which does slow things down decently on
> clang.
>=20
> IMO, if we can make bitmapset.c work with INT_MAX members and get a
> performance increase, then we should do it.
>=20
> David
>=20
>> [1] https://godbolt.org/z/Eh1vzssq7
> <benchmark_results.txt><bms_fixes.patch>

I also did a load test with a standalone c program with 4 versions:

* The original bms_next_member (Original)
* The fast version from [1], that uses 64bit maths (Fast)
* The original version + INT32_MAX check + 64bit maths (Original2)
* I tried the other approach that pulls up the first iteration, so that =
removes "mask =3D (~(bitmapword) 0);=E2=80=9D from the loop. (PullUp)

Note: all tests used -O2 to build the executable.

On my MacBook M4, the Fast version constantly won, and PullUp version =
performed badly.
```
% gcc --version
Apple clang version 17.0.0 (clang-1700.6.4.2)
Target: arm64-apple-darwin25.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
```

A typical test run:
```
Benchmarking 100000 iterations...

Original:  0.48893 seconds
Fast:      0.46979 seconds
Original2:      0.47740 seconds
PullUp: 0.48029 seconds
```=20

On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, =
Orignal2 won in the most runs, and the PullUp version was faster than =
Fast version.
```
chaol@lichao-highgo:~$ gcc --version                                     =
                                      =20
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                =
                                     =20
Copyright (C) 2023 Free Software Foundation, Inc.                        =
                                     =20
This is free software; see the source for copying conditions.  There is =
NO                                     =20
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR =
PURPOSE.
```

A typical test run:
```
Original:  0.99849 seconds                                               =
                                      =20
Fast:      0.74722 seconds                                               =
                                      =20
Original2:      0.59407 seconds                                          =
                                     =20
PullUp: 0.62746 seconds    =20
```

Then I also tried to run on Windows directly. Here, PullUp version =
performed the best.
```
$ gcc --version
gcc.exe (Rev13, Built by MSYS2 project) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is =
NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR =
PURPOSE.
```

A typical test run:
```
Original:  0.32931 seconds
Fast:      0.32740 seconds
Original2:      0.32378 seconds
PullUp: 0.30795 seconds
```

I=E2=80=99m curious that, when something performs differently across =
platforms, which platform should take priority?

Please see the attached test program. It=E2=80=99s possible I did =
something wrong.

[1] https://godbolt.org/z/Eh1vzssq7

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/


--Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653
Content-Disposition: attachment;
	filename=test_bms_next.c
Content-Type: application/octet-stream;
	x-unix-mode=0644;
	name="test_bms_next.c"
Content-Transfer-Encoding: 7bit

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <limits.h>

//#define NULL ((void *) 0)
typedef uint64_t uint64;
typedef int64_t int64;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword;      /* must be an unsigned type */
typedef int64 signedbitmapword; /* must be the matching signed type */

#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)

typedef struct Bitmapset
{
    int         nwords;         /* number of words in array */
    bitmapword  words[];    /* really [nwords] */
} Bitmapset;

static inline int
bmw_rightmost_one_pos(uint64 word)
{
    return __builtin_ctzll(word);
}

// 1. Original version
int
bms_next_member(const Bitmapset *a, int prevbit)
{
    int         nwords;
    bitmapword  mask;

    //Assert(bms_is_valid_set(a));

    if (a == NULL)
        return -2;
    nwords = a->nwords;
    prevbit++;
    mask = (~(bitmapword) 0) << BITNUM(prevbit);
    for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
    {
        bitmapword  w = a->words[wordnum];

        /* ignore bits before prevbit */
        w &= mask;

        if (w != 0)
        {
            int         result;

            result = wordnum * BITS_PER_BITMAPWORD;
            result += bmw_rightmost_one_pos(w);
            return result;
        }

        /* in subsequent words, consider all bits */
        mask = (~(bitmapword) 0);
    }
    return -2;
}

// 2. Fast version (size_t usage)
int
bms_next_member_fast(const Bitmapset *a, int prevbit)
{
    uint64      currbit;
    size_t      nwords;
    bitmapword  mask;

    if (a == NULL)
        return -2;
    nwords = (size_t) a->nwords;
    currbit = (uint64) prevbit + 1;
    mask = (~(bitmapword) 0) << BITNUM(currbit);
    for (size_t wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
    {
        bitmapword  w = a->words[wordnum];

        /* ignore bits before currbit */
        w &= mask;

        if (w != 0)
        {
            int         result;

            result = (int) wordnum * BITS_PER_BITMAPWORD;
            result += bmw_rightmost_one_pos(w);
            return result;
        }

        /* in subsequent words, consider all bits */
        mask = (~(bitmapword) 0);
    }
    return -2;
}

// 3. Original version + INT32_MAX check + 64bit
int
bms_next_member_2(const Bitmapset *a, int prevbit)
{
    size_t         nwords;
    bitmapword  mask;

    if (a == NULL || prevbit == INT32_MAX)
        return -2;
    nwords = (size_t) a->nwords;
    prevbit++;
    mask = (~(bitmapword) 0) << BITNUM(prevbit);
    for (size_t wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
    {
        bitmapword  w = a->words[wordnum];

        /* ignore bits before prevbit */
        w &= mask;

        if (w != 0)
        {
            int         result;

            result = (int)wordnum * BITS_PER_BITMAPWORD;
            result += bmw_rightmost_one_pos(w);
            return result;
        }

        /* in subsequent words, consider all bits */
        mask = (~(bitmapword) 0);
    }
    return -2;
}

// 4. Pull up first iteration
int bms_next_member_pullup(const Bitmapset *a, int prevbit) {
if (a == NULL || prevbit == INT_MAX)
        return -2;

    uint64      currbit = (uint64) prevbit + 1;
    int         wordnum = WORDNUM(currbit);
    int         nwords = a->nwords;

    if (wordnum >= nwords)
        return -2;

    /* Handle first word with mask */
    const bitmapword *p = &a->words[wordnum];
    bitmapword  w = (*p) & ((~(bitmapword) 0) << BITNUM(currbit));

    if (w != 0)
        return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(w);

    /* The "Tight" Pointer Scan */
    const bitmapword *end = &a->words[nwords];
    for (p++; p < end; p++)
    {
        if (*p != 0)
        {
            wordnum = p - a->words; // Pointer arithmetic to get index
            return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(*p);
        }
    }

    return -2;
}


double get_time() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
}

int main() {
    int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
    Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword));
    bms->nwords = words_to_alloc;
    memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));

    /* Set a bit far into the set to force a long scan */
    int target_bit = (words_to_alloc - 1) * 64 + 10;
    bms->words[words_to_alloc - 1] |= (1ULL << 10);

    int iterations = 100000;
    volatile int sink;

    printf("Benchmarking %d iterations...\n\n", iterations);

    // Test Original
    double start = get_time();
    for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
    printf("Original:  %.5f seconds\n", get_time() - start);

    // Test Fast
    start = get_time();
    for (int i = 0; i < iterations; i++) sink = bms_next_member_fast(bms, 0);
    printf("Fast:      %.5f seconds\n", get_time() - start);

    // Test Original2
    start = get_time();
    for (int i = 0; i < iterations; i++) sink = bms_next_member_2(bms, 0);
    printf("Original2:      %.5f seconds\n", get_time() - start);

    // Pull up first iteration
    start = get_time();
    for (int i = 0; i < iterations; i++) sink = bms_next_member_pullup(bms, 0);
    printf("PullUp: %.5f seconds\n", get_time() - start);

    free(bms);
    return 0;
}
--Apple-Mail=_D212144E-814A-4F20-8D28-6CB615F36653--