Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\))
Subject: Re: Small and unlikely overflow hazard in bms_next_member()
From: Chao Li <li.evan.chao@gmail.com>
In-Reply-To: 
 <CAApHDvo+WtKHWbgmPA+H2K24P6h6aUJ_9kAdS__G1L9LDFGwgQ@mail.gmail.com>
Date: Mon, 6 Apr 2026 10:00:38 +0800
Cc: Tom Lane <tgl@sss.pgh.pa.us>,
 PostgreSQL Developers <pgsql-hackers@lists.postgresql.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <64C19DC3-CE9B-4222-B71E-31B88574FBDC@gmail.com>
References: 
 <CAApHDvq0T=iJ0Sf5TNE9yyWwfOeVjmrBt0wSywDnGD9Y4YJQBA@mail.gmail.com>
 <3190647.1775103768@sss.pgh.pa.us>
 <CAApHDvrvvq_m+nRwjsOpCsFa4EtVtmvJX7zAD=Siria-x6DpbQ@mail.gmail.com>
 <CAApHDvqTUm3Cbgz3ZLV+ad8s_HJHZYrVbrBvGyPQdxCRR-6dvA@mail.gmail.com>
 <59B9EFAF-84DF-40A9-847F-9CF457A798BB@gmail.com>
 <CAApHDvocDY0qvhxRNBaNRjEr46j8TsGFig+=6bHN-NYS_dhSCw@mail.gmail.com>
 <D8422C13-E995-4D0D-B296-5D643DE89399@gmail.com>
 <CAApHDvo+WtKHWbgmPA+H2K24P6h6aUJ_9kAdS__G1L9LDFGwgQ@mail.gmail.com>
To: David Rowley <dgrowleyml@gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/64C19DC3-CE9B-4222-B71E-31B88574FBDC%40gmail.com>
Precedence: bulk


> On Apr 4, 2026, at 11:30, David Rowley <dgrowleyml@gmail.com> wrote:
>=20
> On Sat, 4 Apr 2026 at 02:28, Chao Li <li.evan.chao@gmail.com> wrote:
>> In test_bms_next2.c, you set words_to_alloc =3D 1, which disables the =
optimization in the unrolled version. If I change words_to_alloc =3D 2, =
then on my MacBook M4, the unrolled version seems to win:
>> ```
>> chaol@ChaodeMacBook-Air test % ./test_bms_next2
>> Benchmarking 100000000 iterations...
>>=20
>> Original:  0.61559 seconds
>> David's:          0.61651 seconds
>> Chao's version:   1.06033 seconds
>> Unrolled loop: 0.60561 seconds
>> 2800000000
>> ```
>=20
> Doing the same locally, here's what I get on my Zen2 machine:
>=20
> drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && =
./test_bms_next
> Benchmarking 100000000 iterations...
>=20
> Original:  1.21125 seconds
> David's:          1.11997 seconds
> Chao's version:   1.72662 seconds
> Unrolled loop: 1.63090 seconds
> 2800000000
> drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next &&
> ./test_bms_next
> Benchmarking 100000000 iterations...
>=20
> Original:  1.10780 seconds
> David's:          1.05968 seconds
> Chao's version:   1.87123 seconds
> Unrolled loop: 1.11002 seconds
> 2800000000

I just tried my Windows laptop with both MSYS2 and WSL Ubuntu, and on =
both of them the original version and David=E2=80=99s version =
consistently performed better than the other versions.

I have a new finding from Godbolt: adding likely(w !=3D 0) seems to =
reduce the instruction count by one. bms_next_member_fast has 26 =
instructions, while after adding "likely", bms_next_member_fast2 has 25. =
It seems that this may avoid one jump and perhaps be slightly better, =
but I=E2=80=99m not sure. I=E2=80=99m really not an expert in assembly.

Then I tested. =E2=80=9Clikely=E2=80=9D doesn=E2=80=99t seem to help on =
Mac:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  0.54994 seconds
David's:	  0.78218 seconds
David's likely:	  0.78990 seconds
Chao's version:	  1.11530 seconds
Unrolled loop: 0.47660 seconds
3500000000
```

But it helps slightly on Windows:
```
$ ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  1.45312 seconds
David's:          1.48438 seconds
David's likely:   1.40625 seconds
Chao's version:   3.12500 seconds
Unrolled loop: 2.95312 seconds
-794967296
```

I=E2=80=99m not suggesting adding likely() to your version, I=E2=80=99m =
just sharing the information for your reference and leaving the decision =
to you.

>=20
>> I guess one-word Bitmapset are probably the most common case in =
practice. So overall, I agree that your version is the best choice. =
Please go ahead with it as we have already gained both a bug fix and a =
performance improvement. If we really want to study the performance more =
deeply, I think that would be a separate topic.
>=20
> Yeah, but it's better to find bottlenecks and focus there than to
> optimise blindly without knowing if it'll make any actual difference.
> It's still important to test for regressions when modifying code and
> try to avoid them, as it's sometimes not obvious if the code being
> modified is critical for performance.
>=20
> I also don't think we should be doing anything to optimise for
> multi-word sets at the expense of slowing down single-word sets.

Fully agreed.

> Not
> that it's very representative of the real world, but I added some
> telemetry to bitmapset.c and I see 43 multi-word sets being created
> and 1.45 million single-word sets created during make check, or 1 in
> 33785, if you prefer.
>=20
> diff --git a/src/backend/nodes/bitmapset.c =
b/src/backend/nodes/bitmapset.c
> index 786f343b3c9..21b5053e9a2 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -219,6 +219,10 @@ bms_make_singleton(int x)
>        int                     wordnum,
>                                bitnum;
>=20
> +       if (x >=3D 64)
> +               elog(NOTICE, "multi-word set bms_make_singleton");
> +       else
> +               elog(NOTICE, "single-word set bms_make_singleton");
>        if (x < 0)
>                elog(ERROR, "negative bitmapset member not allowed");
>        wordnum =3D WORDNUM(x);
> @@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x)
>        wordnum =3D WORDNUM(x);
>        bitnum =3D BITNUM(x);
>=20
> +       if (x >=3D 64 && a->nwords =3D=3D 1)
> +               elog(NOTICE, "multi-set bms_add_member");
> +
>        /* enlarge the set if necessary */
>        if (wordnum >=3D a->nwords)
>        {
>=20
> Not quite perfect as a set made first as a single word set that later
> becomes a multi-word set will be double counted. The number of
> operations on the sets is likely more important anyway, not the number
> of sets being created. The point is, multi-word sets are rare for most
> workloads.
>=20

What tests did you run after adding the logs to collect the data? This =
is a method I might borrow in the future for similar investigations.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/