Content-Type: text/plain;
	charset=utf-8
Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\))
Subject: Re: Small and unlikely overflow hazard in bms_next_member()
From: Chao Li <li.evan.chao@gmail.com>
In-Reply-To: 
 <CAApHDvocDY0qvhxRNBaNRjEr46j8TsGFig+=6bHN-NYS_dhSCw@mail.gmail.com>
Date: Fri, 3 Apr 2026 21:28:03 +0800
Cc: Tom Lane <tgl@sss.pgh.pa.us>,
 PostgreSQL Developers <pgsql-hackers@lists.postgresql.org>
Content-Transfer-Encoding: quoted-printable
Message-Id: <D8422C13-E995-4D0D-B296-5D643DE89399@gmail.com>
References: 
 <CAApHDvq0T=iJ0Sf5TNE9yyWwfOeVjmrBt0wSywDnGD9Y4YJQBA@mail.gmail.com>
 <3190647.1775103768@sss.pgh.pa.us>
 <CAApHDvrvvq_m+nRwjsOpCsFa4EtVtmvJX7zAD=Siria-x6DpbQ@mail.gmail.com>
 <CAApHDvqTUm3Cbgz3ZLV+ad8s_HJHZYrVbrBvGyPQdxCRR-6dvA@mail.gmail.com>
 <59B9EFAF-84DF-40A9-847F-9CF457A798BB@gmail.com>
 <CAApHDvocDY0qvhxRNBaNRjEr46j8TsGFig+=6bHN-NYS_dhSCw@mail.gmail.com>
To: David Rowley <dgrowleyml@gmail.com>
Archived-At: 
 <https://www.postgresql.org/message-id/D8422C13-E995-4D0D-B296-5D643DE89399%40gmail.com>
Precedence: bulk


> On Apr 3, 2026, at 19:42, David Rowley <dgrowleyml@gmail.com> wrote:
>=20
> On Fri, 3 Apr 2026 at 16:24, Chao Li <li.evan.chao@gmail.com> wrote:
>> I also did a load test with a standalone c program with 4 versions:
>=20
> I think you'd be better off picking a more realistic Bitmapset. The
> code is doing:
>=20
>    int words_to_alloc =3D 20000; // Large set to bypass CPU cache =
slightly
>    Bitmapset *bms =3D malloc(sizeof(Bitmapset) + words_to_alloc *
> sizeof(bitmapword));
>    bms->nwords =3D words_to_alloc;
>    memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
>=20
>    /* Set a bit far into the set to force a long scan */
>    int target_bit =3D (words_to_alloc - 1) * 64 + 10;
>    bms->words[words_to_alloc - 1] |=3D (1ULL << 10);
>=20
> A set with 20k words! Then you're doing:
>=20
>    for (int i =3D 0; i < iterations; i++) sink =3D =
bms_next_member(bms, 0);
>=20
> So you're not looping correctly over the set, as looping over a set
> with 1 member requires 2 calls to the function.
>=20
> I'd say this unrealistically favours the unrolled loop version, as
> most of the effort is in the loop looping over empty words and you can
> do that without the extra bit-wise AND that the other versions have.
> That version also seems to apply both the 64-bit fix and the INT_MAX
> fix. Why both?  Because you're only calling the function once per
> iteration, it also means your || INT_MAX is only executed once, rather
> than n_members + 1 as it would normally be executed. That'll make your
> version seem better than it is.
>=20
> I fixed all that and changed the Bitmapset to something more realistic
> with how it's more likely to be seen in PostgreSQL and ran the
> benchmark with a proper bms_next_member loop. File attached and
> results below:
>=20
> Zen2 Linux
>=20
> $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
>=20
> Original:  0.62113 seconds
> David's:          0.70257 seconds
> Chao's version:   1.70691 seconds
> Unrolled loop: 1.56239 seconds
> 2800000000
>=20
> $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
>=20
> Original:  1.11263 seconds
> David's:          0.87399 seconds
> Chao's version:   0.52962 seconds
> Unrolled loop: 1.07799 seconds
> 2800000000
>=20
> Apple M2:
>=20
> Benchmarking 100000000 iterations...
>=20
> Original:  0.64023 seconds
> David's:          0.41087 seconds
> Chao's version:   1.21113 seconds
> Unrolled loop: 0.55924 seconds
> 2800000000
>=20
>> I=E2=80=99m curious that, when something performs differently across =
platforms, which platform should take priority?
>=20
> I don't think it matters too much for this case. If we were actually
> working on a function that was a bottleneck, then it would be worth
> looking deeper and seeing if the compilers are producing suboptimal
> code and then try to coax them into making better code. As a last
> resort, have two versions and some preprocessor to decide which one.
> However, this just isn't performance-critical enough to warrant such
> an effort. I think we're already going far too far with this. I just
> want to fix the bug. If performance increases in some way as a result
> of that, then good.
>=20
> If I sum up the times from each version of the above results, I get:
>=20
> Original: 2.37399
> David's: 1.98743
> Chao's: 3.44766
> Unrolled: 3.19962
>=20
> I'm happy to go with the fastest one. I expect the one with the fewest
> instructions is likely more important when running it in the planner,
> as that's less to decode. I didn't look, but I expect your loop
> unrolled version and your || INT_MAX version to have more
> instructions.
>=20
> David
> <test_bms_next2.c>

In test_bms_next2.c, you set words_to_alloc =3D 1, which disables the =
optimization in the unrolled version. If I change words_to_alloc =3D 2, =
then on my MacBook M4, the unrolled version seems to win:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  0.61559 seconds
David's:	  0.61651 seconds
Chao's version:	  1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```

(I also checked on Godbolt, from the instruction-count perspective, the =
unrolled version actually looks the worst.)

I guess one-word Bitmapset are probably the most common case in =
practice. So overall, I agree that your version is the best choice. =
Please go ahead with it as we have already gained both a bug fix and a =
performance improvement. If we really want to study the performance more =
deeply, I think that would be a separate topic.

It was fun working on this patch, and thank you for your patience.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/