Small and unlikely overflow hazard in bms_next

public inbox for [email protected]  
help / color / mirror / Atom feed

Small and unlikely overflow hazard in bms_next_member()
17+ messages / 4 participants
[nested] [flat]

* Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-02 04:09  David Rowley <[email protected]>
  0 siblings, 2 replies; 17+ messages in thread

From: David Rowley @ 2026-04-02 04:09 UTC (permalink / raw)
  To: PostgreSQL Developers <[email protected]>

I've been working on bms_left_shift_members() to bitshift members
either left or right in order to tidy up some existing code and
improve a future Bitmapset use case I'm currently working on.

When testing some ERROR code I added to ensure we don't get an
excessively large left shift value and end up with members higher than
INT32_MAX, I discovered that bms_next_member() can't handle that
value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
access a negative array index, i.e. seg fault.

I appreciate that such a large member is quite unlikely, but if this
isn't fixed then I need to code my error checking code to disallow
members >= INT32_MAX rather than > INT32_MAX. I did have a comment
explaining why I was doing that, but fixing the bug saves the weird
special case and the comment.

Patched attached. I was thinking it might not be worthy of
backpatching, but I'll entertain alternative views on that.

David

Attachments:

  [application/octet-stream] bms_next_member_fix.patch (905B, 2-bms_next_member_fix.patch)
  download | inline diff:
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 786f343b3c9..912a1ab696a 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1289,6 +1289,7 @@ bms_join(Bitmapset *a, Bitmapset *b)
 int
 bms_next_member(const Bitmapset *a, int prevbit)
 {
+	int64		currbit;
 	int			nwords;
 	bitmapword	mask;

@@ -1297,13 +1298,13 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	if (a == NULL)
 		return -2;
 	nwords = a->nwords;
-	prevbit++;
-	mask = (~(bitmapword) 0) << BITNUM(prevbit);
-	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
+	currbit = (int64) prevbit + 1;
+	mask = (~(bitmapword) 0) << BITNUM(currbit);
+	for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
 	{
 		bitmapword	w = a->words[wordnum];

-		/* ignore bits before prevbit */
+		/* ignore bits before currbit */
 		w &= mask;

 		if (w != 0)

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-02 04:22  Tom Lane <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 17+ messages in thread

From: Tom Lane @ 2026-04-02 04:22 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>

David Rowley <[email protected]> writes:
> When testing some ERROR code I added to ensure we don't get an
> excessively large left shift value and end up with members higher than
> INT32_MAX, I discovered that bms_next_member() can't handle that
> value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
> access a negative array index, i.e. seg fault.

> I appreciate that such a large member is quite unlikely,

I think it's impossible, and if it's not then this is not the
only place in bitmapset.c that could theoretically overflow.
As an example, bms_prev_member does

	Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD);

but if the bitmapset were large enough to accommodate INT_MAX
as a member then a->nwords * BITS_PER_BITMAPWORD must overflow.

I don't think we should add cycles here for this purpose.
If it makes you feel better, maybe add Asserts to
bms_make_singleton and bms_add_member to constrain the
maximum member value to somewhat less than INT_MAX?

			regards, tom lane

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-02 06:38  Chao Li <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 2 replies; 17+ messages in thread

From: Chao Li @ 2026-04-02 06:38 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>



> On Apr 2, 2026, at 12:09, David Rowley <[email protected]> wrote:
> 
> I've been working on bms_left_shift_members() to bitshift members
> either left or right in order to tidy up some existing code and
> improve a future Bitmapset use case I'm currently working on.
> 
> When testing some ERROR code I added to ensure we don't get an
> excessively large left shift value and end up with members higher than
> INT32_MAX, I discovered that bms_next_member() can't handle that
> value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
> access a negative array index, i.e. seg fault.
> 
> I appreciate that such a large member is quite unlikely, but if this
> isn't fixed then I need to code my error checking code to disallow
> members >= INT32_MAX rather than > INT32_MAX. I did have a comment
> explaining why I was doing that, but fixing the bug saves the weird
> special case and the comment.
> 
> Patched attached. I was thinking it might not be worthy of
> backpatching, but I'll entertain alternative views on that.
> 
> David
> <bms_next_member_fix.patch>

Both the return value of bms_next_member() and the parameter “prevbit" are defined as int, which seems to imply that a bitmap can hold at most INT32_MAX elements. So I wonder if a cleaner fix would be to just add a range guard, like this:

```
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 786f343b3c9..7f79f81f278 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit)

        Assert(bms_is_valid_set(a));

-       if (a == NULL)
+       if (a == NULL || prevbit == INT32_MAX)
                return -2;
        nwords = a->nwords;
        prevbit++;
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/









^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-02 07:54  jie wang <[email protected]>
  parent: Chao Li <[email protected]>
  1 sibling, 0 replies; 17+ messages in thread

From: jie wang @ 2026-04-02 07:54 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: David Rowley <[email protected]>; PostgreSQL Developers <[email protected]>

Chao Li <[email protected]> 于2026年4月2日周四 14:39写道：

>
>
> > On Apr 2, 2026, at 12:09, David Rowley <[email protected]> wrote:
> >
> > I've been working on bms_left_shift_members() to bitshift members
> > either left or right in order to tidy up some existing code and
> > improve a future Bitmapset use case I'm currently working on.
> >
> > When testing some ERROR code I added to ensure we don't get an
> > excessively large left shift value and end up with members higher than
> > INT32_MAX, I discovered that bms_next_member() can't handle that
> > value, as "prevbit++" will wrap to INT32_MIN and then we'll try to
> > access a negative array index, i.e. seg fault.
> >
> > I appreciate that such a large member is quite unlikely, but if this
> > isn't fixed then I need to code my error checking code to disallow
> > members >= INT32_MAX rather than > INT32_MAX. I did have a comment
> > explaining why I was doing that, but fixing the bug saves the weird
> > special case and the comment.
> >
> > Patched attached. I was thinking it might not be worthy of
> > backpatching, but I'll entertain alternative views on that.
> >
> > David
> > <bms_next_member_fix.patch>
>
> Both the return value of bms_next_member() and the parameter “prevbit" are
> defined as int, which seems to imply that a bitmap can hold at most
> INT32_MAX elements. So I wonder if a cleaner fix would be to just add a
> range guard, like this:
>
> ```
> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index 786f343b3c9..7f79f81f278 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit)
>
>         Assert(bms_is_valid_set(a));
>
> -       if (a == NULL)
> +       if (a == NULL || prevbit == INT32_MAX)
>                 return -2;
>         nwords = a->nwords;
>         prevbit++;
> ```
>
> Best regards,
> --
> Chao Li (Evan)
> HighGo Software Co., Ltd.
> https://www.highgo.com/
>
>
>
>
>
>
>
Hi,

The both solutions look good to me. I am slightly keen on Chao's version
that looks simpler to me.

Thanks!
wang jie


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-02 22:12  David Rowley <[email protected]>
  parent: Tom Lane <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: David Rowley @ 2026-04-02 22:12 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>

On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote:
>         Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD);

> but if the bitmapset were large enough to accommodate INT_MAX
> as a member then a->nwords * BITS_PER_BITMAPWORD must overflow.

I missed that one. That's annoying. Even "prevbit = a->nwords *
BITS_PER_BITMAPWORD - 1;" is undefined if it wraps due to the signed
maths.

> I don't think we should add cycles here for this purpose.

I'm not keen on slowing things down for this either. I did do some
experiments in [1] that sees fewer instructions from using 64-bit
maths. I might go off and see if there are any wins there that also
give us the INT_MAX fix. It's not great effort to reward ratio
though...

David

[1] https://godbolt.org/z/Eh1vzssq7

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-02 22:20  David Rowley <[email protected]>
  parent: Chao Li <[email protected]>
  1 sibling, 0 replies; 17+ messages in thread

From: David Rowley @ 2026-04-02 22:20 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>

On Thu, 2 Apr 2026 at 19:39, Chao Li <[email protected]> wrote:
> Both the return value of bms_next_member() and the parameter “prevbit" are defined as int, which seems to imply that a bitmap can hold at most INT32_MAX elements. So I wonder if a cleaner fix would be to just add a range guard, like this:
>
> ```
> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index 786f343b3c9..7f79f81f278 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit)
>
>         Assert(bms_is_valid_set(a));
>
> -       if (a == NULL)
> +       if (a == NULL || prevbit == INT32_MAX)

I don't think that's a nice way at all. Adding a special case plus
extra instructions, including a new jump instruction for the boolean
short-circuiting. More instruction decoding, more L1i space needed,
more branches to predict and more space in the branch predictor tables
to overwrite other useful branches. Adds run-time overhead.

I was aiming for low overhead and no special cases. 2a600a93c means we
don't care about the performance on 32-bit hardware anymore, so that
can't be used as a counterargument.

David

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-03 02:08  David Rowley <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 2 replies; 17+ messages in thread

From: David Rowley @ 2026-04-03 02:08 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>

On Fri, 3 Apr 2026 at 11:12, David Rowley <[email protected]> wrote:
>
> On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote:
> > I don't think we should add cycles here for this purpose.
>
> I'm not keen on slowing things down for this either. I did do some
> experiments in [1] that sees fewer instructions from using 64-bit
> maths. I might go off and see if there are any wins there that also
> give us the INT_MAX fix. It's not great effort to reward ratio
> though...

The reduction in instructions with the patched version got me curious
to see if it would translate into a performance increase.  I tested on
an AMD Zen2 machine, and it's a decent amount faster than master. I
tested with gcc and clang.

I also scanned over the remaining parts of bitmapset.c and didn't find
anywhere else that has overflow risk aside from what you pointed out
in bms_prev_member().

The attached patch contains the benchmark function I added to the
test_bitmapset module. It should apply to master with a bit of noise.

CREATE EXTENSION test_bitmapset;
SELECT
    generate_series(1,3) AS run,
    bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
bms_next_member_us,
    bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
bms_prev_member_us;

master (gcc)

 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26473 |              40404
   2 |              26218 |              40413
   3 |              26209 |              40387

patched (gcc)

 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25409 |              29705
   2 |              24905 |              29693
   3 |              24870 |              29707

Times are in microseconds to do 1 million bms_*_member() loops over
the entire set.

I've also attached the full results I got. I've also included the
results from Chao's version, which does slow things down decently on
clang.

IMO, if we can make bitmapset.c work with INT_MAX members and get a
performance increase, then we should do it.

David

> [1] https://godbolt.org/z/Eh1vzssq7

** GCC **

Master:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26473 |              40404
   2 |              26218 |              40413
   3 |              26209 |              40387
   4 |              26245 |              40521
   5 |              26343 |              40559
   6 |              26343 |              40580
   7 |              26366 |              40639
   8 |              26735 |              40467
   9 |              26201 |              40334
  10 |              26208 |              40343
(10 rows)

Time: 735.560 ms
postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26616 |              40392
   2 |              26239 |              40374
   3 |              26191 |              40374
   4 |              26230 |              40370
   5 |              26232 |              40380
   6 |              26210 |              40377
   7 |              26206 |              40395
   8 |              26249 |              40431
   9 |              26209 |              40374
  10 |              26234 |              40432
(10 rows)

Time: 733.927 ms

Patched:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25409 |              29705
   2 |              24905 |              29693
   3 |              24870 |              29707
   4 |              24933 |              29869
   5 |              25059 |              29870
   6 |              25026 |              29906
   7 |              25055 |              29897
   8 |              25055 |              29875
   9 |              25031 |              29870
  10 |              25042 |              29816
(10 rows)

Time: 604.086 ms
postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25074 |              29723
   2 |              24882 |              29721
   3 |              24875 |              29710
   4 |              24976 |              29854
   5 |              25052 |              29865
   6 |              25021 |              29904
   7 |              25025 |              29818
   8 |              25034 |              29813
   9 |              25029 |              29828
  10 |              25042 |              29866
(10 rows)

Time: 603.708 ms

Chao's idea:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25677 |              40376
   2 |              25271 |              40430
   3 |              25263 |              40370
   4 |              25301 |              40377
   5 |              25357 |              40552
   6 |              25394 |              40560
   7 |              25412 |              40501
   8 |              25401 |              40558
   9 |              25381 |              40565
  10 |              25370 |              40548
(10 rows)

Time: 725.376 ms


** CLANG **

master:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              29970 |              42551
   2 |              29479 |              42765
   3 |              29475 |              42805
   4 |              29523 |              42825
   5 |              29578 |              42898
   6 |              29529 |              42997
   7 |              29535 |              42985
   8 |              29593 |              43021
   9 |              29575 |              42976
  10 |              29562 |              42967
(10 rows)

Time: 798.131 ms

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              30919 |              42516
   2 |              29312 |              42629
   3 |              29422 |              42777
   4 |              29307 |              42811
   5 |              29273 |              42852
   6 |              29219 |              42826
   7 |              29860 |              43264
   8 |              29325 |              42814
   9 |              29299 |              42837
  10 |              29351 |              42909
(10 rows)

Time: 796.495 ms

Patched:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26741 |              33537
   2 |              26094 |              33590
   3 |              26087 |              33724
   4 |              26276 |              33868
   5 |              26346 |              33943
   6 |              26433 |              33815
   7 |              26324 |              33927
   8 |              26274 |              33893
   9 |              26350 |              33878
  10 |              26304 |              33841
(10 rows)

Time: 661.924 ms
postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26502 |              33523
   2 |              26173 |              33635
   3 |              26188 |              33772
   4 |              26290 |              33767
   5 |              26325 |              33935
   6 |              26326 |              33951
   7 |              26335 |              33883
   8 |              26571 |              33884
   9 |              26472 |              33837
  10 |              26277 |              33919
(10 rows)

Time: 662.422 ms


Chao's idea:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              34865 |              42599
   2 |              34547 |              42817
   3 |              34623 |              42948
   4 |              34689 |              42954
   5 |              34733 |              42928
   6 |              34731 |              42845
   7 |              34727 |              42967
   8 |              34763 |              43014
   9 |              34775 |              42987
  10 |              34788 |              43038
(10 rows)

Time: 854.851 ms

Attachments:

  [text/plain] benchmark_results.txt (8.0K, 2-benchmark_results.txt)
  download | inline:
** GCC **

Master:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26473 |              40404
   2 |              26218 |              40413
   3 |              26209 |              40387
   4 |              26245 |              40521
   5 |              26343 |              40559
   6 |              26343 |              40580
   7 |              26366 |              40639
   8 |              26735 |              40467
   9 |              26201 |              40334
  10 |              26208 |              40343
(10 rows)

Time: 735.560 ms
postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26616 |              40392
   2 |              26239 |              40374
   3 |              26191 |              40374
   4 |              26230 |              40370
   5 |              26232 |              40380
   6 |              26210 |              40377
   7 |              26206 |              40395
   8 |              26249 |              40431
   9 |              26209 |              40374
  10 |              26234 |              40432
(10 rows)

Time: 733.927 ms

Patched:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25409 |              29705
   2 |              24905 |              29693
   3 |              24870 |              29707
   4 |              24933 |              29869
   5 |              25059 |              29870
   6 |              25026 |              29906
   7 |              25055 |              29897
   8 |              25055 |              29875
   9 |              25031 |              29870
  10 |              25042 |              29816
(10 rows)

Time: 604.086 ms
postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25074 |              29723
   2 |              24882 |              29721
   3 |              24875 |              29710
   4 |              24976 |              29854
   5 |              25052 |              29865
   6 |              25021 |              29904
   7 |              25025 |              29818
   8 |              25034 |              29813
   9 |              25029 |              29828
  10 |              25042 |              29866
(10 rows)

Time: 603.708 ms

Chao's idea:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              25677 |              40376
   2 |              25271 |              40430
   3 |              25263 |              40370
   4 |              25301 |              40377
   5 |              25357 |              40552
   6 |              25394 |              40560
   7 |              25412 |              40501
   8 |              25401 |              40558
   9 |              25381 |              40565
  10 |              25370 |              40548
(10 rows)

Time: 725.376 ms


** CLANG **

master:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              29970 |              42551
   2 |              29479 |              42765
   3 |              29475 |              42805
   4 |              29523 |              42825
   5 |              29578 |              42898
   6 |              29529 |              42997
   7 |              29535 |              42985
   8 |              29593 |              43021
   9 |              29575 |              42976
  10 |              29562 |              42967
(10 rows)

Time: 798.131 ms

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              30919 |              42516
   2 |              29312 |              42629
   3 |              29422 |              42777
   4 |              29307 |              42811
   5 |              29273 |              42852
   6 |              29219 |              42826
   7 |              29860 |              43264
   8 |              29325 |              42814
   9 |              29299 |              42837
  10 |              29351 |              42909
(10 rows)

Time: 796.495 ms

Patched:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26741 |              33537
   2 |              26094 |              33590
   3 |              26087 |              33724
   4 |              26276 |              33868
   5 |              26346 |              33943
   6 |              26433 |              33815
   7 |              26324 |              33927
   8 |              26274 |              33893
   9 |              26350 |              33878
  10 |              26304 |              33841
(10 rows)

Time: 661.924 ms
postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              26502 |              33523
   2 |              26173 |              33635
   3 |              26188 |              33772
   4 |              26290 |              33767
   5 |              26325 |              33935
   6 |              26326 |              33951
   7 |              26335 |              33883
   8 |              26571 |              33884
   9 |              26472 |              33837
  10 |              26277 |              33919
(10 rows)

Time: 662.422 ms


Chao's idea:

postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us;
 run | bms_next_member_us | bms_prev_member_us
-----+--------------------+--------------------
   1 |              34865 |              42599
   2 |              34547 |              42817
   3 |              34623 |              42948
   4 |              34689 |              42954
   5 |              34733 |              42928
   6 |              34731 |              42845
   7 |              34727 |              42967
   8 |              34763 |              43014
   9 |              34775 |              42987
  10 |              34788 |              43038
(10 rows)

Time: 854.851 ms

  [application/octet-stream] bms_fixes.patch (5.0K, 3-bms_fixes.patch)
  download | inline diff:
diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index ac3e763de43..46b4ca4b0f1 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1444,6 +1444,7 @@ bms_join(Bitmapset *a, Bitmapset *b)
 int
 bms_next_member(const Bitmapset *a, int prevbit)
 {
+	int64		currbit;
 	int			nwords;
 	bitmapword	mask;
 
@@ -1452,13 +1453,13 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	if (a == NULL)
 		return -2;
 	nwords = a->nwords;
-	prevbit++;
-	mask = (~(bitmapword) 0) << BITNUM(prevbit);
-	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
+	currbit = (int64) prevbit + 1;
+	mask = (~(bitmapword) 0) << BITNUM(currbit);
+	for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
 	{
 		bitmapword	w = a->words[wordnum];
 
-		/* ignore bits before prevbit */
+		/* ignore bits before currbit */
 		w &= mask;
 
 		if (w != 0)
@@ -1505,6 +1506,7 @@ int
 bms_prev_member(const Bitmapset *a, int prevbit)
 {
 	int			ushiftbits;
+	uint32		currbit;
 	bitmapword	mask;
 
 	Assert(bms_is_valid_set(a));
@@ -1517,18 +1519,18 @@ bms_prev_member(const Bitmapset *a, int prevbit)
 		return -2;
 
 	/* Validate callers didn't give us something out of range */
-	Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD);
+	Assert(prevbit <= (int64) a->nwords * BITS_PER_BITMAPWORD);
 	Assert(prevbit >= -1);
 
 	/* transform -1 to the highest possible bit we could have set */
 	if (prevbit == -1)
-		prevbit = a->nwords * BITS_PER_BITMAPWORD - 1;
+		currbit = (uint32) a->nwords * BITS_PER_BITMAPWORD - 1;
 	else
-		prevbit--;
+		currbit = prevbit - 1;
 
-	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1);
+	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1);
 	mask = (~(bitmapword) 0) >> ushiftbits;
-	for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--)
+	for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--)
 	{
 		bitmapword	w = a->words[wordnum];
 
diff --git a/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql b/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql
index 6668bcef9a9..d6c0f739e13 100644
--- a/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql
+++ b/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql
@@ -3,6 +3,15 @@
 -- complain if script is sourced in psql, rather than via CREATE EXTENSION
 \echo Use "CREATE EXTENSION test_bitmapset" to load this file. \quit
 
+-- Benchmark functions
+CREATE FUNCTION bench_bms_next_member(text, integer)
+RETURNS bigint
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
+CREATE FUNCTION bench_bms_prev_member(text, integer)
+RETURNS bigint
+AS 'MODULE_PATHNAME' LANGUAGE C;
+
 -- Bitmapset API functions
 CREATE FUNCTION test_bms_make_singleton(integer)
 RETURNS text STRICT
diff --git a/src/test/modules/test_bitmapset/test_bitmapset.c b/src/test/modules/test_bitmapset/test_bitmapset.c
index 4fe22ee64d2..990347a0f22 100644
--- a/src/test/modules/test_bitmapset/test_bitmapset.c
+++ b/src/test/modules/test_bitmapset/test_bitmapset.c
@@ -17,6 +17,7 @@
 #include "postgres.h"
 
 #include <stddef.h>
+#include <time.h>
 #include "catalog/pg_type.h"
 #include "common/pg_prng.h"
 #include "utils/array.h"
@@ -27,8 +28,13 @@
 #include "utils/builtins.h"
 #include "utils/timestamp.h"
 
+
 PG_MODULE_MAGIC;
 
+/* Benchmark functions */
+PG_FUNCTION_INFO_V1(bench_bms_next_member);
+PG_FUNCTION_INFO_V1(bench_bms_prev_member);
+
 /* Bitmapset API functions in order of appearance in bitmapset.c */
 PG_FUNCTION_INFO_V1(test_bms_make_singleton);
 PG_FUNCTION_INFO_V1(test_bms_add_member);
@@ -105,6 +111,73 @@ PG_FUNCTION_INFO_V1(test_random_shift_operations);
 #define PG_RETURN_BITMAPSET_AS_TEXT(bms) \
 	PG_RETURN_TEXT_P(BITMAPSET_TO_TEXT(bms))
 
+#define NANOSEC_PER_SEC 1000000000
+
+int64 get_clock_diff(struct timespec* t1, struct timespec* t2);
+
+int64
+get_clock_diff(struct timespec* t1, struct timespec* t2)
+{
+	int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC;
+	nanosec += (t1->tv_nsec - t2->tv_nsec);
+
+	return nanosec;
+}
+
+Datum
+bench_bms_next_member(PG_FUNCTION_ARGS)
+{
+	Bitmapset* bms = PG_ARG_GETBITMAPSET(0);
+	int32	loops = PG_GETARG_INT32(1);
+
+	struct timespec start, end;
+	int64 nanoseconds;
+	int64 counter = 0;
+
+	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
+
+	for (int i = 0; i < loops; i++)
+	{
+		int m = -1;
+		while ((m = bms_next_member(bms, m)) >= 0)
+		{
+			counter++;
+		}
+	}
+
+	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
+	nanoseconds = get_clock_diff(&end, &start);
+
+	PG_RETURN_INT64(nanoseconds);
+}
+
+Datum
+bench_bms_prev_member(PG_FUNCTION_ARGS)
+{
+	Bitmapset* bms = PG_ARG_GETBITMAPSET(0);
+	int32	loops = PG_GETARG_INT32(1);
+
+	struct timespec start, end;
+	int64 nanoseconds;
+	int64 counter = 0;
+
+	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start);
+
+	for (int i = 0; i < loops; i++)
+	{
+		int m = -1;
+		while ((m = bms_prev_member(bms, m)) >= 0)
+		{
+			counter++;
+		}
+	}
+
+	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end);
+	nanoseconds = get_clock_diff(&end, &start);
+
+	PG_RETURN_INT64(nanoseconds);
+}
+
 /*
  * Individual test functions for each bitmapset API function
  *


^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-03 03:24  Chao Li <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 17+ messages in thread

From: Chao Li @ 2026-04-03 03:24 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>

> On Apr 3, 2026, at 10:08, David Rowley <[email protected]> wrote:
> 
> On Fri, 3 Apr 2026 at 11:12, David Rowley <[email protected]> wrote:
>> 
>> On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote:
>>> I don't think we should add cycles here for this purpose.
>> 
>> I'm not keen on slowing things down for this either. I did do some
>> experiments in [1] that sees fewer instructions from using 64-bit
>> maths. I might go off and see if there are any wins there that also
>> give us the INT_MAX fix. It's not great effort to reward ratio
>> though...
> 
> The reduction in instructions with the patched version got me curious
> to see if it would translate into a performance increase.  I tested on
> an AMD Zen2 machine, and it's a decent amount faster than master. I
> tested with gcc and clang.
> 
> I also scanned over the remaining parts of bitmapset.c and didn't find
> anywhere else that has overflow risk aside from what you pointed out
> in bms_prev_member().
> 
> The attached patch contains the benchmark function I added to the
> test_bitmapset module. It should apply to master with a bit of noise.
> 
> CREATE EXTENSION test_bitmapset;
> SELECT
>    generate_series(1,3) AS run,
>    bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
> bms_next_member_us,
>    bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
> bms_prev_member_us;
> 
> master (gcc)
> 
> run | bms_next_member_us | bms_prev_member_us
> -----+--------------------+--------------------
>   1 |              26473 |              40404
>   2 |              26218 |              40413
>   3 |              26209 |              40387
> 
> patched (gcc)
> 
> run | bms_next_member_us | bms_prev_member_us
> -----+--------------------+--------------------
>   1 |              25409 |              29705
>   2 |              24905 |              29693
>   3 |              24870 |              29707
> 
> Times are in microseconds to do 1 million bms_*_member() loops over
> the entire set.
> 
> I've also attached the full results I got. I've also included the
> results from Chao's version, which does slow things down decently on
> clang.
> 
> IMO, if we can make bitmapset.c work with INT_MAX members and get a
> performance increase, then we should do it.
> 
> David
> 
>> [1] https://godbolt.org/z/Eh1vzssq7
> <benchmark_results.txt><bms_fixes.patch>

I also did a load test with a standalone c program with 4 versions:

* The original bms_next_member (Original)
* The fast version from [1], that uses 64bit maths (Fast)
* The original version + INT32_MAX check + 64bit maths (Original2)
* I tried the other approach that pulls up the first iteration, so that removes "mask = (~(bitmapword) 0);” from the loop. (PullUp)

Note: all tests used -O2 to build the executable.

On my MacBook M4, the Fast version constantly won, and PullUp version performed badly.
```
% gcc --version
Apple clang version 17.0.0 (clang-1700.6.4.2)
Target: arm64-apple-darwin25.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
```

A typical test run:
```
Benchmarking 100000 iterations...

Original:  0.48893 seconds
Fast:      0.46979 seconds
Original2:      0.47740 seconds
PullUp: 0.48029 seconds
``` 

On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, Orignal2 won in the most runs, and the PullUp version was faster than Fast version.
```
chaol@lichao-highgo:~$ gcc --version                                                                            
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                      
Copyright (C) 2023 Free Software Foundation, Inc.                                                              
This is free software; see the source for copying conditions.  There is NO                                      
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```

A typical test run:
```
Original:  0.99849 seconds                                                                                      
Fast:      0.74722 seconds                                                                                      
Original2:      0.59407 seconds                                                                                
PullUp: 0.62746 seconds     
```

Then I also tried to run on Windows directly. Here, PullUp version performed the best.
```
$ gcc --version
gcc.exe (Rev13, Built by MSYS2 project) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```

A typical test run:
```
Original:  0.32931 seconds
Fast:      0.32740 seconds
Original2:      0.32378 seconds
PullUp: 0.30795 seconds
```

I’m curious that, when something performs differently across platforms, which platform should take priority?

Please see the attached test program. It’s possible I did something wrong.

[1] https://godbolt.org/z/Eh1vzssq7

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachments:

  [application/octet-stream] test_bms_next.c (5.4K, 2-test_bms_next.c)
  download

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-03 11:42  David Rowley <[email protected]>
  parent: Chao Li <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: David Rowley @ 2026-04-03 11:42 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>

On Fri, 3 Apr 2026 at 16:24, Chao Li <[email protected]> wrote:
> I also did a load test with a standalone c program with 4 versions:

I think you'd be better off picking a more realistic Bitmapset. The
code is doing:

    int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
    Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
sizeof(bitmapword));
    bms->nwords = words_to_alloc;
    memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));

    /* Set a bit far into the set to force a long scan */
    int target_bit = (words_to_alloc - 1) * 64 + 10;
    bms->words[words_to_alloc - 1] |= (1ULL << 10);

A set with 20k words! Then you're doing:

    for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);

So you're not looping correctly over the set, as looping over a set
with 1 member requires 2 calls to the function.

I'd say this unrealistically favours the unrolled loop version, as
most of the effort is in the loop looping over empty words and you can
do that without the extra bit-wise AND that the other versions have.
That version also seems to apply both the 64-bit fix and the INT_MAX
fix. Why both?  Because you're only calling the function once per
iteration, it also means your || INT_MAX is only executed once, rather
than n_members + 1 as it would normally be executed. That'll make your
version seem better than it is.

I fixed all that and changed the Bitmapset to something more realistic
with how it's more likely to be seen in PostgreSQL and ran the
benchmark with a proper bms_next_member loop. File attached and
results below:

Zen2 Linux

$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...

Original:  0.62113 seconds
David's:          0.70257 seconds
Chao's version:   1.70691 seconds
Unrolled loop: 1.56239 seconds
2800000000

$ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...

Original:  1.11263 seconds
David's:          0.87399 seconds
Chao's version:   0.52962 seconds
Unrolled loop: 1.07799 seconds
2800000000

Apple M2:

Benchmarking 100000000 iterations...

Original:  0.64023 seconds
David's:          0.41087 seconds
Chao's version:   1.21113 seconds
Unrolled loop: 0.55924 seconds
2800000000

> I’m curious that, when something performs differently across platforms, which platform should take priority?

I don't think it matters too much for this case. If we were actually
working on a function that was a bottleneck, then it would be worth
looking deeper and seeing if the compilers are producing suboptimal
code and then try to coax them into making better code. As a last
resort, have two versions and some preprocessor to decide which one.
However, this just isn't performance-critical enough to warrant such
an effort. I think we're already going far too far with this. I just
want to fix the bug. If performance increases in some way as a result
of that, then good.

If I sum up the times from each version of the above results, I get:

Original: 2.37399
David's: 1.98743
Chao's: 3.44766
Unrolled: 3.19962

I'm happy to go with the fastest one. I expect the one with the fewest
instructions is likely more important when running it in the planner,
as that's less to decode. I didn't look, but I expect your loop
unrolled version and your || INT_MAX version to have more
instructions.

David

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <limits.h>

//#define NULL ((void *) 0)
typedef uint64_t uint64;
typedef int64_t int64;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword;	  /* must be an unsigned type */
typedef int64 signedbitmapword; /* must be the matching signed type */

#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)

typedef struct Bitmapset
{
	int		 nwords;		/* number of words in array */
	bitmapword  words[];	/* really [nwords] */
} Bitmapset;

static inline int
bmw_rightmost_one_pos(uint64 word)
{
	return __builtin_ctzll(word);
}

// 1. Original version
int
bms_next_member(const Bitmapset *a, int prevbit)
{
	int		 nwords;
	bitmapword  mask;

	if (a == NULL)
		return -2;

	nwords = a->nwords;
	prevbit++;
	mask = (~(bitmapword) 0) << BITNUM(prevbit);
	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before prevbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

// 2. David's version
int
bms_next_member_david(const Bitmapset *a, int prevbit)
{
	int64		currbit;
	size_t		nwords;
	bitmapword	mask;

	if (a == NULL)
		return-2;
	nwords = a->nwords;
	currbit = (int64) prevbit + 1;
	mask = (~(bitmapword) 0) << BITNUM(currbit);
	for (size_t wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before currbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = (int) wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

// 3. Original version + INT32_MAX check + 64bit
int
bms_next_member_chao(const Bitmapset *a, int prevbit)
{
	size_t		 nwords;
	bitmapword  mask;

	if (a == NULL || prevbit == INT32_MAX)
		return -2;
	nwords = (size_t) a->nwords;
	prevbit++;
	mask = (~(bitmapword) 0) << BITNUM(prevbit);
	for (size_t wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before prevbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = (int)wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

// 4. Pull up first iteration
int bms_next_member_unroll_first(const Bitmapset *a, int prevbit)
{
	uint64		currbit;
	int			wordnum;
	int			nwords;

	if (a == NULL)
		return -2;

	currbit = (int64) prevbit + 1;
	wordnum = WORDNUM(currbit);
	nwords = a->nwords;

	if (wordnum >= nwords)
		return -2;

	/* Handle first word with mask */
	const bitmapword *p = &a->words[wordnum];
	bitmapword  w = (*p) & ((~(bitmapword) 0) << BITNUM(currbit));

	if (w != 0)
		return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(w);

	/* The "Tight" Pointer Scan */
	const bitmapword *end = &a->words[nwords];
	for (p++; p < end; p++)
	{
		if (*p != 0)
		{
			wordnum = p - a->words; // Pointer arithmetic to get index
			return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(*p);
		}
	}

	return -2;
}

double get_time() {
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_sec + ts.tv_nsec * 1e-9;
}

int main() {
	int words_to_alloc = 1; // Large set to bypass CPU cache slightly
	Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword));
	bms->nwords = words_to_alloc;
	memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
	double end;
	int64 count = 0;

	/* Set a bit far into the set to force a long scan */
	bms->words[words_to_alloc - 1] |= 0xaf4;

	int iterations = 100000000;


	printf("Benchmarking %d iterations...\n\n", iterations);

	// Test Original
	double start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("Original:  %.5f seconds\n", end - start);

	// Test Fast
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_david(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("David's:	  %.5f seconds\n", end - start);

	// Test Chao's version
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_chao(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("Chao's version:	  %.5f seconds\n", end - start);

	// Pull up first iteration
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_unroll_first(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("Unrolled loop: %.5f seconds\n", end - start);

	printf("%ld\n", count);

	free(bms);
	return 0;
}

Attachments:

  [text/plain] test_bms_next2.c (5.2K, 2-test_bms_next2.c)
  download | inline:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <limits.h>

//#define NULL ((void *) 0)
typedef uint64_t uint64;
typedef int64_t int64;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword;	  /* must be an unsigned type */
typedef int64 signedbitmapword; /* must be the matching signed type */

#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)

typedef struct Bitmapset
{
	int		 nwords;		/* number of words in array */
	bitmapword  words[];	/* really [nwords] */
} Bitmapset;

static inline int
bmw_rightmost_one_pos(uint64 word)
{
	return __builtin_ctzll(word);
}

// 1. Original version
int
bms_next_member(const Bitmapset *a, int prevbit)
{
	int		 nwords;
	bitmapword  mask;

	if (a == NULL)
		return -2;

	nwords = a->nwords;
	prevbit++;
	mask = (~(bitmapword) 0) << BITNUM(prevbit);
	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before prevbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

// 2. David's version
int
bms_next_member_david(const Bitmapset *a, int prevbit)
{
	int64		currbit;
	size_t		nwords;
	bitmapword	mask;

	if (a == NULL)
		return-2;
	nwords = a->nwords;
	currbit = (int64) prevbit + 1;
	mask = (~(bitmapword) 0) << BITNUM(currbit);
	for (size_t wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before currbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = (int) wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

// 3. Original version + INT32_MAX check + 64bit
int
bms_next_member_chao(const Bitmapset *a, int prevbit)
{
	size_t		 nwords;
	bitmapword  mask;

	if (a == NULL || prevbit == INT32_MAX)
		return -2;
	nwords = (size_t) a->nwords;
	prevbit++;
	mask = (~(bitmapword) 0) << BITNUM(prevbit);
	for (size_t wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before prevbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = (int)wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

// 4. Pull up first iteration
int bms_next_member_unroll_first(const Bitmapset *a, int prevbit)
{
	uint64		currbit;
	int			wordnum;
	int			nwords;

	if (a == NULL)
		return -2;

	currbit = (int64) prevbit + 1;
	wordnum = WORDNUM(currbit);
	nwords = a->nwords;

	if (wordnum >= nwords)
		return -2;

	/* Handle first word with mask */
	const bitmapword *p = &a->words[wordnum];
	bitmapword  w = (*p) & ((~(bitmapword) 0) << BITNUM(currbit));

	if (w != 0)
		return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(w);

	/* The "Tight" Pointer Scan */
	const bitmapword *end = &a->words[nwords];
	for (p++; p < end; p++)
	{
		if (*p != 0)
		{
			wordnum = p - a->words; // Pointer arithmetic to get index
			return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(*p);
		}
	}

	return -2;
}

double get_time() {
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_sec + ts.tv_nsec * 1e-9;
}

int main() {
	int words_to_alloc = 1; // Large set to bypass CPU cache slightly
	Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword));
	bms->nwords = words_to_alloc;
	memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
	double end;
	int64 count = 0;

	/* Set a bit far into the set to force a long scan */
	bms->words[words_to_alloc - 1] |= 0xaf4;

	int iterations = 100000000;


	printf("Benchmarking %d iterations...\n\n", iterations);

	// Test Original
	double start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("Original:  %.5f seconds\n", end - start);

	// Test Fast
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_david(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("David's:	  %.5f seconds\n", end - start);

	// Test Chao's version
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_chao(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("Chao's version:	  %.5f seconds\n", end - start);

	// Pull up first iteration
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_unroll_first(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("Unrolled loop: %.5f seconds\n", end - start);

	printf("%ld\n", count);

	free(bms);
	return 0;
}

^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-03 13:28  Chao Li <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Chao Li @ 2026-04-03 13:28 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>



> On Apr 3, 2026, at 19:42, David Rowley <[email protected]> wrote:
> 
> On Fri, 3 Apr 2026 at 16:24, Chao Li <[email protected]> wrote:
>> I also did a load test with a standalone c program with 4 versions:
> 
> I think you'd be better off picking a more realistic Bitmapset. The
> code is doing:
> 
>    int words_to_alloc = 20000; // Large set to bypass CPU cache slightly
>    Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc *
> sizeof(bitmapword));
>    bms->nwords = words_to_alloc;
>    memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
> 
>    /* Set a bit far into the set to force a long scan */
>    int target_bit = (words_to_alloc - 1) * 64 + 10;
>    bms->words[words_to_alloc - 1] |= (1ULL << 10);
> 
> A set with 20k words! Then you're doing:
> 
>    for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0);
> 
> So you're not looping correctly over the set, as looping over a set
> with 1 member requires 2 calls to the function.
> 
> I'd say this unrealistically favours the unrolled loop version, as
> most of the effort is in the loop looping over empty words and you can
> do that without the extra bit-wise AND that the other versions have.
> That version also seems to apply both the 64-bit fix and the INT_MAX
> fix. Why both?  Because you're only calling the function once per
> iteration, it also means your || INT_MAX is only executed once, rather
> than n_members + 1 as it would normally be executed. That'll make your
> version seem better than it is.
> 
> I fixed all that and changed the Bitmapset to something more realistic
> with how it's more likely to be seen in PostgreSQL and ran the
> benchmark with a proper bms_next_member loop. File attached and
> results below:
> 
> Zen2 Linux
> 
> $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
> 
> Original:  0.62113 seconds
> David's:          0.70257 seconds
> Chao's version:   1.70691 seconds
> Unrolled loop: 1.56239 seconds
> 2800000000
> 
> $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
> 
> Original:  1.11263 seconds
> David's:          0.87399 seconds
> Chao's version:   0.52962 seconds
> Unrolled loop: 1.07799 seconds
> 2800000000
> 
> Apple M2:
> 
> Benchmarking 100000000 iterations...
> 
> Original:  0.64023 seconds
> David's:          0.41087 seconds
> Chao's version:   1.21113 seconds
> Unrolled loop: 0.55924 seconds
> 2800000000
> 
>> I’m curious that, when something performs differently across platforms, which platform should take priority?
> 
> I don't think it matters too much for this case. If we were actually
> working on a function that was a bottleneck, then it would be worth
> looking deeper and seeing if the compilers are producing suboptimal
> code and then try to coax them into making better code. As a last
> resort, have two versions and some preprocessor to decide which one.
> However, this just isn't performance-critical enough to warrant such
> an effort. I think we're already going far too far with this. I just
> want to fix the bug. If performance increases in some way as a result
> of that, then good.
> 
> If I sum up the times from each version of the above results, I get:
> 
> Original: 2.37399
> David's: 1.98743
> Chao's: 3.44766
> Unrolled: 3.19962
> 
> I'm happy to go with the fastest one. I expect the one with the fewest
> instructions is likely more important when running it in the planner,
> as that's less to decode. I didn't look, but I expect your loop
> unrolled version and your || INT_MAX version to have more
> instructions.
> 
> David
> <test_bms_next2.c>

In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  0.61559 seconds
David's:	  0.61651 seconds
Chao's version:	  1.06033 seconds
Unrolled loop: 0.60561 seconds
2800000000
```

(I also checked on Godbolt, from the instruction-count perspective, the unrolled version actually looks the worst.)

I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.

It was fun working on this patch, and thank you for your patience.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/









^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-04 03:30  David Rowley <[email protected]>
  parent: Chao Li <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: David Rowley @ 2026-04-04 03:30 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>

On Sat, 4 Apr 2026 at 02:28, Chao Li <[email protected]> wrote:
> In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
> ```
> chaol@ChaodeMacBook-Air test % ./test_bms_next2
> Benchmarking 100000000 iterations...
>
> Original:  0.61559 seconds
> David's:          0.61651 seconds
> Chao's version:   1.06033 seconds
> Unrolled loop: 0.60561 seconds
> 2800000000
> ```

Doing the same locally, here's what I get on my Zen2 machine:

drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
Benchmarking 100000000 iterations...

Original:  1.21125 seconds
David's:          1.11997 seconds
Chao's version:   1.72662 seconds
Unrolled loop: 1.63090 seconds
2800000000
drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next &&
./test_bms_next
Benchmarking 100000000 iterations...

Original:  1.10780 seconds
David's:          1.05968 seconds
Chao's version:   1.87123 seconds
Unrolled loop: 1.11002 seconds
2800000000

> I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.

Yeah, but it's better to find bottlenecks and focus there than to
optimise blindly without knowing if it'll make any actual difference.
It's still important to test for regressions when modifying code and
try to avoid them, as it's sometimes not obvious if the code being
modified is critical for performance.

I also don't think we should be doing anything to optimise for
multi-word sets at the expense of slowing down single-word sets. Not
that it's very representative of the real world, but I added some
telemetry to bitmapset.c and I see 43 multi-word sets being created
and 1.45 million single-word sets created during make check, or 1 in
33785, if you prefer.

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 786f343b3c9..21b5053e9a2 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -219,6 +219,10 @@ bms_make_singleton(int x)
        int                     wordnum,
                                bitnum;

+       if (x >= 64)
+               elog(NOTICE, "multi-word set bms_make_singleton");
+       else
+               elog(NOTICE, "single-word set bms_make_singleton");
        if (x < 0)
                elog(ERROR, "negative bitmapset member not allowed");
        wordnum = WORDNUM(x);
@@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x)
        wordnum = WORDNUM(x);
        bitnum = BITNUM(x);

+       if (x >= 64 && a->nwords == 1)
+               elog(NOTICE, "multi-set bms_add_member");
+
        /* enlarge the set if necessary */
        if (wordnum >= a->nwords)
        {

Not quite perfect as a set made first as a single word set that later
becomes a multi-word set will be double counted. The number of
operations on the sets is likely more important anyway, not the number
of sets being created. The point is, multi-word sets are rare for most
workloads.

David





^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-06 02:00  Chao Li <[email protected]>
  parent: David Rowley <[email protected]>
  0 siblings, 1 reply; 17+ messages in thread

From: Chao Li @ 2026-04-06 02:00 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>



> On Apr 4, 2026, at 11:30, David Rowley <[email protected]> wrote:
> 
> On Sat, 4 Apr 2026 at 02:28, Chao Li <[email protected]> wrote:
>> In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win:
>> ```
>> chaol@ChaodeMacBook-Air test % ./test_bms_next2
>> Benchmarking 100000000 iterations...
>> 
>> Original:  0.61559 seconds
>> David's:          0.61651 seconds
>> Chao's version:   1.06033 seconds
>> Unrolled loop: 0.60561 seconds
>> 2800000000
>> ```
> 
> Doing the same locally, here's what I get on my Zen2 machine:
> 
> drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next
> Benchmarking 100000000 iterations...
> 
> Original:  1.21125 seconds
> David's:          1.11997 seconds
> Chao's version:   1.72662 seconds
> Unrolled loop: 1.63090 seconds
> 2800000000
> drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next &&
> ./test_bms_next
> Benchmarking 100000000 iterations...
> 
> Original:  1.10780 seconds
> David's:          1.05968 seconds
> Chao's version:   1.87123 seconds
> Unrolled loop: 1.11002 seconds
> 2800000000

I just tried my Windows laptop with both MSYS2 and WSL Ubuntu, and on both of them the original version and David’s version consistently performed better than the other versions.

I have a new finding from Godbolt: adding likely(w != 0) seems to reduce the instruction count by one. bms_next_member_fast has 26 instructions, while after adding "likely", bms_next_member_fast2 has 25. It seems that this may avoid one jump and perhaps be slightly better, but I’m not sure. I’m really not an expert in assembly.

Then I tested. “likely” doesn’t seem to help on Mac:
```
chaol@ChaodeMacBook-Air test % ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  0.54994 seconds
David's:	  0.78218 seconds
David's likely:	  0.78990 seconds
Chao's version:	  1.11530 seconds
Unrolled loop: 0.47660 seconds
3500000000
```

But it helps slightly on Windows:
```
$ ./test_bms_next2
Benchmarking 100000000 iterations...

Original:  1.45312 seconds
David's:          1.48438 seconds
David's likely:   1.40625 seconds
Chao's version:   3.12500 seconds
Unrolled loop: 2.95312 seconds
-794967296
```

I’m not suggesting adding likely() to your version, I’m just sharing the information for your reference and leaving the decision to you.

> 
>> I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic.
> 
> Yeah, but it's better to find bottlenecks and focus there than to
> optimise blindly without knowing if it'll make any actual difference.
> It's still important to test for regressions when modifying code and
> try to avoid them, as it's sometimes not obvious if the code being
> modified is critical for performance.
> 
> I also don't think we should be doing anything to optimise for
> multi-word sets at the expense of slowing down single-word sets.

Fully agreed.

> Not
> that it's very representative of the real world, but I added some
> telemetry to bitmapset.c and I see 43 multi-word sets being created
> and 1.45 million single-word sets created during make check, or 1 in
> 33785, if you prefer.
> 
> diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
> index 786f343b3c9..21b5053e9a2 100644
> --- a/src/backend/nodes/bitmapset.c
> +++ b/src/backend/nodes/bitmapset.c
> @@ -219,6 +219,10 @@ bms_make_singleton(int x)
>        int                     wordnum,
>                                bitnum;
> 
> +       if (x >= 64)
> +               elog(NOTICE, "multi-word set bms_make_singleton");
> +       else
> +               elog(NOTICE, "single-word set bms_make_singleton");
>        if (x < 0)
>                elog(ERROR, "negative bitmapset member not allowed");
>        wordnum = WORDNUM(x);
> @@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x)
>        wordnum = WORDNUM(x);
>        bitnum = BITNUM(x);
> 
> +       if (x >= 64 && a->nwords == 1)
> +               elog(NOTICE, "multi-set bms_add_member");
> +
>        /* enlarge the set if necessary */
>        if (wordnum >= a->nwords)
>        {
> 
> Not quite perfect as a set made first as a single word set that later
> becomes a multi-word set will be double counted. The number of
> operations on the sets is likely more important anyway, not the number
> of sets being created. The point is, multi-word sets are rare for most
> workloads.
> 

What tests did you run after adding the logs to collect the data? This is a method I might borrow in the future for similar investigations.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/









^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-12 12:33  David Rowley <[email protected]>
  parent: Chao Li <[email protected]>
  0 siblings, 0 replies; 17+ messages in thread

From: David Rowley @ 2026-04-12 12:33 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>

On Mon, 6 Apr 2026 at 14:01, Chao Li <[email protected]> wrote:
> > On Apr 4, 2026, at 11:30, David Rowley <[email protected]> wrote:
> > Not quite perfect as a set made first as a single word set that later
> > becomes a multi-word set will be double counted. The number of
> > operations on the sets is likely more important anyway, not the number
> > of sets being created. The point is, multi-word sets are rare for most
> > workloads.
> >
>
> What tests did you run after adding the logs to collect the data? This is a method I might borrow in the future for similar investigations.

"make check" then grep regression.diffs for the NOTICE message and
pipe to "wc -l"





^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-12 13:17  David Rowley <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 2 replies; 17+ messages in thread

From: David Rowley @ 2026-04-12 13:17 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>

On Fri, 3 Apr 2026 at 15:08, David Rowley <[email protected]> wrote:
> IMO, if we can make bitmapset.c work with INT_MAX members and get a
> performance increase, then we should do it.

Re-thinking this after a week's holiday, it seems fine to use an
unsigned 32-bit int rather than a 64-bit int to fix this bug. I'd
previously been uncertain if there were any guarantees in C to what
(unsigned int) -1 would return, but going by [1] at 6.3.1.3, it says:

"Otherwise, if the new type is unsigned, the value is converted by
repeatedly adding or subtracting one more than the maximum value that
can be represented in the new type until the value is in the range of
the new type."

So, it seems even on one's complement that -1 as an unsigned int will
be UINT_MAX. When we add 1 to UINT_MAX, we're guaranteed to get 0, as
it's unsigned maths and overflows are going to result in a value
modulus the max value for the type.

That leads me to the attached v2 patch. Compiler Explorer link showing
the assembly at [2].

Testing the performance, it's better than master, so I got rid of the
size_t wordnum stuff. We're post-freeze now, so fiddling with other
optimisations seems a bit off the table as there appears to be no
performance regression to compensate for now, per:

drowley@amd3990x:~$ gcc test_bms_next3.c -O2 -o test_bms_next3 &&
./test_bms_next3
Benchmarking 100000000 bms_next_member iterations...
master: 1.18330 seconds
Patched: 1.05493 seconds

Benchmarking 100000000 bms_prev_member iterations...
master: 2.94522 seconds
Patched: 1.86130 seconds

drowley@amd3990x:~$ clang test_bms_next3.c -O2 -o test_bms_next3 &&
./test_bms_next3
Benchmarking 100000000 bms_next_member iterations...
master: 1.07860 seconds
Patched: 1.07896 seconds

Benchmarking 100000000 bms_prev_member iterations...
master: 2.76550 seconds
Patched: 2.12369 seconds

David

[1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf
[2] https://godbolt.org/z/xW96rxd3P

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <limits.h>

//#define NULL ((void *) 0)
typedef uint32_t uint32;
typedef int32_t int32;
typedef uint64_t uint64;
typedef int64_t int64;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword;	  /* must be an unsigned type */
typedef int64 signedbitmapword; /* must be the matching signed type */

#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)

#ifdef __GNUC__
#define likely(x)	__builtin_expect((x) != 0, 1)
#define unlikely(x) __builtin_expect((x) != 0, 0)
#else
#define likely(x)	((x) != 0)
#define unlikely(x) ((x) != 0)
#endif

typedef struct Bitmapset
{
	int		 nwords;		/* number of words in array */
	bitmapword  words[];	/* really [nwords] */
} Bitmapset;

static inline int
bmw_rightmost_one_pos(uint64 word)
{
	return __builtin_ctzll(word);
}

static inline int
bmw_leftmost_one_pos(uint64 word)
{
	return 63 - __builtin_clzll(word);
}

int
bms_next_member(const Bitmapset *a, int prevbit)
{
	int		 nwords;
	bitmapword  mask;

	if (a == NULL)
		return -2;

	nwords = a->nwords;
	prevbit++;
	mask = (~(bitmapword) 0) << BITNUM(prevbit);
	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before prevbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

int
bms_next_member_patched(const Bitmapset *a, int prevbit)
{
	unsigned int currbit = prevbit;
	int			nwords;
	bitmapword	mask;

	if (a == NULL)
		return -2;
	nwords = a->nwords;

	/* use an unsigned int to avoid the risk that int overflows */
	currbit++;
	mask = (~(bitmapword) 0) << BITNUM(currbit);
	for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
	{
		bitmapword	w = a->words[wordnum];

		/* ignore bits before currbit */
		w &= mask;

		if (w != 0)
		{
			int			result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

int
bms_prev_member(const Bitmapset *a, int prevbit)
{
	int			ushiftbits;
	bitmapword	mask;

	/*
	 * If set is NULL or if there are no more bits to the right then we've
	 * nothing to do.
	 */
	if (a == NULL || prevbit == 0)
		return -2;

	/* transform -1 to the highest possible bit we could have set */
	if (prevbit == -1)
		prevbit = a->nwords * BITS_PER_BITMAPWORD - 1;
	else
		prevbit--;

	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1);
	mask = (~(bitmapword) 0) >> ushiftbits;
	for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--)
	{
		bitmapword	w = a->words[wordnum];

		/* mask out bits left of prevbit */
		w &= mask;

		if (w != 0)
		{
			int			result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_leftmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

int
bms_prev_member_patched(const Bitmapset *a, int prevbit)
{
	unsigned int currbit;
	int			ushiftbits;
	bitmapword	mask;


	/*
	 * If set is NULL or if there are no more bits to the right then we've
	 * nothing to do.
	 */
	if (a == NULL || prevbit == 0)
		return -2;

	/*
	 * Transform -1 to the highest possible bit we could have set.  We do this
	 * in unsigned math to avoid the risk of overflowing a signed int.
	 */
	if (prevbit < 0)
		currbit = (unsigned int) a->nwords * BITS_PER_BITMAPWORD - 1;
	else
		currbit = prevbit - 1;

	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1);
	mask = (~(bitmapword) 0) >> ushiftbits;
	for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--)
	{
		bitmapword	w = a->words[wordnum];

		/* mask out bits left of currbit */
		w &= mask;

		if (w != 0)
		{
			int			result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_leftmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}


double get_time() {
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_sec + ts.tv_nsec * 1e-9;
}

Bitmapset *bms;


int main() {
	int words_to_alloc = 1; // Large set to bypass CPU cache slightly
	bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword));
	bms->nwords = words_to_alloc;
	memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
	double start, end;
	int64 count = 0;

	/* Set a bit far into the set to force a long scan */
	bms->words[words_to_alloc - 1] |= 0xaf4;

	int iterations = 100000000;


	printf("Benchmarking %d bms_next_member iterations...\n", iterations);

	/* master */
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("master: %.5f seconds\n", end - start);

	// Test David
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_patched(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("Patched: %.5f seconds\n", end - start);

	printf("\nBenchmarking %d bms_prev_member iterations...\n", iterations);

	/* master */
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_prev_member(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("master: %.5f seconds\n", end - start);

	// Test David
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_prev_member_patched(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("Patched: %.5f seconds\n", end - start);

	printf("%ld\n", count);
	free(bms);
	return 0;
}

Attachments:

  [text/plain] test_bms_next3.c (5.7K, 2-test_bms_next3.c)
  download | inline:
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <time.h>
#include <limits.h>

//#define NULL ((void *) 0)
typedef uint32_t uint32;
typedef int32_t int32;
typedef uint64_t uint64;
typedef int64_t int64;
#define BITS_PER_BITMAPWORD 64
typedef uint64 bitmapword;	  /* must be an unsigned type */
typedef int64 signedbitmapword; /* must be the matching signed type */

#define WORDNUM(x)  ((x) / BITS_PER_BITMAPWORD)
#define BITNUM(x)   ((x) % BITS_PER_BITMAPWORD)

#ifdef __GNUC__
#define likely(x)	__builtin_expect((x) != 0, 1)
#define unlikely(x) __builtin_expect((x) != 0, 0)
#else
#define likely(x)	((x) != 0)
#define unlikely(x) ((x) != 0)
#endif

typedef struct Bitmapset
{
	int		 nwords;		/* number of words in array */
	bitmapword  words[];	/* really [nwords] */
} Bitmapset;

static inline int
bmw_rightmost_one_pos(uint64 word)
{
	return __builtin_ctzll(word);
}

static inline int
bmw_leftmost_one_pos(uint64 word)
{
	return 63 - __builtin_clzll(word);
}

int
bms_next_member(const Bitmapset *a, int prevbit)
{
	int		 nwords;
	bitmapword  mask;

	if (a == NULL)
		return -2;

	nwords = a->nwords;
	prevbit++;
	mask = (~(bitmapword) 0) << BITNUM(prevbit);
	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
	{
		bitmapword  w = a->words[wordnum];

		/* ignore bits before prevbit */
		w &= mask;

		if (w != 0)
		{
			int		 result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

int
bms_next_member_patched(const Bitmapset *a, int prevbit)
{
	unsigned int currbit = prevbit;
	int			nwords;
	bitmapword	mask;

	if (a == NULL)
		return -2;
	nwords = a->nwords;

	/* use an unsigned int to avoid the risk that int overflows */
	currbit++;
	mask = (~(bitmapword) 0) << BITNUM(currbit);
	for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
	{
		bitmapword	w = a->words[wordnum];

		/* ignore bits before currbit */
		w &= mask;

		if (w != 0)
		{
			int			result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_rightmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

int
bms_prev_member(const Bitmapset *a, int prevbit)
{
	int			ushiftbits;
	bitmapword	mask;

	/*
	 * If set is NULL or if there are no more bits to the right then we've
	 * nothing to do.
	 */
	if (a == NULL || prevbit == 0)
		return -2;

	/* transform -1 to the highest possible bit we could have set */
	if (prevbit == -1)
		prevbit = a->nwords * BITS_PER_BITMAPWORD - 1;
	else
		prevbit--;

	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1);
	mask = (~(bitmapword) 0) >> ushiftbits;
	for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--)
	{
		bitmapword	w = a->words[wordnum];

		/* mask out bits left of prevbit */
		w &= mask;

		if (w != 0)
		{
			int			result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_leftmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}

int
bms_prev_member_patched(const Bitmapset *a, int prevbit)
{
	unsigned int currbit;
	int			ushiftbits;
	bitmapword	mask;


	/*
	 * If set is NULL or if there are no more bits to the right then we've
	 * nothing to do.
	 */
	if (a == NULL || prevbit == 0)
		return -2;

	/*
	 * Transform -1 to the highest possible bit we could have set.  We do this
	 * in unsigned math to avoid the risk of overflowing a signed int.
	 */
	if (prevbit < 0)
		currbit = (unsigned int) a->nwords * BITS_PER_BITMAPWORD - 1;
	else
		currbit = prevbit - 1;

	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1);
	mask = (~(bitmapword) 0) >> ushiftbits;
	for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--)
	{
		bitmapword	w = a->words[wordnum];

		/* mask out bits left of currbit */
		w &= mask;

		if (w != 0)
		{
			int			result;

			result = wordnum * BITS_PER_BITMAPWORD;
			result += bmw_leftmost_one_pos(w);
			return result;
		}

		/* in subsequent words, consider all bits */
		mask = (~(bitmapword) 0);
	}
	return -2;
}


double get_time() {
	struct timespec ts;
	clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts);
	return ts.tv_sec + ts.tv_nsec * 1e-9;
}

Bitmapset *bms;


int main() {
	int words_to_alloc = 1; // Large set to bypass CPU cache slightly
	bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword));
	bms->nwords = words_to_alloc;
	memset(bms->words, 0, words_to_alloc * sizeof(bitmapword));
	double start, end;
	int64 count = 0;

	/* Set a bit far into the set to force a long scan */
	bms->words[words_to_alloc - 1] |= 0xaf4;

	int iterations = 100000000;


	printf("Benchmarking %d bms_next_member iterations...\n", iterations);

	/* master */
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("master: %.5f seconds\n", end - start);

	// Test David
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_next_member_patched(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("Patched: %.5f seconds\n", end - start);

	printf("\nBenchmarking %d bms_prev_member iterations...\n", iterations);

	/* master */
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_prev_member(bms, j)) >= 0)
			count++;
	}
	end = get_time();
	printf("master: %.5f seconds\n", end - start);

	// Test David
	start = get_time();
	for (int i = 0; i < iterations; i++)
	{
		int j = -1;
		while ((j = bms_prev_member_patched(bms, j)) >= 0)
			count++;
	}

	end = get_time();
	printf("Patched: %.5f seconds\n", end - start);

	printf("%ld\n", count);
	free(bms);
	return 0;
}

  [application/octet-stream] v2-0001-Fix-unlikely-overflow-bug-in-bms_next_member.patch (3.2K, 3-v2-0001-Fix-unlikely-overflow-bug-in-bms_next_member.patch)
  download | inline diff:
From d1c644ba4de05624fc5a63b3d5f4026c71ed4aa5 Mon Sep 17 00:00:00 2001
From: David Rowley <[email protected]>
Date: Sat, 4 Apr 2026 16:41:23 +1300
Subject: [PATCH v2] Fix unlikely overflow bug in bms_next_member()

... and bms_prev_member().

Both of these functions won't work correctly when given a prevbit of
INT_MAX.  Here we fix that by using an unsigned int to calculate which
member to look for next.

In practise, Bitmapsets will never have such a large member, so no
backpatch.

Author: David Rowley <[email protected]>
Reviewed-by: Chao Li <[email protected]>
Discussion: https://postgr.es/m/CAApHDvq0T%3DiJ0Sf5TNE9yyWwfOeVjmrBt0wSywDnGD9Y4YJQBA%40mail.gmail.com
---
 src/backend/nodes/bitmapset.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c
index 786f343b3c9..957172648c3 100644
--- a/src/backend/nodes/bitmapset.c
+++ b/src/backend/nodes/bitmapset.c
@@ -1289,6 +1289,7 @@ bms_join(Bitmapset *a, Bitmapset *b)
 int
 bms_next_member(const Bitmapset *a, int prevbit)
 {
+	unsigned int currbit = prevbit;
 	int			nwords;
 	bitmapword	mask;
 
@@ -1297,13 +1298,15 @@ bms_next_member(const Bitmapset *a, int prevbit)
 	if (a == NULL)
 		return -2;
 	nwords = a->nwords;
-	prevbit++;
-	mask = (~(bitmapword) 0) << BITNUM(prevbit);
-	for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++)
+
+	/* use an unsigned int to avoid the risk that int overflows */
+	currbit++;
+	mask = (~(bitmapword) 0) << BITNUM(currbit);
+	for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++)
 	{
 		bitmapword	w = a->words[wordnum];
 
-		/* ignore bits before prevbit */
+		/* ignore bits before currbit */
 		w &= mask;
 
 		if (w != 0)
@@ -1345,10 +1348,10 @@ bms_next_member(const Bitmapset *a, int prevbit)
  * It makes no difference in simple loop usage, but complex iteration logic
  * might need such an ability.
  */
-
 int
 bms_prev_member(const Bitmapset *a, int prevbit)
 {
+	unsigned int currbit;
 	int			ushiftbits;
 	bitmapword	mask;
 
@@ -1362,22 +1365,24 @@ bms_prev_member(const Bitmapset *a, int prevbit)
 		return -2;
 
 	/* Validate callers didn't give us something out of range */
-	Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD);
-	Assert(prevbit >= -1);
+	Assert(prevbit < 0 || prevbit <= (unsigned int) (a->nwords * BITS_PER_BITMAPWORD));
 
-	/* transform -1 to the highest possible bit we could have set */
-	if (prevbit == -1)
-		prevbit = a->nwords * BITS_PER_BITMAPWORD - 1;
+	/*
+	 * Transform -1 to the highest possible bit we could have set.  We do this
+	 * in unsigned math to avoid the risk of overflowing a signed int.
+	 */
+	if (prevbit < 0)
+		currbit = (unsigned int) a->nwords * BITS_PER_BITMAPWORD - 1;
 	else
-		prevbit--;
+		currbit = prevbit - 1;
 
-	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1);
+	ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1);
 	mask = (~(bitmapword) 0) >> ushiftbits;
-	for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--)
+	for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--)
 	{
 		bitmapword	w = a->words[wordnum];
 
-		/* mask out bits left of prevbit */
+		/* mask out bits left of currbit */
 		w &= mask;
 
 		if (w != 0)
-- 
2.51.0



^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-12 23:41  David Rowley <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 0 replies; 17+ messages in thread

From: David Rowley @ 2026-04-12 23:41 UTC (permalink / raw)
  To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]>

On Mon, 13 Apr 2026 at 01:17, David Rowley <[email protected]> wrote:
> Re-thinking this after a week's holiday, it seems fine to use an
> unsigned 32-bit int rather than a 64-bit int to fix this bug. I'd
> previously been uncertain if there were any guarantees in C to what
> (unsigned int) -1 would return, but going by [1] at 6.3.1.3, it says:
>
> "Otherwise, if the new type is unsigned, the value is converted by
> repeatedly adding or subtracting one more than the maximum value that
> can be represented in the new type until the value is in the range of
> the new type."
>
> So, it seems even on one's complement that -1 as an unsigned int will
> be UINT_MAX. When we add 1 to UINT_MAX, we're guaranteed to get 0, as
> it's unsigned maths and overflows are going to result in a value
> modulus the max value for the type.

I've pushed that version. No backpatch.

David





^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-13 00:23  Chao Li <[email protected]>
  parent: David Rowley <[email protected]>
  1 sibling, 1 reply; 17+ messages in thread

From: Chao Li @ 2026-04-13 00:23 UTC (permalink / raw)
  To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>



> On Apr 12, 2026, at 21:17, David Rowley <[email protected]> wrote:
> 
> On Fri, 3 Apr 2026 at 15:08, David Rowley <[email protected]> wrote:
>> IMO, if we can make bitmapset.c work with INT_MAX members and get a
>> performance increase, then we should do it.
> 
> Re-thinking this after a week's holiday, it seems fine to use an
> unsigned 32-bit int rather than a 64-bit int to fix this bug. I'd
> previously been uncertain if there were any guarantees in C to what
> (unsigned int) -1 would return, but going by [1] at 6.3.1.3, it says:
> 
> "Otherwise, if the new type is unsigned, the value is converted by
> repeatedly adding or subtracting one more than the maximum value that
> can be represented in the new type until the value is in the range of
> the new type."
> 
> So, it seems even on one's complement that -1 as an unsigned int will
> be UINT_MAX. When we add 1 to UINT_MAX, we're guaranteed to get 0, as
> it's unsigned maths and overflows are going to result in a value
> modulus the max value for the type.
> 
> That leads me to the attached v2 patch. Compiler Explorer link showing
> the assembly at [2].
> 
> Testing the performance, it's better than master, so I got rid of the
> size_t wordnum stuff. We're post-freeze now, so fiddling with other
> optimisations seems a bit off the table as there appears to be no
> performance regression to compensate for now, per:
> 
> drowley@amd3990x:~$ gcc test_bms_next3.c -O2 -o test_bms_next3 &&
> ./test_bms_next3
> Benchmarking 100000000 bms_next_member iterations...
> master: 1.18330 seconds
> Patched: 1.05493 seconds
> 
> Benchmarking 100000000 bms_prev_member iterations...
> master: 2.94522 seconds
> Patched: 1.86130 seconds
> 
> drowley@amd3990x:~$ clang test_bms_next3.c -O2 -o test_bms_next3 &&
> ./test_bms_next3
> Benchmarking 100000000 bms_next_member iterations...
> master: 1.07860 seconds
> Patched: 1.07896 seconds
> 
> Benchmarking 100000000 bms_prev_member iterations...
> master: 2.76550 seconds
> Patched: 2.12369 seconds
> 
> David
> 
> [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf
> [2] https://godbolt.org/z/xW96rxd3P
> <test_bms_next3.c><v2-0001-Fix-unlikely-overflow-bug-in-bms_next_member.patch>

I just tried test_bms_next3 on my MacBook, looks like patched bms_prev_member is much faster there. I ran about 10 times, the results were consistent.
```
chaol@ChaodeMacBook-Air test % gcc test_bms_next3.c -O2 -o test_bms_next3
test_bms_next3.c:281:18: warning: format specifies type 'long' but the argument has type 'int64' (aka 'long long') [-Wformat]
  281 |         printf("%ld\n", count);
      |                 ~~~     ^~~~~
      |                 %lld
1 warning generated.
chaol@ChaodeMacBook-Air test % ./test_bms_next3
Benchmarking 100000000 bms_next_member iterations...
master: 0.49287 seconds
Patched: 0.47315 seconds

Benchmarking 100000000 bms_prev_member iterations...
master: 1.04644 seconds
Patched: 0.58053 seconds
2800000000
```

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/









^ permalink  raw  reply  [nested|flat] 17+ messages in thread

* Re: Small and unlikely overflow hazard in bms_next_member()
@ 2026-04-13 02:43  David Rowley <[email protected]>
  parent: Chao Li <[email protected]>
  0 siblings, 0 replies; 17+ messages in thread

From: David Rowley @ 2026-04-13 02:43 UTC (permalink / raw)
  To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]>

On Mon, 13 Apr 2026 at 12:24, Chao Li <[email protected]> wrote:
> I just tried test_bms_next3 on my MacBook, looks like patched bms_prev_member is much faster there. I ran about 10 times, the results were consistent.

Thanks. I saw similar results.

David





^ permalink  raw  reply  [nested|flat] 17+ messages in thread

end of thread, other threads:[~2026-04-13 02:43 UTC | newest]

Thread overview: 17+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-02 04:09 Small and unlikely overflow hazard in bms_next_member() David Rowley <[email protected]>
2026-04-02 04:22 ` Tom Lane <[email protected]>
2026-04-02 22:12   ` David Rowley <[email protected]>
2026-04-03 02:08     ` David Rowley <[email protected]>
2026-04-03 03:24       ` Chao Li <[email protected]>
2026-04-03 11:42         ` David Rowley <[email protected]>
2026-04-03 13:28           ` Chao Li <[email protected]>
2026-04-04 03:30             ` David Rowley <[email protected]>
2026-04-06 02:00               ` Chao Li <[email protected]>
2026-04-12 12:33                 ` David Rowley <[email protected]>
2026-04-12 13:17       ` David Rowley <[email protected]>
2026-04-12 23:41         ` David Rowley <[email protected]>
2026-04-13 00:23         ` Chao Li <[email protected]>
2026-04-13 02:43           ` David Rowley <[email protected]>
2026-04-02 06:38 ` Chao Li <[email protected]>
2026-04-02 07:54   ` jie wang <[email protected]>
2026-04-02 22:20   ` David Rowley <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox