public inbox for [email protected]help / color / mirror / Atom feed
Small and unlikely overflow hazard in bms_next_member() 17+ messages / 4 participants [nested] [flat]
* Small and unlikely overflow hazard in bms_next_member() @ 2026-04-02 04:09 David Rowley <[email protected]> 0 siblings, 2 replies; 17+ messages in thread From: David Rowley @ 2026-04-02 04:09 UTC (permalink / raw) To: PostgreSQL Developers <[email protected]> I've been working on bms_left_shift_members() to bitshift members either left or right in order to tidy up some existing code and improve a future Bitmapset use case I'm currently working on. When testing some ERROR code I added to ensure we don't get an excessively large left shift value and end up with members higher than INT32_MAX, I discovered that bms_next_member() can't handle that value, as "prevbit++" will wrap to INT32_MIN and then we'll try to access a negative array index, i.e. seg fault. I appreciate that such a large member is quite unlikely, but if this isn't fixed then I need to code my error checking code to disallow members >= INT32_MAX rather than > INT32_MAX. I did have a comment explaining why I was doing that, but fixing the bug saves the weird special case and the comment. Patched attached. I was thinking it might not be worthy of backpatching, but I'll entertain alternative views on that. David Attachments: [application/octet-stream] bms_next_member_fix.patch (905B, 2-bms_next_member_fix.patch) download | inline diff: diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..912a1ab696a 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1289,6 +1289,7 @@ bms_join(Bitmapset *a, Bitmapset *b) int bms_next_member(const Bitmapset *a, int prevbit) { + int64 currbit; int nwords; bitmapword mask; @@ -1297,13 +1298,13 @@ bms_next_member(const Bitmapset *a, int prevbit) if (a == NULL) return -2; nwords = a->nwords; - prevbit++; - mask = (~(bitmapword) 0) << BITNUM(prevbit); - for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) + currbit = (int64) prevbit + 1; + mask = (~(bitmapword) 0) << BITNUM(currbit); + for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; - /* ignore bits before prevbit */ + /* ignore bits before currbit */ w &= mask; if (w != 0) ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-02 04:22 Tom Lane <[email protected]> parent: David Rowley <[email protected]> 1 sibling, 1 reply; 17+ messages in thread From: Tom Lane @ 2026-04-02 04:22 UTC (permalink / raw) To: David Rowley <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> David Rowley <[email protected]> writes: > When testing some ERROR code I added to ensure we don't get an > excessively large left shift value and end up with members higher than > INT32_MAX, I discovered that bms_next_member() can't handle that > value, as "prevbit++" will wrap to INT32_MIN and then we'll try to > access a negative array index, i.e. seg fault. > I appreciate that such a large member is quite unlikely, I think it's impossible, and if it's not then this is not the only place in bitmapset.c that could theoretically overflow. As an example, bms_prev_member does Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD); but if the bitmapset were large enough to accommodate INT_MAX as a member then a->nwords * BITS_PER_BITMAPWORD must overflow. I don't think we should add cycles here for this purpose. If it makes you feel better, maybe add Asserts to bms_make_singleton and bms_add_member to constrain the maximum member value to somewhat less than INT_MAX? regards, tom lane ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-02 06:38 Chao Li <[email protected]> parent: David Rowley <[email protected]> 1 sibling, 2 replies; 17+ messages in thread From: Chao Li @ 2026-04-02 06:38 UTC (permalink / raw) To: David Rowley <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> > On Apr 2, 2026, at 12:09, David Rowley <[email protected]> wrote: > > I've been working on bms_left_shift_members() to bitshift members > either left or right in order to tidy up some existing code and > improve a future Bitmapset use case I'm currently working on. > > When testing some ERROR code I added to ensure we don't get an > excessively large left shift value and end up with members higher than > INT32_MAX, I discovered that bms_next_member() can't handle that > value, as "prevbit++" will wrap to INT32_MIN and then we'll try to > access a negative array index, i.e. seg fault. > > I appreciate that such a large member is quite unlikely, but if this > isn't fixed then I need to code my error checking code to disallow > members >= INT32_MAX rather than > INT32_MAX. I did have a comment > explaining why I was doing that, but fixing the bug saves the weird > special case and the comment. > > Patched attached. I was thinking it might not be worthy of > backpatching, but I'll entertain alternative views on that. > > David > <bms_next_member_fix.patch> Both the return value of bms_next_member() and the parameter “prevbit" are defined as int, which seems to imply that a bitmap can hold at most INT32_MAX elements. So I wonder if a cleaner fix would be to just add a range guard, like this: ``` diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..7f79f81f278 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit) Assert(bms_is_valid_set(a)); - if (a == NULL) + if (a == NULL || prevbit == INT32_MAX) return -2; nwords = a->nwords; prevbit++; ``` Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-02 07:54 jie wang <[email protected]> parent: Chao Li <[email protected]> 1 sibling, 0 replies; 17+ messages in thread From: jie wang @ 2026-04-02 07:54 UTC (permalink / raw) To: Chao Li <[email protected]>; +Cc: David Rowley <[email protected]>; PostgreSQL Developers <[email protected]> Chao Li <[email protected]> 于2026年4月2日周四 14:39写道: > > > > On Apr 2, 2026, at 12:09, David Rowley <[email protected]> wrote: > > > > I've been working on bms_left_shift_members() to bitshift members > > either left or right in order to tidy up some existing code and > > improve a future Bitmapset use case I'm currently working on. > > > > When testing some ERROR code I added to ensure we don't get an > > excessively large left shift value and end up with members higher than > > INT32_MAX, I discovered that bms_next_member() can't handle that > > value, as "prevbit++" will wrap to INT32_MIN and then we'll try to > > access a negative array index, i.e. seg fault. > > > > I appreciate that such a large member is quite unlikely, but if this > > isn't fixed then I need to code my error checking code to disallow > > members >= INT32_MAX rather than > INT32_MAX. I did have a comment > > explaining why I was doing that, but fixing the bug saves the weird > > special case and the comment. > > > > Patched attached. I was thinking it might not be worthy of > > backpatching, but I'll entertain alternative views on that. > > > > David > > <bms_next_member_fix.patch> > > Both the return value of bms_next_member() and the parameter “prevbit" are > defined as int, which seems to imply that a bitmap can hold at most > INT32_MAX elements. So I wonder if a cleaner fix would be to just add a > range guard, like this: > > ``` > diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c > index 786f343b3c9..7f79f81f278 100644 > --- a/src/backend/nodes/bitmapset.c > +++ b/src/backend/nodes/bitmapset.c > @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit) > > Assert(bms_is_valid_set(a)); > > - if (a == NULL) > + if (a == NULL || prevbit == INT32_MAX) > return -2; > nwords = a->nwords; > prevbit++; > ``` > > Best regards, > -- > Chao Li (Evan) > HighGo Software Co., Ltd. > https://www.highgo.com/ > > > > > > > Hi, The both solutions look good to me. I am slightly keen on Chao's version that looks simpler to me. Thanks! wang jie ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-02 22:12 David Rowley <[email protected]> parent: Tom Lane <[email protected]> 0 siblings, 1 reply; 17+ messages in thread From: David Rowley @ 2026-04-02 22:12 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote: > Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD); > but if the bitmapset were large enough to accommodate INT_MAX > as a member then a->nwords * BITS_PER_BITMAPWORD must overflow. I missed that one. That's annoying. Even "prevbit = a->nwords * BITS_PER_BITMAPWORD - 1;" is undefined if it wraps due to the signed maths. > I don't think we should add cycles here for this purpose. I'm not keen on slowing things down for this either. I did do some experiments in [1] that sees fewer instructions from using 64-bit maths. I might go off and see if there are any wins there that also give us the INT_MAX fix. It's not great effort to reward ratio though... David [1] https://godbolt.org/z/Eh1vzssq7 ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-02 22:20 David Rowley <[email protected]> parent: Chao Li <[email protected]> 1 sibling, 0 replies; 17+ messages in thread From: David Rowley @ 2026-04-02 22:20 UTC (permalink / raw) To: Chao Li <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> On Thu, 2 Apr 2026 at 19:39, Chao Li <[email protected]> wrote: > Both the return value of bms_next_member() and the parameter “prevbit" are defined as int, which seems to imply that a bitmap can hold at most INT32_MAX elements. So I wonder if a cleaner fix would be to just add a range guard, like this: > > ``` > diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c > index 786f343b3c9..7f79f81f278 100644 > --- a/src/backend/nodes/bitmapset.c > +++ b/src/backend/nodes/bitmapset.c > @@ -1294,7 +1294,7 @@ bms_next_member(const Bitmapset *a, int prevbit) > > Assert(bms_is_valid_set(a)); > > - if (a == NULL) > + if (a == NULL || prevbit == INT32_MAX) I don't think that's a nice way at all. Adding a special case plus extra instructions, including a new jump instruction for the boolean short-circuiting. More instruction decoding, more L1i space needed, more branches to predict and more space in the branch predictor tables to overwrite other useful branches. Adds run-time overhead. I was aiming for low overhead and no special cases. 2a600a93c means we don't care about the performance on 32-bit hardware anymore, so that can't be used as a counterargument. David ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-03 02:08 David Rowley <[email protected]> parent: David Rowley <[email protected]> 0 siblings, 2 replies; 17+ messages in thread From: David Rowley @ 2026-04-03 02:08 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> On Fri, 3 Apr 2026 at 11:12, David Rowley <[email protected]> wrote: > > On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote: > > I don't think we should add cycles here for this purpose. > > I'm not keen on slowing things down for this either. I did do some > experiments in [1] that sees fewer instructions from using 64-bit > maths. I might go off and see if there are any wins there that also > give us the INT_MAX fix. It's not great effort to reward ratio > though... The reduction in instructions with the patched version got me curious to see if it would translate into a performance increase. I tested on an AMD Zen2 machine, and it's a decent amount faster than master. I tested with gcc and clang. I also scanned over the remaining parts of bitmapset.c and didn't find anywhere else that has overflow risk aside from what you pointed out in bms_prev_member(). The attached patch contains the benchmark function I added to the test_bitmapset module. It should apply to master with a bit of noise. CREATE EXTENSION test_bitmapset; SELECT generate_series(1,3) AS run, bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS bms_prev_member_us; master (gcc) run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26473 | 40404 2 | 26218 | 40413 3 | 26209 | 40387 patched (gcc) run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25409 | 29705 2 | 24905 | 29693 3 | 24870 | 29707 Times are in microseconds to do 1 million bms_*_member() loops over the entire set. I've also attached the full results I got. I've also included the results from Chao's version, which does slow things down decently on clang. IMO, if we can make bitmapset.c work with INT_MAX members and get a performance increase, then we should do it. David > [1] https://godbolt.org/z/Eh1vzssq7 ** GCC ** Master: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26473 | 40404 2 | 26218 | 40413 3 | 26209 | 40387 4 | 26245 | 40521 5 | 26343 | 40559 6 | 26343 | 40580 7 | 26366 | 40639 8 | 26735 | 40467 9 | 26201 | 40334 10 | 26208 | 40343 (10 rows) Time: 735.560 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26616 | 40392 2 | 26239 | 40374 3 | 26191 | 40374 4 | 26230 | 40370 5 | 26232 | 40380 6 | 26210 | 40377 7 | 26206 | 40395 8 | 26249 | 40431 9 | 26209 | 40374 10 | 26234 | 40432 (10 rows) Time: 733.927 ms Patched: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25409 | 29705 2 | 24905 | 29693 3 | 24870 | 29707 4 | 24933 | 29869 5 | 25059 | 29870 6 | 25026 | 29906 7 | 25055 | 29897 8 | 25055 | 29875 9 | 25031 | 29870 10 | 25042 | 29816 (10 rows) Time: 604.086 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25074 | 29723 2 | 24882 | 29721 3 | 24875 | 29710 4 | 24976 | 29854 5 | 25052 | 29865 6 | 25021 | 29904 7 | 25025 | 29818 8 | 25034 | 29813 9 | 25029 | 29828 10 | 25042 | 29866 (10 rows) Time: 603.708 ms Chao's idea: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25677 | 40376 2 | 25271 | 40430 3 | 25263 | 40370 4 | 25301 | 40377 5 | 25357 | 40552 6 | 25394 | 40560 7 | 25412 | 40501 8 | 25401 | 40558 9 | 25381 | 40565 10 | 25370 | 40548 (10 rows) Time: 725.376 ms ** CLANG ** master: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 29970 | 42551 2 | 29479 | 42765 3 | 29475 | 42805 4 | 29523 | 42825 5 | 29578 | 42898 6 | 29529 | 42997 7 | 29535 | 42985 8 | 29593 | 43021 9 | 29575 | 42976 10 | 29562 | 42967 (10 rows) Time: 798.131 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 30919 | 42516 2 | 29312 | 42629 3 | 29422 | 42777 4 | 29307 | 42811 5 | 29273 | 42852 6 | 29219 | 42826 7 | 29860 | 43264 8 | 29325 | 42814 9 | 29299 | 42837 10 | 29351 | 42909 (10 rows) Time: 796.495 ms Patched: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26741 | 33537 2 | 26094 | 33590 3 | 26087 | 33724 4 | 26276 | 33868 5 | 26346 | 33943 6 | 26433 | 33815 7 | 26324 | 33927 8 | 26274 | 33893 9 | 26350 | 33878 10 | 26304 | 33841 (10 rows) Time: 661.924 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26502 | 33523 2 | 26173 | 33635 3 | 26188 | 33772 4 | 26290 | 33767 5 | 26325 | 33935 6 | 26326 | 33951 7 | 26335 | 33883 8 | 26571 | 33884 9 | 26472 | 33837 10 | 26277 | 33919 (10 rows) Time: 662.422 ms Chao's idea: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 34865 | 42599 2 | 34547 | 42817 3 | 34623 | 42948 4 | 34689 | 42954 5 | 34733 | 42928 6 | 34731 | 42845 7 | 34727 | 42967 8 | 34763 | 43014 9 | 34775 | 42987 10 | 34788 | 43038 (10 rows) Time: 854.851 ms Attachments: [text/plain] benchmark_results.txt (8.0K, 2-benchmark_results.txt) download | inline: ** GCC ** Master: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26473 | 40404 2 | 26218 | 40413 3 | 26209 | 40387 4 | 26245 | 40521 5 | 26343 | 40559 6 | 26343 | 40580 7 | 26366 | 40639 8 | 26735 | 40467 9 | 26201 | 40334 10 | 26208 | 40343 (10 rows) Time: 735.560 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26616 | 40392 2 | 26239 | 40374 3 | 26191 | 40374 4 | 26230 | 40370 5 | 26232 | 40380 6 | 26210 | 40377 7 | 26206 | 40395 8 | 26249 | 40431 9 | 26209 | 40374 10 | 26234 | 40432 (10 rows) Time: 733.927 ms Patched: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25409 | 29705 2 | 24905 | 29693 3 | 24870 | 29707 4 | 24933 | 29869 5 | 25059 | 29870 6 | 25026 | 29906 7 | 25055 | 29897 8 | 25055 | 29875 9 | 25031 | 29870 10 | 25042 | 29816 (10 rows) Time: 604.086 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25074 | 29723 2 | 24882 | 29721 3 | 24875 | 29710 4 | 24976 | 29854 5 | 25052 | 29865 6 | 25021 | 29904 7 | 25025 | 29818 8 | 25034 | 29813 9 | 25029 | 29828 10 | 25042 | 29866 (10 rows) Time: 603.708 ms Chao's idea: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 25677 | 40376 2 | 25271 | 40430 3 | 25263 | 40370 4 | 25301 | 40377 5 | 25357 | 40552 6 | 25394 | 40560 7 | 25412 | 40501 8 | 25401 | 40558 9 | 25381 | 40565 10 | 25370 | 40548 (10 rows) Time: 725.376 ms ** CLANG ** master: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 29970 | 42551 2 | 29479 | 42765 3 | 29475 | 42805 4 | 29523 | 42825 5 | 29578 | 42898 6 | 29529 | 42997 7 | 29535 | 42985 8 | 29593 | 43021 9 | 29575 | 42976 10 | 29562 | 42967 (10 rows) Time: 798.131 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 30919 | 42516 2 | 29312 | 42629 3 | 29422 | 42777 4 | 29307 | 42811 5 | 29273 | 42852 6 | 29219 | 42826 7 | 29860 | 43264 8 | 29325 | 42814 9 | 29299 | 42837 10 | 29351 | 42909 (10 rows) Time: 796.495 ms Patched: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26741 | 33537 2 | 26094 | 33590 3 | 26087 | 33724 4 | 26276 | 33868 5 | 26346 | 33943 6 | 26433 | 33815 7 | 26324 | 33927 8 | 26274 | 33893 9 | 26350 | 33878 10 | 26304 | 33841 (10 rows) Time: 661.924 ms postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 26502 | 33523 2 | 26173 | 33635 3 | 26188 | 33772 4 | 26290 | 33767 5 | 26325 | 33935 6 | 26326 | 33951 7 | 26335 | 33883 8 | 26571 | 33884 9 | 26472 | 33837 10 | 26277 | 33919 (10 rows) Time: 662.422 ms Chao's idea: postgres=# select generate_series(1,10) run,bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_next_member_us, bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 as bms_prev_member_us; run | bms_next_member_us | bms_prev_member_us -----+--------------------+-------------------- 1 | 34865 | 42599 2 | 34547 | 42817 3 | 34623 | 42948 4 | 34689 | 42954 5 | 34733 | 42928 6 | 34731 | 42845 7 | 34727 | 42967 8 | 34763 | 43014 9 | 34775 | 42987 10 | 34788 | 43038 (10 rows) Time: 854.851 ms [application/octet-stream] bms_fixes.patch (5.0K, 3-bms_fixes.patch) download | inline diff: diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index ac3e763de43..46b4ca4b0f1 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1444,6 +1444,7 @@ bms_join(Bitmapset *a, Bitmapset *b) int bms_next_member(const Bitmapset *a, int prevbit) { + int64 currbit; int nwords; bitmapword mask; @@ -1452,13 +1453,13 @@ bms_next_member(const Bitmapset *a, int prevbit) if (a == NULL) return -2; nwords = a->nwords; - prevbit++; - mask = (~(bitmapword) 0) << BITNUM(prevbit); - for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) + currbit = (int64) prevbit + 1; + mask = (~(bitmapword) 0) << BITNUM(currbit); + for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; - /* ignore bits before prevbit */ + /* ignore bits before currbit */ w &= mask; if (w != 0) @@ -1505,6 +1506,7 @@ int bms_prev_member(const Bitmapset *a, int prevbit) { int ushiftbits; + uint32 currbit; bitmapword mask; Assert(bms_is_valid_set(a)); @@ -1517,18 +1519,18 @@ bms_prev_member(const Bitmapset *a, int prevbit) return -2; /* Validate callers didn't give us something out of range */ - Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD); + Assert(prevbit <= (int64) a->nwords * BITS_PER_BITMAPWORD); Assert(prevbit >= -1); /* transform -1 to the highest possible bit we could have set */ if (prevbit == -1) - prevbit = a->nwords * BITS_PER_BITMAPWORD - 1; + currbit = (uint32) a->nwords * BITS_PER_BITMAPWORD - 1; else - prevbit--; + currbit = prevbit - 1; - ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1); + ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1); mask = (~(bitmapword) 0) >> ushiftbits; - for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--) + for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--) { bitmapword w = a->words[wordnum]; diff --git a/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql b/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql index 6668bcef9a9..d6c0f739e13 100644 --- a/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql +++ b/src/test/modules/test_bitmapset/test_bitmapset--1.0.sql @@ -3,6 +3,15 @@ -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION test_bitmapset" to load this file. \quit +-- Benchmark functions +CREATE FUNCTION bench_bms_next_member(text, integer) +RETURNS bigint +AS 'MODULE_PATHNAME' LANGUAGE C; + +CREATE FUNCTION bench_bms_prev_member(text, integer) +RETURNS bigint +AS 'MODULE_PATHNAME' LANGUAGE C; + -- Bitmapset API functions CREATE FUNCTION test_bms_make_singleton(integer) RETURNS text STRICT diff --git a/src/test/modules/test_bitmapset/test_bitmapset.c b/src/test/modules/test_bitmapset/test_bitmapset.c index 4fe22ee64d2..990347a0f22 100644 --- a/src/test/modules/test_bitmapset/test_bitmapset.c +++ b/src/test/modules/test_bitmapset/test_bitmapset.c @@ -17,6 +17,7 @@ #include "postgres.h" #include <stddef.h> +#include <time.h> #include "catalog/pg_type.h" #include "common/pg_prng.h" #include "utils/array.h" @@ -27,8 +28,13 @@ #include "utils/builtins.h" #include "utils/timestamp.h" + PG_MODULE_MAGIC; +/* Benchmark functions */ +PG_FUNCTION_INFO_V1(bench_bms_next_member); +PG_FUNCTION_INFO_V1(bench_bms_prev_member); + /* Bitmapset API functions in order of appearance in bitmapset.c */ PG_FUNCTION_INFO_V1(test_bms_make_singleton); PG_FUNCTION_INFO_V1(test_bms_add_member); @@ -105,6 +111,73 @@ PG_FUNCTION_INFO_V1(test_random_shift_operations); #define PG_RETURN_BITMAPSET_AS_TEXT(bms) \ PG_RETURN_TEXT_P(BITMAPSET_TO_TEXT(bms)) +#define NANOSEC_PER_SEC 1000000000 + +int64 get_clock_diff(struct timespec* t1, struct timespec* t2); + +int64 +get_clock_diff(struct timespec* t1, struct timespec* t2) +{ + int64_t nanosec = (t1->tv_sec - t2->tv_sec) * NANOSEC_PER_SEC; + nanosec += (t1->tv_nsec - t2->tv_nsec); + + return nanosec; +} + +Datum +bench_bms_next_member(PG_FUNCTION_ARGS) +{ + Bitmapset* bms = PG_ARG_GETBITMAPSET(0); + int32 loops = PG_GETARG_INT32(1); + + struct timespec start, end; + int64 nanoseconds; + int64 counter = 0; + + clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); + + for (int i = 0; i < loops; i++) + { + int m = -1; + while ((m = bms_next_member(bms, m)) >= 0) + { + counter++; + } + } + + clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); + nanoseconds = get_clock_diff(&end, &start); + + PG_RETURN_INT64(nanoseconds); +} + +Datum +bench_bms_prev_member(PG_FUNCTION_ARGS) +{ + Bitmapset* bms = PG_ARG_GETBITMAPSET(0); + int32 loops = PG_GETARG_INT32(1); + + struct timespec start, end; + int64 nanoseconds; + int64 counter = 0; + + clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &start); + + for (int i = 0; i < loops; i++) + { + int m = -1; + while ((m = bms_prev_member(bms, m)) >= 0) + { + counter++; + } + } + + clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end); + nanoseconds = get_clock_diff(&end, &start); + + PG_RETURN_INT64(nanoseconds); +} + /* * Individual test functions for each bitmapset API function * ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-03 03:24 Chao Li <[email protected]> parent: David Rowley <[email protected]> 1 sibling, 1 reply; 17+ messages in thread From: Chao Li @ 2026-04-03 03:24 UTC (permalink / raw) To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> > On Apr 3, 2026, at 10:08, David Rowley <[email protected]> wrote: > > On Fri, 3 Apr 2026 at 11:12, David Rowley <[email protected]> wrote: >> >> On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote: >>> I don't think we should add cycles here for this purpose. >> >> I'm not keen on slowing things down for this either. I did do some >> experiments in [1] that sees fewer instructions from using 64-bit >> maths. I might go off and see if there are any wins there that also >> give us the INT_MAX fix. It's not great effort to reward ratio >> though... > > The reduction in instructions with the patched version got me curious > to see if it would translate into a performance increase. I tested on > an AMD Zen2 machine, and it's a decent amount faster than master. I > tested with gcc and clang. > > I also scanned over the remaining parts of bitmapset.c and didn't find > anywhere else that has overflow risk aside from what you pointed out > in bms_prev_member(). > > The attached patch contains the benchmark function I added to the > test_bitmapset module. It should apply to master with a bit of noise. > > CREATE EXTENSION test_bitmapset; > SELECT > generate_series(1,3) AS run, > bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS > bms_next_member_us, > bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS > bms_prev_member_us; > > master (gcc) > > run | bms_next_member_us | bms_prev_member_us > -----+--------------------+-------------------- > 1 | 26473 | 40404 > 2 | 26218 | 40413 > 3 | 26209 | 40387 > > patched (gcc) > > run | bms_next_member_us | bms_prev_member_us > -----+--------------------+-------------------- > 1 | 25409 | 29705 > 2 | 24905 | 29693 > 3 | 24870 | 29707 > > Times are in microseconds to do 1 million bms_*_member() loops over > the entire set. > > I've also attached the full results I got. I've also included the > results from Chao's version, which does slow things down decently on > clang. > > IMO, if we can make bitmapset.c work with INT_MAX members and get a > performance increase, then we should do it. > > David > >> [1] https://godbolt.org/z/Eh1vzssq7 > <benchmark_results.txt><bms_fixes.patch> I also did a load test with a standalone c program with 4 versions: * The original bms_next_member (Original) * The fast version from [1], that uses 64bit maths (Fast) * The original version + INT32_MAX check + 64bit maths (Original2) * I tried the other approach that pulls up the first iteration, so that removes "mask = (~(bitmapword) 0);” from the loop. (PullUp) Note: all tests used -O2 to build the executable. On my MacBook M4, the Fast version constantly won, and PullUp version performed badly. ``` % gcc --version Apple clang version 17.0.0 (clang-1700.6.4.2) Target: arm64-apple-darwin25.3.0 Thread model: posix InstalledDir: /Library/Developer/CommandLineTools/usr/bin ``` A typical test run: ``` Benchmarking 100000 iterations... Original: 0.48893 seconds Fast: 0.46979 seconds Original2: 0.47740 seconds PullUp: 0.48029 seconds ``` On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, Orignal2 won in the most runs, and the PullUp version was faster than Fast version. ``` chaol@lichao-highgo:~$ gcc --version gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Copyright (C) 2023 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` A typical test run: ``` Original: 0.99849 seconds Fast: 0.74722 seconds Original2: 0.59407 seconds PullUp: 0.62746 seconds ``` Then I also tried to run on Windows directly. Here, PullUp version performed the best. ``` $ gcc --version gcc.exe (Rev13, Built by MSYS2 project) 15.2.0 Copyright (C) 2025 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. ``` A typical test run: ``` Original: 0.32931 seconds Fast: 0.32740 seconds Original2: 0.32378 seconds PullUp: 0.30795 seconds ``` I’m curious that, when something performs differently across platforms, which platform should take priority? Please see the attached test program. It’s possible I did something wrong. [1] https://godbolt.org/z/Eh1vzssq7 -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ Attachments: [application/octet-stream] test_bms_next.c (5.4K, 2-test_bms_next.c) download ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-03 11:42 David Rowley <[email protected]> parent: Chao Li <[email protected]> 0 siblings, 1 reply; 17+ messages in thread From: David Rowley @ 2026-04-03 11:42 UTC (permalink / raw) To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> On Fri, 3 Apr 2026 at 16:24, Chao Li <[email protected]> wrote: > I also did a load test with a standalone c program with 4 versions: I think you'd be better off picking a more realistic Bitmapset. The code is doing: int words_to_alloc = 20000; // Large set to bypass CPU cache slightly Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword)); bms->nwords = words_to_alloc; memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); /* Set a bit far into the set to force a long scan */ int target_bit = (words_to_alloc - 1) * 64 + 10; bms->words[words_to_alloc - 1] |= (1ULL << 10); A set with 20k words! Then you're doing: for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0); So you're not looping correctly over the set, as looping over a set with 1 member requires 2 calls to the function. I'd say this unrealistically favours the unrolled loop version, as most of the effort is in the loop looping over empty words and you can do that without the extra bit-wise AND that the other versions have. That version also seems to apply both the 64-bit fix and the INT_MAX fix. Why both? Because you're only calling the function once per iteration, it also means your || INT_MAX is only executed once, rather than n_members + 1 as it would normally be executed. That'll make your version seem better than it is. I fixed all that and changed the Bitmapset to something more realistic with how it's more likely to be seen in PostgreSQL and ran the benchmark with a proper bms_next_member loop. File attached and results below: Zen2 Linux $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next Benchmarking 100000000 iterations... Original: 0.62113 seconds David's: 0.70257 seconds Chao's version: 1.70691 seconds Unrolled loop: 1.56239 seconds 2800000000 $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next Benchmarking 100000000 iterations... Original: 1.11263 seconds David's: 0.87399 seconds Chao's version: 0.52962 seconds Unrolled loop: 1.07799 seconds 2800000000 Apple M2: Benchmarking 100000000 iterations... Original: 0.64023 seconds David's: 0.41087 seconds Chao's version: 1.21113 seconds Unrolled loop: 0.55924 seconds 2800000000 > I’m curious that, when something performs differently across platforms, which platform should take priority? I don't think it matters too much for this case. If we were actually working on a function that was a bottleneck, then it would be worth looking deeper and seeing if the compilers are producing suboptimal code and then try to coax them into making better code. As a last resort, have two versions and some preprocessor to decide which one. However, this just isn't performance-critical enough to warrant such an effort. I think we're already going far too far with this. I just want to fix the bug. If performance increases in some way as a result of that, then good. If I sum up the times from each version of the above results, I get: Original: 2.37399 David's: 1.98743 Chao's: 3.44766 Unrolled: 3.19962 I'm happy to go with the fastest one. I expect the one with the fewest instructions is likely more important when running it in the planner, as that's less to decode. I didn't look, but I expect your loop unrolled version and your || INT_MAX version to have more instructions. David #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <time.h> #include <limits.h> //#define NULL ((void *) 0) typedef uint64_t uint64; typedef int64_t int64; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ typedef int64 signedbitmapword; /* must be the matching signed type */ #define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) #define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) typedef struct Bitmapset { int nwords; /* number of words in array */ bitmapword words[]; /* really [nwords] */ } Bitmapset; static inline int bmw_rightmost_one_pos(uint64 word) { return __builtin_ctzll(word); } // 1. Original version int bms_next_member(const Bitmapset *a, int prevbit) { int nwords; bitmapword mask; if (a == NULL) return -2; nwords = a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 2. David's version int bms_next_member_david(const Bitmapset *a, int prevbit) { int64 currbit; size_t nwords; bitmapword mask; if (a == NULL) return-2; nwords = a->nwords; currbit = (int64) prevbit + 1; mask = (~(bitmapword) 0) << BITNUM(currbit); for (size_t wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before currbit */ w &= mask; if (w != 0) { int result; result = (int) wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 3. Original version + INT32_MAX check + 64bit int bms_next_member_chao(const Bitmapset *a, int prevbit) { size_t nwords; bitmapword mask; if (a == NULL || prevbit == INT32_MAX) return -2; nwords = (size_t) a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (size_t wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = (int)wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 4. Pull up first iteration int bms_next_member_unroll_first(const Bitmapset *a, int prevbit) { uint64 currbit; int wordnum; int nwords; if (a == NULL) return -2; currbit = (int64) prevbit + 1; wordnum = WORDNUM(currbit); nwords = a->nwords; if (wordnum >= nwords) return -2; /* Handle first word with mask */ const bitmapword *p = &a->words[wordnum]; bitmapword w = (*p) & ((~(bitmapword) 0) << BITNUM(currbit)); if (w != 0) return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(w); /* The "Tight" Pointer Scan */ const bitmapword *end = &a->words[nwords]; for (p++; p < end; p++) { if (*p != 0) { wordnum = p - a->words; // Pointer arithmetic to get index return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(*p); } } return -2; } double get_time() { struct timespec ts; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9; } int main() { int words_to_alloc = 1; // Large set to bypass CPU cache slightly Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword)); bms->nwords = words_to_alloc; memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); double end; int64 count = 0; /* Set a bit far into the set to force a long scan */ bms->words[words_to_alloc - 1] |= 0xaf4; int iterations = 100000000; printf("Benchmarking %d iterations...\n\n", iterations); // Test Original double start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member(bms, j)) >= 0) count++; } end = get_time(); printf("Original: %.5f seconds\n", end - start); // Test Fast start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_david(bms, j)) >= 0) count++; } end = get_time(); printf("David's: %.5f seconds\n", end - start); // Test Chao's version start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_chao(bms, j)) >= 0) count++; } end = get_time(); printf("Chao's version: %.5f seconds\n", end - start); // Pull up first iteration start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_unroll_first(bms, j)) >= 0) count++; } end = get_time(); printf("Unrolled loop: %.5f seconds\n", end - start); printf("%ld\n", count); free(bms); return 0; } Attachments: [text/plain] test_bms_next2.c (5.2K, 2-test_bms_next2.c) download | inline: #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <time.h> #include <limits.h> //#define NULL ((void *) 0) typedef uint64_t uint64; typedef int64_t int64; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ typedef int64 signedbitmapword; /* must be the matching signed type */ #define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) #define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) typedef struct Bitmapset { int nwords; /* number of words in array */ bitmapword words[]; /* really [nwords] */ } Bitmapset; static inline int bmw_rightmost_one_pos(uint64 word) { return __builtin_ctzll(word); } // 1. Original version int bms_next_member(const Bitmapset *a, int prevbit) { int nwords; bitmapword mask; if (a == NULL) return -2; nwords = a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 2. David's version int bms_next_member_david(const Bitmapset *a, int prevbit) { int64 currbit; size_t nwords; bitmapword mask; if (a == NULL) return-2; nwords = a->nwords; currbit = (int64) prevbit + 1; mask = (~(bitmapword) 0) << BITNUM(currbit); for (size_t wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before currbit */ w &= mask; if (w != 0) { int result; result = (int) wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 3. Original version + INT32_MAX check + 64bit int bms_next_member_chao(const Bitmapset *a, int prevbit) { size_t nwords; bitmapword mask; if (a == NULL || prevbit == INT32_MAX) return -2; nwords = (size_t) a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (size_t wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = (int)wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } // 4. Pull up first iteration int bms_next_member_unroll_first(const Bitmapset *a, int prevbit) { uint64 currbit; int wordnum; int nwords; if (a == NULL) return -2; currbit = (int64) prevbit + 1; wordnum = WORDNUM(currbit); nwords = a->nwords; if (wordnum >= nwords) return -2; /* Handle first word with mask */ const bitmapword *p = &a->words[wordnum]; bitmapword w = (*p) & ((~(bitmapword) 0) << BITNUM(currbit)); if (w != 0) return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(w); /* The "Tight" Pointer Scan */ const bitmapword *end = &a->words[nwords]; for (p++; p < end; p++) { if (*p != 0) { wordnum = p - a->words; // Pointer arithmetic to get index return (wordnum * BITS_PER_BITMAPWORD) + bmw_rightmost_one_pos(*p); } } return -2; } double get_time() { struct timespec ts; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9; } int main() { int words_to_alloc = 1; // Large set to bypass CPU cache slightly Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword)); bms->nwords = words_to_alloc; memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); double end; int64 count = 0; /* Set a bit far into the set to force a long scan */ bms->words[words_to_alloc - 1] |= 0xaf4; int iterations = 100000000; printf("Benchmarking %d iterations...\n\n", iterations); // Test Original double start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member(bms, j)) >= 0) count++; } end = get_time(); printf("Original: %.5f seconds\n", end - start); // Test Fast start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_david(bms, j)) >= 0) count++; } end = get_time(); printf("David's: %.5f seconds\n", end - start); // Test Chao's version start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_chao(bms, j)) >= 0) count++; } end = get_time(); printf("Chao's version: %.5f seconds\n", end - start); // Pull up first iteration start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_unroll_first(bms, j)) >= 0) count++; } end = get_time(); printf("Unrolled loop: %.5f seconds\n", end - start); printf("%ld\n", count); free(bms); return 0; } ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-03 13:28 Chao Li <[email protected]> parent: David Rowley <[email protected]> 0 siblings, 1 reply; 17+ messages in thread From: Chao Li @ 2026-04-03 13:28 UTC (permalink / raw) To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> > On Apr 3, 2026, at 19:42, David Rowley <[email protected]> wrote: > > On Fri, 3 Apr 2026 at 16:24, Chao Li <[email protected]> wrote: >> I also did a load test with a standalone c program with 4 versions: > > I think you'd be better off picking a more realistic Bitmapset. The > code is doing: > > int words_to_alloc = 20000; // Large set to bypass CPU cache slightly > Bitmapset *bms = malloc(sizeof(Bitmapset) + words_to_alloc * > sizeof(bitmapword)); > bms->nwords = words_to_alloc; > memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); > > /* Set a bit far into the set to force a long scan */ > int target_bit = (words_to_alloc - 1) * 64 + 10; > bms->words[words_to_alloc - 1] |= (1ULL << 10); > > A set with 20k words! Then you're doing: > > for (int i = 0; i < iterations; i++) sink = bms_next_member(bms, 0); > > So you're not looping correctly over the set, as looping over a set > with 1 member requires 2 calls to the function. > > I'd say this unrealistically favours the unrolled loop version, as > most of the effort is in the loop looping over empty words and you can > do that without the extra bit-wise AND that the other versions have. > That version also seems to apply both the 64-bit fix and the INT_MAX > fix. Why both? Because you're only calling the function once per > iteration, it also means your || INT_MAX is only executed once, rather > than n_members + 1 as it would normally be executed. That'll make your > version seem better than it is. > > I fixed all that and changed the Bitmapset to something more realistic > with how it's more likely to be seen in PostgreSQL and ran the > benchmark with a proper bms_next_member loop. File attached and > results below: > > Zen2 Linux > > $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... > > Original: 0.62113 seconds > David's: 0.70257 seconds > Chao's version: 1.70691 seconds > Unrolled loop: 1.56239 seconds > 2800000000 > > $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... > > Original: 1.11263 seconds > David's: 0.87399 seconds > Chao's version: 0.52962 seconds > Unrolled loop: 1.07799 seconds > 2800000000 > > Apple M2: > > Benchmarking 100000000 iterations... > > Original: 0.64023 seconds > David's: 0.41087 seconds > Chao's version: 1.21113 seconds > Unrolled loop: 0.55924 seconds > 2800000000 > >> I’m curious that, when something performs differently across platforms, which platform should take priority? > > I don't think it matters too much for this case. If we were actually > working on a function that was a bottleneck, then it would be worth > looking deeper and seeing if the compilers are producing suboptimal > code and then try to coax them into making better code. As a last > resort, have two versions and some preprocessor to decide which one. > However, this just isn't performance-critical enough to warrant such > an effort. I think we're already going far too far with this. I just > want to fix the bug. If performance increases in some way as a result > of that, then good. > > If I sum up the times from each version of the above results, I get: > > Original: 2.37399 > David's: 1.98743 > Chao's: 3.44766 > Unrolled: 3.19962 > > I'm happy to go with the fastest one. I expect the one with the fewest > instructions is likely more important when running it in the planner, > as that's less to decode. I didn't look, but I expect your loop > unrolled version and your || INT_MAX version to have more > instructions. > > David > <test_bms_next2.c> In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win: ``` chaol@ChaodeMacBook-Air test % ./test_bms_next2 Benchmarking 100000000 iterations... Original: 0.61559 seconds David's: 0.61651 seconds Chao's version: 1.06033 seconds Unrolled loop: 0.60561 seconds 2800000000 ``` (I also checked on Godbolt, from the instruction-count perspective, the unrolled version actually looks the worst.) I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic. It was fun working on this patch, and thank you for your patience. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-04 03:30 David Rowley <[email protected]> parent: Chao Li <[email protected]> 0 siblings, 1 reply; 17+ messages in thread From: David Rowley @ 2026-04-04 03:30 UTC (permalink / raw) To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> On Sat, 4 Apr 2026 at 02:28, Chao Li <[email protected]> wrote: > In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win: > ``` > chaol@ChaodeMacBook-Air test % ./test_bms_next2 > Benchmarking 100000000 iterations... > > Original: 0.61559 seconds > David's: 0.61651 seconds > Chao's version: 1.06033 seconds > Unrolled loop: 0.60561 seconds > 2800000000 > ``` Doing the same locally, here's what I get on my Zen2 machine: drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next Benchmarking 100000000 iterations... Original: 1.21125 seconds David's: 1.11997 seconds Chao's version: 1.72662 seconds Unrolled loop: 1.63090 seconds 2800000000 drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next Benchmarking 100000000 iterations... Original: 1.10780 seconds David's: 1.05968 seconds Chao's version: 1.87123 seconds Unrolled loop: 1.11002 seconds 2800000000 > I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic. Yeah, but it's better to find bottlenecks and focus there than to optimise blindly without knowing if it'll make any actual difference. It's still important to test for regressions when modifying code and try to avoid them, as it's sometimes not obvious if the code being modified is critical for performance. I also don't think we should be doing anything to optimise for multi-word sets at the expense of slowing down single-word sets. Not that it's very representative of the real world, but I added some telemetry to bitmapset.c and I see 43 multi-word sets being created and 1.45 million single-word sets created during make check, or 1 in 33785, if you prefer. diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..21b5053e9a2 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -219,6 +219,10 @@ bms_make_singleton(int x) int wordnum, bitnum; + if (x >= 64) + elog(NOTICE, "multi-word set bms_make_singleton"); + else + elog(NOTICE, "single-word set bms_make_singleton"); if (x < 0) elog(ERROR, "negative bitmapset member not allowed"); wordnum = WORDNUM(x); @@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x) wordnum = WORDNUM(x); bitnum = BITNUM(x); + if (x >= 64 && a->nwords == 1) + elog(NOTICE, "multi-set bms_add_member"); + /* enlarge the set if necessary */ if (wordnum >= a->nwords) { Not quite perfect as a set made first as a single word set that later becomes a multi-word set will be double counted. The number of operations on the sets is likely more important anyway, not the number of sets being created. The point is, multi-word sets are rare for most workloads. David ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-06 02:00 Chao Li <[email protected]> parent: David Rowley <[email protected]> 0 siblings, 1 reply; 17+ messages in thread From: Chao Li @ 2026-04-06 02:00 UTC (permalink / raw) To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> > On Apr 4, 2026, at 11:30, David Rowley <[email protected]> wrote: > > On Sat, 4 Apr 2026 at 02:28, Chao Li <[email protected]> wrote: >> In test_bms_next2.c, you set words_to_alloc = 1, which disables the optimization in the unrolled version. If I change words_to_alloc = 2, then on my MacBook M4, the unrolled version seems to win: >> ``` >> chaol@ChaodeMacBook-Air test % ./test_bms_next2 >> Benchmarking 100000000 iterations... >> >> Original: 0.61559 seconds >> David's: 0.61651 seconds >> Chao's version: 1.06033 seconds >> Unrolled loop: 0.60561 seconds >> 2800000000 >> ``` > > Doing the same locally, here's what I get on my Zen2 machine: > > drowley@amd3990x:~$ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... > > Original: 1.21125 seconds > David's: 1.11997 seconds > Chao's version: 1.72662 seconds > Unrolled loop: 1.63090 seconds > 2800000000 > drowley@amd3990x:~$ clang test_bms_next.c -O2 -o test_bms_next && > ./test_bms_next > Benchmarking 100000000 iterations... > > Original: 1.10780 seconds > David's: 1.05968 seconds > Chao's version: 1.87123 seconds > Unrolled loop: 1.11002 seconds > 2800000000 I just tried my Windows laptop with both MSYS2 and WSL Ubuntu, and on both of them the original version and David’s version consistently performed better than the other versions. I have a new finding from Godbolt: adding likely(w != 0) seems to reduce the instruction count by one. bms_next_member_fast has 26 instructions, while after adding "likely", bms_next_member_fast2 has 25. It seems that this may avoid one jump and perhaps be slightly better, but I’m not sure. I’m really not an expert in assembly. Then I tested. “likely” doesn’t seem to help on Mac: ``` chaol@ChaodeMacBook-Air test % ./test_bms_next2 Benchmarking 100000000 iterations... Original: 0.54994 seconds David's: 0.78218 seconds David's likely: 0.78990 seconds Chao's version: 1.11530 seconds Unrolled loop: 0.47660 seconds 3500000000 ``` But it helps slightly on Windows: ``` $ ./test_bms_next2 Benchmarking 100000000 iterations... Original: 1.45312 seconds David's: 1.48438 seconds David's likely: 1.40625 seconds Chao's version: 3.12500 seconds Unrolled loop: 2.95312 seconds -794967296 ``` I’m not suggesting adding likely() to your version, I’m just sharing the information for your reference and leaving the decision to you. > >> I guess one-word Bitmapset are probably the most common case in practice. So overall, I agree that your version is the best choice. Please go ahead with it as we have already gained both a bug fix and a performance improvement. If we really want to study the performance more deeply, I think that would be a separate topic. > > Yeah, but it's better to find bottlenecks and focus there than to > optimise blindly without knowing if it'll make any actual difference. > It's still important to test for regressions when modifying code and > try to avoid them, as it's sometimes not obvious if the code being > modified is critical for performance. > > I also don't think we should be doing anything to optimise for > multi-word sets at the expense of slowing down single-word sets. Fully agreed. > Not > that it's very representative of the real world, but I added some > telemetry to bitmapset.c and I see 43 multi-word sets being created > and 1.45 million single-word sets created during make check, or 1 in > 33785, if you prefer. > > diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c > index 786f343b3c9..21b5053e9a2 100644 > --- a/src/backend/nodes/bitmapset.c > +++ b/src/backend/nodes/bitmapset.c > @@ -219,6 +219,10 @@ bms_make_singleton(int x) > int wordnum, > bitnum; > > + if (x >= 64) > + elog(NOTICE, "multi-word set bms_make_singleton"); > + else > + elog(NOTICE, "single-word set bms_make_singleton"); > if (x < 0) > elog(ERROR, "negative bitmapset member not allowed"); > wordnum = WORDNUM(x); > @@ -811,6 +815,9 @@ bms_add_member(Bitmapset *a, int x) > wordnum = WORDNUM(x); > bitnum = BITNUM(x); > > + if (x >= 64 && a->nwords == 1) > + elog(NOTICE, "multi-set bms_add_member"); > + > /* enlarge the set if necessary */ > if (wordnum >= a->nwords) > { > > Not quite perfect as a set made first as a single word set that later > becomes a multi-word set will be double counted. The number of > operations on the sets is likely more important anyway, not the number > of sets being created. The point is, multi-word sets are rare for most > workloads. > What tests did you run after adding the logs to collect the data? This is a method I might borrow in the future for similar investigations. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-12 12:33 David Rowley <[email protected]> parent: Chao Li <[email protected]> 0 siblings, 0 replies; 17+ messages in thread From: David Rowley @ 2026-04-12 12:33 UTC (permalink / raw) To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> On Mon, 6 Apr 2026 at 14:01, Chao Li <[email protected]> wrote: > > On Apr 4, 2026, at 11:30, David Rowley <[email protected]> wrote: > > Not quite perfect as a set made first as a single word set that later > > becomes a multi-word set will be double counted. The number of > > operations on the sets is likely more important anyway, not the number > > of sets being created. The point is, multi-word sets are rare for most > > workloads. > > > > What tests did you run after adding the logs to collect the data? This is a method I might borrow in the future for similar investigations. "make check" then grep regression.diffs for the NOTICE message and pipe to "wc -l" ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-12 13:17 David Rowley <[email protected]> parent: David Rowley <[email protected]> 1 sibling, 2 replies; 17+ messages in thread From: David Rowley @ 2026-04-12 13:17 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> On Fri, 3 Apr 2026 at 15:08, David Rowley <[email protected]> wrote: > IMO, if we can make bitmapset.c work with INT_MAX members and get a > performance increase, then we should do it. Re-thinking this after a week's holiday, it seems fine to use an unsigned 32-bit int rather than a 64-bit int to fix this bug. I'd previously been uncertain if there were any guarantees in C to what (unsigned int) -1 would return, but going by [1] at 6.3.1.3, it says: "Otherwise, if the new type is unsigned, the value is converted by repeatedly adding or subtracting one more than the maximum value that can be represented in the new type until the value is in the range of the new type." So, it seems even on one's complement that -1 as an unsigned int will be UINT_MAX. When we add 1 to UINT_MAX, we're guaranteed to get 0, as it's unsigned maths and overflows are going to result in a value modulus the max value for the type. That leads me to the attached v2 patch. Compiler Explorer link showing the assembly at [2]. Testing the performance, it's better than master, so I got rid of the size_t wordnum stuff. We're post-freeze now, so fiddling with other optimisations seems a bit off the table as there appears to be no performance regression to compensate for now, per: drowley@amd3990x:~$ gcc test_bms_next3.c -O2 -o test_bms_next3 && ./test_bms_next3 Benchmarking 100000000 bms_next_member iterations... master: 1.18330 seconds Patched: 1.05493 seconds Benchmarking 100000000 bms_prev_member iterations... master: 2.94522 seconds Patched: 1.86130 seconds drowley@amd3990x:~$ clang test_bms_next3.c -O2 -o test_bms_next3 && ./test_bms_next3 Benchmarking 100000000 bms_next_member iterations... master: 1.07860 seconds Patched: 1.07896 seconds Benchmarking 100000000 bms_prev_member iterations... master: 2.76550 seconds Patched: 2.12369 seconds David [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf [2] https://godbolt.org/z/xW96rxd3P #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <time.h> #include <limits.h> //#define NULL ((void *) 0) typedef uint32_t uint32; typedef int32_t int32; typedef uint64_t uint64; typedef int64_t int64; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ typedef int64 signedbitmapword; /* must be the matching signed type */ #define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) #define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) #ifdef __GNUC__ #define likely(x) __builtin_expect((x) != 0, 1) #define unlikely(x) __builtin_expect((x) != 0, 0) #else #define likely(x) ((x) != 0) #define unlikely(x) ((x) != 0) #endif typedef struct Bitmapset { int nwords; /* number of words in array */ bitmapword words[]; /* really [nwords] */ } Bitmapset; static inline int bmw_rightmost_one_pos(uint64 word) { return __builtin_ctzll(word); } static inline int bmw_leftmost_one_pos(uint64 word) { return 63 - __builtin_clzll(word); } int bms_next_member(const Bitmapset *a, int prevbit) { int nwords; bitmapword mask; if (a == NULL) return -2; nwords = a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } int bms_next_member_patched(const Bitmapset *a, int prevbit) { unsigned int currbit = prevbit; int nwords; bitmapword mask; if (a == NULL) return -2; nwords = a->nwords; /* use an unsigned int to avoid the risk that int overflows */ currbit++; mask = (~(bitmapword) 0) << BITNUM(currbit); for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before currbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } int bms_prev_member(const Bitmapset *a, int prevbit) { int ushiftbits; bitmapword mask; /* * If set is NULL or if there are no more bits to the right then we've * nothing to do. */ if (a == NULL || prevbit == 0) return -2; /* transform -1 to the highest possible bit we could have set */ if (prevbit == -1) prevbit = a->nwords * BITS_PER_BITMAPWORD - 1; else prevbit--; ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1); mask = (~(bitmapword) 0) >> ushiftbits; for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--) { bitmapword w = a->words[wordnum]; /* mask out bits left of prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_leftmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } int bms_prev_member_patched(const Bitmapset *a, int prevbit) { unsigned int currbit; int ushiftbits; bitmapword mask; /* * If set is NULL or if there are no more bits to the right then we've * nothing to do. */ if (a == NULL || prevbit == 0) return -2; /* * Transform -1 to the highest possible bit we could have set. We do this * in unsigned math to avoid the risk of overflowing a signed int. */ if (prevbit < 0) currbit = (unsigned int) a->nwords * BITS_PER_BITMAPWORD - 1; else currbit = prevbit - 1; ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1); mask = (~(bitmapword) 0) >> ushiftbits; for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--) { bitmapword w = a->words[wordnum]; /* mask out bits left of currbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_leftmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } double get_time() { struct timespec ts; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9; } Bitmapset *bms; int main() { int words_to_alloc = 1; // Large set to bypass CPU cache slightly bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword)); bms->nwords = words_to_alloc; memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); double start, end; int64 count = 0; /* Set a bit far into the set to force a long scan */ bms->words[words_to_alloc - 1] |= 0xaf4; int iterations = 100000000; printf("Benchmarking %d bms_next_member iterations...\n", iterations); /* master */ start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member(bms, j)) >= 0) count++; } end = get_time(); printf("master: %.5f seconds\n", end - start); // Test David start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_patched(bms, j)) >= 0) count++; } end = get_time(); printf("Patched: %.5f seconds\n", end - start); printf("\nBenchmarking %d bms_prev_member iterations...\n", iterations); /* master */ start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_prev_member(bms, j)) >= 0) count++; } end = get_time(); printf("master: %.5f seconds\n", end - start); // Test David start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_prev_member_patched(bms, j)) >= 0) count++; } end = get_time(); printf("Patched: %.5f seconds\n", end - start); printf("%ld\n", count); free(bms); return 0; } Attachments: [text/plain] test_bms_next3.c (5.7K, 2-test_bms_next3.c) download | inline: #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <time.h> #include <limits.h> //#define NULL ((void *) 0) typedef uint32_t uint32; typedef int32_t int32; typedef uint64_t uint64; typedef int64_t int64; #define BITS_PER_BITMAPWORD 64 typedef uint64 bitmapword; /* must be an unsigned type */ typedef int64 signedbitmapword; /* must be the matching signed type */ #define WORDNUM(x) ((x) / BITS_PER_BITMAPWORD) #define BITNUM(x) ((x) % BITS_PER_BITMAPWORD) #ifdef __GNUC__ #define likely(x) __builtin_expect((x) != 0, 1) #define unlikely(x) __builtin_expect((x) != 0, 0) #else #define likely(x) ((x) != 0) #define unlikely(x) ((x) != 0) #endif typedef struct Bitmapset { int nwords; /* number of words in array */ bitmapword words[]; /* really [nwords] */ } Bitmapset; static inline int bmw_rightmost_one_pos(uint64 word) { return __builtin_ctzll(word); } static inline int bmw_leftmost_one_pos(uint64 word) { return 63 - __builtin_clzll(word); } int bms_next_member(const Bitmapset *a, int prevbit) { int nwords; bitmapword mask; if (a == NULL) return -2; nwords = a->nwords; prevbit++; mask = (~(bitmapword) 0) << BITNUM(prevbit); for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } int bms_next_member_patched(const Bitmapset *a, int prevbit) { unsigned int currbit = prevbit; int nwords; bitmapword mask; if (a == NULL) return -2; nwords = a->nwords; /* use an unsigned int to avoid the risk that int overflows */ currbit++; mask = (~(bitmapword) 0) << BITNUM(currbit); for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; /* ignore bits before currbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_rightmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } int bms_prev_member(const Bitmapset *a, int prevbit) { int ushiftbits; bitmapword mask; /* * If set is NULL or if there are no more bits to the right then we've * nothing to do. */ if (a == NULL || prevbit == 0) return -2; /* transform -1 to the highest possible bit we could have set */ if (prevbit == -1) prevbit = a->nwords * BITS_PER_BITMAPWORD - 1; else prevbit--; ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1); mask = (~(bitmapword) 0) >> ushiftbits; for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--) { bitmapword w = a->words[wordnum]; /* mask out bits left of prevbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_leftmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } int bms_prev_member_patched(const Bitmapset *a, int prevbit) { unsigned int currbit; int ushiftbits; bitmapword mask; /* * If set is NULL or if there are no more bits to the right then we've * nothing to do. */ if (a == NULL || prevbit == 0) return -2; /* * Transform -1 to the highest possible bit we could have set. We do this * in unsigned math to avoid the risk of overflowing a signed int. */ if (prevbit < 0) currbit = (unsigned int) a->nwords * BITS_PER_BITMAPWORD - 1; else currbit = prevbit - 1; ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1); mask = (~(bitmapword) 0) >> ushiftbits; for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--) { bitmapword w = a->words[wordnum]; /* mask out bits left of currbit */ w &= mask; if (w != 0) { int result; result = wordnum * BITS_PER_BITMAPWORD; result += bmw_leftmost_one_pos(w); return result; } /* in subsequent words, consider all bits */ mask = (~(bitmapword) 0); } return -2; } double get_time() { struct timespec ts; clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); return ts.tv_sec + ts.tv_nsec * 1e-9; } Bitmapset *bms; int main() { int words_to_alloc = 1; // Large set to bypass CPU cache slightly bms = malloc(sizeof(Bitmapset) + words_to_alloc * sizeof(bitmapword)); bms->nwords = words_to_alloc; memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); double start, end; int64 count = 0; /* Set a bit far into the set to force a long scan */ bms->words[words_to_alloc - 1] |= 0xaf4; int iterations = 100000000; printf("Benchmarking %d bms_next_member iterations...\n", iterations); /* master */ start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member(bms, j)) >= 0) count++; } end = get_time(); printf("master: %.5f seconds\n", end - start); // Test David start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_next_member_patched(bms, j)) >= 0) count++; } end = get_time(); printf("Patched: %.5f seconds\n", end - start); printf("\nBenchmarking %d bms_prev_member iterations...\n", iterations); /* master */ start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_prev_member(bms, j)) >= 0) count++; } end = get_time(); printf("master: %.5f seconds\n", end - start); // Test David start = get_time(); for (int i = 0; i < iterations; i++) { int j = -1; while ((j = bms_prev_member_patched(bms, j)) >= 0) count++; } end = get_time(); printf("Patched: %.5f seconds\n", end - start); printf("%ld\n", count); free(bms); return 0; } [application/octet-stream] v2-0001-Fix-unlikely-overflow-bug-in-bms_next_member.patch (3.2K, 3-v2-0001-Fix-unlikely-overflow-bug-in-bms_next_member.patch) download | inline diff: From d1c644ba4de05624fc5a63b3d5f4026c71ed4aa5 Mon Sep 17 00:00:00 2001 From: David Rowley <[email protected]> Date: Sat, 4 Apr 2026 16:41:23 +1300 Subject: [PATCH v2] Fix unlikely overflow bug in bms_next_member() ... and bms_prev_member(). Both of these functions won't work correctly when given a prevbit of INT_MAX. Here we fix that by using an unsigned int to calculate which member to look for next. In practise, Bitmapsets will never have such a large member, so no backpatch. Author: David Rowley <[email protected]> Reviewed-by: Chao Li <[email protected]> Discussion: https://postgr.es/m/CAApHDvq0T%3DiJ0Sf5TNE9yyWwfOeVjmrBt0wSywDnGD9Y4YJQBA%40mail.gmail.com --- src/backend/nodes/bitmapset.c | 33 +++++++++++++++++++-------------- 1 file changed, 19 insertions(+), 14 deletions(-) diff --git a/src/backend/nodes/bitmapset.c b/src/backend/nodes/bitmapset.c index 786f343b3c9..957172648c3 100644 --- a/src/backend/nodes/bitmapset.c +++ b/src/backend/nodes/bitmapset.c @@ -1289,6 +1289,7 @@ bms_join(Bitmapset *a, Bitmapset *b) int bms_next_member(const Bitmapset *a, int prevbit) { + unsigned int currbit = prevbit; int nwords; bitmapword mask; @@ -1297,13 +1298,15 @@ bms_next_member(const Bitmapset *a, int prevbit) if (a == NULL) return -2; nwords = a->nwords; - prevbit++; - mask = (~(bitmapword) 0) << BITNUM(prevbit); - for (int wordnum = WORDNUM(prevbit); wordnum < nwords; wordnum++) + + /* use an unsigned int to avoid the risk that int overflows */ + currbit++; + mask = (~(bitmapword) 0) << BITNUM(currbit); + for (int wordnum = WORDNUM(currbit); wordnum < nwords; wordnum++) { bitmapword w = a->words[wordnum]; - /* ignore bits before prevbit */ + /* ignore bits before currbit */ w &= mask; if (w != 0) @@ -1345,10 +1348,10 @@ bms_next_member(const Bitmapset *a, int prevbit) * It makes no difference in simple loop usage, but complex iteration logic * might need such an ability. */ - int bms_prev_member(const Bitmapset *a, int prevbit) { + unsigned int currbit; int ushiftbits; bitmapword mask; @@ -1362,22 +1365,24 @@ bms_prev_member(const Bitmapset *a, int prevbit) return -2; /* Validate callers didn't give us something out of range */ - Assert(prevbit <= a->nwords * BITS_PER_BITMAPWORD); - Assert(prevbit >= -1); + Assert(prevbit < 0 || prevbit <= (unsigned int) (a->nwords * BITS_PER_BITMAPWORD)); - /* transform -1 to the highest possible bit we could have set */ - if (prevbit == -1) - prevbit = a->nwords * BITS_PER_BITMAPWORD - 1; + /* + * Transform -1 to the highest possible bit we could have set. We do this + * in unsigned math to avoid the risk of overflowing a signed int. + */ + if (prevbit < 0) + currbit = (unsigned int) a->nwords * BITS_PER_BITMAPWORD - 1; else - prevbit--; + currbit = prevbit - 1; - ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(prevbit) + 1); + ushiftbits = BITS_PER_BITMAPWORD - (BITNUM(currbit) + 1); mask = (~(bitmapword) 0) >> ushiftbits; - for (int wordnum = WORDNUM(prevbit); wordnum >= 0; wordnum--) + for (int wordnum = WORDNUM(currbit); wordnum >= 0; wordnum--) { bitmapword w = a->words[wordnum]; - /* mask out bits left of prevbit */ + /* mask out bits left of currbit */ w &= mask; if (w != 0) -- 2.51.0 ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-12 23:41 David Rowley <[email protected]> parent: David Rowley <[email protected]> 1 sibling, 0 replies; 17+ messages in thread From: David Rowley @ 2026-04-12 23:41 UTC (permalink / raw) To: Tom Lane <[email protected]>; +Cc: PostgreSQL Developers <[email protected]> On Mon, 13 Apr 2026 at 01:17, David Rowley <[email protected]> wrote: > Re-thinking this after a week's holiday, it seems fine to use an > unsigned 32-bit int rather than a 64-bit int to fix this bug. I'd > previously been uncertain if there were any guarantees in C to what > (unsigned int) -1 would return, but going by [1] at 6.3.1.3, it says: > > "Otherwise, if the new type is unsigned, the value is converted by > repeatedly adding or subtracting one more than the maximum value that > can be represented in the new type until the value is in the range of > the new type." > > So, it seems even on one's complement that -1 as an unsigned int will > be UINT_MAX. When we add 1 to UINT_MAX, we're guaranteed to get 0, as > it's unsigned maths and overflows are going to result in a value > modulus the max value for the type. I've pushed that version. No backpatch. David ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-13 00:23 Chao Li <[email protected]> parent: David Rowley <[email protected]> 1 sibling, 1 reply; 17+ messages in thread From: Chao Li @ 2026-04-13 00:23 UTC (permalink / raw) To: David Rowley <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> > On Apr 12, 2026, at 21:17, David Rowley <[email protected]> wrote: > > On Fri, 3 Apr 2026 at 15:08, David Rowley <[email protected]> wrote: >> IMO, if we can make bitmapset.c work with INT_MAX members and get a >> performance increase, then we should do it. > > Re-thinking this after a week's holiday, it seems fine to use an > unsigned 32-bit int rather than a 64-bit int to fix this bug. I'd > previously been uncertain if there were any guarantees in C to what > (unsigned int) -1 would return, but going by [1] at 6.3.1.3, it says: > > "Otherwise, if the new type is unsigned, the value is converted by > repeatedly adding or subtracting one more than the maximum value that > can be represented in the new type until the value is in the range of > the new type." > > So, it seems even on one's complement that -1 as an unsigned int will > be UINT_MAX. When we add 1 to UINT_MAX, we're guaranteed to get 0, as > it's unsigned maths and overflows are going to result in a value > modulus the max value for the type. > > That leads me to the attached v2 patch. Compiler Explorer link showing > the assembly at [2]. > > Testing the performance, it's better than master, so I got rid of the > size_t wordnum stuff. We're post-freeze now, so fiddling with other > optimisations seems a bit off the table as there appears to be no > performance regression to compensate for now, per: > > drowley@amd3990x:~$ gcc test_bms_next3.c -O2 -o test_bms_next3 && > ./test_bms_next3 > Benchmarking 100000000 bms_next_member iterations... > master: 1.18330 seconds > Patched: 1.05493 seconds > > Benchmarking 100000000 bms_prev_member iterations... > master: 2.94522 seconds > Patched: 1.86130 seconds > > drowley@amd3990x:~$ clang test_bms_next3.c -O2 -o test_bms_next3 && > ./test_bms_next3 > Benchmarking 100000000 bms_next_member iterations... > master: 1.07860 seconds > Patched: 1.07896 seconds > > Benchmarking 100000000 bms_prev_member iterations... > master: 2.76550 seconds > Patched: 2.12369 seconds > > David > > [1] https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf > [2] https://godbolt.org/z/xW96rxd3P > <test_bms_next3.c><v2-0001-Fix-unlikely-overflow-bug-in-bms_next_member.patch> I just tried test_bms_next3 on my MacBook, looks like patched bms_prev_member is much faster there. I ran about 10 times, the results were consistent. ``` chaol@ChaodeMacBook-Air test % gcc test_bms_next3.c -O2 -o test_bms_next3 test_bms_next3.c:281:18: warning: format specifies type 'long' but the argument has type 'int64' (aka 'long long') [-Wformat] 281 | printf("%ld\n", count); | ~~~ ^~~~~ | %lld 1 warning generated. chaol@ChaodeMacBook-Air test % ./test_bms_next3 Benchmarking 100000000 bms_next_member iterations... master: 0.49287 seconds Patched: 0.47315 seconds Benchmarking 100000000 bms_prev_member iterations... master: 1.04644 seconds Patched: 0.58053 seconds 2800000000 ``` Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/ ^ permalink raw reply [nested|flat] 17+ messages in thread
* Re: Small and unlikely overflow hazard in bms_next_member() @ 2026-04-13 02:43 David Rowley <[email protected]> parent: Chao Li <[email protected]> 0 siblings, 0 replies; 17+ messages in thread From: David Rowley @ 2026-04-13 02:43 UTC (permalink / raw) To: Chao Li <[email protected]>; +Cc: Tom Lane <[email protected]>; PostgreSQL Developers <[email protected]> On Mon, 13 Apr 2026 at 12:24, Chao Li <[email protected]> wrote: > I just tried test_bms_next3 on my MacBook, looks like patched bms_prev_member is much faster there. I ran about 10 times, the results were consistent. Thanks. I saw similar results. David ^ permalink raw reply [nested|flat] 17+ messages in thread
end of thread, other threads:[~2026-04-13 02:43 UTC | newest] Thread overview: 17+ messages (download: mbox mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-04-02 04:09 Small and unlikely overflow hazard in bms_next_member() David Rowley <[email protected]> 2026-04-02 04:22 ` Tom Lane <[email protected]> 2026-04-02 22:12 ` David Rowley <[email protected]> 2026-04-03 02:08 ` David Rowley <[email protected]> 2026-04-03 03:24 ` Chao Li <[email protected]> 2026-04-03 11:42 ` David Rowley <[email protected]> 2026-04-03 13:28 ` Chao Li <[email protected]> 2026-04-04 03:30 ` David Rowley <[email protected]> 2026-04-06 02:00 ` Chao Li <[email protected]> 2026-04-12 12:33 ` David Rowley <[email protected]> 2026-04-12 13:17 ` David Rowley <[email protected]> 2026-04-12 23:41 ` David Rowley <[email protected]> 2026-04-13 00:23 ` Chao Li <[email protected]> 2026-04-13 02:43 ` David Rowley <[email protected]> 2026-04-02 06:38 ` Chao Li <[email protected]> 2026-04-02 07:54 ` jie wang <[email protected]> 2026-04-02 22:20 ` David Rowley <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox