Re: Small and unlikely overflow hazard in bms_next_member()

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Chao Li <[email protected]>
To: David Rowley <[email protected]>
Cc: Tom Lane <[email protected]>
Cc: PostgreSQL Developers <[email protected]>
Subject: Re: Small and unlikely overflow hazard in bms_next_member()
Date: Fri, 3 Apr 2026 11:24:08 +0800
Message-ID: <[email protected]> (raw)
In-Reply-To: <CAApHDvqTUm3Cbgz3ZLV+ad8s_HJHZYrVbrBvGyPQdxCRR-6dvA@mail.gmail.com>
References: <CAApHDvq0T=iJ0Sf5TNE9yyWwfOeVjmrBt0wSywDnGD9Y4YJQBA@mail.gmail.com>
	<[email protected]>
	<CAApHDvrvvq_m+nRwjsOpCsFa4EtVtmvJX7zAD=Siria-x6DpbQ@mail.gmail.com>
	<CAApHDvqTUm3Cbgz3ZLV+ad8s_HJHZYrVbrBvGyPQdxCRR-6dvA@mail.gmail.com>

> On Apr 3, 2026, at 10:08, David Rowley <[email protected]> wrote:
> 
> On Fri, 3 Apr 2026 at 11:12, David Rowley <[email protected]> wrote:
>> 
>> On Thu, 2 Apr 2026 at 17:22, Tom Lane <[email protected]> wrote:
>>> I don't think we should add cycles here for this purpose.
>> 
>> I'm not keen on slowing things down for this either. I did do some
>> experiments in [1] that sees fewer instructions from using 64-bit
>> maths. I might go off and see if there are any wins there that also
>> give us the INT_MAX fix. It's not great effort to reward ratio
>> though...
> 
> The reduction in instructions with the patched version got me curious
> to see if it would translate into a performance increase.  I tested on
> an AMD Zen2 machine, and it's a decent amount faster than master. I
> tested with gcc and clang.
> 
> I also scanned over the remaining parts of bitmapset.c and didn't find
> anywhere else that has overflow risk aside from what you pointed out
> in bms_prev_member().
> 
> The attached patch contains the benchmark function I added to the
> test_bitmapset module. It should apply to master with a bit of noise.
> 
> CREATE EXTENSION test_bitmapset;
> SELECT
>    generate_series(1,3) AS run,
>    bench_bms_next_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
> bms_next_member_us,
>    bench_bms_prev_member('(b 1 2 3 4 5 6 7 8 64)', 1000000)/1000 AS
> bms_prev_member_us;
> 
> master (gcc)
> 
> run | bms_next_member_us | bms_prev_member_us
> -----+--------------------+--------------------
>   1 |              26473 |              40404
>   2 |              26218 |              40413
>   3 |              26209 |              40387
> 
> patched (gcc)
> 
> run | bms_next_member_us | bms_prev_member_us
> -----+--------------------+--------------------
>   1 |              25409 |              29705
>   2 |              24905 |              29693
>   3 |              24870 |              29707
> 
> Times are in microseconds to do 1 million bms_*_member() loops over
> the entire set.
> 
> I've also attached the full results I got. I've also included the
> results from Chao's version, which does slow things down decently on
> clang.
> 
> IMO, if we can make bitmapset.c work with INT_MAX members and get a
> performance increase, then we should do it.
> 
> David
> 
>> [1] https://godbolt.org/z/Eh1vzssq7
> <benchmark_results.txt><bms_fixes.patch>

I also did a load test with a standalone c program with 4 versions:

* The original bms_next_member (Original)
* The fast version from [1], that uses 64bit maths (Fast)
* The original version + INT32_MAX check + 64bit maths (Original2)
* I tried the other approach that pulls up the first iteration, so that removes "mask = (~(bitmapword) 0);” from the loop. (PullUp)

Note: all tests used -O2 to build the executable.

On my MacBook M4, the Fast version constantly won, and PullUp version performed badly.
```
% gcc --version
Apple clang version 17.0.0 (clang-1700.6.4.2)
Target: arm64-apple-darwin25.3.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
```

A typical test run:
```
Benchmarking 100000 iterations...

Original:  0.48893 seconds
Fast:      0.46979 seconds
Original2:      0.47740 seconds
PullUp: 0.48029 seconds
``` 

On my Windows laptop, Intel(R) Core Ultra 5, with WSL based Ubuntu, Orignal2 won in the most runs, and the PullUp version was faster than Fast version.
```
chaol@lichao-highgo:~$ gcc --version                                                                            
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0                                                                      
Copyright (C) 2023 Free Software Foundation, Inc.                                                              
This is free software; see the source for copying conditions.  There is NO                                      
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```

A typical test run:
```
Original:  0.99849 seconds                                                                                      
Fast:      0.74722 seconds                                                                                      
Original2:      0.59407 seconds                                                                                
PullUp: 0.62746 seconds     
```

Then I also tried to run on Windows directly. Here, PullUp version performed the best.
```
$ gcc --version
gcc.exe (Rev13, Built by MSYS2 project) 15.2.0
Copyright (C) 2025 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
```

A typical test run:
```
Original:  0.32931 seconds
Fast:      0.32740 seconds
Original2:      0.32378 seconds
PullUp: 0.30795 seconds
```

I’m curious that, when something performs differently across platforms, which platform should take priority?

Please see the attached test program. It’s possible I did something wrong.

[1] https://godbolt.org/z/Eh1vzssq7

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Attachments:

  [application/octet-stream] test_bms_next.c (5.4K, 2-test_bms_next.c)
  download

view thread (17+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Small and unlikely overflow hazard in bms_next_member()
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox