Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8eZt-000kSH-0a for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 13:28:49 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1w8eZr-00BjBp-2J for pgsql-hackers@arkaria.postgresql.org; Fri, 03 Apr 2026 13:28:48 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1w8eZr-00BjBe-0s for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 13:28:47 +0000 Received: from mail-pf1-x42e.google.com ([2607:f8b0:4864:20::42e]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1w8eZp-00000000MMu-181f for pgsql-hackers@lists.postgresql.org; Fri, 03 Apr 2026 13:28:46 +0000 Received: by mail-pf1-x42e.google.com with SMTP id d2e1a72fcca58-82748257f5fso1854136b3a.1 for ; Fri, 03 Apr 2026 06:28:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1775222924; x=1775827724; darn=lists.postgresql.org; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=QxRSAyJXmALRCu42pHm4EoN6+Zf7jhvCTqiDV16CEBA=; b=n7fstB/czzjNvqA4JcQ1nhPeMLjAZc8LArsAOGPnO5ahk1b6FmOoR2sJm704kOXvVL JcCD2z2cUgctm5FrJpfP79qEo2fP3oQSWRymBQE+jdPwyrFW2RIMA75w27Efkr+CJPsD uscWo+BaDtSIV+Ye+llN5EbvRswkNz3tGnp8lp7Jm7SjjHXOW7T3K7kgTzkLscB8wKBI 46oXOsc1OYDfmELa3OoG/spRyrdnv1vNuNE8PTDW2rnfcD9JmSTPbwd8lSBb0GYojUBt VX6wJX4YNFDEctimcBFeU1eaG9j+W9iK97R67vF93A2ANWlaRzN0CMDYhDz0/GaoyOwu 4IVg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775222924; x=1775827724; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-gg:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=QxRSAyJXmALRCu42pHm4EoN6+Zf7jhvCTqiDV16CEBA=; b=I7eCwiE3PDHM7TWsVrHikMpuT03vszUmUdoHBOKFW8LXUoPx/eEOjbD8JgVcq7l7Ck OEd++nFcRqj4qCIPCyUAdRt9LfOdxoviTAa+2snL8PfwUcjqYoTYV793zrG2M6tw9kWa 3U3q1k7N/tG6gPwoi1visqLrkkjLrbzJrnOu9rSLecMbg4hFluWBUVUCIiaWchX06hS4 uukMxrfASmJsvT2WNyJYAcX7g9kNStmwG0JYG5mqf8p+a9+VmCIUCUDo1WOvgQq44CrQ h5htHENXt4cD6NDFOpNu+jVEEzoHJ1bhtDDLSByNcZH0BG/6ihBqDHdAuyWkF0GanPLS S/wA== X-Forwarded-Encrypted: i=1; AJvYcCWmWzYbnA0cGSVRPttl7lKOnH0YuKaay6UJTcnAp5XbCWeFoUMWD/C8GwcRYEfxC+lmqnEj0Je/GeUXEGNB@lists.postgresql.org X-Gm-Message-State: AOJu0Yze//rqdo5JN0ZWC7MiFM+iYw4m/1MrLppAgyW1VyVK9jFTxn/c md2decDRfNaueOJb0qWoe/7Iy0ah5PAd3TV0LFXf17hEXcfVM2DtUFqx X-Gm-Gg: ATEYQzwBAqxobTTe3pfyRdGYXNwYmzBzmumz7tuW7jvGPw/15ywQxpb5+QAGZHosd6p sMBj/jAAAGe22N4v3ntlF0T7qtD5XReP4FTIaeVFcaq5y10SxdtxAsgby9ruyjJA2jxhI2M9ikT V6nfd4pmbxQx35aqCxKZ1+yzBPpYedEhbQzHZTvV8yzyAHgPH/h4fEnXru5V/E9KovVzvWsBjci jNv5/LPss5MHz1NDUPF23Wasbf7yAWmrl5qSxB4VVU5ZveXV4OoCABwSkCywDVm34DisjMMINvn CHmKysNzbhIKg2EyeeAKO/npi4g1g/MZ40lvUAXIC2/fJxFfli1enTE259GglPIHHILQQ9jGrSz 4HqwStdNOho7sxxL4PBjIMVK7KvTIH//v/Uj/awDVFrNQ7nXR3YeUQC2zPvl0EVu20fyPyDmp9G hqeQPbTazhBdxsTZiR86vaZZ3khyvEdZw= X-Received: by 2002:a05:6a00:1a8e:b0:824:3ef6:a815 with SMTP id d2e1a72fcca58-82d00231a64mr5487033b3a.8.1775222923996; Fri, 03 Apr 2026 06:28:43 -0700 (PDT) Received: from smtpclient.apple ([45.32.121.103]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-82cf9b3ccc8sm7882733b3a.19.2026.04.03.06.28.41 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 03 Apr 2026 06:28:43 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\)) Subject: Re: Small and unlikely overflow hazard in bms_next_member() From: Chao Li In-Reply-To: Date: Fri, 3 Apr 2026 21:28:03 +0800 Cc: Tom Lane , PostgreSQL Developers Content-Transfer-Encoding: quoted-printable Message-Id: References: <3190647.1775103768@sss.pgh.pa.us> <59B9EFAF-84DF-40A9-847F-9CF457A798BB@gmail.com> To: David Rowley X-Mailer: Apple Mail (2.3864.400.21) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk > On Apr 3, 2026, at 19:42, David Rowley wrote: >=20 > On Fri, 3 Apr 2026 at 16:24, Chao Li wrote: >> I also did a load test with a standalone c program with 4 versions: >=20 > I think you'd be better off picking a more realistic Bitmapset. The > code is doing: >=20 > int words_to_alloc =3D 20000; // Large set to bypass CPU cache = slightly > Bitmapset *bms =3D malloc(sizeof(Bitmapset) + words_to_alloc * > sizeof(bitmapword)); > bms->nwords =3D words_to_alloc; > memset(bms->words, 0, words_to_alloc * sizeof(bitmapword)); >=20 > /* Set a bit far into the set to force a long scan */ > int target_bit =3D (words_to_alloc - 1) * 64 + 10; > bms->words[words_to_alloc - 1] |=3D (1ULL << 10); >=20 > A set with 20k words! Then you're doing: >=20 > for (int i =3D 0; i < iterations; i++) sink =3D = bms_next_member(bms, 0); >=20 > So you're not looping correctly over the set, as looping over a set > with 1 member requires 2 calls to the function. >=20 > I'd say this unrealistically favours the unrolled loop version, as > most of the effort is in the loop looping over empty words and you can > do that without the extra bit-wise AND that the other versions have. > That version also seems to apply both the 64-bit fix and the INT_MAX > fix. Why both? Because you're only calling the function once per > iteration, it also means your || INT_MAX is only executed once, rather > than n_members + 1 as it would normally be executed. That'll make your > version seem better than it is. >=20 > I fixed all that and changed the Bitmapset to something more realistic > with how it's more likely to be seen in PostgreSQL and ran the > benchmark with a proper bms_next_member loop. File attached and > results below: >=20 > Zen2 Linux >=20 > $ gcc test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... >=20 > Original: 0.62113 seconds > David's: 0.70257 seconds > Chao's version: 1.70691 seconds > Unrolled loop: 1.56239 seconds > 2800000000 >=20 > $ clang test_bms_next.c -O2 -o test_bms_next && ./test_bms_next > Benchmarking 100000000 iterations... >=20 > Original: 1.11263 seconds > David's: 0.87399 seconds > Chao's version: 0.52962 seconds > Unrolled loop: 1.07799 seconds > 2800000000 >=20 > Apple M2: >=20 > Benchmarking 100000000 iterations... >=20 > Original: 0.64023 seconds > David's: 0.41087 seconds > Chao's version: 1.21113 seconds > Unrolled loop: 0.55924 seconds > 2800000000 >=20 >> I=E2=80=99m curious that, when something performs differently across = platforms, which platform should take priority? >=20 > I don't think it matters too much for this case. If we were actually > working on a function that was a bottleneck, then it would be worth > looking deeper and seeing if the compilers are producing suboptimal > code and then try to coax them into making better code. As a last > resort, have two versions and some preprocessor to decide which one. > However, this just isn't performance-critical enough to warrant such > an effort. I think we're already going far too far with this. I just > want to fix the bug. If performance increases in some way as a result > of that, then good. >=20 > If I sum up the times from each version of the above results, I get: >=20 > Original: 2.37399 > David's: 1.98743 > Chao's: 3.44766 > Unrolled: 3.19962 >=20 > I'm happy to go with the fastest one. I expect the one with the fewest > instructions is likely more important when running it in the planner, > as that's less to decode. I didn't look, but I expect your loop > unrolled version and your || INT_MAX version to have more > instructions. >=20 > David > In test_bms_next2.c, you set words_to_alloc =3D 1, which disables the = optimization in the unrolled version. If I change words_to_alloc =3D 2, = then on my MacBook M4, the unrolled version seems to win: ``` chaol@ChaodeMacBook-Air test % ./test_bms_next2 Benchmarking 100000000 iterations... Original: 0.61559 seconds David's: 0.61651 seconds Chao's version: 1.06033 seconds Unrolled loop: 0.60561 seconds 2800000000 ``` (I also checked on Godbolt, from the instruction-count perspective, the = unrolled version actually looks the worst.) I guess one-word Bitmapset are probably the most common case in = practice. So overall, I agree that your version is the best choice. = Please go ahead with it as we have already gained both a bug fix and a = performance improvement. If we really want to study the performance more = deeply, I think that would be a separate topic. It was fun working on this patch, and thank you for your patience. Best regards, -- Chao Li (Evan) HighGo Software Co., Ltd. https://www.highgo.com/