Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAFGs-002CSR-1w for pgsql-hackers@arkaria.postgresql.org; Tue, 07 Apr 2026 22:51:47 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wAFGr-002w2J-02 for pgsql-hackers@arkaria.postgresql.org; Tue, 07 Apr 2026 22:51:45 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wAFGq-002w2B-26 for pgsql-hackers@lists.postgresql.org; Tue, 07 Apr 2026 22:51:45 +0000 Received: from meesny.iki.fi ([2001:67c:2b0:1c1::201]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wAFGo-00000001C58-2mns for pgsql-hackers@lists.postgresql.org; Tue, 07 Apr 2026 22:51:44 +0000 Received: from [10.0.2.15] (unknown [130.41.208.1]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange x25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: hlinnaka) by meesny.iki.fi (Postfix) with ESMTPSA id 4fr1f45xz4zyQn; Wed, 08 Apr 2026 01:51:40 +0300 (EEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=meesny; t=1775602301; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qg2y1MykTJkRlV0szeMf7xw3wb8UQpma61o6Fbct9kc=; b=WebFiLW3xP6bLx6lpwCbbX7BmCLAgErLqjyhPmZMusthzbVWztFfHMcse2sotob9YGmuWr O2+dCX1Hj0baasevAD+eMYRVd70EyfvSFRCZMuQ4D8qioRa6YkLEts6R9Ccq77JGf55jrv 0kcMVSKFq85dxwX3wyaqrqo4GBW/O30= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=iki.fi; s=meesny; t=1775602301; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qg2y1MykTJkRlV0szeMf7xw3wb8UQpma61o6Fbct9kc=; b=fEbx/WwLDWW2qefTDT3/RCgEaa0LNT3Q5fHtm6qANEtkOowjz/WeWBuWUep/jrT1bxibIT OCB7a4rKBl8H9gLMx5YjvcgOo4DvOvv59whtZLjkA3F846kFJZnkF37daJDHQ1fBMgNH8W OEnz5IuH0+9tAbac/em6FOnJwZkwp5w= ARC-Authentication-Results: i=1; ORIGINATING; auth=pass smtp.auth=hlinnaka smtp.mailfrom=hlinnaka@iki.fi ARC-Seal: i=1; a=rsa-sha256; d=iki.fi; s=meesny; cv=none; t=1775602301; b=pHDMjfErxP3xNucEvd0mXohrjY2ENuQvYGcjM4G6VVh3a66ikscDO426lQnolFW2Ypafzr l/R3xmjNBqJ4gJFY9OMiLswSWHADqiftSd3KuM8mnx9xbEnQy7LKkLWmYMrRnX8Dhf/G5x EvIgH2dNqpiLPiNCW3Y14BddEb82nrU= Message-ID: <2f0ff747-0f8f-4627-9022-51eb59ee27c0@iki.fi> Date: Wed, 8 Apr 2026 01:51:39 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Optimization of the is_normalized() function. To: Alexander Borisov , PostgreSQL Hackers References: <387cd5a3-2600-4c3c-b236-576087ef7062@gmail.com> <64841180-99bf-452f-af11-2c62db22ca78@gmail.com> Content-Language: en-US From: Heikki Linnakangas In-Reply-To: <64841180-99bf-452f-af11-2c62db22ca78@gmail.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 21/10/2025 19:17, Alexander Borisov wrote: > I continue to work on optimizations and improvements in Unicode. > This patch changes the approach to storing data to determine whether > a code point is normalized (YES, NO, MAYBE). > In other words, we are talking about Normalization Quick Check. > > Currently, a perfect hash is used to store knowledge about the > code point. I suggest abandoning the use of perfect hash and storing > code point data in bits packed into uint8. > This will give us almost direct access to the data. > > The essence is simple. > 1. Take the code point and divide it by 8. This gives us the index > in the uint8 table. > 2. We need to get the bit for the code point. Take the code point > modulo 8 to get the desired offset. > > Since we need to store one of three values (YES, NO, MAYBE) for each > code point, we are forced to use two bits. So everything is simply > multiplied by 2. I like the simplicity of that. > This is how obtaining a value by code point looks like in code: > > uc = ch * 2; > index = uc / 8; > bit = uc % 8; > > found = UnicodeNormProps_NFC_QC[index]; > found >>= bit; > found &= ~(PG_UINT8_MAX << 2); This way of formulating it seems complicated though. But that's a small detail. > In terms of speed: > > Used code points: 0-9, a-z, A-Z (ASCII) > With patch: tps = 256.279737 > Without patch: tps = 218.902135 That's nice, but ASCII codepoints are handled by the "ch < UNICODE_NFKC_QC_MIN" fastpath, and we could easily add such a fastpath to the perfect hash implementation too. > Used code points: а-я, А-Я (Cyrillic) > With patch: tps = 156.979941 > Without patch: tps = 146.339438 I was hoping for more gain here... The new tables are larger, which might explain some of that. And that will hurt more with more real world workloads, as the tables consume cache space that could be used for other stuff. Perhaps a 2-level radix tree would be more efficient here too? There are long ranges of YES values in the table, which could be collapsed together in a radix tree. > Thoughts out loud: > It is also known that in Postgres, Normalization Quick Check does > not perform a fair check for NFD and NFKD because the hash tables became > too large. For these forms, the result MAYBE is always returned. > We can implement a fair check for NFD and NFKD forms using the specified > approach. In this case, the binary file will increase in size by 99040 > bytes. Considering that this is only used on the backend. I don't have a good feeling for how common these operations are, whether it's worth the tradeoff. One thought with these quick check tables is that they don't need to be 100% accurate, we replace any YES or NO with MAYBE and still get the correct result. We could take advantage of that and compress the tables in a lossy fashion, or even use something like a bloom filter here. Again not sure what the right tradeoff would be. - Heikki