Message-ID: <2f0ff747-0f8f-4627-9022-51eb59ee27c0@iki.fi>
Date: Wed, 8 Apr 2026 01:51:39 +0300
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: Optimization of the is_normalized() function.
To: Alexander Borisov <lex.borisov@gmail.com>,
 PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
References: <387cd5a3-2600-4c3c-b236-576087ef7062@gmail.com>
 <64841180-99bf-452f-af11-2c62db22ca78@gmail.com>
Content-Language: en-US
From: Heikki Linnakangas <hlinnaka@iki.fi>
In-Reply-To: <64841180-99bf-452f-af11-2c62db22ca78@gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Archived-At: 
 <https://www.postgresql.org/message-id/2f0ff747-0f8f-4627-9022-51eb59ee27c0%40iki.fi>
Precedence: bulk

On 21/10/2025 19:17, Alexander Borisov wrote:
> I continue to work on optimizations and improvements in Unicode.
> This patch changes the approach to storing data to determine whether
> a code point is normalized (YES, NO, MAYBE).
> In other words, we are talking about Normalization Quick Check.
> 
> Currently, a perfect hash is used to store knowledge about the
> code point. I suggest abandoning the use of perfect hash and storing
> code point data in bits packed into uint8.
> This will give us almost direct access to the data.
> 
> The essence is simple.
> 1. Take the code point and divide it by 8. This gives us the index
>    in the uint8 table.
> 2. We need to get the bit for the code point. Take the code point
>    modulo 8 to get the desired offset.
> 
> Since we need to store one of three values (YES, NO, MAYBE) for each
> code point, we are forced to use two bits. So everything is simply
> multiplied by 2.

I like the simplicity of that.

> This is how obtaining a value by code point looks like in code:
> 
> uc = ch * 2;
> index = uc / 8;
> bit = uc % 8;
> 
> found = UnicodeNormProps_NFC_QC[index];
> found >>= bit;
> found &= ~(PG_UINT8_MAX << 2);

This way of formulating it seems complicated though. But that's a small 
detail.

> In terms of speed:
> 
> Used code points: 0-9, a-z, A-Z (ASCII)
> With patch: tps = 256.279737
> Without patch: tps = 218.902135

That's nice, but ASCII codepoints are handled by the "ch < 
UNICODE_NFKC_QC_MIN" fastpath, and we could easily add such a fastpath 
to the perfect hash implementation too.

> Used code points: а-я, А-Я (Cyrillic)
> With patch: tps = 156.979941
> Without patch: tps = 146.339438

I was hoping for more gain here...

The new tables are larger, which might explain some of that. And that 
will hurt more with more real world workloads, as the tables consume 
cache space that could be used for other stuff.

Perhaps a 2-level radix tree would be more efficient here too? There are 
long ranges of YES values in the table, which could be collapsed 
together in a radix tree.

> Thoughts out loud:
> It is also known that in Postgres, Normalization Quick Check does
> not perform a fair check for NFD and NFKD because the hash tables became
> too large. For these forms, the result MAYBE is always returned.
> We can implement a fair check for NFD and NFKD forms using the specified
> approach. In this case, the binary file will increase in size by 99040
> bytes. Considering that this is only used on the backend.

I don't have a good feeling for how common these operations are, whether 
it's worth the tradeoff.

One thought with these quick check tables is that they don't need to be 
100% accurate, we replace any YES or NO with MAYBE and still get the 
correct result. We could take advantage of that and compress the tables 
in a lossy fashion, or even use something like a bloom filter here. 
Again not sure what the right tradeoff would be.

- Heikki