public inbox for [email protected]
help / color / mirror / Atom feedFrom: Diego Frias <[email protected]>
To: Michael Paquier <[email protected]>
Cc: [email protected]
Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
Date: Thu, 4 Jun 2026 09:32:53 -0700
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>
References: <[email protected]>
<[email protected]>
Looks great! Thanks for letting me know where the tests live. I’ll
try to get these tests in the official Unicode test suite, too. Should
help future implementors.
Thanks,
Diego
> On Jun 3, 2026, at 9:07 PM, Michael Paquier <[email protected]> wrote:
>
> On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote:
>> In short, TCount actually counts 1 more than the number of T
>> syllables; this is so s % TCount == 0 implies that s has no T
>> syllable (because the 0th place represents the absence of a T
>> syllable), where s is the s-index of a precomposed Hangul
>> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T
>> syllable, the composition algorithm yields a nonsense character when
>> 0x11A7 is put in the T position.
>
> Oops. Yes, including TBASE in the recomposition is incorrect, finding
> your quote here (TBase is set to one less..):
> https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688
>
> The character gets eaten by the normalization. Pas glop.
>
>> Let me know if this patch needs anything else. I can write a test
>> for this, but it looks like the current testing setup in
>> src/common/norm_test.c only runs the Unicode test suite and isn’t
>> built for writing custom tests. If that is something of interest,
>> though, I’m happy to add that to this patch.
>
> We have a set of tests in src/test/regress/sql/unicode.sql that would
> fit nicely with what you want to address here. For this specific
> problem, this would work:
> SELECT normalize(U&'\AC00\11A7', NFC) = U&'\AC00\11A7';
>
> How about adding more normalization check patterns, while on it? I am
> finishing with the attached, all things combined. Diego. what do you
> think?
> --
> Michael
> <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>
view thread (4+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected], [email protected]
Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox