public inbox for [email protected]
help / color / mirror / Atom feedFrom: Diego Frias <[email protected]>
To: [email protected]
Subject: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
Date: Mon, 1 Jun 2026 11:38:32 -0700
Message-ID: <[email protected]> (raw)
Hi hackers
I was browsing the PostgreSQL’s Unicode normalization code and found an issue where the composition algorithm recognizes 0x11A7 as a T syllable and combines it with subsequent S and V syllables. Per the Unicode specification:
TBase is set to one less than the beginning of the range of trailing consonants, which starts at U+11A8. TCount is set to one more than the number of trailing consonants relevant to the decomposition algorithm: (11C216 - 11A816 + 1) + 1.
In short, TCount actually counts 1 more than the number of T syllables; this is so s % TCount == 0 implies that s has no T syllable (because the 0th place represents the absence of a T syllable), where s is the s-index of a precomposed Hangul character. Anyway, since PostgreSQL recognizes 0x11A7 as a T syllable, the composition algorithm yields a nonsense character when 0x11A7 is put in the T position. See https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd14854876c93f1a348df0c19dffe/sr... for a comment on this bug in Rust’s unicode-rs, and https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c0804034a36284bcb8710 for a similar contribution I made to JuliaStrings/utf8proc a few months ago.
Let me know if this patch needs anything else. I can write a test for this, but it looks like the current testing setup in src/common/norm_test.c only runs the Unicode test suite and isn’t built for writing custom tests. If that is something of interest, though, I’m happy to add that to this patch.
Best,
Diego
Attachments:
[application/octet-stream] v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch (1.4K, 2-v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch)
download | inline diff:
From 37d7ba5193a8de6bd31a38a7d93a37b66db1dd9d Mon Sep 17 00:00:00 2001
From: Diego Frias <[email protected]>
Date: Mon, 1 Jun 2026 11:32:41 -0700
Subject: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode
normalization
0x11A7 is not a valid Hangul T syllable despite being equal to T_BASE.
This is because, per the Unicode spec:
TBase is set to one less than the beginning of the range of trailing
consonants, which starts at U+11A8. TCount is set to one more than the
number of trailing consonants relevant to the decomposition algorithm:
(11C216 - 11A816 + 1) + 1.
So the first valid Hangul T syllable is 0x11A8. Also see
https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434
for where the spec describes the usage of 0x11A8, not 0x11A7, during
composition.
---
src/common/unicode_norm.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c
index cf84f202414..0534ae34640 100644
--- a/src/common/unicode_norm.c
+++ b/src/common/unicode_norm.c
@@ -236,7 +236,7 @@ recompose_code(uint32 start, uint32 code, uint32 *result)
/* Check if two current characters are LV and T */
else if (start >= SBASE && start < (SBASE + SCOUNT) &&
((start - SBASE) % TCOUNT) == 0 &&
- code >= TBASE && code < (TBASE + TCOUNT))
+ code > TBASE && code < (TBASE + TCOUNT))
{
/* make syllable of form LVT */
uint32 tindex = code - TBASE;
--
2.39.5 (Apple Git-154)
view thread (4+ messages) latest in thread
reply
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Reply to all the recipients using the --to and --cc options:
reply via email
To: [email protected]
Cc: [email protected], [email protected]
Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization
In-Reply-To: <[email protected]>
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox