Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wUzMU-001aw0-2N for pgsql-hackers@arkaria.postgresql.org; Thu, 04 Jun 2026 04:07:19 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wUzMS-004gvJ-2E for pgsql-hackers@arkaria.postgresql.org; Thu, 04 Jun 2026 04:07:16 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wUzMR-004gv9-2T for pgsql-hackers@lists.postgresql.org; Thu, 04 Jun 2026 04:07:16 +0000 Received: from fout-b3-smtp.messagingengine.com ([202.12.124.146]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wUzMO-00000001AmE-1qnd for pgsql-hackers@lists.postgresql.org; Thu, 04 Jun 2026 04:07:15 +0000 Received: from phl-compute-10.internal (phl-compute-10.internal [10.202.2.50]) by mailfout.stl.internal (Postfix) with ESMTP id A3A331D00143; Thu, 4 Jun 2026 00:07:08 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-10.internal (MEProxy); Thu, 04 Jun 2026 00:07:08 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=paquier.xyz; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:subject :subject:to:to; s=fm1; t=1780546028; x=1780632428; bh=ipyZhV81lO arYiBtIEBliPAT/B1r5ArP6tAyy5bUsho=; b=JNcVx5eqGpdcL2WktsDfwtjKnx waafwQkxqfA3YAOVOj10fvbm6ptbHDzyNrfaojxhW3hg4COO1A/yqMFcsNIYy9Kx igiRDi864h+ijUBtRedhZqm62zHFx8DFu74xZ4CHMyX2jIPYB+445O3he7Oz6yHN eLd/bk/Y937oxAUxw3JSa5t7MAQKP4saCrAUINM2SyNOJFi4W6WvmGkWjSgtlOaX 757r1/yRDyVHevtxKG66Zqa+wmYayAik8tFMPrPQpSmk5llTlVGUk7CkoZx2xMjM h5B5Xj/FuMbNQq5Huxhl3ErQCU0AJ40B+Qn0KcLApaj1qIVZsmE3n/TvzcIg== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:subject:subject:to :to:x-me-proxy:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; t= 1780546028; x=1780632428; bh=ipyZhV81lOarYiBtIEBliPAT/B1r5ArP6tA yy5bUsho=; b=QcI9QuwDiQcFgeHv+uZ5Q9x3FWSKPRf1NybRpq4crCZHW4lipyj 6SHXJLL3H+7UZdClysOe3LR4TVxryjBLUPmCRhpUIkUElFNEQkQw2jX5lLuqqwoz W/XBtBmP+lvkbpTxMQxa4xZszflwDoiSUAnAMjpvps9a2bTL2vyd/9I7l5wj9Asg 7ThOPk5fQKs+Djk54KEU65GdcY/UgGCmuHRiG5g19bWVL/1AMCC9FYZrB+AY5Jg2 rsETUkQwKHiGog+4x8wlfPsCaLki1nr7Ds0lknCgQ0DRMW5yWfBNT2H+aI0qtgv7 SuPgMKZlwCA63CWpRvwF/VdmU5G8aGrLuUA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTG3sHx0KovYWZOSEC/xNvWV2a7Vz15o0i9IgJNMvo+Z86NwK10vQK4pKPbJoWu6Tr ryvCOFWPpuv7gwhdsH+YULVBUVRidx9S5finLL9NM87VyGFSkpiCVqHUbaPzz3RgoVDtiW iVVXMTema2qPlNdRhRTFS/gVpXd66DxsnEBHnM9jWKUQ8Pg0niXZ2L9zxJKidVKGJkkOwH I8VMFXBgcZBSOFrZCVFvEGG7OZf/h3gStkeNLtwvqEcyk21uEhEZBgeJUBQ64TUI24LK0R go3j813AUdARy0IoiU8Oie4amerLDhh9uoHLvkj/GraZ2fLc6DhmIHheKbpCaYfpfbDoU4 m3rw/BIJrw0z+cBEode2F7jDjXpcfzuuCsGl+M7g1mrq+9i3VUhussJfnVoycWUnhbPDZh J+dByEbyeObNI67rKmLQU80EEnZmWxRaaLXL2wwNcyPQO6ZY/0vh4pvOpuR1DWoelZurqv MEYFIFmKoCezSvcP24i+o1czy2csJH0xJLyvVTxSTlZqWGDL/7G0SNEtuR94LBGeMdtW+c pfuFw+armeZI3PzTIIM02XFNEuXp3goVmkqR5p84lWrOH/udIGfVFJUXsIF4xm2XUp5h1D Ui8REwiaP2o46HtEWSejxy7rSBvE/cVctu6W1baN1YTXUWPWvm7N4vYetNWg X-ME-Proxy: Feedback-ID: i0fe9450f:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Thu, 4 Jun 2026 00:07:06 -0400 (EDT) Date: Thu, 4 Jun 2026 13:07:00 +0900 From: Michael Paquier To: Diego Frias Cc: pgsql-hackers@lists.postgresql.org Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization Message-ID: References: MIME-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="T4+IKmQq7I4vrS+N" Content-Disposition: inline In-Reply-To: List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --T4+IKmQq7I4vrS+N Content-Type: multipart/mixed; boundary="rrWaMQxp8/htDYul" Content-Disposition: inline --rrWaMQxp8/htDYul Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote: > In short, TCount actually counts 1 more than the number of T > syllables; this is so s % TCount =3D=3D 0 implies that s has no T > syllable (because the 0th place represents the absence of a T > syllable), where s is the s-index of a precomposed Hangul > character. Anyway, since PostgreSQL recognizes 0x11A7 as a T > syllable, the composition algorithm yields a nonsense character when > 0x11A7 is put in the T position. Oops. Yes, including TBASE in the recomposition is incorrect, finding your quote here (TBase is set to one less..): https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688 The character gets eaten by the normalization. Pas glop. > Let me know if this patch needs anything else. I can write a test > for this, but it looks like the current testing setup in > src/common/norm_test.c only runs the Unicode test suite and isn=E2=80=99t > built for writing custom tests. If that is something of interest, > though, I=E2=80=99m happy to add that to this patch. We have a set of tests in src/test/regress/sql/unicode.sql that would fit nicely with what you want to address here. For this specific problem, this would work: SELECT normalize(U&'\AC00\11A7', NFC) =3D U&'\AC00\11A7'; How about adding more normalization check patterns, while on it? I am finishing with the attached, all things combined. Diego. what do you think? -- Michael --rrWaMQxp8/htDYul Content-Type: text/plain; charset=us-ascii Content-Disposition: attachment; filename=0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch Content-Transfer-Encoding: quoted-printable =46rom 0614fd3227eedffe91c31468def76400fd01d134 Mon Sep 17 00:00:00 2001 =46rom: Michael Paquier Date: Thu, 4 Jun 2026 12:56:36 +0900 Subject: [PATCH] Fix off-by-one with NFC recomposition for Hangul U+11A7 (TBASE) The NFC recomposition incorrectly included TBASE as a valid T syllable, which is incorrect based on the Unicode spec (TBASE is one below the start of the range, range beginning at U+11A8). This would cause the TBASE to be silently swallowed in the normalization, leading to an incorrect result. A couple of regression tests are added to check more patterns with Hangul recomposition and decomposition, on top of a test to check this issue with TBASE. Author: Diego Frias Discussion: https://postgr.es/m/B92ED640-7D4A-4505-B09F-3548F58CBB16@dzfria= s.dev Backpatch-through: 14 --- src/common/unicode_norm.c | 2 +- src/test/regress/expected/unicode.out | 78 +++++++++++++++++++++++++++ src/test/regress/sql/unicode.sql | 20 +++++++ 3 files changed, 99 insertions(+), 1 deletion(-) diff --git a/src/common/unicode_norm.c b/src/common/unicode_norm.c index cf84f2024140..0534ae34640f 100644 --- a/src/common/unicode_norm.c +++ b/src/common/unicode_norm.c @@ -236,7 +236,7 @@ recompose_code(uint32 start, uint32 code, uint32 *resul= t) /* Check if two current characters are LV and T */ else if (start >=3D SBASE && start < (SBASE + SCOUNT) && ((start - SBASE) % TCOUNT) =3D=3D 0 && - code >=3D TBASE && code < (TBASE + TCOUNT)) + code > TBASE && code < (TBASE + TCOUNT)) { /* make syllable of form LVT */ uint32 tindex =3D code - TBASE; diff --git a/src/test/regress/expected/unicode.out b/src/test/regress/expec= ted/unicode.out index 1e06de226491..63e48d3a961f 100644 --- a/src/test/regress/expected/unicode.out +++ b/src/test/regress/expected/unicode.out @@ -105,3 +105,81 @@ ORDER BY num; =20 SELECT is_normalized('abc', 'def'); -- run-time error ERROR: invalid normalization form: def +-- Hangul NFC recomposition tests +-- L+V -> LV composition (first and last) +SELECT normalize(U&'\1100\1161', NFC) =3D U&'\AC00' COLLATE "C" AS hangul_= lv_first; + hangul_lv_first=20 +----------------- + t +(1 row) + +SELECT normalize(U&'\1112\1175', NFC) =3D U&'\D788' COLLATE "C" AS hangul_= lv_last; + hangul_lv_last=20 +---------------- + t +(1 row) + +-- LV+T -> LVT composition +SELECT normalize(U&'\AC00\11A8', NFC) =3D U&'\AC01' COLLATE "C" AS hangul_= lvt_first_t; + hangul_lvt_first_t=20 +-------------------- + t +(1 row) + +SELECT normalize(U&'\AC00\11C2', NFC) =3D U&'\AC1B' COLLATE "C" AS hangul_= lvt_last_t; + hangul_lvt_last_t=20 +------------------- + t +(1 row) + +SELECT normalize(U&'\D788\11A8', NFC) =3D U&'\D789' COLLATE "C" AS hangul_= lvt_last_lv; + hangul_lvt_last_lv=20 +-------------------- + t +(1 row) + +-- L+V+T -> LVT composition +SELECT normalize(U&'\1100\1161\11A8', NFC) =3D U&'\AC01' COLLATE "C" AS ha= ngul_full_lvt; + hangul_full_lvt=20 +----------------- + t +(1 row) + +SELECT normalize(U&'\1112\1175\11C2', NFC) =3D U&'\D7A3' COLLATE "C" AS ha= ngul_full_lvt; + hangul_full_lvt=20 +----------------- + t +(1 row) + +-- TBASE invalid T syllable +SELECT normalize(U&'\AC00\11A7', NFC) =3D U&'\AC00\11A7' COLLATE "C" AS ha= ngul_tbase_not_combined; + hangul_tbase_not_combined=20 +--------------------------- + t +(1 row) + +SELECT normalize(U&'\1100\1161\11A7', NFC) =3D U&'\AC00\11A7' COLLATE "C" = AS hangul_lv_tbase_separate; + hangul_lv_tbase_separate=20 +-------------------------- + t +(1 row) + +-- Hangul NFD decomposition tests +SELECT normalize(U&'\AC00', NFD) =3D U&'\1100\1161' COLLATE "C" AS hangul_= nfd_lv; + hangul_nfd_lv=20 +--------------- + t +(1 row) + +SELECT normalize(U&'\AC01', NFD) =3D U&'\1100\1161\11A8' COLLATE "C" AS ha= ngul_nfd_lvt; + hangul_nfd_lvt=20 +---------------- + t +(1 row) + +SELECT normalize(U&'\D7A3', NFD) =3D U&'\1112\1175\11C2' COLLATE "C" AS ha= ngul_nfd_last; + hangul_nfd_last=20 +----------------- + t +(1 row) + diff --git a/src/test/regress/sql/unicode.sql b/src/test/regress/sql/unicod= e.sql index e50adb68ed0d..951f86a336e8 100644 --- a/src/test/regress/sql/unicode.sql +++ b/src/test/regress/sql/unicode.sql @@ -36,3 +36,23 @@ FROM ORDER BY num; =20 SELECT is_normalized('abc', 'def'); -- run-time error + +-- Hangul NFC recomposition tests +-- L+V -> LV composition (first and last) +SELECT normalize(U&'\1100\1161', NFC) =3D U&'\AC00' COLLATE "C" AS hangul_= lv_first; +SELECT normalize(U&'\1112\1175', NFC) =3D U&'\D788' COLLATE "C" AS hangul_= lv_last; +-- LV+T -> LVT composition +SELECT normalize(U&'\AC00\11A8', NFC) =3D U&'\AC01' COLLATE "C" AS hangul_= lvt_first_t; +SELECT normalize(U&'\AC00\11C2', NFC) =3D U&'\AC1B' COLLATE "C" AS hangul_= lvt_last_t; +SELECT normalize(U&'\D788\11A8', NFC) =3D U&'\D789' COLLATE "C" AS hangul_= lvt_last_lv; +-- L+V+T -> LVT composition +SELECT normalize(U&'\1100\1161\11A8', NFC) =3D U&'\AC01' COLLATE "C" AS ha= ngul_full_lvt; +SELECT normalize(U&'\1112\1175\11C2', NFC) =3D U&'\D7A3' COLLATE "C" AS ha= ngul_full_lvt; +-- TBASE invalid T syllable +SELECT normalize(U&'\AC00\11A7', NFC) =3D U&'\AC00\11A7' COLLATE "C" AS ha= ngul_tbase_not_combined; +SELECT normalize(U&'\1100\1161\11A7', NFC) =3D U&'\AC00\11A7' COLLATE "C" = AS hangul_lv_tbase_separate; + +-- Hangul NFD decomposition tests +SELECT normalize(U&'\AC00', NFD) =3D U&'\1100\1161' COLLATE "C" AS hangul_= nfd_lv; +SELECT normalize(U&'\AC01', NFD) =3D U&'\1100\1161\11A8' COLLATE "C" AS ha= ngul_nfd_lvt; +SELECT normalize(U&'\D7A3', NFD) =3D U&'\1112\1175\11C2' COLLATE "C" AS ha= ngul_nfd_last; --=20 2.54.0 --rrWaMQxp8/htDYul-- --T4+IKmQq7I4vrS+N Content-Type: application/pgp-signature; name=signature.asc -----BEGIN PGP SIGNATURE----- iQIzBAEBCgAdFiEEG72nH6vTowiyblFKnvQgOdbyQH0FAmog+eQACgkQnvQgOdby QH2KLxAAj1HekmNFCYTKXAjXU1nHg/QY7jnLiCdaJWLURMvQFzpnTLo0HFVkxa7h uyCUEP8vle0kOppIJWmsBmwL81YMjABHuIuCsuShtnSYbjlnO3EXo2BwvOWMr/rH 4lXmAf/dP8Z0ALEufN2vnB/qr+WEpJRyPtNwIbQyHzOGdQ1PxWt1fR4R2+iAb8Ny QqysT7pMOEVhrbGTh0Mjs7FssuoUS7Av+vIZF0OcuDqx91PdNDn+UUBDNg+LZsAR cVG20VnPq2nAjj7lEDbAq0buKUHVcuWPzGeDxP5OUozM0j2FI0gDHO4WJO4bY1sC CkTuqS46wnLCdZ01e6Y7dhLxZOSywupfdhhb3oedD3oQ24wkzcDCBLi0rQ0rA4io fcOYdRQWqIPwuLkUpf87Ddajqphf99xrE+zThgZLufkEGs4wsxBQAnE+UJiVGb61 sQIfFtLp3BgFRFHjJt0rn88dnQ8VvnHYRsv0RjHJIKYooQi/7I1o1yC3qfst3YGG c6TMtUVDbG8xkZ/q7RV0ZdfbVqu55jmwjhIeAADHu7bD3+jD0FPO/SrR+d8/wGpC KnQ9O3Yet7+2c1kDw17DdqWkmdcHrrUcJSY7z0ze6yW2lEu0BUoTNa1YwC+tbiS+ Q8CHCqY3ZSNqpmd3TRqxIDDSjVa5p0NdvJHAb3OkJ/M5zgwzyls= =JaXR -----END PGP SIGNATURE----- --T4+IKmQq7I4vrS+N--