Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVB1J-001l6Q-2O for pgsql-hackers@arkaria.postgresql.org; Thu, 04 Jun 2026 16:34:14 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wVB1I-007spY-1a for pgsql-hackers@arkaria.postgresql.org; Thu, 04 Jun 2026 16:34:12 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wVB1I-007spQ-0C for pgsql-hackers@lists.postgresql.org; Thu, 04 Jun 2026 16:34:12 +0000 Received: from qs-2003j-snip4-3.eps.apple.com ([57.103.86.96] helo=outbound.qs.icloud.com) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wVB1F-000000016Jz-05qj for pgsql-hackers@lists.postgresql.org; Thu, 04 Jun 2026 16:34:10 +0000 Received: from outbound.qs.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-east-2d-60-percent-8 (Postfix) with ESMTPS id B9A551800181; Thu, 04 Jun 2026 16:34:05 +0000 (UTC) X-ICL-Out-Info: HUtFAUMHWwJACUgBTUQeDx5WFlZNRAJCTQ5LHVsARQNEClYGVRcOVk1UGVoBdw9IFksRUh4ZD1cGGR5XUFMRVAJQGUBDUw5EUEsbDl8XA1ccVkVcGEMJXQVXHB0cREVbE1UXRgkZCF0dGQhHHwowA0IOVgNDB0UALRkcV1BTEVQCUBlAQ1MORFBUEVdQCy8ENAxKBilyRnFBf0oeWQ5Tdl0BSgouGlgLRgtMA14EXHBFADhXF1EZWxFKVlcIQVUSBEAIVlBUHkEEVhVsCVgGUxlX Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dzfrias.dev; s=sig1; t=1780590846; x=1783182846; bh=INMD2fDvXj8ct15p3W34rb3xj87dHBHFvCqWAyrv3AY=; h=Content-Type:Mime-Version:Subject:From:Date:Message-Id:To:x-icloud-hme; b=RVmW7DL6ZeErV2EABa6s3M/M6LvWKzCV4Gh9j2dldc2CVVIwY5GQCPN2VLspdyeME3xi9T9FUL4KkNNC2By9v0yJMd6APLsVOyzJRbvgfhWK7kfHIkoWdM6obGneVOKme11HVkoP1Pk2zS8UrLPlkgRlN2js+WkFD+LSTloYK8qzvbNvNpCwwiphIWRnXZQINn9VPUYfa0V3nkzfUj1vp2zJVYFkRpMLRRAaDm8B9s5pYRWsrj0PP8Zfx/z+P7AFG0jG8o3NE8jkb/L/5BezdiqMLk2gw147qYdUcweLE5muVY/dzPsAYAmn9pf1Z3g9N2CweOCfZyeNuPyF0R/QsA== mail-alias-created-date: 1748907026120 Received: from smtpclient.apple (unknown [17.57.155.37]) by p00-icloudmta-asmtp-us-east-2d-60-percent-8 (Postfix) with ESMTPSA id 155471800096; Thu, 04 Jun 2026 16:34:05 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.600.51.1.1\)) Subject: Re: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization From: Diego Frias In-Reply-To: Date: Thu, 4 Jun 2026 09:32:53 -0700 Cc: pgsql-hackers@lists.postgresql.org Content-Transfer-Encoding: quoted-printable Message-Id: References: To: Michael Paquier X-Mailer: Apple Mail (2.3864.600.51.1.1) X-Proofpoint-GUID: NnojprV6FU2RqhmbFsUa6FXygzZybBrg X-Authority-Info-Out: v=2.4 cv=T6eBjvKQ c=1 sm=1 tr=0 ts=6a21a8fd cx=c_apl:c_pps:t_out a=bsP7O+dXZ5uKcj+dsLqiMw==:117 a=bsP7O+dXZ5uKcj+dsLqiMw==:17 a=IkcTkHD0fZMA:10 a=FelO9ux0wxsA:10 a=VkNPw1HP01LnGYTKEx00:22 a=te1EGT4yAAAA:8 a=HvnWhiLzAAAA:8 a=c6QND4GDue3Ru-mZNWwA:9 a=QEXdDO2ut3YA:10 a=RRElR4r2U1jGY2dU47NL:22 a=YT91KPC6OSPBWssVLg7J:22 X-Proofpoint-ORIG-GUID: NnojprV6FU2RqhmbFsUa6FXygzZybBrg X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNjA0MDE2MiBTYWx0ZWRfXzJ0y59ghsf7d BI7rzpJJItKxxm4aL0o1vOcTVqkzZcPFqKhojn4X5DlO8kltRXCvna19JPyQT2UT0N7ltt3aHKy 4FI9omc7Mj4rCQzfjD4Hs8HfBWvKwC5S0cRt8W5zTJc1xIIZgd3cmB4fUVB7rowWYZBBU60LWPi RfpArtVwJz6013Pyj0J14ieHY2eNByJ3coCV38/f+fuGR12+7sd6sT9O8mm3/kgUFmG57k9hfoU MIrowv/E225AY2YBdHf95b5WNXGzHoC/CVvbEAx4Gn11RslU9HhMx/0Y9tAXjInEZJPDI7IIyTb RwaZzRyzcINDPH0RZemVOzLT50DZZACRxmkEDPSlOMB6TOUSY6MABcVozsLyo4= List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Looks great! Thanks for letting me know where the tests live. I=E2=80=99ll= try to get these tests in the official Unicode test suite, too. Should help future implementors. Thanks, Diego > On Jun 3, 2026, at 9:07=E2=80=AFPM, Michael Paquier = wrote: >=20 > On Mon, Jun 01, 2026 at 11:38:32AM -0700, Diego Frias wrote: >> In short, TCount actually counts 1 more than the number of T >> syllables; this is so s % TCount =3D=3D 0 implies that s has no T >> syllable (because the 0th place represents the absence of a T >> syllable), where s is the s-index of a precomposed Hangul >> character. Anyway, since PostgreSQL recognizes 0x11A7 as a T >> syllable, the composition algorithm yields a nonsense character when >> 0x11A7 is put in the T position. >=20 > Oops. Yes, including TBASE in the recomposition is incorrect, finding > your quote here (TBase is set to one less..): > https://unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59688 >=20 > The character gets eaten by the normalization. Pas glop. >=20 >> Let me know if this patch needs anything else. I can write a test >> for this, but it looks like the current testing setup in >> src/common/norm_test.c only runs the Unicode test suite and isn=E2=80=99= t >> built for writing custom tests. If that is something of interest, >> though, I=E2=80=99m happy to add that to this patch. >=20 > We have a set of tests in src/test/regress/sql/unicode.sql that would > fit nicely with what you want to address here. For this specific > problem, this would work: > SELECT normalize(U&'\AC00\11A7', NFC) =3D U&'\AC00\11A7'; >=20 > How about adding more normalization check patterns, while on it? I am > finishing with the attached, all things combined. Diego. what do you > think? > -- > Michael > <0001-Fix-off-by-one-with-NFC-recomposition-for-Hangul-U-1.patch>