Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wU7XH-000xQz-04 for pgsql-hackers@arkaria.postgresql.org; Mon, 01 Jun 2026 18:38:51 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wU7XF-00Arq2-2Y for pgsql-hackers@arkaria.postgresql.org; Mon, 01 Jun 2026 18:38:49 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wU7XF-00Arpu-1Y for pgsql-hackers@lists.postgresql.org; Mon, 01 Jun 2026 18:38:49 +0000 Received: from ms-2004h-snip4-11.eps.apple.com ([57.103.74.151] helo=outbound.ms.icloud.com) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wU7XD-00000000i9d-0dHu for pgsql-hackers@lists.postgresql.org; Mon, 01 Jun 2026 18:38:49 +0000 Received: from outbound.ms.icloud.com (unknown [127.0.0.2]) by p00-icloudmta-asmtp-us-west-3a-100-percent-2 (Postfix) with ESMTPS id A5FD8180009F for ; Mon, 01 Jun 2026 18:38:43 +0000 (UTC) X-ICL-Out-Info: HUtFAUMHWwJACUgBTUQeDx5WFlZNRAJCTQ9LHVwCXBxBC1YCVRcOVk1UGVoBdw9IFksRUh4ZD1cGGR5XUFMRVAJQGUBDUw5EUEsbDlwXA1ccVkVcGEMJXQVXHB0AREVbE1UXRgkZCF0dGQhHHwowA0IOVgNDB0UALRkcV1BTEVQCUBlAQ1MORFBUEVdQCykLQnw8BVkHRgU0DTkeWQJbB117SAorGlgHRAE+BlV0KXBBDzhXF1EZWxFKVlcIQVUSBEAIVlBUHkEEVhVsCVgGUxlX Dkim-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=dzfrias.dev; s=sig1; t=1780339125; x=1782931125; bh=3P3iSldnPlBYWfP+WwL+CFpJwjHYIaJ9kiWmWzRZyho=; h=From:Content-Type:Mime-Version:Subject:Message-Id:Date:To:x-icloud-hme; b=WmbMuLF+IJW5e3lVbawMtugGqcdKAKcgw8xPcc+eXuYQe7cUM8rxBfcBjZjrnQhwPuJaUtVQh7hTNadiMjt1HJSTLwZtNWRhCLLInsrksJt4QglQzCCgG/h3jQIkcCTQzWUdh85/dldl2vFPc+Cw2kWKH1ZdRZb2uuVgQfmDu900V/1R1Fu8m+/MT1Jm4HVnJVP9P6JByWOu8CcSYwBlfJptIaqlp531DaxM6fnRL1OMQjUGeNZnpSOVk95AbNsGjpNQ+e06XtfPZpdCo6542SlTAyKNsnBlCMhrenVpGmq/eY2XLlSmQ46Nj9qkpWC1tuiT7Xmxq61N1GjhaWFGTw== mail-alias-created-date: 1748907026120 Received: from smtpclient.apple (unknown [17.57.154.37]) by p00-icloudmta-asmtp-us-west-3a-100-percent-2 (Postfix) with ESMTPSA id E7EC8180016D for ; Mon, 01 Jun 2026 18:38:42 +0000 (UTC) From: Diego Frias Content-Type: multipart/mixed; boundary="Apple-Mail=_FDA30F13-EA66-4699-8EA2-70EE4725C786" Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.600.51.1.1\)) Subject: [PATCH] Fix recognizing 0x11A7 as a Hangul T syllable in Unicode normalization Message-Id: Date: Mon, 1 Jun 2026 11:38:32 -0700 To: pgsql-hackers@lists.postgresql.org X-Mailer: Apple Mail (2.3864.600.51.1.1) X-Proofpoint-GUID: kHlZ7vtTzqDXtGes_i9OERm6zz2ebRh3 X-Authority-Info-Out: v=2.4 cv=f5RFxeyM c=1 sm=1 tr=0 ts=6a1dd1b3 cx=c_apl:c_pps:t_out a=qkKslKyYc0ctBTeLUVfTFg==:117 a=FelO9ux0wxsA:10 a=VkNPw1HP01LnGYTKEx00:22 a=Kuer3FB5AAAA:20 a=NEAV23lmAAAA:8 a=v40gUMXuRyxq4VPyEAUA:9 a=QEXdDO2ut3YA:10 a=GE6ejhqm-1CELDTnm6wA:9 a=B2y7HmGcmWMA:10 a=bA3UWDv6hWIuX7UZL3qL:22 X-Proofpoint-Spam-Details-Enc: AW1haW4tMjYwNjAxMDE4MyBTYWx0ZWRfX1XxawEBW67uG cLD6+bzUC83NvyGiRQqIRjoZxScNXNpAsyqoII2vKffAQiJXtkPVvJA+gqP3BFkSfTpaVIxvyFU +hn2rYaYRs00Do/++vZLr9BdaFGE99p5yqs4Lq046jgPC0VZOSYyQ4FKwRFOjGyxIushmOz+aDQ x0oHE0KR9jQonLVa8odBQlqhBkneFw+tvhDoYQScUtMF1Y54YrwzLyCUtWYIErIZJSo7dzRzHfO +Pt4qRYahLAnR08BJmc2zP6kKOZ3rIVY+C/91B8CBXdf3dEbGLqyyjoYkdEImfW3N0gDZeDX572 GdOCCW18P9coqYq8aW18uAHIqpKna88XQ89xIim8n6kETcP/xDtPEMs5HivSsY= X-Proofpoint-ORIG-GUID: kHlZ7vtTzqDXtGes_i9OERm6zz2ebRh3 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --Apple-Mail=_FDA30F13-EA66-4699-8EA2-70EE4725C786 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Hi hackers I was browsing the PostgreSQL=E2=80=99s Unicode normalization code and = found an issue where the composition algorithm recognizes 0x11A7 as a T = syllable and combines it with subsequent S and V syllables. Per the = Unicode specification: TBase is set to one less than the beginning of the range of trailing = consonants, which starts at U+11A8. TCount is set to one more than the = number of trailing consonants relevant to the decomposition algorithm: = (11C216 - 11A816 + 1) + 1. In short, TCount actually counts 1 more than the number of T syllables; = this is so s % TCount =3D=3D 0 implies that s has no T syllable (because = the 0th place represents the absence of a T syllable), where s is the = s-index of a precomposed Hangul character. Anyway, since PostgreSQL = recognizes 0x11A7 as a T syllable, the composition algorithm yields a = nonsense character when 0x11A7 is put in the T position. See = https://github.com/unicode-rs/unicode-normalization/blob/576ae0b1407dd1485= 4876c93f1a348df0c19dffe/src/normalize.rs#L218 for a comment on this bug = in Rust=E2=80=99s unicode-rs, and = https://github.com/JuliaStrings/utf8proc/commit/0260ba56c81e5ef6f06c080403= 4a36284bcb8710 for a similar contribution I made to = JuliaStrings/utf8proc a few months ago. Let me know if this patch needs anything else. I can write a test for = this, but it looks like the current testing setup in = src/common/norm_test.c only runs the Unicode test suite and isn=E2=80=99t = built for writing custom tests. If that is something of interest, = though, I=E2=80=99m happy to add that to this patch. Best, Diego --Apple-Mail=_FDA30F13-EA66-4699-8EA2-70EE4725C786 Content-Disposition: attachment; filename*0=v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.pat; filename*1=ch Content-Type: application/octet-stream; x-unix-mode=0644; name="v1-0001-Fix-recognizing-0x11A7-as-a-Hangul-T-syllable-in-Uni.patch" Content-Transfer-Encoding: quoted-printable =46rom=2037d7ba5193a8de6bd31a38a7d93a37b66db1dd9d=20Mon=20Sep=2017=20= 00:00:00=202001=0AFrom:=20Diego=20Frias=20=0ADate:=20= Mon,=201=20Jun=202026=2011:32:41=20-0700=0ASubject:=20[PATCH]=20Fix=20= recognizing=200x11A7=20as=20a=20Hangul=20T=20syllable=20in=20Unicode=0A=20= normalization=0A=0A0x11A7=20is=20not=20a=20valid=20Hangul=20T=20syllable=20= despite=20being=20equal=20to=20T_BASE.=0AThis=20is=20because,=20per=20= the=20Unicode=20spec:=0A=0A=20=20TBase=20is=20set=20to=20one=20less=20= than=20the=20beginning=20of=20the=20range=20of=20trailing=0A=20=20= consonants,=20which=20starts=20at=20U+11A8.=20TCount=20is=20set=20to=20= one=20more=20than=20the=0A=20=20number=20of=20trailing=20consonants=20= relevant=20to=20the=20decomposition=20algorithm:=0A=20=20(11C216=20-=20= 11A816=20+=201)=20+=201.=0A=0ASo=20the=20first=20valid=20Hangul=20T=20= syllable=20is=200x11A8.=20Also=20see=0A= https://www.unicode.org/versions/Unicode17.0.0/core-spec/chapter-3/#G59434= =0Afor=20where=20the=20spec=20describes=20the=20usage=20of=200x11A8,=20= not=200x11A7,=20during=0Acomposition.=0A---=0A=20= src/common/unicode_norm.c=20|=202=20+-=0A=201=20file=20changed,=201=20= insertion(+),=201=20deletion(-)=0A=0Adiff=20--git=20= a/src/common/unicode_norm.c=20b/src/common/unicode_norm.c=0Aindex=20= cf84f202414..0534ae34640=20100644=0A---=20a/src/common/unicode_norm.c=0A= +++=20b/src/common/unicode_norm.c=0A@@=20-236,7=20+236,7=20@@=20= recompose_code(uint32=20start,=20uint32=20code,=20uint32=20*result)=0A=20= =09/*=20Check=20if=20two=20current=20characters=20are=20LV=20and=20T=20= */=0A=20=09else=20if=20(start=20>=3D=20SBASE=20&&=20start=20<=20(SBASE=20= +=20SCOUNT)=20&&=0A=20=09=09=09=20((start=20-=20SBASE)=20%=20TCOUNT)=20= =3D=3D=200=20&&=0A-=09=09=09=20code=20>=3D=20TBASE=20&&=20code=20<=20= (TBASE=20+=20TCOUNT))=0A+=09=09=09=20code=20>=20TBASE=20&&=20code=20<=20= (TBASE=20+=20TCOUNT))=0A=20=09{=0A=20=09=09/*=20make=20syllable=20of=20= form=20LVT=20*/=0A=20=09=09uint32=09=09tindex=20=3D=20code=20-=20TBASE;=0A= --=20=0A2.39.5=20(Apple=20Git-154)=0A=0A= --Apple-Mail=_FDA30F13-EA66-4699-8EA2-70EE4725C786--