Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wFMUF-00511h-0q for pgsql-hackers@arkaria.postgresql.org; Wed, 22 Apr 2026 01:34:43 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wFMUD-00B4AZ-2j for pgsql-hackers@arkaria.postgresql.org; Wed, 22 Apr 2026 01:34:41 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wFMUD-00B4AR-1X for pgsql-hackers@lists.postgresql.org; Wed, 22 Apr 2026 01:34:41 +0000 Received: from mail-pg1-x52e.google.com ([2607:f8b0:4864:20::52e]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wFMUA-00000002OVw-3TF4 for pgsql-hackers@lists.postgresql.org; Wed, 22 Apr 2026 01:34:40 +0000 Received: by mail-pg1-x52e.google.com with SMTP id 41be03b00d2f7-c796163fac5so2936546a12.1 for ; Tue, 21 Apr 2026 18:34:38 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1776821676; cv=none; d=google.com; s=arc-20240605; b=CTabEqgtPjiHumS4d6KLYwMrwzdj1/yjU8OQYwgN3cL8ztNuGXma6pC6dDw98zAjdf OvZLGDbVUq/4vGcnL2VzHPCekWM/8fc8Q4iATFNnD2+P0Bm9gIrzpkYX9NImi8XUggnb 1jBVu4yN1CC6uVbXAv8U2TvY0bMkYAigNYejxHyuYGmlSNdLYXVzItitBDPgs6LeyZ65 4gSvMnLlqipd+EOKzFxXwIjYEgAzLZ5e7lQC/z8lnQyXn6mTj0vx6YOLrANAYAmwNJEg owoPTtPcHMvTAVNAedj2Uoe/lxG5DtQlOjpLHs86Hg+zeGOZQUGCDXJaJ++gp4R2kawT v7Ww== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:dkim-signature; bh=/kyP0O9j78AebU0VrmLPP6AVy4Wlb0MAB1OSON+Yb+I=; fh=335T6CoVuYPgGRVpLy/UokE0m0h8bOiS+l74qrvmxTw=; b=MWlSl0Ua8zH4h8pkNOh9O99//aZ2FqlKUCgsWAk9msaQ7y2tP6xT8kAOJH++2RKlv+ hvhatEScUjljvA7d9ySoq5Iw0dfc8sgxQ4wZ1+a1SYfZKeTAKjjRJNw4RKwuKNHmG9ij Gijx30IQfx+4m2ZBttRHrKC5Dgb4Dl6uc/zEZKawhQIXfWXDjN2wRnK7HmbmaJPKMx+M PzjDT+GskM2HI7ojD1/J7kjLttFPGYEefiWrfm6RhElE1Kds/60F4p/i8N5aXEfUtxm2 4ei626CUL9wh5AYzMtvAAzrvEYtLwjKfo8NCTj8mjewBxcTrE9Ta9SCo8dTLwjnJueQZ EJ1w==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1776821676; x=1777426476; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=/kyP0O9j78AebU0VrmLPP6AVy4Wlb0MAB1OSON+Yb+I=; b=iRu1PR1WOjLFvG6t7W5Ipl3XYdilG2AnBlCbmfqG+wjomcjmQq8LGEMxHiPZBrLTuO G+RoSjIA4VuOZVgJQMjts3Gvs+rx40aptqGck/5UHeNycPMbnuIhlcFm8iFxWGQbPADs YvuNLPAGle3Fl2tsKebS97a48v9y2BMKswYsHcqKUQ9ouwPPel8JKCnw/OSCidPaeHXm x0T96SUi98NtsOA/zMGnCvELYnVkI5RIl6KcYQ4L92LL4SM1IbfXmt5PoJI2hhCslZ9p Tk9aNvQeKWE2CZw07kDbBDrjrGm99vuGEOwv99RZxoE1dfbUAeBpUIq0XmGNbNGS3zfj 29ZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1776821676; x=1777426476; h=cc:to:subject:message-id:date:from:reply-to:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=/kyP0O9j78AebU0VrmLPP6AVy4Wlb0MAB1OSON+Yb+I=; b=at/l050kxclOlpp5fiti7aYUSjpll2i/Fdni6c77BRSkOU9/BS3JOjNXfqMqs5pjTW FUTrLt6dj3OEQPEeeZRv9cVXmhNZ5eR/7Gruzxgtj/cFXxqE2UcdFeVQV5TjChjQg3l4 w+MQaK5WlPT+NZfFb6sUt+aZnKAdouD8OCF7B9nbG43vpNFdf6U7/6I6Dctyrc3vreAt Z5biQy8FZVIzR5CdBb121s5zeaJ6YtFIakTjCa48n7lE03stHI2RoUiMcq/EKnpfCCFP V/UaBr5/n1SpwkwuytpYzdTNMfX4rtg9E3yK1LQJ9LyER6oQA+DEdW71RRAwnhBlx1mb jzpg== X-Forwarded-Encrypted: i=1; AFNElJ/0FjJDTba9pB7K+urPSKoBh69flqH0zHXcttHd0EhSxfu3pepPjW0BNYD60ySTP/ciHMgJ//eF84CkeY3c@lists.postgresql.org X-Gm-Message-State: AOJu0YyRvMyu88/OoIrwXKt+dWySWl+GYhPEOPOrUhgExmtXvNtTrOcX 7y3GfN/maVTF8lmnLZpFxvGSRi069zeQvpkzMt3aX/BRZw3TjyJ7RdgzCWono4kPGvU9/CDWwLl TtUwukr6LQ4qyl5K/6/KMak0XyQhuGcs= X-Gm-Gg: AeBDievjxZkw+o8APVNcoJxwNYzlO2BGVRezvgKMGpXN7+aupMc/yfu0wBSP74tYBWg ULCoFDP2n9JpRvi32l4tlbEmEl8GBwaALYzXR9QuzK0+iMwWBDPalBwtdSKSj3W+4dbI5pkWC3m enwlVpAIo5Tw8M4I89BQDyO00MO9+K9BJwfGwMPXfi4OUd6wx81mXD31n33Q7RdE/fbKa5qn6Om DjCkVVeyoHWK4RYx+A1rDFXF20gmQsX1soLVS2++14S8BlPjGXL9wG9uomWL9VADKNA+3MPCvHY SOgm1xeKrHY6ZdHsKVU/PEqveB/pRceLZ9VsV7Y80uAt0rz5Pg== X-Received: by 2002:a05:6a21:e082:b0:39f:9124:6770 with SMTP id adf61e73a8af0-3a08d89ad26mr23633548637.28.1776821675920; Tue, 21 Apr 2026 18:34:35 -0700 (PDT) MIME-Version: 1.0 References: <20260214.192033.705419152780150580.ishii@postgresql.org> <20260417.102824.927096962510122248.ishii@postgresql.org> In-Reply-To: <20260417.102824.927096962510122248.ishii@postgresql.org> Reply-To: assam258@gmail.com From: Henson Choi Date: Wed, 22 Apr 2026 10:34:25 +0900 X-Gm-Features: AQROBzDYsFbrxTRFc2qfiJBOnJTs2Ox1MBt19RX56Gb5XcsQD0mOOC5P8w5JHzY Message-ID: Subject: Re: Questionable description about character sets To: Tatsuo Ishii Cc: thomas.munro@gmail.com, andreas@proxel.se, pgsql-hackers@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000d290aa06500288fc" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000d290aa06500288fc Content-Type: text/plain; charset="UTF-8" Thanks Thomas for looping me in, and thanks Tatsuo-san for driving this. Before getting to the Korean Description-column wording itself, the main thing I want to surface from my audit is two Bytes/Char corrections on this very table -- they turn out to be the most concrete thing I can offer. * JOHAB row Bytes/Char = 1-3. This is wrong. I posted a separate patch for bug #19354 [1] that rewrites pg_johab_mblen() / pg_johab_verifychar() to follow KS X 1001:2004 Annex 3 Table 1 directly, instead of borrowing from pg_euc_mblen() / IS_EUC_RANGE_VALID(). (JOHAB's Hangul lead-byte range 0x84-0xD3 spans 0x8E and 0x8F, which EUC reserves as SS2/SS3, so it was never an EUC profile to begin with.) That patch also corrects pg_wchar_table's maxmblen for JOHAB from 3 to 2 and the Bytes/Char column of this same Table 23.3 from "1-3" to "1-2". * EUC_KR row Bytes/Char = 1-3. Overstated in the same way, but with a twist: the validator is already correct. EUC-KR per KS X 2901 / RFC 1557 designates only G0 (ASCII) and G1 (KS X 1001), so the maximum valid sequence length is 2. pg_euckr_verifychar() already rejects 0x8E and 0x8F via IS_EUC_RANGE_VALID (0xA1-0xFE), so no 3-byte sequence is ever accepted in practice. The stale "3" only survives in pg_wchar_table[PG_EUC_KR].maxmblen and in this docs cell, as a leftover from pg_euckr_mblen() delegating to the shared pg_euc_mblen(). Correcting both to 2 is a pure cleanup with no behavior change and no backward-compatibility impact. If the JOHAB fix lands first, that row's Bytes/Char can inherit the corrected value. For EUC_KR, I could go either way and would rather let you pick the direction: fold the maxmblen/docs cleanup into v1 (since the change is behavior-free), or keep it out and let me post it as its own small patch in a separate thread (since it touches src/common/wchar.c as well as the docs, while your v1 is docs-only). I'm happy to prepare it either way. As for the Korean Description-column wording itself, I'd rather offer input than a finished proposal -- I'm honestly not confident about the right naming convention, especially for UHC. For what it's worth: * EUC_KR's coded character set is just KS X 1001 (plus ASCII); there is no KS equivalent of JIS X 0212. * JOHAB shares the same character repertoire as EUC_KR -- KS X 1001 + ASCII -- and simply arranges those characters into bytes via the combinational code in Annex 3. So if the column is about coded character sets rather than encodings, JOHAB's entry would arguably read identically to EUC_KR's. That's actually a clean illustration of the encoding-vs-character-set distinction you raised in the original post. * UHC / CP949 is the Microsoft superset of EUC-KR that adds the 11172 precomposed Hangul syllables beyond KS X 1001, but those extra syllables aren't standardized as a separately-named coded character set as far as I know -- "CP949" tends to refer to the encoding. I don't have a confident answer for the wording; if you have a preferred convention I'll defer to it. (Structural note in passing: despite the "superset of EUC-KR" framing, UHC is not itself an EUC profile. To fit the extra syllables, it extends the lead-byte range down to 0x81, which necessarily swallows 0x8E and 0x8F -- the bytes EUC reserves as SS2 and SS3. So by extending EUC-KR, CP949 steps outside the EUC family. Mentioning this only because it mirrors the JOHAB situation.) One more observation, and apologies in advance for wandering a bit beyond the scope of this thread: while auditing those code paths I noticed that pg_uhc_verifychar() appears quite loose on trail bytes (it only rejects \0), while CP949's actual trail-byte range is somewhat narrower. Tightening this would be a real behavior change -- existing databases may contain byte sequences that are currently accepted but would be rejected under a stricter verifier -- so it needs its own discussion. I'll raise that in its own separate thread regardless of how the EUC_KR question above is resolved. (UHC's 1-2 / maxmblen = 2 are already correct, so this is purely a verifier-strictness question, not a table-cell question.) So in summary: the UHC verifier question will go to its own separate thread from my side (behavior change, needs consensus), and the EUC_KR cleanup will go to either v1 or a separate thread depending on your call above. Neither should block your v1 patch; the only pieces that touch the same table cells are the two Bytes/Char corrections, both handled either via [1] or via the EUC_KR cleanup, wherever it ends up. [1] https://postgr.es/m/19354-eefe6d8b3e84f9f2@postgresql.org Regards, Henson Choi --000000000000d290aa06500288fc Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Thanks Thomas for looping me in, and than= ks Tatsuo-san for driving
this.=C2=A0 Before getting to the Korean Descr= iption-column wording
itself, the main thing I want to surface from my a= udit is two
Bytes/Char corrections on this very table -- they turn out t= o be
the most concrete thing I can offer.

=C2=A0 * JOHAB row Byte= s/Char =3D 1-3.=C2=A0 This is wrong.=C2=A0 I posted a
=C2=A0 =C2=A0 sepa= rate patch for bug #19354 [1] that rewrites
=C2=A0 =C2=A0 pg_johab_mblen= () / pg_johab_verifychar() to follow
=C2=A0 =C2=A0 KS X 1001:2004 Annex = 3 Table 1 directly, instead of borrowing
=C2=A0 =C2=A0 from pg_euc_mblen= () / IS_EUC_RANGE_VALID(). =C2=A0(JOHAB's Hangul
=C2=A0 =C2=A0 lead-= byte range 0x84-0xD3 spans 0x8E and 0x8F, which EUC
=C2=A0 =C2=A0 reserv= es as SS2/SS3, so it was never an EUC profile to begin
=C2=A0 =C2=A0 wit= h.) =C2=A0That patch also corrects pg_wchar_table's maxmblen for
=C2= =A0 =C2=A0 JOHAB from 3 to 2 and the Bytes/Char column of this same
=C2= =A0 =C2=A0 Table 23.3 from "1-3" to "1-2".

=C2= =A0 * EUC_KR row Bytes/Char =3D 1-3.=C2=A0 Overstated in the same way, but<= br>=C2=A0 =C2=A0 with a twist: the validator is already correct.=C2=A0 EUC-= KR per
=C2=A0 =C2=A0 KS X 2901 / RFC 1557 designates only G0 (ASCII) and= G1
=C2=A0 =C2=A0 (KS X 1001), so the maximum valid sequence length is 2= .
=C2=A0 =C2=A0 pg_euckr_verifychar() already rejects 0x8E and 0x8F via<= br>=C2=A0 =C2=A0 IS_EUC_RANGE_VALID (0xA1-0xFE), so no 3-byte sequence is e= ver
=C2=A0 =C2=A0 accepted in practice.=C2=A0 The stale "3" on= ly survives in
=C2=A0 =C2=A0 pg_wchar_table[PG_EUC_KR].maxmblen and in t= his docs cell, as a
=C2=A0 =C2=A0 leftover from pg_euckr_mblen() delegat= ing to the shared
=C2=A0 =C2=A0 pg_euc_mblen().=C2=A0 Correcting both to= 2 is a pure cleanup with
=C2=A0 =C2=A0 no behavior change and no backwa= rd-compatibility impact.

If the JOHAB fix lands first, that row'= s Bytes/Char can inherit
the corrected value.=C2=A0 For EUC_KR, I could = go either way and would
rather let you pick the direction: fold the maxm= blen/docs cleanup
into v1 (since the change is behavior-free), or keep i= t out and
let me post it as its own small patch in a separate thread (si= nce
it touches src/common/wchar.c as well as the docs, while your v1
= is docs-only).=C2=A0 I'm happy to prepare it either way.

As for = the Korean Description-column wording itself, I'd rather
offer input= than a finished proposal -- I'm honestly not confident
about the ri= ght naming convention, especially for UHC.=C2=A0 For what
it's worth= :

=C2=A0 * EUC_KR's coded character set is just KS X 1001 (plus = ASCII);
=C2=A0 =C2=A0 there is no KS equivalent of JIS X 0212.

= =C2=A0 * JOHAB shares the same character repertoire as EUC_KR --
=C2=A0 = =C2=A0 KS X 1001 + ASCII -- and simply arranges those characters into
= =C2=A0 =C2=A0 bytes via the combinational code in Annex 3.=C2=A0 So if the = column
=C2=A0 =C2=A0 is about coded character sets rather than encodings= , JOHAB's
=C2=A0 =C2=A0 entry would arguably read identically to EUC= _KR's.=C2=A0 That's
=C2=A0 =C2=A0 actually a clean illustration = of the encoding-vs-character-set
=C2=A0 =C2=A0 distinction you raised in= the original post.

=C2=A0 * UHC / CP949 is the Microsoft superset o= f EUC-KR that adds the
=C2=A0 =C2=A0 11172 precomposed Hangul syllables = beyond KS X 1001, but those
=C2=A0 =C2=A0 extra syllables aren't sta= ndardized as a separately-named
=C2=A0 =C2=A0 coded character set as far= as I know -- "CP949" tends to refer
=C2=A0 =C2=A0 to the enco= ding.=C2=A0 I don't have a confident answer for the
=C2=A0 =C2=A0 wo= rding; if you have a preferred convention I'll defer to it.

=C2= =A0 =C2=A0 (Structural note in passing: despite the "superset of EUC-K= R"
=C2=A0 =C2=A0 framing, UHC is not itself an EUC profile.=C2=A0 T= o fit the extra
=C2=A0 =C2=A0 syllables, it extends the lead-byte range = down to 0x81, which
=C2=A0 =C2=A0 necessarily swallows 0x8E and 0x8F -- = the bytes EUC reserves
=C2=A0 =C2=A0 as SS2 and SS3.=C2=A0 So by extendi= ng EUC-KR, CP949 steps outside
=C2=A0 =C2=A0 the EUC family.=C2=A0 Menti= oning this only because it mirrors the
=C2=A0 =C2=A0 JOHAB situation.)
One more observation, and apologies in advance for wandering a bitbeyond the scope of this thread: while auditing those code paths I
noti= ced that pg_uhc_verifychar() appears quite loose on trail
bytes (it only= rejects \0), while CP949's actual trail-byte range
is somewhat narr= ower.=C2=A0 Tightening this would be a real behavior
change -- existing = databases may contain byte sequences that are
currently accepted but wou= ld be rejected under a stricter verifier
-- so it needs its own discussi= on.=C2=A0 I'll raise that in its own
separate thread regardless of h= ow the EUC_KR question above is
resolved. =C2=A0(UHC's 1-2 / maxmble= n =3D 2 are already correct, so this
is purely a verifier-strictness que= stion, not a table-cell
question.)

So in summary: the UHC verifie= r question will go to its own
separate thread from my side (behavior cha= nge, needs consensus),
and the EUC_KR cleanup will go to either v1 or a = separate thread
depending on your call above.=C2=A0 Neither should block= your v1 patch;
the only pieces that touch the same table cells are the = two
Bytes/Char corrections, both handled either via [1] or via the
EU= C_KR cleanup, wherever it ends up.

[1] https://postgr.es/m/19354-eefe6d8b= 3e84f9f2@postgresql.org

Regards,
Henson Choi
--000000000000d290aa06500288fc--