MIME-Version: 1.0
References: <20260214.192033.705419152780150580.ishii@postgresql.org>
 <CA+hUKG+HkG-EnYnR_hQzhDCTtdx0Cj-_X-jAzvNkF_=V39jQng@mail.gmail.com>
 <CA+hUKGJLCs7+8sW8ufY8WmiZzRhK+wtMEpe1-tJ6oyy2YEAQQg@mail.gmail.com>
 <20260417.102824.927096962510122248.ishii@postgresql.org>
In-Reply-To: <20260417.102824.927096962510122248.ishii@postgresql.org>
Reply-To: assam258@gmail.com
From: Henson Choi <assam258@gmail.com>
Date: Wed, 22 Apr 2026 10:34:25 +0900
Message-ID: 
 <CAAAe_zBdGXsALm=GkUPtPx9MLcjcM5hBg3HZU+nh8gKXSjXJJw@mail.gmail.com>
Subject: Re: Questionable description about character sets
To: Tatsuo Ishii <ishii@postgresql.org>
Cc: thomas.munro@gmail.com, andreas@proxel.se,
	pgsql-hackers@lists.postgresql.org
Content-Type: multipart/alternative; boundary="000000000000d290aa06500288fc"
Archived-At: 
 <https://www.postgresql.org/message-id/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com>
Precedence: bulk

--000000000000d290aa06500288fc
Content-Type: text/plain; charset="UTF-8"

Thanks Thomas for looping me in, and thanks Tatsuo-san for driving
this.  Before getting to the Korean Description-column wording
itself, the main thing I want to surface from my audit is two
Bytes/Char corrections on this very table -- they turn out to be
the most concrete thing I can offer.

  * JOHAB row Bytes/Char = 1-3.  This is wrong.  I posted a
    separate patch for bug #19354 [1] that rewrites
    pg_johab_mblen() / pg_johab_verifychar() to follow
    KS X 1001:2004 Annex 3 Table 1 directly, instead of borrowing
    from pg_euc_mblen() / IS_EUC_RANGE_VALID().  (JOHAB's Hangul
    lead-byte range 0x84-0xD3 spans 0x8E and 0x8F, which EUC
    reserves as SS2/SS3, so it was never an EUC profile to begin
    with.)  That patch also corrects pg_wchar_table's maxmblen for
    JOHAB from 3 to 2 and the Bytes/Char column of this same
    Table 23.3 from "1-3" to "1-2".

  * EUC_KR row Bytes/Char = 1-3.  Overstated in the same way, but
    with a twist: the validator is already correct.  EUC-KR per
    KS X 2901 / RFC 1557 designates only G0 (ASCII) and G1
    (KS X 1001), so the maximum valid sequence length is 2.
    pg_euckr_verifychar() already rejects 0x8E and 0x8F via
    IS_EUC_RANGE_VALID (0xA1-0xFE), so no 3-byte sequence is ever
    accepted in practice.  The stale "3" only survives in
    pg_wchar_table[PG_EUC_KR].maxmblen and in this docs cell, as a
    leftover from pg_euckr_mblen() delegating to the shared
    pg_euc_mblen().  Correcting both to 2 is a pure cleanup with
    no behavior change and no backward-compatibility impact.

If the JOHAB fix lands first, that row's Bytes/Char can inherit
the corrected value.  For EUC_KR, I could go either way and would
rather let you pick the direction: fold the maxmblen/docs cleanup
into v1 (since the change is behavior-free), or keep it out and
let me post it as its own small patch in a separate thread (since
it touches src/common/wchar.c as well as the docs, while your v1
is docs-only).  I'm happy to prepare it either way.

As for the Korean Description-column wording itself, I'd rather
offer input than a finished proposal -- I'm honestly not confident
about the right naming convention, especially for UHC.  For what
it's worth:

  * EUC_KR's coded character set is just KS X 1001 (plus ASCII);
    there is no KS equivalent of JIS X 0212.

  * JOHAB shares the same character repertoire as EUC_KR --
    KS X 1001 + ASCII -- and simply arranges those characters into
    bytes via the combinational code in Annex 3.  So if the column
    is about coded character sets rather than encodings, JOHAB's
    entry would arguably read identically to EUC_KR's.  That's
    actually a clean illustration of the encoding-vs-character-set
    distinction you raised in the original post.

  * UHC / CP949 is the Microsoft superset of EUC-KR that adds the
    11172 precomposed Hangul syllables beyond KS X 1001, but those
    extra syllables aren't standardized as a separately-named
    coded character set as far as I know -- "CP949" tends to refer
    to the encoding.  I don't have a confident answer for the
    wording; if you have a preferred convention I'll defer to it.

    (Structural note in passing: despite the "superset of EUC-KR"
    framing, UHC is not itself an EUC profile.  To fit the extra
    syllables, it extends the lead-byte range down to 0x81, which
    necessarily swallows 0x8E and 0x8F -- the bytes EUC reserves
    as SS2 and SS3.  So by extending EUC-KR, CP949 steps outside
    the EUC family.  Mentioning this only because it mirrors the
    JOHAB situation.)

One more observation, and apologies in advance for wandering a bit
beyond the scope of this thread: while auditing those code paths I
noticed that pg_uhc_verifychar() appears quite loose on trail
bytes (it only rejects \0), while CP949's actual trail-byte range
is somewhat narrower.  Tightening this would be a real behavior
change -- existing databases may contain byte sequences that are
currently accepted but would be rejected under a stricter verifier
-- so it needs its own discussion.  I'll raise that in its own
separate thread regardless of how the EUC_KR question above is
resolved.  (UHC's 1-2 / maxmblen = 2 are already correct, so this
is purely a verifier-strictness question, not a table-cell
question.)

So in summary: the UHC verifier question will go to its own
separate thread from my side (behavior change, needs consensus),
and the EUC_KR cleanup will go to either v1 or a separate thread
depending on your call above.  Neither should block your v1 patch;
the only pieces that touch the same table cells are the two
Bytes/Char corrections, both handled either via [1] or via the
EUC_KR cleanup, wherever it ends up.

[1] https://postgr.es/m/19354-eefe6d8b3e84f9f2@postgresql.org

Regards,
Henson Choi

--000000000000d290aa06500288fc
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div dir=3D"ltr">Thanks Thomas for looping me in, and than=
ks Tatsuo-san for driving<br>this.=C2=A0 Before getting to the Korean Descr=
iption-column wording<br>itself, the main thing I want to surface from my a=
udit is two<br>Bytes/Char corrections on this very table -- they turn out t=
o be<br>the most concrete thing I can offer.<br><br>=C2=A0 * JOHAB row Byte=
s/Char =3D 1-3.=C2=A0 This is wrong.=C2=A0 I posted a<br>=C2=A0 =C2=A0 sepa=
rate patch for bug #19354 [1] that rewrites<br>=C2=A0 =C2=A0 pg_johab_mblen=
() / pg_johab_verifychar() to follow<br>=C2=A0 =C2=A0 KS X 1001:2004 Annex =
3 Table 1 directly, instead of borrowing<br>=C2=A0 =C2=A0 from pg_euc_mblen=
() / IS_EUC_RANGE_VALID(). =C2=A0(JOHAB&#39;s Hangul<br>=C2=A0 =C2=A0 lead-=
byte range 0x84-0xD3 spans 0x8E and 0x8F, which EUC<br>=C2=A0 =C2=A0 reserv=
es as SS2/SS3, so it was never an EUC profile to begin<br>=C2=A0 =C2=A0 wit=
h.) =C2=A0That patch also corrects pg_wchar_table&#39;s maxmblen for<br>=C2=
=A0 =C2=A0 JOHAB from 3 to 2 and the Bytes/Char column of this same<br>=C2=
=A0 =C2=A0 Table 23.3 from &quot;1-3&quot; to &quot;1-2&quot;.<br><br>=C2=
=A0 * EUC_KR row Bytes/Char =3D 1-3.=C2=A0 Overstated in the same way, but<=
br>=C2=A0 =C2=A0 with a twist: the validator is already correct.=C2=A0 EUC-=
KR per<br>=C2=A0 =C2=A0 KS X 2901 / RFC 1557 designates only G0 (ASCII) and=
 G1<br>=C2=A0 =C2=A0 (KS X 1001), so the maximum valid sequence length is 2=
.<br>=C2=A0 =C2=A0 pg_euckr_verifychar() already rejects 0x8E and 0x8F via<=
br>=C2=A0 =C2=A0 IS_EUC_RANGE_VALID (0xA1-0xFE), so no 3-byte sequence is e=
ver<br>=C2=A0 =C2=A0 accepted in practice.=C2=A0 The stale &quot;3&quot; on=
ly survives in<br>=C2=A0 =C2=A0 pg_wchar_table[PG_EUC_KR].maxmblen and in t=
his docs cell, as a<br>=C2=A0 =C2=A0 leftover from pg_euckr_mblen() delegat=
ing to the shared<br>=C2=A0 =C2=A0 pg_euc_mblen().=C2=A0 Correcting both to=
 2 is a pure cleanup with<br>=C2=A0 =C2=A0 no behavior change and no backwa=
rd-compatibility impact.<br><br>If the JOHAB fix lands first, that row&#39;=
s Bytes/Char can inherit<br>the corrected value.=C2=A0 For EUC_KR, I could =
go either way and would<br>rather let you pick the direction: fold the maxm=
blen/docs cleanup<br>into v1 (since the change is behavior-free), or keep i=
t out and<br>let me post it as its own small patch in a separate thread (si=
nce<br>it touches src/common/wchar.c as well as the docs, while your v1<br>=
is docs-only).=C2=A0 I&#39;m happy to prepare it either way.<br><br>As for =
the Korean Description-column wording itself, I&#39;d rather<br>offer input=
 than a finished proposal -- I&#39;m honestly not confident<br>about the ri=
ght naming convention, especially for UHC.=C2=A0 For what<br>it&#39;s worth=
:<br><br>=C2=A0 * EUC_KR&#39;s coded character set is just KS X 1001 (plus =
ASCII);<br>=C2=A0 =C2=A0 there is no KS equivalent of JIS X 0212.<br><br>=
=C2=A0 * JOHAB shares the same character repertoire as EUC_KR --<br>=C2=A0 =
=C2=A0 KS X 1001 + ASCII -- and simply arranges those characters into<br>=
=C2=A0 =C2=A0 bytes via the combinational code in Annex 3.=C2=A0 So if the =
column<br>=C2=A0 =C2=A0 is about coded character sets rather than encodings=
, JOHAB&#39;s<br>=C2=A0 =C2=A0 entry would arguably read identically to EUC=
_KR&#39;s.=C2=A0 That&#39;s<br>=C2=A0 =C2=A0 actually a clean illustration =
of the encoding-vs-character-set<br>=C2=A0 =C2=A0 distinction you raised in=
 the original post.<br><br>=C2=A0 * UHC / CP949 is the Microsoft superset o=
f EUC-KR that adds the<br>=C2=A0 =C2=A0 11172 precomposed Hangul syllables =
beyond KS X 1001, but those<br>=C2=A0 =C2=A0 extra syllables aren&#39;t sta=
ndardized as a separately-named<br>=C2=A0 =C2=A0 coded character set as far=
 as I know -- &quot;CP949&quot; tends to refer<br>=C2=A0 =C2=A0 to the enco=
ding.=C2=A0 I don&#39;t have a confident answer for the<br>=C2=A0 =C2=A0 wo=
rding; if you have a preferred convention I&#39;ll defer to it.<br><br>=C2=
=A0 =C2=A0 (Structural note in passing: despite the &quot;superset of EUC-K=
R&quot;<br>=C2=A0 =C2=A0 framing, UHC is not itself an EUC profile.=C2=A0 T=
o fit the extra<br>=C2=A0 =C2=A0 syllables, it extends the lead-byte range =
down to 0x81, which<br>=C2=A0 =C2=A0 necessarily swallows 0x8E and 0x8F -- =
the bytes EUC reserves<br>=C2=A0 =C2=A0 as SS2 and SS3.=C2=A0 So by extendi=
ng EUC-KR, CP949 steps outside<br>=C2=A0 =C2=A0 the EUC family.=C2=A0 Menti=
oning this only because it mirrors the<br>=C2=A0 =C2=A0 JOHAB situation.)<b=
r><br>One more observation, and apologies in advance for wandering a bit<br=
>beyond the scope of this thread: while auditing those code paths I<br>noti=
ced that pg_uhc_verifychar() appears quite loose on trail<br>bytes (it only=
 rejects \0), while CP949&#39;s actual trail-byte range<br>is somewhat narr=
ower.=C2=A0 Tightening this would be a real behavior<br>change -- existing =
databases may contain byte sequences that are<br>currently accepted but wou=
ld be rejected under a stricter verifier<br>-- so it needs its own discussi=
on.=C2=A0 I&#39;ll raise that in its own<br>separate thread regardless of h=
ow the EUC_KR question above is<br>resolved. =C2=A0(UHC&#39;s 1-2 / maxmble=
n =3D 2 are already correct, so this<br>is purely a verifier-strictness que=
stion, not a table-cell<br>question.)<br><br>So in summary: the UHC verifie=
r question will go to its own<br>separate thread from my side (behavior cha=
nge, needs consensus),<br>and the EUC_KR cleanup will go to either v1 or a =
separate thread<br>depending on your call above.=C2=A0 Neither should block=
 your v1 patch;<br>the only pieces that touch the same table cells are the =
two<br>Bytes/Char corrections, both handled either via [1] or via the<br>EU=
C_KR cleanup, wherever it ends up.<br><br>[1] <a href=3D"https://postgr.es/=
m/19354-eefe6d8b3e84f9f2@postgresql.org">https://postgr.es/m/19354-eefe6d8b=
3e84f9f2@postgresql.org</a><br><br>Regards,<br>Henson Choi<br></div>
</div>

--000000000000d290aa06500288fc--