Re: BUG #19354: JOHAB rejects valid byte sequences

public inbox for [email protected]  
help / color / mirror / Atom feed

Re: BUG #19354: JOHAB rejects valid byte sequences
5+ messages / 3 participants
[nested] [flat]

* Re: BUG #19354: JOHAB rejects valid byte sequences
@ 2026-04-15 04:25  Henson Choi <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Henson Choi @ 2026-04-15 04:25 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected]

>
> 3.  UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR
> as a base but supplied all possible pre-composed Hangul and 8,222
> Hanja (complete CJK as of Unicode 2.0).

Small correction: UHC's additions over EUC-KR are on the Hangul side,
not Hanja.  UHC adds 8,822 pre-composed Hangul (taking Hangul coverage
from EUC-KR's 2,350 up to the full 11,172) and leaves Hanja unchanged
at KS X 1001's 4,888.  I enumerated all three encodings against
PostgreSQL's current conversion tables to double-check:

    Encoding   Hangul   Hanja
    EUC_KR      2,350    4,888
    UHC        11,172    4,888
    JOHAB      11,172    4,888   (after this patch)

"Complete CJK as of Unicode 2.0" is off too -- Unicode 2.0's CJK
Unified Ideographs block had roughly 20,900 characters, so UHC and
JOHAB both carry only the KS X 1001 Hanja subset.  The 8,222 figure
looks like it got swapped with the 8,822 Hangul number.

>  Realpolitik that fed back into standards:

1.  The Hancom "Hangul" word processor used de facto standard JOHAB
> encoding, and dominated.
> 2.  KS X 1001 recognised this and added that annex.

Minor nit on the sequence: KS C 5601 already had a combinational annex
in its 1982 revision, but with a different bit layout from the one
Hancom's word processor used.  The 1992 revision swapped the annex's
bit layout to the commercial combinational form (상용 조합형) that
the industry -- Hancom included -- had popularised.  The KS X
1001:2004 commentary documents this transition explicitly ("비트
조합을 널리 쓰고 있는 이른바 상용 조합형으로 바꿈").  So "KS
recognised the de facto standard" applies to 1992, not to the annex's
first appearance.

Worth mentioning for atmosphere: that period was the tail end of the
Apple II clone / MSX era and the rise of IBM PC compatibles in Korea,
and contemporary Korean computer magazines ran running debates over
Wansung vs Johab on three axes at once -- the encoding, the keyboard
layout (두벌식 vs 세벌식, the Korean QWERTY-vs-Dvorak argument), and
the font rendering strategy (per-syllable bitmap tables for Wansung
vs jamo-composition for Johab) -- right alongside their game reviews.
The 1992 annex revision landed in the middle of that churn, not
ahead of it.

One further observation that fits your KS X 1002 note.  EUC-KR isn't
really a single standard but a layered stack -- KS X 1001 (the
character set) + ISO/IEC 2022 (the code-extension skeleton) + the
AT&T-era EUC convention of pinning G0 to ASCII and G1 to the 8-bit
region, later formalised in Korea as KS X 2901.  That informal
layering is precisely what let UHC land so easily: Microsoft extended
the same 8-bit region with additional Hangul, and every EUC-KR
decoder silently kept working for the covered subset.

KS X 1002 tried the opposite approach -- a formally separated
supplementary set, designated via a distinct ISO-2022 escape
sequence.  The design was cleaner on paper but required every
consumer to implement set-switching for a supplementary character
range that nobody was motivated to support.  UHC sidestepped this
entirely by just filling in the unused 8-bit slots.  So the
structural reason 1002 lost to UHC isn't just market power; it is
that UHC matched EUC-KR's informal extensibility while 1002 demanded
strict ISO-2022 compliance.  JOHAB (Annex 3) sits at the other end of
that spectrum -- a self-contained spec where a single document nails
down character set, byte layout, and composition algorithm, which is
what makes the verifier fix tractable.

A small downstream consequence of UHC's slot-filling approach is that
byte-wise comparison no longer matches Korean dictionary order: the
8,822 added Hangul land in the low 0x81-0xA0 range, ahead of the
gananada-ordered EUC-KR region.  Unicode's Hangul Syllables block
(U+AC00-U+D7A3) later restored that by assigning all 11,172 syllables
algorithmically in gananada order, so UTF-8 memcmp once again
produces Korean lexicographic order -- one of the quieter practical
drivers of Korea's Unicode migration.

Credit where it's due on that outcome: getting all 11,172 precomposed
Hangul into the BMP in algorithmic gananada order (the "Korean
Hangul Mess" cleanup in Unicode 2.0, 1996) wasn't inevitable.
Engineers at Microsoft's Korean office were notable advocates for
that arrangement alongside Korean standards-body contributors and
other vendors, and the Korean computing world has been quietly
benefiting from it ever since.  It's a nice detail given who's
reading this thread.

Everything else in the summary matches what I had -- thanks for the
independent write-up, and for taking another look at the patch.

> > The counter argument would be that you could use iconv
> > --from-code=JOHAB ..., or libiconv, or the codecs available in Python,
> > Java, etc for dealing with historical archived data, something that
> > data archivists must be very aware of.
>
> Sure.  But it's not comfortable to remove a user-visible feature
> we've had for decades.  My own primary concern about it was that a
> correct fix could require non-backwards-compatible behavior changes.
> Henson's analysis says that that's not a problem.  So assuming this
> patch withstands review, I'd be much happier to see it applied than
> to remove JOHAB.

Thank you -- the backward-compat angle was the hinge I was hoping
would carry, and I'm glad the analysis held up.  On the size of the
remaining audience: niche Korean standards have a small but stubborn
user base, much the way Dvorak users persist in the West.  There are
still 세벌식 (Sebeolsik) keyboard users in Korea who keep hand-cut
stickers over their QWERTY-printed keycaps rather than switch back;
the JOHAB data holdouts are that kind of tail -- vanishingly small in
absolute numbers, but without a graceful alternative if we close the
door.  A correctly-working JOHAB serves that tail at near-zero
ongoing cost, which is ultimately what the patch is arguing for.

> No opinion at the moment about whether to back-patch.

Happy to defer on back-patching.  The behaviour change is strictly
additive (previously-rejected sequences start accepting, nothing is
reinterpreted), so the back-branches are technically safe, but v19-
only is a perfectly reasonable policy call if the project prefers
minimum surface area on the first cycle.

If you do want back-patches, I'm happy to produce per-branch
versions.  Given how long the JOHAB code has been stable (as noted
earlier in the thread), my feeling is that the same patch should
apply cleanly down to PG 14 without modification.  Happy to verify
that and post the set if it would help.

One personal aside: reading KS X 1001 Annex 3 end-to-end for this fix
turned out to be an unexpectedly cheerful detour -- it felt a bit
like cracking open a 6502 assembly reference from roughly the same
era.  Back then I also had a popular neural-networks book that
convinced teenage-me computers would never approach human cognition
because they could never match the brain's memory scale -- a
prediction that, looking around in 2026, has aged about as well as
you'd expect.  Thanks to everyone on the thread for making that
side-quest worthwhile.

Regards,
Henson

^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: BUG #19354: JOHAB rejects valid byte sequences
@ 2026-04-15 05:57  Henson Choi <[email protected]>
  parent: Henson Choi <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Henson Choi @ 2026-04-15 05:57 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected]

Subject: Fix and expand comments for Korean encodings in encnames.c

Hi hackers,

While reading through the encoding alias table in src/common/encnames.c,
I noticed a few long-standing inaccuracies and omissions in the per-entry
comments for the three Korean encodings.

The most visible issue is the JOHAB entry, whose comment describes it as
"Extended Unix Code for simplified Chinese" -- apparently a copy/paste
slip from a neighboring EUC entry.  JOHAB is in fact the Korean
combining-style encoding defined in KS X 1001 annex 3.

The attached 0002 patch makes comment-only adjustments to the three
Korean encodings:

  * JOHAB: replace the incorrect "simplified Chinese" description with
    a correct one that identifies it as the Korean combining (Johab)
    encoding standardized in KS X 1001 annex 3.

  * EUC_KR: drop a stray space before the comma in the existing
    comment, and note that the encoding covers the KS X 1001
    precomposed (Wansung) form.

  * UHC: spell out "Unified Hangul Code", clarify that it is
    Microsoft Windows CodePage 949, and describe its relationship to
    EUC-KR (superset covering all 11,172 precomposed Hangul syllables).

No behavior change, no catalog change, no pg_wchar.h change -- this
touches comments in src/common/encnames.c only.  pgindent is clean.

Thanks,
Henson Choi

From c7a7335d2cf5a2881b25d9091fd020a2d62f7661 Mon Sep 17 00:00:00 2001
From: Henson Choi <[email protected]>
Date: Wed, 15 Apr 2026 14:52:35 +0900
Subject: [PATCH v1] Fix and expand comments for Korean encodings in encnames.c

---
 src/common/encnames.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/common/encnames.c b/src/common/encnames.c
index 9085dbecce1..959b991dde4 100644
--- a/src/common/encnames.c
+++ b/src/common/encnames.c
@@ -61,8 +61,9 @@ static const pg_encname pg_encname_tbl[] =
 								 * Japanese, standard OSF */
 	{
 		"euckr", PG_EUC_KR
-	},							/* EUC-KR; Extended Unix Code for Korean , KS
-								 * X 1001 standard */
+	},							/* EUC-KR; Extended Unix Code for Korean
+								 * precomposed (Wansung) encoding, standard KS
+								 * X 1001 */
 	{
 		"euctw", PG_EUC_TW
 	},							/* EUC-TW; Extended Unix Code for
@@ -119,8 +120,8 @@ static const pg_encname pg_encname_tbl[] =
 	},							/* ISO-8859-9; RFC1345,KXS2 */
 	{
 		"johab", PG_JOHAB
-	},							/* JOHAB; Extended Unix Code for simplified
-								 * Chinese */
+	},							/* JOHAB; Korean combining (Johab) encoding,
+								 * standard KS X 1001 annex 3 */
 	{
 		"koi8", PG_KOI8R
 	},							/* _dirty_ alias for KOI8-R (backward
@@ -186,7 +187,9 @@ static const pg_encname pg_encname_tbl[] =
 	},							/* alias for WIN1258 */
 	{
 		"uhc", PG_UHC
-	},							/* UHC; Korean Windows CodePage 949 */
+	},							/* UHC; Unified Hangul Code, Microsoft Windows
+								 * CodePage 949; superset of EUC-KR covering
+								 * all 11,172 precomposed Hangul syllables */
 	{
 		"unicode", PG_UTF8
 	},							/* alias for UTF8 */
-- 
2.50.1 (Apple Git-155)



Attachments:

  [text/plain] 0002-Fix-and-expand-comments-for-Korean-encodings.txt (1.7K, 3-0002-Fix-and-expand-comments-for-Korean-encodings.txt)
  download | inline diff:
From c7a7335d2cf5a2881b25d9091fd020a2d62f7661 Mon Sep 17 00:00:00 2001
From: Henson Choi <[email protected]>
Date: Wed, 15 Apr 2026 14:52:35 +0900
Subject: [PATCH v1] Fix and expand comments for Korean encodings in encnames.c

---
 src/common/encnames.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/src/common/encnames.c b/src/common/encnames.c
index 9085dbecce1..959b991dde4 100644
--- a/src/common/encnames.c
+++ b/src/common/encnames.c
@@ -61,8 +61,9 @@ static const pg_encname pg_encname_tbl[] =
 								 * Japanese, standard OSF */
 	{
 		"euckr", PG_EUC_KR
-	},							/* EUC-KR; Extended Unix Code for Korean , KS
-								 * X 1001 standard */
+	},							/* EUC-KR; Extended Unix Code for Korean
+								 * precomposed (Wansung) encoding, standard KS
+								 * X 1001 */
 	{
 		"euctw", PG_EUC_TW
 	},							/* EUC-TW; Extended Unix Code for
@@ -119,8 +120,8 @@ static const pg_encname pg_encname_tbl[] =
 	},							/* ISO-8859-9; RFC1345,KXS2 */
 	{
 		"johab", PG_JOHAB
-	},							/* JOHAB; Extended Unix Code for simplified
-								 * Chinese */
+	},							/* JOHAB; Korean combining (Johab) encoding,
+								 * standard KS X 1001 annex 3 */
 	{
 		"koi8", PG_KOI8R
 	},							/* _dirty_ alias for KOI8-R (backward
@@ -186,7 +187,9 @@ static const pg_encname pg_encname_tbl[] =
 	},							/* alias for WIN1258 */
 	{
 		"uhc", PG_UHC
-	},							/* UHC; Korean Windows CodePage 949 */
+	},							/* UHC; Unified Hangul Code, Microsoft Windows
+								 * CodePage 949; superset of EUC-KR covering
+								 * all 11,172 precomposed Hangul syllables */
 	{
 		"unicode", PG_UTF8
 	},							/* alias for UTF8 */
-- 
2.50.1 (Apple Git-155)



^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: BUG #19354: JOHAB rejects valid byte sequences
@ 2026-04-15 06:59  Thomas Munro <[email protected]>
  parent: Henson Choi <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Thomas Munro @ 2026-04-15 06:59 UTC (permalink / raw)
  To: [email protected]; +Cc: Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected]

On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <[email protected]> wrote:
> While reading through the encoding alias table in src/common/encnames.c,
> I noticed a few long-standing inaccuracies and omissions in the per-entry
> comments for the three Korean encodings.

LGTM, so I will go ahead and push this to all branches.






^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: BUG #19354: JOHAB rejects valid byte sequences
@ 2026-05-14 19:36  Robert Haas <[email protected]>
  parent: Thomas Munro <[email protected]>
  0 siblings, 1 reply; 5+ messages in thread

From: Robert Haas @ 2026-05-14 19:36 UTC (permalink / raw)
  To: Thomas Munro <[email protected]>; +Cc: [email protected]; Heikki Linnakangas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected]

On Wed, Apr 15, 2026 at 2:59 AM Thomas Munro <[email protected]> wrote:>
> On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <[email protected]> wrote:
> > While reading through the encoding alias table in src/common/encnames.c,
> > I noticed a few long-standing inaccuracies and omissions in the per-entry
> > comments for the three Korean encodings.
>
> LGTM, so I will go ahead and push this to all branches.

I see that this was done, but this isn't the actual fix for this
issue, right? Is somebody going to apply the main fix patch (perhaps
just to master)?

-- 
Robert Haas
EDB: http://www.enterprisedb.com





^ permalink  raw  reply  [nested|flat] 5+ messages in thread

* Re: BUG #19354: JOHAB rejects valid byte sequences
@ 2026-05-16 09:39  Henson Choi <[email protected]>
  parent: Robert Haas <[email protected]>
  0 siblings, 0 replies; 5+ messages in thread

From: Henson Choi @ 2026-05-16 09:39 UTC (permalink / raw)
  To: Robert Haas <[email protected]>; +Cc: Thomas Munro <[email protected]>; Heikki Linnakangas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected]

2026년 5월 15일 (금) 오전 4:36, Robert Haas <[email protected]>님이 작성:

> On Wed, Apr 15, 2026 at 2:59 AM Thomas Munro <[email protected]>
> wrote:>
> > On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <[email protected]> wrote:
> > > While reading through the encoding alias table in
> src/common/encnames.c,
> > > I noticed a few long-standing inaccuracies and omissions in the
> per-entry
> > > comments for the three Korean encodings.
> >
> > LGTM, so I will go ahead and push this to all branches.
>
> I see that this was done, but this isn't the actual fix for this
> issue, right? Is somebody going to apply the main fix patch (perhaps
> just to master)?


Right -- the cosmetic encnames.c comment cleanup is what Thomas
pushed; the actual verifier fix is still pending.  Tatsuo gave a +1
for preserving JOHAB and applying the verifier correction.

Thomas may have other priorities right now.  This is not urgent on my
end, so I am content to wait for him.

Thanks,
Henson


^ permalink  raw  reply  [nested|flat] 5+ messages in thread

end of thread, other threads:[~2026-05-16 09:39 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15 04:25 Re: BUG #19354: JOHAB rejects valid byte sequences Henson Choi <[email protected]>
2026-04-15 05:57 ` Henson Choi <[email protected]>
2026-04-15 06:59   ` Thomas Munro <[email protected]>
2026-05-14 19:36     ` Robert Haas <[email protected]>
2026-05-16 09:39       ` Henson Choi <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox