public inbox for [email protected]help / color / mirror / Atom feed
Re: BUG #19354: JOHAB rejects valid byte sequences 5+ messages / 3 participants [nested] [flat]
* Re: BUG #19354: JOHAB rejects valid byte sequences @ 2026-04-15 04:25 Henson Choi <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Henson Choi @ 2026-04-15 04:25 UTC (permalink / raw) To: Thomas Munro <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected] > > 3. UHC (= "Unified Hangul Code", invented by Microsoft): used EUR-KR > as a base but supplied all possible pre-composed Hangul and 8,222 > Hanja (complete CJK as of Unicode 2.0). Small correction: UHC's additions over EUC-KR are on the Hangul side, not Hanja. UHC adds 8,822 pre-composed Hangul (taking Hangul coverage from EUC-KR's 2,350 up to the full 11,172) and leaves Hanja unchanged at KS X 1001's 4,888. I enumerated all three encodings against PostgreSQL's current conversion tables to double-check: Encoding Hangul Hanja EUC_KR 2,350 4,888 UHC 11,172 4,888 JOHAB 11,172 4,888 (after this patch) "Complete CJK as of Unicode 2.0" is off too -- Unicode 2.0's CJK Unified Ideographs block had roughly 20,900 characters, so UHC and JOHAB both carry only the KS X 1001 Hanja subset. The 8,222 figure looks like it got swapped with the 8,822 Hangul number. > Realpolitik that fed back into standards: 1. The Hancom "Hangul" word processor used de facto standard JOHAB > encoding, and dominated. > 2. KS X 1001 recognised this and added that annex. Minor nit on the sequence: KS C 5601 already had a combinational annex in its 1982 revision, but with a different bit layout from the one Hancom's word processor used. The 1992 revision swapped the annex's bit layout to the commercial combinational form (상용 조합형) that the industry -- Hancom included -- had popularised. The KS X 1001:2004 commentary documents this transition explicitly ("비트 조합을 널리 쓰고 있는 이른바 상용 조합형으로 바꿈"). So "KS recognised the de facto standard" applies to 1992, not to the annex's first appearance. Worth mentioning for atmosphere: that period was the tail end of the Apple II clone / MSX era and the rise of IBM PC compatibles in Korea, and contemporary Korean computer magazines ran running debates over Wansung vs Johab on three axes at once -- the encoding, the keyboard layout (두벌식 vs 세벌식, the Korean QWERTY-vs-Dvorak argument), and the font rendering strategy (per-syllable bitmap tables for Wansung vs jamo-composition for Johab) -- right alongside their game reviews. The 1992 annex revision landed in the middle of that churn, not ahead of it. One further observation that fits your KS X 1002 note. EUC-KR isn't really a single standard but a layered stack -- KS X 1001 (the character set) + ISO/IEC 2022 (the code-extension skeleton) + the AT&T-era EUC convention of pinning G0 to ASCII and G1 to the 8-bit region, later formalised in Korea as KS X 2901. That informal layering is precisely what let UHC land so easily: Microsoft extended the same 8-bit region with additional Hangul, and every EUC-KR decoder silently kept working for the covered subset. KS X 1002 tried the opposite approach -- a formally separated supplementary set, designated via a distinct ISO-2022 escape sequence. The design was cleaner on paper but required every consumer to implement set-switching for a supplementary character range that nobody was motivated to support. UHC sidestepped this entirely by just filling in the unused 8-bit slots. So the structural reason 1002 lost to UHC isn't just market power; it is that UHC matched EUC-KR's informal extensibility while 1002 demanded strict ISO-2022 compliance. JOHAB (Annex 3) sits at the other end of that spectrum -- a self-contained spec where a single document nails down character set, byte layout, and composition algorithm, which is what makes the verifier fix tractable. A small downstream consequence of UHC's slot-filling approach is that byte-wise comparison no longer matches Korean dictionary order: the 8,822 added Hangul land in the low 0x81-0xA0 range, ahead of the gananada-ordered EUC-KR region. Unicode's Hangul Syllables block (U+AC00-U+D7A3) later restored that by assigning all 11,172 syllables algorithmically in gananada order, so UTF-8 memcmp once again produces Korean lexicographic order -- one of the quieter practical drivers of Korea's Unicode migration. Credit where it's due on that outcome: getting all 11,172 precomposed Hangul into the BMP in algorithmic gananada order (the "Korean Hangul Mess" cleanup in Unicode 2.0, 1996) wasn't inevitable. Engineers at Microsoft's Korean office were notable advocates for that arrangement alongside Korean standards-body contributors and other vendors, and the Korean computing world has been quietly benefiting from it ever since. It's a nice detail given who's reading this thread. Everything else in the summary matches what I had -- thanks for the independent write-up, and for taking another look at the patch. > > The counter argument would be that you could use iconv > > --from-code=JOHAB ..., or libiconv, or the codecs available in Python, > > Java, etc for dealing with historical archived data, something that > > data archivists must be very aware of. > > Sure. But it's not comfortable to remove a user-visible feature > we've had for decades. My own primary concern about it was that a > correct fix could require non-backwards-compatible behavior changes. > Henson's analysis says that that's not a problem. So assuming this > patch withstands review, I'd be much happier to see it applied than > to remove JOHAB. Thank you -- the backward-compat angle was the hinge I was hoping would carry, and I'm glad the analysis held up. On the size of the remaining audience: niche Korean standards have a small but stubborn user base, much the way Dvorak users persist in the West. There are still 세벌식 (Sebeolsik) keyboard users in Korea who keep hand-cut stickers over their QWERTY-printed keycaps rather than switch back; the JOHAB data holdouts are that kind of tail -- vanishingly small in absolute numbers, but without a graceful alternative if we close the door. A correctly-working JOHAB serves that tail at near-zero ongoing cost, which is ultimately what the patch is arguing for. > No opinion at the moment about whether to back-patch. Happy to defer on back-patching. The behaviour change is strictly additive (previously-rejected sequences start accepting, nothing is reinterpreted), so the back-branches are technically safe, but v19- only is a perfectly reasonable policy call if the project prefers minimum surface area on the first cycle. If you do want back-patches, I'm happy to produce per-branch versions. Given how long the JOHAB code has been stable (as noted earlier in the thread), my feeling is that the same patch should apply cleanly down to PG 14 without modification. Happy to verify that and post the set if it would help. One personal aside: reading KS X 1001 Annex 3 end-to-end for this fix turned out to be an unexpectedly cheerful detour -- it felt a bit like cracking open a 6502 assembly reference from roughly the same era. Back then I also had a popular neural-networks book that convinced teenage-me computers would never approach human cognition because they could never match the brain's memory scale -- a prediction that, looking around in 2026, has aged about as well as you'd expect. Thanks to everyone on the thread for making that side-quest worthwhile. Regards, Henson ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: BUG #19354: JOHAB rejects valid byte sequences @ 2026-04-15 05:57 Henson Choi <[email protected]> parent: Henson Choi <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Henson Choi @ 2026-04-15 05:57 UTC (permalink / raw) To: Thomas Munro <[email protected]>; +Cc: Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected] Subject: Fix and expand comments for Korean encodings in encnames.c Hi hackers, While reading through the encoding alias table in src/common/encnames.c, I noticed a few long-standing inaccuracies and omissions in the per-entry comments for the three Korean encodings. The most visible issue is the JOHAB entry, whose comment describes it as "Extended Unix Code for simplified Chinese" -- apparently a copy/paste slip from a neighboring EUC entry. JOHAB is in fact the Korean combining-style encoding defined in KS X 1001 annex 3. The attached 0002 patch makes comment-only adjustments to the three Korean encodings: * JOHAB: replace the incorrect "simplified Chinese" description with a correct one that identifies it as the Korean combining (Johab) encoding standardized in KS X 1001 annex 3. * EUC_KR: drop a stray space before the comma in the existing comment, and note that the encoding covers the KS X 1001 precomposed (Wansung) form. * UHC: spell out "Unified Hangul Code", clarify that it is Microsoft Windows CodePage 949, and describe its relationship to EUC-KR (superset covering all 11,172 precomposed Hangul syllables). No behavior change, no catalog change, no pg_wchar.h change -- this touches comments in src/common/encnames.c only. pgindent is clean. Thanks, Henson Choi From c7a7335d2cf5a2881b25d9091fd020a2d62f7661 Mon Sep 17 00:00:00 2001 From: Henson Choi <[email protected]> Date: Wed, 15 Apr 2026 14:52:35 +0900 Subject: [PATCH v1] Fix and expand comments for Korean encodings in encnames.c --- src/common/encnames.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/src/common/encnames.c b/src/common/encnames.c index 9085dbecce1..959b991dde4 100644 --- a/src/common/encnames.c +++ b/src/common/encnames.c @@ -61,8 +61,9 @@ static const pg_encname pg_encname_tbl[] = * Japanese, standard OSF */ { "euckr", PG_EUC_KR - }, /* EUC-KR; Extended Unix Code for Korean , KS - * X 1001 standard */ + }, /* EUC-KR; Extended Unix Code for Korean + * precomposed (Wansung) encoding, standard KS + * X 1001 */ { "euctw", PG_EUC_TW }, /* EUC-TW; Extended Unix Code for @@ -119,8 +120,8 @@ static const pg_encname pg_encname_tbl[] = }, /* ISO-8859-9; RFC1345,KXS2 */ { "johab", PG_JOHAB - }, /* JOHAB; Extended Unix Code for simplified - * Chinese */ + }, /* JOHAB; Korean combining (Johab) encoding, + * standard KS X 1001 annex 3 */ { "koi8", PG_KOI8R }, /* _dirty_ alias for KOI8-R (backward @@ -186,7 +187,9 @@ static const pg_encname pg_encname_tbl[] = }, /* alias for WIN1258 */ { "uhc", PG_UHC - }, /* UHC; Korean Windows CodePage 949 */ + }, /* UHC; Unified Hangul Code, Microsoft Windows + * CodePage 949; superset of EUC-KR covering + * all 11,172 precomposed Hangul syllables */ { "unicode", PG_UTF8 }, /* alias for UTF8 */ -- 2.50.1 (Apple Git-155) Attachments: [text/plain] 0002-Fix-and-expand-comments-for-Korean-encodings.txt (1.7K, 3-0002-Fix-and-expand-comments-for-Korean-encodings.txt) download | inline diff: From c7a7335d2cf5a2881b25d9091fd020a2d62f7661 Mon Sep 17 00:00:00 2001 From: Henson Choi <[email protected]> Date: Wed, 15 Apr 2026 14:52:35 +0900 Subject: [PATCH v1] Fix and expand comments for Korean encodings in encnames.c --- src/common/encnames.c | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/src/common/encnames.c b/src/common/encnames.c index 9085dbecce1..959b991dde4 100644 --- a/src/common/encnames.c +++ b/src/common/encnames.c @@ -61,8 +61,9 @@ static const pg_encname pg_encname_tbl[] = * Japanese, standard OSF */ { "euckr", PG_EUC_KR - }, /* EUC-KR; Extended Unix Code for Korean , KS - * X 1001 standard */ + }, /* EUC-KR; Extended Unix Code for Korean + * precomposed (Wansung) encoding, standard KS + * X 1001 */ { "euctw", PG_EUC_TW }, /* EUC-TW; Extended Unix Code for @@ -119,8 +120,8 @@ static const pg_encname pg_encname_tbl[] = }, /* ISO-8859-9; RFC1345,KXS2 */ { "johab", PG_JOHAB - }, /* JOHAB; Extended Unix Code for simplified - * Chinese */ + }, /* JOHAB; Korean combining (Johab) encoding, + * standard KS X 1001 annex 3 */ { "koi8", PG_KOI8R }, /* _dirty_ alias for KOI8-R (backward @@ -186,7 +187,9 @@ static const pg_encname pg_encname_tbl[] = }, /* alias for WIN1258 */ { "uhc", PG_UHC - }, /* UHC; Korean Windows CodePage 949 */ + }, /* UHC; Unified Hangul Code, Microsoft Windows + * CodePage 949; superset of EUC-KR covering + * all 11,172 precomposed Hangul syllables */ { "unicode", PG_UTF8 }, /* alias for UTF8 */ -- 2.50.1 (Apple Git-155) ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: BUG #19354: JOHAB rejects valid byte sequences @ 2026-04-15 06:59 Thomas Munro <[email protected]> parent: Henson Choi <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Thomas Munro @ 2026-04-15 06:59 UTC (permalink / raw) To: [email protected]; +Cc: Heikki Linnakangas <[email protected]>; Robert Haas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected] On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <[email protected]> wrote: > While reading through the encoding alias table in src/common/encnames.c, > I noticed a few long-standing inaccuracies and omissions in the per-entry > comments for the three Korean encodings. LGTM, so I will go ahead and push this to all branches. ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: BUG #19354: JOHAB rejects valid byte sequences @ 2026-05-14 19:36 Robert Haas <[email protected]> parent: Thomas Munro <[email protected]> 0 siblings, 1 reply; 5+ messages in thread From: Robert Haas @ 2026-05-14 19:36 UTC (permalink / raw) To: Thomas Munro <[email protected]>; +Cc: [email protected]; Heikki Linnakangas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected] On Wed, Apr 15, 2026 at 2:59 AM Thomas Munro <[email protected]> wrote:> > On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <[email protected]> wrote: > > While reading through the encoding alias table in src/common/encnames.c, > > I noticed a few long-standing inaccuracies and omissions in the per-entry > > comments for the three Korean encodings. > > LGTM, so I will go ahead and push this to all branches. I see that this was done, but this isn't the actual fix for this issue, right? Is somebody going to apply the main fix patch (perhaps just to master)? -- Robert Haas EDB: http://www.enterprisedb.com ^ permalink raw reply [nested|flat] 5+ messages in thread
* Re: BUG #19354: JOHAB rejects valid byte sequences @ 2026-05-16 09:39 Henson Choi <[email protected]> parent: Robert Haas <[email protected]> 0 siblings, 0 replies; 5+ messages in thread From: Henson Choi @ 2026-05-16 09:39 UTC (permalink / raw) To: Robert Haas <[email protected]>; +Cc: Thomas Munro <[email protected]>; Heikki Linnakangas <[email protected]>; Tom Lane <[email protected]>; Jeroen Vermeulen <[email protected]>; VASUKI M <[email protected]>; [email protected] 2026년 5월 15일 (금) 오전 4:36, Robert Haas <[email protected]>님이 작성: > On Wed, Apr 15, 2026 at 2:59 AM Thomas Munro <[email protected]> > wrote:> > > On Wed, Apr 15, 2026 at 5:58 PM Henson Choi <[email protected]> wrote: > > > While reading through the encoding alias table in > src/common/encnames.c, > > > I noticed a few long-standing inaccuracies and omissions in the > per-entry > > > comments for the three Korean encodings. > > > > LGTM, so I will go ahead and push this to all branches. > > I see that this was done, but this isn't the actual fix for this > issue, right? Is somebody going to apply the main fix patch (perhaps > just to master)? Right -- the cosmetic encnames.c comment cleanup is what Thomas pushed; the actual verifier fix is still pending. Tatsuo gave a +1 for preserving JOHAB and applying the verifier correction. Thomas may have other priorities right now. This is not urgent on my end, so I am content to wait for him. Thanks, Henson ^ permalink raw reply [nested|flat] 5+ messages in thread
end of thread, other threads:[~2026-05-16 09:39 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed) -- links below jump to the message on this page -- 2026-04-15 04:25 Re: BUG #19354: JOHAB rejects valid byte sequences Henson Choi <[email protected]> 2026-04-15 05:57 ` Henson Choi <[email protected]> 2026-04-15 06:59 ` Thomas Munro <[email protected]> 2026-05-14 19:36 ` Robert Haas <[email protected]> 2026-05-16 09:39 ` Henson Choi <[email protected]>
This inbox is served by agora; see mirroring instructions for how to clone and mirror all data and code used for this inbox