Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges

public inbox for [email protected]  
help / color / mirror / Atom feed

From: 유도건 <[email protected]>
To: [email protected]
To: Henson Choi <[email protected]>
To: Tatsuo Ishii <[email protected]>
To: [email protected]
Subject: Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges
Date: Fri, 5 Jun 2026 11:20:26 +0900
Message-ID: <CAFVBZ_GuA1SrRDqUNnCPzbCZGFvzC18+-0YQEKpAnJesut1xew@mail.gmail.com> (raw)

Hi,

Per CP949 (Windows-949), a two-byte UHC sequence requires the lead
byte to be in 0x81-0xFE and the trail byte to be in 0x41-0x5A,
0x61-0x7A, or 0x81-0xFE.

pg_uhc_verifychar() in src/common/wchar.c accepts any lead byte
with the high bit set (0x80-0xFF) and any trail byte other than
NUL, without enforcing those ranges.  Out-of-range pairs such as
0x80 0x41 (invalid lead) or 0x81 0x40 (invalid trail) are accepted
by the verifier and rejected only later by the conversion table,
with the message:

  ERROR:  character with byte sequence 0x80 0x41 in encoding "UHC"
          has no equivalent in encoding "UTF8"

This is misleading -- those pairs are not unmappable, they are
structurally invalid in CP949 -- and it is inconsistent with
pg_euckr_verifychar() (src/common/wchar.c:1044), which already
enforces lead/trail byte ranges explicitly via IS_EUC_RANGE_VALID().

The following evidence supports tightening the UHC verifier:

- Microsoft CP949 (Windows-949) specifies the two-byte form as
  lead 0x81-0xFE, trail 0x41-0x5A | 0x61-0x7A | 0x81-0xFE.
  Other byte values are not valid for the two-byte form.

- PostgreSQL's own UHC -> UTF-8 conversion table is already built
  on this assumption.  The radix tree header in
  src/backend/utils/mb/Unicode/uhc_to_utf8.map declares:

      0x81, /* b2_1_lower */
      0xfe, /* b2_1_upper */
      0x41, /* b2_2_lower */
      0xfe, /* b2_2_upper */

  i.e. the conversion side already restricts the byte ranges and
  rejects anything outside them; the verifier is just doing the
  rejection in the wrong place with the wrong message.

- pg_euckr_verifychar() already follows the strict shape: it
  validates lead/trail ranges directly rather than relying on
  pg_uhc_mblen() + a NUL-only trail check.  This patch brings
  pg_uhc_verifychar() in line with it.

This is split into two patches to make the change visible:

0001 -- Add a regression test for UHC.

  UHC is a client-only encoding, so there has been no dedicated
  test for pg_uhc_verifychar().  This adds
  src/test/regress/sql/uhc.sql, exercising the verifier through
  convert_from() in a UTF8 database.  The expected output records
  the *current* behavior on master, so this patch applies cleanly
  and all tests pass without any code change.

0002 -- Tighten pg_uhc_verifychar() to enforce CP949 byte ranges.

  Rewrite pg_uhc_verifychar() to check lead range (0x81-0xFE) and
  trail range (0x41-0x5A, 0x61-0x7A, or 0x81-0xFE) directly,
  following the style of pg_euckr_verifychar().  The new
  trail-range check also subsumes the previous NONUTF8_INVALID
  sentinel check (0x8d 0x20), which is removed -- 0x20 is not in
  any valid trail range, so 0x8d 0x20 is still rejected.

  The diff in expected/uhc.out is exactly eight lines, all of the
  form:

      -ERROR:  character with byte sequence 0xXX 0xYY in encoding
      -        "UHC" has no equivalent in encoding "UTF8"
      +ERROR:  invalid byte sequence for encoding "UHC": 0xXX 0xYY

  No other test result changes.  This makes the user-visible
  effect of the fix self-evident:

  - the accept/reject outcome for any input is unchanged;
  - the error message format changes from "has no equivalent in
    encoding UTF8" to "invalid byte sequence for encoding UHC"
    for the eight previously misclassified pairs;
  - rejection moves from the conversion step to the verifier,
    which is the appropriate place for a structural check.

Only client-side paths are affected since UHC is not supported as
a server encoding.

This issue was reported by Henson Choi in [1].

[1]
https://postgr.es/m/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com

v1 patches attached.

Regards,
DoGeon Yoo


Attachments:

  [application/octet-stream] v1-0001-Add-regression-test-for-UHC-encoding-baseline-capture.patch (8.2K, 3-v1-0001-Add-regression-test-for-UHC-encoding-baseline-capture.patch)
  download | inline diff:
From dd5eb976fdefc447826b0310a782c2848c3f21a1 Mon Sep 17 00:00:00 2001
From: DoGeon Yoo <[email protected]>
Date: Thu, 14 May 2026 15:44:19 +0900
Subject: [PATCH v1 1/2] Add regression test for UHC encoding (baseline
 capture)

UHC is a client-only encoding, so pg_uhc_verifychar() can only be
exercised indirectly through convert_from() in a UTF8 database.
There has been no dedicated regression test for it.

This commit adds src/test/regress/sql/uhc.sql covering:

- valid two-byte sequences at the CP949 lead/trail boundaries
  (trail 0x41, 0x5A, 0x61, 0x7A, 0x81, 0xFE; high leads 0xC7, 0xFD)
- invalid lead bytes (0x80, 0xFF)
- invalid trail bytes (0x40, 0x5B, 0x60, 0x7B, 0x80, 0xFF)
- the NONUTF8_INVALID sentinel pair (0x8d 0x20)
- a truncated two-byte character

The expected output records the *current* behavior on master, not
the desired behavior.  In particular, the eight invalid-lead and
invalid-trail cases (0x80 0x41, 0xFF 0x41, 0x81 0x40, ...) are
currently accepted by pg_uhc_verifychar() and rejected only later
by the conversion table with "character with byte sequence ... has
no equivalent in encoding UTF8".

Capturing this behavior here makes the follow-up patch's diff
self-evident: a subsequent commit that tightens pg_uhc_verifychar()
to enforce the CP949 lead/trail byte ranges will turn those eight
"has no equivalent" errors into "invalid byte sequence" errors,
without changing any other test result.

uhc_1.out provides an early \quit fallback for non-UTF8 databases.
---
 src/test/regress/expected/uhc.out   | 86 +++++++++++++++++++++++++++++
 src/test/regress/expected/uhc_1.out |  6 ++
 src/test/regress/parallel_schedule  |  2 +-
 src/test/regress/sql/uhc.sql        | 36 ++++++++++++
 4 files changed, 129 insertions(+), 1 deletion(-)
 create mode 100644 src/test/regress/expected/uhc.out
 create mode 100644 src/test/regress/expected/uhc_1.out
 create mode 100644 src/test/regress/sql/uhc.sql

diff --git a/src/test/regress/expected/uhc.out b/src/test/regress/expected/uhc.out
new file mode 100644
index 00000000000..d922cca7caf
--- /dev/null
+++ b/src/test/regress/expected/uhc.out
@@ -0,0 +1,86 @@
+-- This test is about UHC (Windows-949 / CP949) encoding.  UHC is a
+-- client-only encoding, so exercise pg_uhc_verifychar() via convert_from()
+-- in a UTF8 database.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+-- valid: EUC_KR-compatible Hangul (U+AC00 "가")
+SELECT encode(convert_to(convert_from('\xb0a1', 'UHC'), 'UTF8'), 'hex');
+ encode 
+--------
+ eab080
+(1 row)
+
+-- valid: CP949 lead/trail boundary values
+SELECT encode(convert_to(convert_from('\x8141', 'UHC'), 'UTF8'), 'hex');	-- trail 0x41
+ encode 
+--------
+ eab082
+(1 row)
+
+SELECT encode(convert_to(convert_from('\x815a', 'UHC'), 'UTF8'), 'hex');	-- trail 0x5A
+ encode 
+--------
+ eab0b4
+(1 row)
+
+SELECT encode(convert_to(convert_from('\x8161', 'UHC'), 'UTF8'), 'hex');	-- trail 0x61
+ encode 
+--------
+ eab0b5
+(1 row)
+
+SELECT encode(convert_to(convert_from('\x817a', 'UHC'), 'UTF8'), 'hex');	-- trail 0x7A
+ encode 
+--------
+ eab195
+(1 row)
+
+SELECT encode(convert_to(convert_from('\x8181', 'UHC'), 'UTF8'), 'hex');	-- trail 0x81
+ encode 
+--------
+ eab196
+(1 row)
+
+SELECT encode(convert_to(convert_from('\x81fe', 'UHC'), 'UTF8'), 'hex');	-- trail 0xFE
+ encode 
+--------
+ eab493
+(1 row)
+
+SELECT encode(convert_to(convert_from('\xc7a1', 'UHC'), 'UTF8'), 'hex');	-- high lead 0xC7
+ encode 
+--------
+ ed9088
+(1 row)
+
+SELECT encode(convert_to(convert_from('\xfda1', 'UHC'), 'UTF8'), 'hex');	-- high lead 0xFD
+ encode 
+--------
+ e788bb
+(1 row)
+
+-- invalid lead byte (0x80 and 0xFF are unused in CP949)
+SELECT convert_from('\x8041', 'UHC');
+ERROR:  character with byte sequence 0x80 0x41 in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\xff41', 'UHC');
+ERROR:  character with byte sequence 0xff 0x41 in encoding "UHC" has no equivalent in encoding "UTF8"
+-- invalid trail byte
+SELECT convert_from('\x8140', 'UHC');	-- 0x40
+ERROR:  character with byte sequence 0x81 0x40 in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\x815b', 'UHC');	-- 0x5B
+ERROR:  character with byte sequence 0x81 0x5b in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\x8160', 'UHC');	-- 0x60
+ERROR:  character with byte sequence 0x81 0x60 in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\x817b', 'UHC');	-- 0x7B
+ERROR:  character with byte sequence 0x81 0x7b in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\x8180', 'UHC');	-- 0x80
+ERROR:  character with byte sequence 0x81 0x80 in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\x81ff', 'UHC');	-- 0xFF
+ERROR:  character with byte sequence 0x81 0xff in encoding "UHC" has no equivalent in encoding "UTF8"
+SELECT convert_from('\x8d20', 'UHC');	-- NONUTF8_INVALID sentinel pair
+ERROR:  invalid byte sequence for encoding "UHC": 0x8d 0x20
+-- truncated two-byte character
+SELECT convert_from('\x81', 'UHC');
+ERROR:  invalid byte sequence for encoding "UHC": 0x81
diff --git a/src/test/regress/expected/uhc_1.out b/src/test/regress/expected/uhc_1.out
new file mode 100644
index 00000000000..9deb8b8ee1d
--- /dev/null
+++ b/src/test/regress/expected/uhc_1.out
@@ -0,0 +1,6 @@
+-- This test is about UHC (Windows-949 / CP949) encoding.  UHC is a
+-- client-only encoding, so exercise pg_uhc_verifychar() via convert_from()
+-- in a UTF8 database.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
diff --git a/src/test/regress/parallel_schedule b/src/test/regress/parallel_schedule
index 8fa0a6c47fb..15d5e539961 100644
--- a/src/test/regress/parallel_schedule
+++ b/src/test/regress/parallel_schedule
@@ -28,7 +28,7 @@ test: strings md5 numerology point lseg line box path polygon circle date time t
 # geometry depends on point, lseg, line, box, path, polygon, circle
 # horology depends on date, time, timetz, timestamp, timestamptz, interval
 # ----------
-test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc database stats_import pg_ndistinct pg_dependencies oid8 encoding euc_kr
+test: geometry horology tstypes regex type_sanity opr_sanity misc_sanity comments expressions unicode xid mvcc database stats_import pg_ndistinct pg_dependencies oid8 encoding euc_kr uhc
 
 # ----------
 # Load huge amounts of data
diff --git a/src/test/regress/sql/uhc.sql b/src/test/regress/sql/uhc.sql
new file mode 100644
index 00000000000..6905ad084b4
--- /dev/null
+++ b/src/test/regress/sql/uhc.sql
@@ -0,0 +1,36 @@
+-- This test is about UHC (Windows-949 / CP949) encoding.  UHC is a
+-- client-only encoding, so exercise pg_uhc_verifychar() via convert_from()
+-- in a UTF8 database.
+SELECT getdatabaseencoding() <> 'UTF8' AS skip_test \gset
+\if :skip_test
+\quit
+\endif
+
+-- valid: EUC_KR-compatible Hangul (U+AC00 "가")
+SELECT encode(convert_to(convert_from('\xb0a1', 'UHC'), 'UTF8'), 'hex');
+
+-- valid: CP949 lead/trail boundary values
+SELECT encode(convert_to(convert_from('\x8141', 'UHC'), 'UTF8'), 'hex');	-- trail 0x41
+SELECT encode(convert_to(convert_from('\x815a', 'UHC'), 'UTF8'), 'hex');	-- trail 0x5A
+SELECT encode(convert_to(convert_from('\x8161', 'UHC'), 'UTF8'), 'hex');	-- trail 0x61
+SELECT encode(convert_to(convert_from('\x817a', 'UHC'), 'UTF8'), 'hex');	-- trail 0x7A
+SELECT encode(convert_to(convert_from('\x8181', 'UHC'), 'UTF8'), 'hex');	-- trail 0x81
+SELECT encode(convert_to(convert_from('\x81fe', 'UHC'), 'UTF8'), 'hex');	-- trail 0xFE
+SELECT encode(convert_to(convert_from('\xc7a1', 'UHC'), 'UTF8'), 'hex');	-- high lead 0xC7
+SELECT encode(convert_to(convert_from('\xfda1', 'UHC'), 'UTF8'), 'hex');	-- high lead 0xFD
+
+-- invalid lead byte (0x80 and 0xFF are unused in CP949)
+SELECT convert_from('\x8041', 'UHC');
+SELECT convert_from('\xff41', 'UHC');
+
+-- invalid trail byte
+SELECT convert_from('\x8140', 'UHC');	-- 0x40
+SELECT convert_from('\x815b', 'UHC');	-- 0x5B
+SELECT convert_from('\x8160', 'UHC');	-- 0x60
+SELECT convert_from('\x817b', 'UHC');	-- 0x7B
+SELECT convert_from('\x8180', 'UHC');	-- 0x80
+SELECT convert_from('\x81ff', 'UHC');	-- 0xFF
+SELECT convert_from('\x8d20', 'UHC');	-- NONUTF8_INVALID sentinel pair
+
+-- truncated two-byte character
+SELECT convert_from('\x81', 'UHC');
-- 
2.43.0



  [application/octet-stream] v1-0002-Tighten-pg_uhc_verifychar-to-enforce-CP949-lead-trail.patch (5.4K, 4-v1-0002-Tighten-pg_uhc_verifychar-to-enforce-CP949-lead-trail.patch)
  download | inline diff:
From 56f18a5cfab07826205862b9b176a29b2d3ba4a1 Mon Sep 17 00:00:00 2001
From: DoGeon Yoo <[email protected]>
Date: Thu, 14 May 2026 15:46:42 +0900
Subject: [PATCH v1 2/2] Tighten pg_uhc_verifychar() to enforce CP949
 lead/trail byte ranges

Per CP949 (Windows-949), a two-byte UHC sequence requires the lead
byte to be in 0x81-0xFE and the trail byte to be in 0x41-0x5A,
0x61-0x7A, or 0x81-0xFE.

pg_uhc_verifychar() accepts any lead byte with the high bit set
(0x80-0xFF) and any trail byte other than NUL, without enforcing
those ranges.  Out-of-range pairs such as 0x80 0x41 (invalid lead)
or 0x81 0x40 (invalid trail) are accepted by the verifier; they
are rejected only later by the conversion table, with the message
"character with byte sequence ... has no equivalent in encoding
UTF8".  This makes the diagnostic misleading (the pair is not
unmappable, it is structurally invalid) and is inconsistent with
pg_euckr_verifychar(), which already enforces lead/trail ranges
explicitly.

Rewrite pg_uhc_verifychar() to check the lead and trail byte
ranges directly, following the style of pg_euckr_verifychar().
The new trail-range check also subsumes the previous
NONUTF8_INVALID sentinel check (0x8d 0x20), which is removed as
it becomes redundant -- 0x20 is not in any valid trail range, so
0x8d 0x20 is still rejected.

After this change, out-of-range pairs are rejected at the verifier
with "invalid byte sequence for encoding UHC".  The regression
test added in the previous commit captures this exactly: eight
"has no equivalent" errors become "invalid byte sequence" errors,
and no other test result changes.  The user-visible effect is the
error message format and the stage at which the byte sequence is
rejected; the accept/reject outcome for any input is unchanged.

Only client-side paths are affected since UHC is not supported as
a server encoding.

Reported-by: Henson Choi <[email protected]>
Discussion: https://www.postgresql.org/message-id/CAAAe_zBdGXsALm%3DGkUPtPx9MLcjcM5hBg3HZU%2Bnh8gKXSjXJJw%40mail.gmail.com
---
 src/common/wchar.c                | 37 ++++++++++++++++++++-----------
 src/test/regress/expected/uhc.out | 16 ++++++-------
 2 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/src/common/wchar.c b/src/common/wchar.c
index 926823cabec..5f8b0333325 100644
--- a/src/common/wchar.c
+++ b/src/common/wchar.c
@@ -1387,26 +1387,37 @@ pg_gbk_verifystr(const unsigned char *s, int len)
 static int
 pg_uhc_verifychar(const unsigned char *s, int len)
 {
-	int			l,
-				mbl;
+	int			l;
+	unsigned char c1,
+				c2;
 
-	l = mbl = pg_uhc_mblen(s);
+	c1 = *s++;
 
-	if (len < l)
-		return -1;
+	if (IS_HIGHBIT_SET(c1))
+	{
+		l = 2;
+		if (l > len)
+			return -1;
 
-	if (l == 2 &&
-		s[0] == NONUTF8_INVALID_BYTE0 &&
-		s[1] == NONUTF8_INVALID_BYTE1)
-		return -1;
+		c2 = *s++;
 
-	while (--l > 0)
-	{
-		if (*++s == '\0')
+		/* CP949 lead byte must be 0x81-0xFE */
+		if (c1 < 0x81 || c1 > 0xfe)
+			return -1;
+
+		/* CP949 trail byte: 0x41-0x5A, 0x61-0x7A, or 0x81-0xFE */
+		if (!((c2 >= 0x41 && c2 <= 0x5a) ||
+			  (c2 >= 0x61 && c2 <= 0x7a) ||
+			  (c2 >= 0x81 && c2 <= 0xfe)))
 			return -1;
 	}
+	else
+		/* must be ASCII */
+	{
+		l = 1;
+	}
 
-	return mbl;
+	return l;
 }
 
 static int
diff --git a/src/test/regress/expected/uhc.out b/src/test/regress/expected/uhc.out
index d922cca7caf..20949feb703 100644
--- a/src/test/regress/expected/uhc.out
+++ b/src/test/regress/expected/uhc.out
@@ -63,22 +63,22 @@ SELECT encode(convert_to(convert_from('\xfda1', 'UHC'), 'UTF8'), 'hex');	-- high
 
 -- invalid lead byte (0x80 and 0xFF are unused in CP949)
 SELECT convert_from('\x8041', 'UHC');
-ERROR:  character with byte sequence 0x80 0x41 in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x80 0x41
 SELECT convert_from('\xff41', 'UHC');
-ERROR:  character with byte sequence 0xff 0x41 in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0xff 0x41
 -- invalid trail byte
 SELECT convert_from('\x8140', 'UHC');	-- 0x40
-ERROR:  character with byte sequence 0x81 0x40 in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x81 0x40
 SELECT convert_from('\x815b', 'UHC');	-- 0x5B
-ERROR:  character with byte sequence 0x81 0x5b in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x81 0x5b
 SELECT convert_from('\x8160', 'UHC');	-- 0x60
-ERROR:  character with byte sequence 0x81 0x60 in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x81 0x60
 SELECT convert_from('\x817b', 'UHC');	-- 0x7B
-ERROR:  character with byte sequence 0x81 0x7b in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x81 0x7b
 SELECT convert_from('\x8180', 'UHC');	-- 0x80
-ERROR:  character with byte sequence 0x81 0x80 in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x81 0x80
 SELECT convert_from('\x81ff', 'UHC');	-- 0xFF
-ERROR:  character with byte sequence 0x81 0xff in encoding "UHC" has no equivalent in encoding "UTF8"
+ERROR:  invalid byte sequence for encoding "UHC": 0x81 0xff
 SELECT convert_from('\x8d20', 'UHC');	-- NONUTF8_INVALID sentinel pair
 ERROR:  invalid byte sequence for encoding "UHC": 0x8d 0x20
 -- truncated two-byte character
-- 
2.43.0

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected], [email protected]
  Subject: Re: Tighten pg_uhc_verifychar() to enforce CP949 lead/trail byte ranges
  In-Reply-To: <CAFVBZ_GuA1SrRDqUNnCPzbCZGFvzC18+-0YQEKpAnJesut1xew@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox