Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Zhongpu Chen <[email protected]>
To: David G. Johnston <[email protected]>
Cc: [email protected] <[email protected]>
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: Sat, 2 May 2026 12:49:00 +0800
Message-ID: <CA+1gyqJwhQ5n4VZmJdnouaq7yMgYR+w_RiY=A6VWz4TzcUiHkw@mail.gmail.com> (raw)
In-Reply-To: <CAKFQuwZuEZFYK9Arp_qFsoJ5o2EDDDCfsTwBYvoxzhBiXRJHQg@mail.gmail.com>
References: <CA+1gyqJJJDhq=cc_D0ad59WH_OD2G_mN54xTru0KYoNaLkF48Q@mail.gmail.com>
	<CA+1gyq+LF_91g_i0WXeKK6JGF8viaqaF213S-9Arq=SG=4GAaA@mail.gmail.com>
	<CAKFQuwZuEZFYK9Arp_qFsoJ5o2EDDDCfsTwBYvoxzhBiXRJHQg@mail.gmail.com>

Thanks for the clarification.


I agree that validation on every input may have runtime-cost concerns. But
this can be well-controlled. For example, MySQL adopts a finer checking for
EUC-CN (i.e., GB2312) in
https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:


```

static int func_gb2312_uni_onechar(int code) {
  if ((code >= 0x2121) && (code <= 0x2658))
    return (tab_gb2312_uni0[code - 0x2121]);
  if ((code >= 0x2721) && (code <= 0x296F))
    return (tab_gb2312_uni1[code - 0x2721]);
  if ((code >= 0x3021) && (code <= 0x777E))
    return (tab_gb2312_uni2[code - 0x3021]);
  return (0);
}

```

where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking
can also be enhanced.


Anyway, it is reasonable to note these details in the documentation.


On Sat, May 2, 2026 at 11:28 AM David G. Johnston <
[email protected]> wrote:

> On Friday, May 1, 2026, Zhongpu Chen <[email protected]> wrote:
>
>> The issue is not specific to E'\\x..' literals. A normal COPY FROM data
>> file with ENCODING 'EUC_CN' can create text rows that later cannot be
>> retrieved with SELECT.
>>
>>  This suggests that input validation for EUC_CN is only structural, while
>> the EUC_CN-to-UTF8 conversion table is stricter.
>>
>
> I suspect a lack of desire to maintain and ensure that specific values are
> verified; or accepting the runtime cost to do so.  It is indeed
> structural.  This point should probably be documented better.  But it’s
> hard to feel too bad if the input claims it is providing verifiable EUC_CN
> data then proceeds to supply data that lacks meaning in reality.  We are
> happy to just store and return your data to you - but it’s unreasonable to
> ask for it to be converted.  It would be nice for the database to provide
> an extra layer of protection, so I’m not against the idea.  Either
> automatically or or at least providing a function that could, say, be
> called in a trigger for opt-in.  But definitely feels like a problematic
> benefit-to-cost proposition.
>
> David J.
>
>

-- 
Zhongpu Chen

view thread (12+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
  In-Reply-To: <CA+1gyqJwhQ5n4VZmJdnouaq7yMgYR+w_RiY=A6VWz4TzcUiHkw@mail.gmail.com>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox