Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wLdVx-001uwD-0B for pgsql-hackers@arkaria.postgresql.org; Sat, 09 May 2026 08:58:25 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wLdVv-00Cpor-2R for pgsql-hackers@arkaria.postgresql.org; Sat, 09 May 2026 08:58:23 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wLdVv-00CpoY-11 for pgsql-hackers@lists.postgresql.org; Sat, 09 May 2026 08:58:23 +0000 Received: from mail-vs1-xe2b.google.com ([2607:f8b0:4864:20::e2b]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wLdVs-00000000xUw-3aIC for pgsql-hackers@lists.postgresql.org; Sat, 09 May 2026 08:58:22 +0000 Received: by mail-vs1-xe2b.google.com with SMTP id ada2fe7eead31-6312970d9e3so1551909137.2 for ; Sat, 09 May 2026 01:58:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1778317100; cv=none; d=google.com; s=arc-20240605; b=AGEVTj3M/T7t3QfiYDiA2HjodxXk3G6Hm/zwJZ+7zCsz05xFTedlgRgGmQh4sLuEtd 9PJZ/U0ORt3jleOav3f8g758ThjkUo+9+P6Oxy1D4qBPwzpuSwbOB1TJqeZeMOBAWA4l PADLuekM40xLo8GuWkYxMRVCwnznyDFLoS/Q1iMMRtzHy0fm/bIg6Y2bMF/Jy7clBpzn X/o8APCBbmHDmHaatLZh37kNdGBqGY1a9tM03QQoGum8c54E/ZEza3lA20Wmw21oBZ1y vQa5dCWqm6P8R69HD8wX/ScnHZPzV5596VvFS0qKuji/laTU5GpmidWHB5n8SQo/lHZu v1Gw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=+XiwTxOJabnu173V8sbzcoTxDcVRudhL/RDCQO5gdEI=; fh=zxo4eWtiC3hXXe6v0N7/JiTG7uJr/4iFxWP6RwE6sYY=; b=MLVltaPqfNV75awBu140kILlBIU0omFHBTTMRhe38YWNHZBUtyv15k6ONAJWAPJ8vE T4mVi2dIdDA9cM5UvcbmYaDKNS+WX5VSIAvqn7PAR4NSMIQfVWTXOhN26/8T91Q1loRy IUnDidXdvgMUv1h6gQlwnNYu47z/iRbeqpUMHYhp6Qg6bEdt7Set6F8v/fbmyQN0dnqu AGIWqQ4rH5i8cD9orUQxcTngEReNgjPMfHSbhgbnaHgiqi1eAy1rRwx4k1nnQLVEHFuv dpAHK2DiqskCi/9YGlj4sqKT9s9vEBUy6aYdt2e3KHA8OcR4BJEXaAf21AvuRItGOrmb k0JQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778317100; x=1778921900; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=+XiwTxOJabnu173V8sbzcoTxDcVRudhL/RDCQO5gdEI=; b=Anh1w/zTIwm0sDtRx2cPAXOOrJ9LxJ4R27jU8ar0Lee89g4iHMSnxuISAqYlGehDQp UIsrQYqhcT4hjXJUnCV1QeOI6T6wd8wogiK1FbhPbaB4fhHVQ9RDnjVQGOp7CbknWEn2 CMipwT9WOn5gTJJTdQ1Kpm1juMcpu+kbbS31o5hrd5bMZFoPAbKfBBJxy4anLc4obnsc +6pggWJ0hBiyUVYOOIbA4m63QGTLZBDgkpWVWs9UjjDzAUribVhUGW1p2q/mxHBavx4n kPjbcQHKA5DJl40y7BVoLuYG7UghoNBflVZig2mQ1ZtVkGGxFP6FbWOx4v6pZbPH6aTU BIKQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778317100; x=1778921900; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=+XiwTxOJabnu173V8sbzcoTxDcVRudhL/RDCQO5gdEI=; b=l2NLDy+zBIdAIIwWTy80tKUZHAI/lbxNwJ3OBBvN46+3sf+1lZ6LKxYKhpZqnFqfzE AJdDSVkGJ0UyFtH+84rp+w9aFsDzrMP40JjUBXamZ4qiByEAirxZAzL9Vt8ElaImmpGq tl4RVP4g2Wfn2S/l6euJcz7WTxtiR8k6uYd+F/hI7bGlcXxJ0ND3VPKOgjtvrZcKMSTY qVPhqTnyrCUgaejmD1vnY+1Z/hLztc669ll7fBxi8m33x5F1uu3iUA4BO8k2s1SBasC8 Cp4JBrkn8sk4JJlNaU7tOkRAfpPyoNgZt46aigbB4BWo3oSg9AH9zQ1WZMtGXC4Kcbuv 8IdQ== X-Forwarded-Encrypted: i=1; AFNElJ+v6ysY0FbAm7uvIoSzW9OIMY2k0j9dwk8aoqD2MQpS/Yk5WXgX+eRatJI5XFMxUJZicetUY/FbY5e98xFY@lists.postgresql.org X-Gm-Message-State: AOJu0Ywa3QiJ5rNXSbrXniYZGXm0SRkAF4H+KG6GeTrawh6K1H3sMLsK KYOvY7gZXzQs2TIDtmX/T1NoD/rgaYu0bb7ee0Lz5hzlRswpkvndUzL/WMG7AGQ71bc69S/M21b EMPz6yx4XI8vxRmxbfSvhdARSLfgeOJI= X-Gm-Gg: Acq92OGZrxGGzunx6tnnwsidLzYSKyEbyp+gzdUACnzUO5Y2iQnBCx7t2isV5FBf4dp MmpoICZd6C3sd4EF21vsimitRGptYfB3+Pn2OsMCbIxzXpwfoZzHH2RImLMJy8XUaSex1n0cchk msiCF2p6LJq5p9EhtqZrbBdCsVU02kAQOuuIU32iQPTjj+ahKP7OePkVvhJN78ZAmdbGjGpmCyI Vxvv19CyOL6PkNt7oukzYl/HfV7PM9JQx8GAy5mTcgwyb3hLyQ9iCZ7lEK/w0fS1FCwzKUETOJe ZxlE8bV1DnRlFbz/ZCMVEclk/93JNsSlFtZz+w== X-Received: by 2002:a05:6102:1528:b0:631:b365:40ee with SMTP id ada2fe7eead31-631b3655741mr909695137.4.1778317100452; Sat, 09 May 2026 01:58:20 -0700 (PDT) MIME-Version: 1.0 References: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> <20260506.211907.1578384907621261702.ishii@postgresql.org> In-Reply-To: <20260506.211907.1578384907621261702.ishii@postgresql.org> From: Zhongpu Chen Date: Sat, 9 May 2026 16:58:09 +0800 X-Gm-Features: AVHnY4K7RPPkNMXs4Nb0Lqtfh3VLTswM1_eCZZvqVVu2wB32kd-FC5HxynsTFvM Message-ID: Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 To: Tatsuo Ishii Cc: peter@eisentraut.org, pgsql-hackers@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000001210a706515eb738" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000001210a706515eb738 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable > If so, tightening up the validation may break such that uses. I agree. What about introducing an extra GUC which allows users to specify verification logic? In fact, I have implemented this patch. ``` SHOW encoding_validation; -- default behaviour SET encoding_validation =3D 'native'; -- enforce Write to be fully compatible with Read SET encoding_validation =3D 'read_compatible'; ``` On Wed, May 6, 2026 at 8:19=E2=80=AFPM Tatsuo Ishii = wrote: > > It is in general not necessarily required that all text in all > > non-UTF8 encodings must be convertible to UTF8. > > > > (This is also a result of history: These encodings were implemented in > > PostgreSQL before Unicode.) > > > > That said, I can see how different behaviors might be desirable. > > > > My first question would be, are these non-convertible byte sequences > > just characters that don't map to Unicode, or are they invalid within > > the definition of the EUC-* encodings themselves? > > A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the > Roman numerals (iii), which is not defined in the original GB2312 > (the character set of EUC_CN), > > > If the latter, then > > we should just reject them (modulo some backward compatibility), > > similar to how we reject certain Unicode code points that exist > > "structurally" but are not valid for one reason or another. > > After GB2312, GB18030 was defined. (It is claimed that GB18030 is a > super set of GB2312). In DB18030, lowercase forms of the Roman > numerals and other characters (e.g. Euro sign) were added. > > So I suspect that a) those characters are sometimes used with EUC_CN > despite the fact that they are not valid GB2312 characters. b) some > EUC_CN users might have already written those characters into EUC_CN > databases. If so, tightening up the validation may break such that > uses. This is just my guess. Please correct me if I am wrong. > > > Alternatively, if these byte sequences are valid characters but they > > just didn't end up in Unicode for some reason, then rejecting them > > might break valid uses. > > That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl > explicitly rejects characters that are not part of GB2312, including > 0xA2A3, as the script is using GB18030 as a source data. > > Regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp > --=20 Zhongpu Chen --0000000000001210a706515eb738 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
>=C2=A0 If so, tightening up the validation may br= eak such that uses.

I agree. What about introducin= g an extra GUC which allows users to specify verification logic? In fact, I= have implemented this patch.

```
SHOW e= ncoding_validation;
-- default behaviour
SET encoding_v= alidation =3D 'native';
-- enforce Write to be fully comp= atible with Read
SET encoding_validation =3D 'read_compatible= ';
```

On Wed, May 6, 2026 at 8:19=E2=80= =AFPM Tatsuo Ishii <ishii@postgr= esql.org> wrote:
> It is in general not necessarily required that all text in all=
> non-UTF8 encodings must be convertible to UTF8.
>
> (This is also a result of history: These encodings were implemented in=
> PostgreSQL before Unicode.)
>
> That said, I can see how different behaviors might be desirable.
>
> My first question would be, are these non-convertible byte sequences > just characters that don't map to Unicode, or are they invalid wit= hin
> the definition of the EUC-* encodings themselves?

A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
Roman numerals (iii), which is not defined in the original GB2312
(the character set of EUC_CN),

> If the latter, then
> we should just reject them (modulo some backward compatibility),
> similar to how we reject certain Unicode code points that exist
> "structurally" but are not valid for one reason or another.<= br>
After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
super set of GB2312). In DB18030, lowercase forms of the Roman
numerals and other characters (e.g. Euro sign) were added.

So I suspect that a) those characters are sometimes used with EUC_CN
despite the fact that they are not valid GB2312 characters. b) some
EUC_CN users might have already written those characters into EUC_CN
databases. If so, tightening up the validation may break such that
uses. This is just my guess. Please correct me if I am wrong.

> Alternatively, if these byte sequences are valid characters but they > just didn't end up in Unicode for some reason, then rejecting them=
> might break valid uses.

That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
explicitly rejects characters that are not part of GB2312, including
0xA2A3, as the script is using GB18030 as a source data.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


--
Zhongpu Chen
--0000000000001210a706515eb738--