MIME-Version: 1.0
References: 
 <CA+1gyqJJJDhq=cc_D0ad59WH_OD2G_mN54xTru0KYoNaLkF48Q@mail.gmail.com>
 <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org>
 <20260506.211907.1578384907621261702.ishii@postgresql.org>
 <CA+1gyqJW8ht=GEoxARAL=8pUGbq7qw7VV4eP+g6PK9f+Qi_TXg@mail.gmail.com>
In-Reply-To: 
 <CA+1gyqJW8ht=GEoxARAL=8pUGbq7qw7VV4eP+g6PK9f+Qi_TXg@mail.gmail.com>
From: Zhongpu Chen <chenloveit@gmail.com>
Date: Sun, 10 May 2026 10:28:57 +0800
Message-ID: 
 <CA+1gyq+KeNhn=ZR6MZap49e8NX984O2z2FFoY_2dpmnMFL7a9w@mail.gmail.com>
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document
 that accepted byte sequences may be unconvertible to UTF8
To: Tatsuo Ishii <ishii@postgresql.org>
Cc: peter@eisentraut.org, pgsql-hackers@lists.postgresql.org
Content-Type: multipart/alternative; boundary="0000000000000126aa06516d6541"
Archived-At: 
 <https://www.postgresql.org/message-id/CA%2B1gyq%2BKeNhn%3DZR6MZap49e8NX984O2z2FFoY_2dpmnMFL7a9w%40mail.gmail.com>
Precedence: bulk

--0000000000000126aa06516d6541
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

My prototype implementation:
https://github.com/SWUFE-DB-Group/postgresql-encoding-validation and the
usage:
https://github.com/SWUFE-DB-Group/postgresql-encoding-validation/blob/main/=
DEV.md

On Sat, May 9, 2026 at 4:58=E2=80=AFPM Zhongpu Chen <chenloveit@gmail.com> =
wrote:

> >  If so, tightening up the validation may break such that uses.
>
> I agree. What about introducing an extra GUC which allows users to specif=
y
> verification logic? In fact, I have implemented this patch.
>
> ```
> SHOW encoding_validation;
> -- default behaviour
> SET encoding_validation =3D 'native';
> -- enforce Write to be fully compatible with Read
> SET encoding_validation =3D 'read_compatible';
> ```
>
> On Wed, May 6, 2026 at 8:19=E2=80=AFPM Tatsuo Ishii <ishii@postgresql.org=
> wrote:
>
>> > It is in general not necessarily required that all text in all
>> > non-UTF8 encodings must be convertible to UTF8.
>> >
>> > (This is also a result of history: These encodings were implemented in
>> > PostgreSQL before Unicode.)
>> >
>> > That said, I can see how different behaviors might be desirable.
>> >
>> > My first question would be, are these non-convertible byte sequences
>> > just characters that don't map to Unicode, or are they invalid within
>> > the definition of the EUC-* encodings themselves?
>>
>> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
>> Roman numerals (iii), which is not defined in the original GB2312
>> (the character set of EUC_CN),
>>
>> > If the latter, then
>> > we should just reject them (modulo some backward compatibility),
>> > similar to how we reject certain Unicode code points that exist
>> > "structurally" but are not valid for one reason or another.
>>
>> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
>> super set of GB2312). In DB18030, lowercase forms of the Roman
>> numerals and other characters (e.g. Euro sign) were added.
>>
>> So I suspect that a) those characters are sometimes used with EUC_CN
>> despite the fact that they are not valid GB2312 characters. b) some
>> EUC_CN users might have already written those characters into EUC_CN
>> databases. If so, tightening up the validation may break such that
>> uses. This is just my guess. Please correct me if I am wrong.
>>
>> > Alternatively, if these byte sequences are valid characters but they
>> > just didn't end up in Unicode for some reason, then rejecting them
>> > might break valid uses.
>>
>> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
>> explicitly rejects characters that are not part of GB2312, including
>> 0xA2A3, as the script is using GB18030 as a source data.
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>
>
> --
> Zhongpu Chen
>


--=20
Zhongpu Chen

--0000000000000126aa06516d6541
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">My prototype implementation:=C2=A0<a href=3D"https://githu=
b.com/SWUFE-DB-Group/postgresql-encoding-validation">https://github.com/SWU=
FE-DB-Group/postgresql-encoding-validation</a> and the usage:=C2=A0<a href=
=3D"https://github.com/SWUFE-DB-Group/postgresql-encoding-validation/blob/m=
ain/DEV.md">https://github.com/SWUFE-DB-Group/postgresql-encoding-validatio=
n/blob/main/DEV.md</a></div><br><div class=3D"gmail_quote gmail_quote_conta=
iner"><div dir=3D"ltr" class=3D"gmail_attr">On Sat, May 9, 2026 at 4:58=E2=
=80=AFPM Zhongpu Chen &lt;<a href=3D"mailto:chenloveit@gmail.com">chenlovei=
t@gmail.com</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" style=
=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding=
-left:1ex"><div dir=3D"ltr"><div>&gt;=C2=A0 If so, tightening up the valida=
tion may break such that uses.</div><div><br></div><div>I agree. What about=
 introducing an extra GUC which allows users to specify verification logic?=
 In fact, I have implemented this patch.</div><div><br></div><div>```</div>=
<div>SHOW encoding_validation;</div><div>-- default behaviour</div><div>SET=
 encoding_validation =3D &#39;native&#39;;</div><div>-- enforce Write to be=
 fully compatible with Read</div><div>SET encoding_validation =3D &#39;read=
_compatible&#39;;<br>```</div></div><br><div class=3D"gmail_quote"><div dir=
=3D"ltr" class=3D"gmail_attr">On Wed, May 6, 2026 at 8:19=E2=80=AFPM Tatsuo=
 Ishii &lt;<a href=3D"mailto:ishii@postgresql.org" target=3D"_blank">ishii@=
postgresql.org</a>&gt; wrote:<br></div><blockquote class=3D"gmail_quote" st=
yle=3D"margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padd=
ing-left:1ex">&gt; It is in general not necessarily required that all text =
in all<br>
&gt; non-UTF8 encodings must be convertible to UTF8.<br>
&gt; <br>
&gt; (This is also a result of history: These encodings were implemented in=
<br>
&gt; PostgreSQL before Unicode.)<br>
&gt; <br>
&gt; That said, I can see how different behaviors might be desirable.<br>
&gt; <br>
&gt; My first question would be, are these non-convertible byte sequences<b=
r>
&gt; just characters that don&#39;t map to Unicode, or are they invalid wit=
hin<br>
&gt; the definition of the EUC-* encodings themselves?<br>
<br>
A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the<br>
Roman numerals (iii), which is not defined in the original GB2312<br>
(the character set of EUC_CN),<br>
<br>
&gt; If the latter, then<br>
&gt; we should just reject them (modulo some backward compatibility),<br>
&gt; similar to how we reject certain Unicode code points that exist<br>
&gt; &quot;structurally&quot; but are not valid for one reason or another.<=
br>
<br>
After GB2312, GB18030 was defined. (It is claimed that GB18030 is a<br>
super set of GB2312). In DB18030, lowercase forms of the Roman<br>
numerals and other characters (e.g. Euro sign) were added.<br>
<br>
So I suspect that a) those characters are sometimes used with EUC_CN<br>
despite the fact that they are not valid GB2312 characters. b) some<br>
EUC_CN users might have already written those characters into EUC_CN<br>
databases. If so, tightening up the validation may break such that<br>
uses. This is just my guess. Please correct me if I am wrong.<br>
<br>
&gt; Alternatively, if these byte sequences are valid characters but they<b=
r>
&gt; just didn&#39;t end up in Unicode for some reason, then rejecting them=
<br>
&gt; might break valid uses.<br>
<br>
That&#39;s not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl<br>
explicitly rejects characters that are not part of GB2312, including<br>
0xA2A3, as the script is using GB18030 as a source data.<br>
<br>
Regards,<br>
--<br>
Tatsuo Ishii<br>
SRA OSS K.K.<br>
English: <a href=3D"http://www.sraoss.co.jp/index_en/" rel=3D"noreferrer" t=
arget=3D"_blank">http://www.sraoss.co.jp/index_en/</a><br>
Japanese:<a href=3D"http://www.sraoss.co.jp" rel=3D"noreferrer" target=3D"_=
blank">http://www.sraoss.co.jp</a><br>
</blockquote></div><div><br clear=3D"all"></div><br><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr">Zhongpu Chen</div></div>
</blockquote></div><div><br clear=3D"all"></div><br><span class=3D"gmail_si=
gnature_prefix">-- </span><br><div dir=3D"ltr" class=3D"gmail_signature"><d=
iv dir=3D"ltr">Zhongpu Chen</div></div>

--0000000000000126aa06516d6541--