Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wLtvs-0025tK-2o for pgsql-hackers@arkaria.postgresql.org; Sun, 10 May 2026 02:30:17 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wLtur-00ECpO-08 for pgsql-hackers@arkaria.postgresql.org; Sun, 10 May 2026 02:29:13 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wLtuq-00ECpG-23 for pgsql-hackers@lists.postgresql.org; Sun, 10 May 2026 02:29:12 +0000 Received: from mail-vs1-xe35.google.com ([2607:f8b0:4864:20::e35]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wLtuo-00000001UYy-0uPU for pgsql-hackers@lists.postgresql.org; Sun, 10 May 2026 02:29:12 +0000 Received: by mail-vs1-xe35.google.com with SMTP id ada2fe7eead31-63134048d8dso1680189137.1 for ; Sat, 09 May 2026 19:29:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1778380148; cv=none; d=google.com; s=arc-20240605; b=jdbk5OWSs5i2D2dGiFPHVsogZVMbsnJi/sTILu2jU2V4q+bwpjtAG+sbqTW3SdAL2T 0NMkrWA198eiptZhmuF45xPslnNZvXJVJ9hOQYkjcnRG7bvV68wIDVE9ygtfO2Lt6I0V VeuWqXgzI6QmGMjSFy1gIJUdPH8abEerErSTBT5WhLplpWomIsD1KrJX58zLMG46NmqJ Fb9GPbavb5yFyLkqAaf8Fs1ZEeyIaAIpX0Ei3EG/XRaUzz8THGCvNNHXhBYyD4yU+6uI YF2Y8J4F2NS7NDZsB30c2BK1HUVdyGAUm0fwTcplLw4pVyklzxlv7rKHZQocPHmbCea/ wVkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=rHC+/o7pvCH5Jp2NCEmgmQkpxI0zEVJPhM8iO3S62wc=; fh=enFujTnFF3ftiamdKq3s50iIBcWB4O3dPdK83mkb4+I=; b=feaPyJDeBlXQZ+d/TETADniwjHYdfWBaL4f1tx0WaT3UM+nwqltqpRrY1+c32PnxnH WXj5Z8dAMlyYF3YubUd3VBP+3yH0JShACv9MSulvFJIsC/lNtOIl9ti1EzdQTLgGlkp/ JqAXvl4STswCVc31+A6IzHI0zqRTkwrgxnybaPVXuBSC14dp5vPKeqXc27wVadpLxe3G PgCSc+jITFtgUVnhDdbC5sXJm8aQGyeSBaLOqTKPZQqUvkqog4ejP5CXdRmdoI4HFLrf HYZPVnKVY82/BS1JnFdnHjLkB+M4bNX0Rz8Z/CmUDL5E2Eh9O0wb1+V+NlHFEsQtRz7n BmAA==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778380148; x=1778984948; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=rHC+/o7pvCH5Jp2NCEmgmQkpxI0zEVJPhM8iO3S62wc=; b=G+IGW2YHfldB5vEKvDDHE4MFNzgv7YbZq/FRpISLs3YU5I1q0R+9IjcRmqPRhJc40c lA82S97UXTvpRvPnc/oSAXC710TpRj0ZG4AU/nwD6YFldI9IBWiykJaShY0d/CPIuOff EzelHsDMColmLYsGQEHxK98ptWQmUepUadgfSaJXZRc7AGC7Q3sN1ceCZSpH5aJqAPi2 pS80x+TLlSAD5jzz9cDhw1l+xhSsAo4GtXfCp6BnSkg5er2mwmLPDoltPMeT61xwrGFn qzY1dDs6bN1OEl7ILkIeNJGCJlovI5nru1QfykqorPqizXkeN/6T9yxgPb12AwUmaifk BQlA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778380148; x=1778984948; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=rHC+/o7pvCH5Jp2NCEmgmQkpxI0zEVJPhM8iO3S62wc=; b=TIZG6JQ0bXdody+HNPX/ceAy6ENXk/qmVKBm3h6lhjQyDszIb2hwEAcMDNPYBSq9td LtYGBgPT02E5wB1F7gEZZwmfR2/caYcMKbhMhXIgHVOnCXJe/Mzreu6ekN9fXMiInh3l TgugZQ4txt9ROLnfLCCrxlrceFIDX8I3mqXN5XEe+AnMWXPr+o3ZciBoQPF6S/uP9jwH DG7Q82PxNW8A6C7SERyLqeTqubON0jWuAKsWCCFXcrRHeW8wHfWlWXEDaNi8F6f6/8WX 3Fzoi2UU3qvyyl0iQqqwt8p0LcR83a9agWKb9WW6+dWafb6pBnMeVYxRwH4thJj2alXl eoog== X-Forwarded-Encrypted: i=1; AFNElJ8WeOQlnvNZRugccTyN7kQ2c6cGQYrlKriJrjipdNGy6LTO/+kk/vShv8+XbePmKueYP7QXmM/aydiTcmuf@lists.postgresql.org X-Gm-Message-State: AOJu0Yy7BpsNC/Jl61Uo2yXy9Qbkq0NUjRWG+tTvifqA7WfNUkY4ZzGp oC4TTesBXGK+kBDGcFrMKf66cw6s5ZYDVhOn3XCn7GB91x5YIq0v1PjMxOXW0k5s/pWTOoZ5Q6M 0J5o9v2eGcFxzu3ghPRXk8U9yfRsYuXuPdXxpgnuIGA== X-Gm-Gg: Acq92OFpZJQCMRmjVnRV9cSPeMyRw+peN59PVERFnF7zHQexti/obJVv+QqGDrx6ph9 BwVO7v3DdpyyAdwxfi0u//j2oIZW9Itk4Upsr6K8HuTn5k8aF1oguriDK/PMJPuGXNXmbjPWQZ0 fTKb3uetecmCSHkGA3d4lIw+jDjKCjVlBVExXDipnJF7/fYlvLcIODOFj+vdpnlKKMcWQqfQ17P 6MmVIkHnAT66Nnzd1zlGua6snCwmfJVAxMfNOGOX5Tas341vJ2/A0STnPbxvqh/BrA3x3RUyu9r HFfzm+nZ X-Received: by 2002:a05:6102:579b:b0:608:cdd9:2bcd with SMTP id ada2fe7eead31-63115fca23emr4952292137.15.1778380148121; Sat, 09 May 2026 19:29:08 -0700 (PDT) MIME-Version: 1.0 References: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> <20260506.211907.1578384907621261702.ishii@postgresql.org> In-Reply-To: From: Zhongpu Chen Date: Sun, 10 May 2026 10:28:57 +0800 X-Gm-Features: AVHnY4I9xds__dleCbvM5n7wtPnv3kqMehhkRm5p3K1Kn5ZkJzndMz8gyc8TrD4 Message-ID: Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 To: Tatsuo Ishii Cc: peter@eisentraut.org, pgsql-hackers@lists.postgresql.org Content-Type: multipart/alternative; boundary="0000000000000126aa06516d6541" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --0000000000000126aa06516d6541 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable My prototype implementation: https://github.com/SWUFE-DB-Group/postgresql-encoding-validation and the usage: https://github.com/SWUFE-DB-Group/postgresql-encoding-validation/blob/main/= DEV.md On Sat, May 9, 2026 at 4:58=E2=80=AFPM Zhongpu Chen = wrote: > > If so, tightening up the validation may break such that uses. > > I agree. What about introducing an extra GUC which allows users to specif= y > verification logic? In fact, I have implemented this patch. > > ``` > SHOW encoding_validation; > -- default behaviour > SET encoding_validation =3D 'native'; > -- enforce Write to be fully compatible with Read > SET encoding_validation =3D 'read_compatible'; > ``` > > On Wed, May 6, 2026 at 8:19=E2=80=AFPM Tatsuo Ishii wrote: > >> > It is in general not necessarily required that all text in all >> > non-UTF8 encodings must be convertible to UTF8. >> > >> > (This is also a result of history: These encodings were implemented in >> > PostgreSQL before Unicode.) >> > >> > That said, I can see how different behaviors might be desirable. >> > >> > My first question would be, are these non-convertible byte sequences >> > just characters that don't map to Unicode, or are they invalid within >> > the definition of the EUC-* encodings themselves? >> >> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the >> Roman numerals (iii), which is not defined in the original GB2312 >> (the character set of EUC_CN), >> >> > If the latter, then >> > we should just reject them (modulo some backward compatibility), >> > similar to how we reject certain Unicode code points that exist >> > "structurally" but are not valid for one reason or another. >> >> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a >> super set of GB2312). In DB18030, lowercase forms of the Roman >> numerals and other characters (e.g. Euro sign) were added. >> >> So I suspect that a) those characters are sometimes used with EUC_CN >> despite the fact that they are not valid GB2312 characters. b) some >> EUC_CN users might have already written those characters into EUC_CN >> databases. If so, tightening up the validation may break such that >> uses. This is just my guess. Please correct me if I am wrong. >> >> > Alternatively, if these byte sequences are valid characters but they >> > just didn't end up in Unicode for some reason, then rejecting them >> > might break valid uses. >> >> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl >> explicitly rejects characters that are not part of GB2312, including >> 0xA2A3, as the script is using GB18030 as a source data. >> >> Regards, >> -- >> Tatsuo Ishii >> SRA OSS K.K. >> English: http://www.sraoss.co.jp/index_en/ >> Japanese:http://www.sraoss.co.jp >> > > > -- > Zhongpu Chen > --=20 Zhongpu Chen --0000000000000126aa06516d6541 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Sat, May 9, 2026 at 4:58=E2= =80=AFPM Zhongpu Chen <chenlovei= t@gmail.com> wrote:
>=C2=A0 If so, tightening up the valida= tion may break such that uses.

I agree. What about= introducing an extra GUC which allows users to specify verification logic?= In fact, I have implemented this patch.

```
=
SHOW encoding_validation;
-- default behaviour
SET= encoding_validation =3D 'native';
-- enforce Write to be= fully compatible with Read
SET encoding_validation =3D 'read= _compatible';
```

On Wed, May 6, 2026 at 8:19=E2=80=AFPM Tatsuo= Ishii <ishii@= postgresql.org> wrote:
> It is in general not necessarily required that all text = in all
> non-UTF8 encodings must be convertible to UTF8.
>
> (This is also a result of history: These encodings were implemented in=
> PostgreSQL before Unicode.)
>
> That said, I can see how different behaviors might be desirable.
>
> My first question would be, are these non-convertible byte sequences > just characters that don't map to Unicode, or are they invalid wit= hin
> the definition of the EUC-* encodings themselves?

A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
Roman numerals (iii), which is not defined in the original GB2312
(the character set of EUC_CN),

> If the latter, then
> we should just reject them (modulo some backward compatibility),
> similar to how we reject certain Unicode code points that exist
> "structurally" but are not valid for one reason or another.<= br>
After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
super set of GB2312). In DB18030, lowercase forms of the Roman
numerals and other characters (e.g. Euro sign) were added.

So I suspect that a) those characters are sometimes used with EUC_CN
despite the fact that they are not valid GB2312 characters. b) some
EUC_CN users might have already written those characters into EUC_CN
databases. If so, tightening up the validation may break such that
uses. This is just my guess. Please correct me if I am wrong.

> Alternatively, if these byte sequences are valid characters but they > just didn't end up in Unicode for some reason, then rejecting them=
> might break valid uses.

That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
explicitly rejects characters that are not part of GB2312, including
0xA2A3, as the script is using GB18030 as a source data.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


--
Zhongpu Chen


--
Zhongpu Chen
--0000000000000126aa06516d6541--