Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKYLv-0017xu-2E for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 09:15:35 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wKYLu-00G0tn-1l for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 09:15:34 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKYLu-00G0tf-0f for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 09:15:34 +0000 Received: from mail-ua1-x92b.google.com ([2607:f8b0:4864:20::92b]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wKYLs-00000000nNk-0331 for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 09:15:33 +0000 Received: by mail-ua1-x92b.google.com with SMTP id a1e0cc1a2514c-9568159ee07so3960851241.1 for ; Wed, 06 May 2026 02:15:31 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1778058930; cv=none; d=google.com; s=arc-20240605; b=YPcGHX09/MPrEOnMBLQDJd4+6m9Qy2MCSNYaMX+kL9izc/n1m+zSHWUdGANvGJUPpU Fcv6+1z7b/G7NOeQ5YsrVzfDb7bc74awRIt7yMhxyYooydgkY3mO+CcfPx+v6t/+XYRT 4TpwHdX+WXQzXFYCURXKffABMRlkrSLQ2mkhYSIXqE7I+ZKEIcv5IUpwfX7EO3yzn4CI 6M2isdjM1bTb8xQIjJcn6KV7lHBbOkXIwIEyzoj6ORxUxy0226R6ZVz21Ysf1d3/kLBb Ev8ueEwfskwutkV5+0LIeJ2Co0LVjTv/kD+RARoj6BpnvWQ5eJv17SE3ewHIwjgGPBcb ZRtA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=kYBmCLHuBEtyq1Pr6tinkBVgatLobH26+/pKnpT/PbI=; fh=+M/4E1td1cw9UEpXfAxelXvvfrDyZy/Sc5QIk2PKYN4=; b=Aw3+u7KkBaSb0CKEPA5ttxqWtUBsYIvNPsA6OYCU2LbQGzl4ex4yMYmTjxQEPKEgjk ytC151OGosDi19nYiSBCkL4NmLndSGM4Onp8pihnJAGDE75g3S96hm6VKV76rtlGd3Yz 22cQpZxJdWBILsTvijpYvOfLmh75KVmxeHMq6T1E4RLfxbYzb4PUTIPcN7iSob6X1Vso P4o8fgw0UQw760CKKTkW4Cs+X0Wohqls5BQDw7TPu83ypGneMCk4YGzElkQ+ER9svlY7 jozPHQiP5AtTRprJE7+Bi9d0ywwXcUxYgGkya07155mk3Rhsa9Ozxa7GfQYPWbnvVSVA CBLQ==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778058930; x=1778663730; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=kYBmCLHuBEtyq1Pr6tinkBVgatLobH26+/pKnpT/PbI=; b=iy4K9EfrKS4y1hM5m2BF7tIfs57+bUBqp8uVfqz46g0fLZEAf4xUF9INnnA7Bdge7e vqOGAeAYwLzrxocEfglEoUPZnuisC90TV6FwKAQiMGz5p/fCwP8X8NU2uXotOiqQcdwW 1IkPgj9ZJ2wCgv9niiB5ioXekDrrK737XtVdvorvzowRmK52djZtF2eAuxuNwysjQfZ1 kK1rSGNDACdh01Bb6oNd9mTSb9oRGV8QMsXqO2tquG4nvrC/u08yw1uw+8mwoOUaGMRO hS9XNLqd/Fl5XLtp6PrzbO9k67b9iiHVXNU9wrQT6YJh6hQmYgzpQ9zMtJRF6q6QJXbM 6QjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778058930; x=1778663730; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=kYBmCLHuBEtyq1Pr6tinkBVgatLobH26+/pKnpT/PbI=; b=laQF417U+Ui5USUX7/x7rtphkywDaPCGZKDhDNxTI8x8V6rVGcww7BbFPLTD53HDRQ 426rPiGhNdyRw6zy1On4qfDrEpMzP+frX1zgXKIgOEch7RGONGRJGNyh9ZS1RkScSxpk 4Jx0zgIDiMw47oGw3ttqdIU7+7sm2AULVcHdVcRAgAMtj9u9WA1vG13A6cK0kvth0xVb fqQadBd9pD33PO26Y4QUDPQ1EEh0d90fSAfIABTb6laKgCaGZdszRIU5a+mPCen+MuZI 0KEHbqWBUyHbP6SRqxagdtNcFAR7L1j0tmeP3HO2DQSSHamItkHyLxAFC5DXyN1byRl/ iFcQ== X-Gm-Message-State: AOJu0YzuOk0YUBEjsZ2FwLbxVSTyME2SOerqw7N3uCqIgUx3pQBfjx71 cg5MHdk6VttypdAjnHu5vJsdF5jCo73kd45Pj+/QGe4DOya91ZiQsooJV2cpH9ftYVYdguRH/CU LPDvncMSsTxi+7uMJ8vge7CDQMWNggUY= X-Gm-Gg: AeBDies28+dpEPWY6mcmHybIBN7v2mwfNbuSKEs2lw/7JbZcj+DNJJR718zUUkKi1eA T5h2U/5foAsx72bpI84N6Le4sJjlA7/7M0vw73GqpDsMcdg/ID9cXwVUisutDAjXKv9GPNOFKCF U7zbrjP8I4YvMQ/3FYa2YeKXBJT/1qXSRgj0COqt8TaNNuhNIOZ6hFuyElIeWuZrQXQ/DgRbidy 1JqTcpOaqgUe5aSk6683aogJjMfnJWWsUH+i8/VcZQvzvCil2/RLuwwynm8Yh+N0Sfs7oPncuey JPIjB5EsKQPdeu4oR3a01HKR0+X9tWpXvouXEoCVRkgy55Y= X-Received: by 2002:a05:6102:418a:b0:60f:f543:232a with SMTP id ada2fe7eead31-630f8e78525mr892317137.2.1778058929679; Wed, 06 May 2026 02:15:29 -0700 (PDT) MIME-Version: 1.0 References: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> In-Reply-To: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> From: Zhongpu Chen Date: Wed, 6 May 2026 17:15:18 +0800 X-Gm-Features: AVHnY4K1uoD8QtKgR1NcoJQGtCaK0aV_eCK-hqsUCFuIZxtur_POA2uegevNaTc Message-ID: Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 To: Peter Eisentraut Cc: pgsql-hackers@lists.postgresql.org Content-Type: multipart/alternative; boundary="000000000000e4b16f0651229a8d" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000e4b16f0651229a8d Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I agree that not every valid character encoded in a legacy non-UTF8 encoding is necessarily required to be convertible to UTF8. But this assumes that the byte sequence actually denotes a valid character in the declared legacy encoding. For the reported EUC-CN cases, this is exactly the point in question. These byte sequences are structurally well-formed EUC-CN byte pairs, but they fall into reserved or unassigned positions of the GB2312 code table. For example, byte sequences with first byte 0xAA correspond to row 10 of GB2312, which is reserved/unassigned. Therefore, these cases are not merely valid legacy characters that happen to lack Unicode mappings. Rather, under strict GB2312/EUC-CN semantics, they are not assigned to any character at all, and thus should not be considered valid GB2312 characters. So my concern is not that every legacy-encoded character must be convertible to UTF8. The concern is that PostgreSQL's write-time validation accepts a structural superset of EUC-CN byte pairs as text, while some of these byte pairs are not valid assigned GB2312 characters and PostgreSQL's own later conversion path cannot assign character semantics to them. BTW, as noted in MySQL's implementation, a finer checker is possible. On Wed, May 6, 2026 at 3:32=E2=80=AFPM Peter Eisentraut wrote: > On 02.05.26 04:31, Zhongpu Chen wrote: > > See the related bug report https://www.postgresql.org/message-id/ > > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com > > > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com> > > > > Currently PostgreSQL accepts structurally well-formed EUC_CN byte > > sequences such as 0xA2A3 into text columns. The value round-trips when > > client_encoding is EUC_CN, but fails when client_encoding is UTF8 > > because euc_cn_to_utf8 has no mapping. > > > > If this behavior is intentional for compatibility, the documentation > > should explicitly say that validation for some legacy encodings is byte= - > > structure validation, not mapping-table validation. > > If it is not intentional, stricter validation could reject unassigned > > byte positions at input time. > > It is in general not necessarily required that all text in all non-UTF8 > encodings must be convertible to UTF8. > > (This is also a result of history: These encodings were implemented in > PostgreSQL before Unicode.) > > That said, I can see how different behaviors might be desirable. > > My first question would be, are these non-convertible byte sequences > just characters that don't map to Unicode, or are they invalid within > the definition of the EUC-* encodings themselves? If the latter, then > we should just reject them (modulo some backward compatibility), similar > to how we reject certain Unicode code points that exist "structurally" > but are not valid for one reason or another. > > Alternatively, if these byte sequences are valid characters but they > just didn't end up in Unicode for some reason, then rejecting them might > break valid uses. > > (I don't know much about EUC-* to be able to answer these.) > > --=20 Zhongpu Chen --000000000000e4b16f0651229a8d Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I agree that not every valid character encoded in a l= egacy non-UTF8 encoding is necessarily required to be convertible to UTF8. = But this assumes that the byte sequence actually denotes a valid character = in the declared legacy encoding.

For the reported EUC-CN cases, this= is exactly the point in question. These byte sequences are structurally we= ll-formed EUC-CN byte pairs, but they fall into reserved or unassigned posi= tions of the GB2312 code table. For example, byte sequences with first byte= 0xAA correspond to row 10 of GB2312, which is reserved/unassigned. Therefo= re, these cases are not merely valid legacy characters that happen to lack = Unicode mappings. Rather, under strict GB2312/EUC-CN semantics, they are no= t assigned to any character at all, and thus should not be considered valid= GB2312 characters.

So my concern is not that every legacy-encoded c= haracter must be convertible to UTF8. The concern is that PostgreSQL's = write-time validation accepts a structural superset of EUC-CN byte pairs as= text, while some of these byte pairs are not valid assigned GB2312 charact= ers and PostgreSQL's own later conversion path cannot assign character = semantics to them.

BTW, as noted in MySQL's implementation= , a finer checker is possible.


On Wed, = May 6, 2026 at 3:32=E2=80=AFPM Peter Eisentraut <peter@eisentraut.org> wrote:
On 02.05.26 04:31, Zhongpu Chen wrote:=
> See the related bug report https://www.postgresql.org/mes= sage-id/
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com
> <
https://www.postgresql.org/message-id/
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com>
>
> Currently PostgreSQL accepts structurally well-formed EUC_CN byte
> sequences such as 0xA2A3 into text columns. The value round-trips when=
> client_encoding is EUC_CN, but fails when client_encoding is UTF8
> because euc_cn_to_utf8 has no mapping.
>
> If this behavior is intentional for compatibility, the documentation <= br> > should explicitly say that validation for some legacy encodings is byt= e-
> structure validation, not mapping-table validation.
> If it is not intentional, stricter validation could reject unassigned =
> byte positions at input time.

It is in general not necessarily required that all text in all non-UTF8 encodings must be convertible to UTF8.

(This is also a result of history: These encodings were implemented in
PostgreSQL before Unicode.)

That said, I can see how different behaviors might be desirable.

My first question would be, are these non-convertible byte sequences
just characters that don't map to Unicode, or are they invalid within <= br> the definition of the EUC-* encodings themselves?=C2=A0 If the latter, then=
we should just reject them (modulo some backward compatibility), similar to how we reject certain Unicode code points that exist "structurally&= quot;
but are not valid for one reason or another.

Alternatively, if these byte sequences are valid characters but they
just didn't end up in Unicode for some reason, then rejecting them migh= t
break valid uses.

(I don't know much about EUC-* to be able to answer these.)



--
Zhongpu Chen
--000000000000e4b16f0651229a8d--