Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Peter Eisentraut <[email protected]>
To: Zhongpu Chen <[email protected]>
To: [email protected]
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: Wed, 6 May 2026 09:32:23 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <CA+1gyqJJJDhq=cc_D0ad59WH_OD2G_mN54xTru0KYoNaLkF48Q@mail.gmail.com>
References: <CA+1gyqJJJDhq=cc_D0ad59WH_OD2G_mN54xTru0KYoNaLkF48Q@mail.gmail.com>

On 02.05.26 04:31, Zhongpu Chen wrote:
> See the related bug report https://www.postgresql.org/message-id/ 
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com 
> <https://www.postgresql.org/message-id/ 
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com>
> 
> Currently PostgreSQL accepts structurally well-formed EUC_CN byte 
> sequences such as 0xA2A3 into text columns. The value round-trips when 
> client_encoding is EUC_CN, but fails when client_encoding is UTF8 
> because euc_cn_to_utf8 has no mapping.
> 
> If this behavior is intentional for compatibility, the documentation 
> should explicitly say that validation for some legacy encodings is byte- 
> structure validation, not mapping-table validation.
> If it is not intentional, stricter validation could reject unassigned 
> byte positions at input time.

It is in general not necessarily required that all text in all non-UTF8 
encodings must be convertible to UTF8.

(This is also a result of history: These encodings were implemented in 
PostgreSQL before Unicode.)

That said, I can see how different behaviors might be desirable.

My first question would be, are these non-convertible byte sequences 
just characters that don't map to Unicode, or are they invalid within 
the definition of the EUC-* encodings themselves?  If the latter, then 
we should just reject them (modulo some backward compatibility), similar 
to how we reject certain Unicode code points that exist "structurally" 
but are not valid for one reason or another.

Alternatively, if these byte sequences are valid characters but they 
just didn't end up in Unicode for some reason, then rejecting them might 
break valid uses.

(I don't know much about EUC-* to be able to answer these.)

view thread (12+ messages)  latest in thread

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox