Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8

public inbox for [email protected]  
help / color / mirror / Atom feed

Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
12+ messages / 4 participants
[nested] [flat]

* Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-02 02:31  Zhongpu Chen <[email protected]>
  0 siblings, 2 replies; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-02 02:31 UTC (permalink / raw)
  To: [email protected]

See the related bug report
https://www.postgresql.org/message-id/CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.g...

Currently PostgreSQL accepts structurally well-formed EUC_CN byte sequences
such as 0xA2A3 into text columns. The value round-trips when
client_encoding is EUC_CN, but fails when client_encoding is UTF8 because
euc_cn_to_utf8 has no mapping.

If this behavior is intentional for compatibility, the documentation should
explicitly say that validation for some legacy encodings is byte-structure
validation, not mapping-table validation.
If it is not intentional, stricter validation could reject unassigned byte
positions at input time.

-- 
Zhongpu Chen

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-02 02:39  Zhongpu Chen <[email protected]>
  parent: Zhongpu Chen <[email protected]>
  1 sibling, 1 reply; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-02 02:39 UTC (permalink / raw)
  To: [email protected]

The issue is not specific to E'\\x..' literals. A normal COPY FROM data
file with ENCODING 'EUC_CN' can create text rows that later cannot be
retrieved with SELECT.

 This suggests that input validation for EUC_CN is only structural, while
the EUC_CN-to-UTF8 conversion table is stricter.


On Sat, May 2, 2026 at 10:31 AM Zhongpu Chen <[email protected]> wrote:

> See the related bug report
> https://www.postgresql.org/message-id/CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.g...
>
> Currently PostgreSQL accepts structurally well-formed EUC_CN byte
> sequences such as 0xA2A3 into text columns. The value round-trips when
> client_encoding is EUC_CN, but fails when client_encoding is UTF8 because
> euc_cn_to_utf8 has no mapping.
>
> If this behavior is intentional for compatibility, the documentation
> should explicitly say that validation for some legacy encodings is
> byte-structure validation, not mapping-table validation.
> If it is not intentional, stricter validation could reject unassigned byte
> positions at input time.
>
> --
> Zhongpu Chen
>


-- 
Zhongpu Chen


^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-02 03:28  David G. Johnston <[email protected]>
  parent: Zhongpu Chen <[email protected]>
  0 siblings, 1 reply; 12+ messages in thread

From: David G. Johnston @ 2026-05-02 03:28 UTC (permalink / raw)
  To: Zhongpu Chen <[email protected]>; +Cc: [email protected] <[email protected]>

On Friday, May 1, 2026, Zhongpu Chen <[email protected]> wrote:

> The issue is not specific to E'\\x..' literals. A normal COPY FROM data
> file with ENCODING 'EUC_CN' can create text rows that later cannot be
> retrieved with SELECT.
>
>  This suggests that input validation for EUC_CN is only structural, while
> the EUC_CN-to-UTF8 conversion table is stricter.
>

I suspect a lack of desire to maintain and ensure that specific values are
verified; or accepting the runtime cost to do so.  It is indeed
structural.  This point should probably be documented better.  But it’s
hard to feel too bad if the input claims it is providing verifiable EUC_CN
data then proceeds to supply data that lacks meaning in reality.  We are
happy to just store and return your data to you - but it’s unreasonable to
ask for it to be converted.  It would be nice for the database to provide
an extra layer of protection, so I’m not against the idea.  Either
automatically or or at least providing a function that could, say, be
called in a trigger for opt-in.  But definitely feels like a problematic
benefit-to-cost proposition.

David J.

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-02 04:49  Zhongpu Chen <[email protected]>
  parent: David G. Johnston <[email protected]>
  0 siblings, 1 reply; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-02 04:49 UTC (permalink / raw)
  To: David G. Johnston <[email protected]>; +Cc: [email protected] <[email protected]>

Thanks for the clarification.

I agree that validation on every input may have runtime-cost concerns. But
this can be well-controlled. For example, MySQL adopts a finer checking for
EUC-CN (i.e., GB2312) in
https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:

```

static int func_gb2312_uni_onechar(int code) {
  if ((code >= 0x2121) && (code <= 0x2658))
    return (tab_gb2312_uni0[code - 0x2121]);
  if ((code >= 0x2721) && (code <= 0x296F))
    return (tab_gb2312_uni1[code - 0x2721]);
  if ((code >= 0x3021) && (code <= 0x777E))
    return (tab_gb2312_uni2[code - 0x3021]);
  return (0);
}

```

where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking
can also be enhanced.

Anyway, it is reasonable to note these details in the documentation.

On Sat, May 2, 2026 at 11:28 AM David G. Johnston <
[email protected]> wrote:

> On Friday, May 1, 2026, Zhongpu Chen <[email protected]> wrote:
>
>> The issue is not specific to E'\\x..' literals. A normal COPY FROM data
>> file with ENCODING 'EUC_CN' can create text rows that later cannot be
>> retrieved with SELECT.
>>
>>  This suggests that input validation for EUC_CN is only structural, while
>> the EUC_CN-to-UTF8 conversion table is stricter.
>>
>
> I suspect a lack of desire to maintain and ensure that specific values are
> verified; or accepting the runtime cost to do so.  It is indeed
> structural.  This point should probably be documented better.  But it’s
> hard to feel too bad if the input claims it is providing verifiable EUC_CN
> data then proceeds to supply data that lacks meaning in reality.  We are
> happy to just store and return your data to you - but it’s unreasonable to
> ask for it to be converted.  It would be nice for the database to provide
> an extra layer of protection, so I’m not against the idea.  Either
> automatically or or at least providing a function that could, say, be
> called in a trigger for opt-in.  But definitely feels like a problematic
> benefit-to-cost proposition.
>
> David J.
>
>

-- 
Zhongpu Chen

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-06 06:34  Zhongpu Chen <[email protected]>
  parent: Zhongpu Chen <[email protected]>
  0 siblings, 0 replies; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-06 06:34 UTC (permalink / raw)
  To: David G. Johnston <[email protected]>; +Cc: [email protected] <[email protected]>

I run a benchmark to test the performance over a Chinese classic novel with
respect to various validation strategies:
https://github.com/SWUFE-DB-Group/NUAV/blob/main/encoding-validation/NUAV/src/gb2312.rs

The running log of `cargo bench -- gb2312`:

```
     Running benches/gb2312.rs (target/release/deps/gb2312-53d8e01b8e6785c8)
gb2312::is_gb2312_iconv time:   [2.5621 ms 2.5681 ms 2.5740 ms]
                        change: [-2.6404% -2.3284% -2.0023%] (p = 0.00 <
0.05)
                        Performance has improved.

gb2312::is_gb2312_icu   time:   [3.2427 ms 3.2589 ms 3.2771 ms]
                        change: [-1.5710% -1.0409% -0.4387%] (p = 0.00 <
0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

gb2312::is_gb2312_rs    time:   [2.8157 ms 2.8229 ms 2.8303 ms]
                        change: [-1.6985% -1.2165% -0.7501%] (p = 0.00 <
0.05)
                        Change within noise threshold.

Benchmarking gb2312::is_gb2312_range: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 8.3s, enable flat sampling, or reduce sample count to 50.
gb2312::is_gb2312_range time:   [1.6237 ms 1.6294 ms 1.6351 ms]
                        change: [+3.8720% +4.2901% +4.6933%] (p = 0.00 <
0.05)
                        Performance has regressed.

gb2312::is_gb2312_lookup
                        time:   [488.12 µs 490.04 µs 491.97 µs]
                        change: [+0.9273% +2.2343% +3.2599%] (p = 0.00 <
0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) low mild

gb2312::is_gb2312_simd  time:   [181.00 µs 181.77 µs 182.53 µs]
                        change: [-4.4563% -3.6971% -3.0260%] (p = 0.00 <
0.05)
                        Performance has improved.

gb2312:is_gb2312_ranges_pg
                        time:   [467.69 µs 469.27 µs 470.82 µs]

Benchmarking gb2312:is_gb2312_ranges_mysql: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase
target time to 6.4s, enable flat sampling, or reduce sample count to 60.
gb2312:is_gb2312_ranges_mysql
                        time:   [1.2611 ms 1.2667 ms 1.2724 ms]

```

As we can see, the PG-style validation does not bring much improvement.
Instead, it is slower than my strict-styles.

On Sat, May 2, 2026 at 12:49 PM Zhongpu Chen <[email protected]> wrote:

> Thanks for the clarification.
>
>
> I agree that validation on every input may have runtime-cost concerns. But
> this can be well-controlled. For example, MySQL adopts a finer checking for
> EUC-CN (i.e., GB2312) in
> https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:
>
>
> ```
>
> static int func_gb2312_uni_onechar(int code) {
>   if ((code >= 0x2121) && (code <= 0x2658))
>     return (tab_gb2312_uni0[code - 0x2121]);
>   if ((code >= 0x2721) && (code <= 0x296F))
>     return (tab_gb2312_uni1[code - 0x2721]);
>   if ((code >= 0x3021) && (code <= 0x777E))
>     return (tab_gb2312_uni2[code - 0x3021]);
>   return (0);
> }
>
> ```
>
> where `code` is obtained by subtracting 0x8080. Of course, MySQL's
> checking can also be enhanced.
>
>
> Anyway, it is reasonable to note these details in the documentation.
>
>
> On Sat, May 2, 2026 at 11:28 AM David G. Johnston <
> [email protected]> wrote:
>
>> On Friday, May 1, 2026, Zhongpu Chen <[email protected]> wrote:
>>
>>> The issue is not specific to E'\\x..' literals. A normal COPY FROM data
>>> file with ENCODING 'EUC_CN' can create text rows that later cannot be
>>> retrieved with SELECT.
>>>
>>>  This suggests that input validation for EUC_CN is only structural,
>>> while the EUC_CN-to-UTF8 conversion table is stricter.
>>>
>>
>> I suspect a lack of desire to maintain and ensure that specific values
>> are verified; or accepting the runtime cost to do so.  It is indeed
>> structural.  This point should probably be documented better.  But it’s
>> hard to feel too bad if the input claims it is providing verifiable EUC_CN
>> data then proceeds to supply data that lacks meaning in reality.  We are
>> happy to just store and return your data to you - but it’s unreasonable to
>> ask for it to be converted.  It would be nice for the database to provide
>> an extra layer of protection, so I’m not against the idea.  Either
>> automatically or or at least providing a function that could, say, be
>> called in a trigger for opt-in.  But definitely feels like a problematic
>> benefit-to-cost proposition.
>>
>> David J.
>>
>>
>
> --
> Zhongpu Chen
>


-- 
Zhongpu Chen


^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-06 07:32  Peter Eisentraut <[email protected]>
  parent: Zhongpu Chen <[email protected]>
  1 sibling, 2 replies; 12+ messages in thread

From: Peter Eisentraut @ 2026-05-06 07:32 UTC (permalink / raw)
  To: Zhongpu Chen <[email protected]>; [email protected]

On 02.05.26 04:31, Zhongpu Chen wrote:
> See the related bug report https://www.postgresql.org/message-id/ 
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com 
> <https://www.postgresql.org/message-id/ 
> CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com>
> 
> Currently PostgreSQL accepts structurally well-formed EUC_CN byte 
> sequences such as 0xA2A3 into text columns. The value round-trips when 
> client_encoding is EUC_CN, but fails when client_encoding is UTF8 
> because euc_cn_to_utf8 has no mapping.
> 
> If this behavior is intentional for compatibility, the documentation 
> should explicitly say that validation for some legacy encodings is byte- 
> structure validation, not mapping-table validation.
> If it is not intentional, stricter validation could reject unassigned 
> byte positions at input time.

It is in general not necessarily required that all text in all non-UTF8 
encodings must be convertible to UTF8.

(This is also a result of history: These encodings were implemented in 
PostgreSQL before Unicode.)

That said, I can see how different behaviors might be desirable.

My first question would be, are these non-convertible byte sequences 
just characters that don't map to Unicode, or are they invalid within 
the definition of the EUC-* encodings themselves?  If the latter, then 
we should just reject them (modulo some backward compatibility), similar 
to how we reject certain Unicode code points that exist "structurally" 
but are not valid for one reason or another.

Alternatively, if these byte sequences are valid characters but they 
just didn't end up in Unicode for some reason, then rejecting them might 
break valid uses.

(I don't know much about EUC-* to be able to answer these.)

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-06 09:15  Zhongpu Chen <[email protected]>
  parent: Peter Eisentraut <[email protected]>
  1 sibling, 0 replies; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-06 09:15 UTC (permalink / raw)
  To: Peter Eisentraut <[email protected]>; +Cc: [email protected]

I agree that not every valid character encoded in a legacy non-UTF8
encoding is necessarily required to be convertible to UTF8. But this
assumes that the byte sequence actually denotes a valid character in the
declared legacy encoding.

For the reported EUC-CN cases, this is exactly the point in question. These
byte sequences are structurally well-formed EUC-CN byte pairs, but they
fall into reserved or unassigned positions of the GB2312 code table. For
example, byte sequences with first byte 0xAA correspond to row 10 of
GB2312, which is reserved/unassigned. Therefore, these cases are not merely
valid legacy characters that happen to lack Unicode mappings. Rather, under
strict GB2312/EUC-CN semantics, they are not assigned to any character at
all, and thus should not be considered valid GB2312 characters.

So my concern is not that every legacy-encoded character must be
convertible to UTF8. The concern is that PostgreSQL's write-time validation
accepts a structural superset of EUC-CN byte pairs as text, while some of
these byte pairs are not valid assigned GB2312 characters and PostgreSQL's
own later conversion path cannot assign character semantics to them.

BTW, as noted in MySQL's implementation, a finer checker is possible.

On Wed, May 6, 2026 at 3:32 PM Peter Eisentraut <[email protected]>
wrote:

> On 02.05.26 04:31, Zhongpu Chen wrote:
> > See the related bug report https://www.postgresql.org/message-id/
> > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com
> > <https://www.postgresql.org/message-id/
> > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com>
> >
> > Currently PostgreSQL accepts structurally well-formed EUC_CN byte
> > sequences such as 0xA2A3 into text columns. The value round-trips when
> > client_encoding is EUC_CN, but fails when client_encoding is UTF8
> > because euc_cn_to_utf8 has no mapping.
> >
> > If this behavior is intentional for compatibility, the documentation
> > should explicitly say that validation for some legacy encodings is byte-
> > structure validation, not mapping-table validation.
> > If it is not intentional, stricter validation could reject unassigned
> > byte positions at input time.
>
> It is in general not necessarily required that all text in all non-UTF8
> encodings must be convertible to UTF8.
>
> (This is also a result of history: These encodings were implemented in
> PostgreSQL before Unicode.)
>
> That said, I can see how different behaviors might be desirable.
>
> My first question would be, are these non-convertible byte sequences
> just characters that don't map to Unicode, or are they invalid within
> the definition of the EUC-* encodings themselves?  If the latter, then
> we should just reject them (modulo some backward compatibility), similar
> to how we reject certain Unicode code points that exist "structurally"
> but are not valid for one reason or another.
>
> Alternatively, if these byte sequences are valid characters but they
> just didn't end up in Unicode for some reason, then rejecting them might
> break valid uses.
>
> (I don't know much about EUC-* to be able to answer these.)
>
>

-- 
Zhongpu Chen

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-06 12:19  Tatsuo Ishii <[email protected]>
  parent: Peter Eisentraut <[email protected]>
  1 sibling, 1 reply; 12+ messages in thread

From: Tatsuo Ishii @ 2026-05-06 12:19 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]; [email protected]

> It is in general not necessarily required that all text in all
> non-UTF8 encodings must be convertible to UTF8.
> 
> (This is also a result of history: These encodings were implemented in
> PostgreSQL before Unicode.)
> 
> That said, I can see how different behaviors might be desirable.
> 
> My first question would be, are these non-convertible byte sequences
> just characters that don't map to Unicode, or are they invalid within
> the definition of the EUC-* encodings themselves?

A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
Roman numerals (iii), which is not defined in the original GB2312
(the character set of EUC_CN),

> If the latter, then
> we should just reject them (modulo some backward compatibility),
> similar to how we reject certain Unicode code points that exist
> "structurally" but are not valid for one reason or another.

After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
super set of GB2312). In DB18030, lowercase forms of the Roman
numerals and other characters (e.g. Euro sign) were added.

So I suspect that a) those characters are sometimes used with EUC_CN
despite the fact that they are not valid GB2312 characters. b) some
EUC_CN users might have already written those characters into EUC_CN
databases. If so, tightening up the validation may break such that
uses. This is just my guess. Please correct me if I am wrong.

> Alternatively, if these byte sequences are valid characters but they
> just didn't end up in Unicode for some reason, then rejecting them
> might break valid uses.

That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
explicitly rejects characters that are not part of GB2312, including
0xA2A3, as the script is using GB18030 as a source data.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-09 08:58  Zhongpu Chen <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 2 replies; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-09 08:58 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]; [email protected]

>  If so, tightening up the validation may break such that uses.

I agree. What about introducing an extra GUC which allows users to specify
verification logic? In fact, I have implemented this patch.

```
SHOW encoding_validation;
-- default behaviour
SET encoding_validation = 'native';
-- enforce Write to be fully compatible with Read
SET encoding_validation = 'read_compatible';
```

On Wed, May 6, 2026 at 8:19 PM Tatsuo Ishii <[email protected]> wrote:

> > It is in general not necessarily required that all text in all
> > non-UTF8 encodings must be convertible to UTF8.
> >
> > (This is also a result of history: These encodings were implemented in
> > PostgreSQL before Unicode.)
> >
> > That said, I can see how different behaviors might be desirable.
> >
> > My first question would be, are these non-convertible byte sequences
> > just characters that don't map to Unicode, or are they invalid within
> > the definition of the EUC-* encodings themselves?
>
> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
> Roman numerals (iii), which is not defined in the original GB2312
> (the character set of EUC_CN),
>
> > If the latter, then
> > we should just reject them (modulo some backward compatibility),
> > similar to how we reject certain Unicode code points that exist
> > "structurally" but are not valid for one reason or another.
>
> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
> super set of GB2312). In DB18030, lowercase forms of the Roman
> numerals and other characters (e.g. Euro sign) were added.
>
> So I suspect that a) those characters are sometimes used with EUC_CN
> despite the fact that they are not valid GB2312 characters. b) some
> EUC_CN users might have already written those characters into EUC_CN
> databases. If so, tightening up the validation may break such that
> uses. This is just my guess. Please correct me if I am wrong.
>
> > Alternatively, if these byte sequences are valid characters but they
> > just didn't end up in Unicode for some reason, then rejecting them
> > might break valid uses.
>
> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
> explicitly rejects characters that are not part of GB2312, including
> 0xA2A3, as the script is using GB18030 as a source data.
>
> Regards,
> --
> Tatsuo Ishii
> SRA OSS K.K.
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>


-- 
Zhongpu Chen


^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-10 02:28  Zhongpu Chen <[email protected]>
  parent: Zhongpu Chen <[email protected]>
  1 sibling, 0 replies; 12+ messages in thread

From: Zhongpu Chen @ 2026-05-10 02:28 UTC (permalink / raw)
  To: Tatsuo Ishii <[email protected]>; +Cc: [email protected]; [email protected]

My prototype implementation:
https://github.com/SWUFE-DB-Group/postgresql-encoding-validation and the
usage:
https://github.com/SWUFE-DB-Group/postgresql-encoding-validation/blob/main/DEV.md

On Sat, May 9, 2026 at 4:58 PM Zhongpu Chen <[email protected]> wrote:

> >  If so, tightening up the validation may break such that uses.
>
> I agree. What about introducing an extra GUC which allows users to specify
> verification logic? In fact, I have implemented this patch.
>
> ```
> SHOW encoding_validation;
> -- default behaviour
> SET encoding_validation = 'native';
> -- enforce Write to be fully compatible with Read
> SET encoding_validation = 'read_compatible';
> ```
>
> On Wed, May 6, 2026 at 8:19 PM Tatsuo Ishii <[email protected]> wrote:
>
>> > It is in general not necessarily required that all text in all
>> > non-UTF8 encodings must be convertible to UTF8.
>> >
>> > (This is also a result of history: These encodings were implemented in
>> > PostgreSQL before Unicode.)
>> >
>> > That said, I can see how different behaviors might be desirable.
>> >
>> > My first question would be, are these non-convertible byte sequences
>> > just characters that don't map to Unicode, or are they invalid within
>> > the definition of the EUC-* encodings themselves?
>>
>> A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the
>> Roman numerals (iii), which is not defined in the original GB2312
>> (the character set of EUC_CN),
>>
>> > If the latter, then
>> > we should just reject them (modulo some backward compatibility),
>> > similar to how we reject certain Unicode code points that exist
>> > "structurally" but are not valid for one reason or another.
>>
>> After GB2312, GB18030 was defined. (It is claimed that GB18030 is a
>> super set of GB2312). In DB18030, lowercase forms of the Roman
>> numerals and other characters (e.g. Euro sign) were added.
>>
>> So I suspect that a) those characters are sometimes used with EUC_CN
>> despite the fact that they are not valid GB2312 characters. b) some
>> EUC_CN users might have already written those characters into EUC_CN
>> databases. If so, tightening up the validation may break such that
>> uses. This is just my guess. Please correct me if I am wrong.
>>
>> > Alternatively, if these byte sequences are valid characters but they
>> > just didn't end up in Unicode for some reason, then rejecting them
>> > might break valid uses.
>>
>> That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl
>> explicitly rejects characters that are not part of GB2312, including
>> 0xA2A3, as the script is using GB18030 as a source data.
>>
>> Regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>
>
> --
> Zhongpu Chen
>


-- 
Zhongpu Chen


^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-11 01:40  Tatsuo Ishii <[email protected]>
  parent: Zhongpu Chen <[email protected]>
  1 sibling, 1 reply; 12+ messages in thread

From: Tatsuo Ishii @ 2026-05-11 01:40 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]; [email protected]

>>  If so, tightening up the validation may break such that uses.
> 
> I agree. What about introducing an extra GUC which allows users to specify
> verification logic? In fact, I have implemented this patch.
> 
> ```
> SHOW encoding_validation;
> -- default behaviour
> SET encoding_validation = 'native';
> -- enforce Write to be fully compatible with Read
> SET encoding_validation = 'read_compatible';

-1 for using GUC. These settings may vary depending on the encoding.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp






^ permalink  raw  reply  [nested|flat] 12+ messages in thread

* Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
@ 2026-05-11 02:39  Tatsuo Ishii <[email protected]>
  parent: Tatsuo Ishii <[email protected]>
  0 siblings, 0 replies; 12+ messages in thread

From: Tatsuo Ishii @ 2026-05-11 02:39 UTC (permalink / raw)
  To: [email protected]; +Cc: [email protected]

[Add Cc: to pgsql-hackers]

From: Zhongpu Chen <[email protected]>
Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8
Date: Mon, 11 May 2026 09:56:20 +0800
Message-ID: <CA+1gyqJWpDhOCiM2WrCTffbbTdQ2gWiVzZikiQFkKmTng5Hn_w@mail.gmail.com>

> I see. The settings may be used in a finer way. For example, `set
> euc-cn-encoding-valiation = 'read_compatible'`.

It will make pg_dumpall not working. Suppose there's a database
 populated with `set euc-cn-encoding-valiation = 'native'.

1. Dump the database cluster using pg_dumpall.
2. Create a new database cluster using initdb.
3. Set euc-cn-encoding-valiation = 'read_compatible' in the postgresql.conf.
4. Restore from the dump --- failure because of disallowed EUC_CN characters.

I think encoding properties (including character validation) should
belong to encoding itself, rather than GUC parameters. If you want to
have "strict" EUC_CN and "non-strict" EUC_CN at the same time, I think
the best way to implement it is, add new EUC_CN variant encoding.

Regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

^ permalink  raw  reply  [nested|flat] 12+ messages in thread

end of thread, other threads:[~2026-05-11 02:39 UTC | newest]

Thread overview: 12+ messages (download: mbox mbox.gz follow: Atom feed)
-- links below jump to the message on this page --
2026-05-02 02:31 Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 Zhongpu Chen <[email protected]>
2026-05-02 02:39 ` Zhongpu Chen <[email protected]>
2026-05-02 03:28   ` David G. Johnston <[email protected]>
2026-05-02 04:49     ` Zhongpu Chen <[email protected]>
2026-05-06 06:34       ` Zhongpu Chen <[email protected]>
2026-05-06 07:32 ` Peter Eisentraut <[email protected]>
2026-05-06 09:15   ` Zhongpu Chen <[email protected]>
2026-05-06 12:19   ` Tatsuo Ishii <[email protected]>
2026-05-09 08:58     ` Zhongpu Chen <[email protected]>
2026-05-10 02:28       ` Zhongpu Chen <[email protected]>
2026-05-11 01:40       ` Tatsuo Ishii <[email protected]>
2026-05-11 02:39         ` Tatsuo Ishii <[email protected]>

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox