Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKVqY-0014D5-0S for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 06:35:02 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wKVqW-00FF3Z-2H for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 06:35:00 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKVqW-00FF3Q-0U for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 06:35:00 +0000 Received: from mail-ua1-x92c.google.com ([2607:f8b0:4864:20::92c]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wKVqT-00000000RbZ-2p6P for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 06:34:59 +0000 Received: by mail-ua1-x92c.google.com with SMTP id a1e0cc1a2514c-95d226e3ed8so439459241.0 for ; Tue, 05 May 2026 23:34:58 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1778049297; cv=none; d=google.com; s=arc-20240605; b=IczyyzDkQfwv1Ak8xvAmQ8mD8A7zxfQGe6SOhb4omysu2LEOPO2y+RsbelCvzYWpkG sLzumM0Pidx4yq5jze5/ePz6Mi2rFPD5raoPYSdeEeVLUBW0u3KuNQBApd98mhb0xqC9 nt6N+H293eBEluKbSfKbYQ02yrs/Va6D5JVWvS7BZDN7BTQCfq2yoJROkPOWoDwXfHiY RqUnl0zU2J0aDfkQDaK9vT7lbDA+xk4CLSU2aj967RPymP7OIN5FNyY5nXS13OuDitfD KFW4WQcv7SUq5xftXnagpkTXnjKu/CdrxHNy9GHZO4JntuJgrh9hIfyfVE3AW1UD80aY 2NzA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=5He6s+y/lssLM2XBLzdpQur9tWg9g33a2uGNrrJVKc8=; fh=GFo+u67MzUl5ed+wjU0qijpu4VYkVdztqHO90dFo74c=; b=J/vj7oeOxZ+D6xKJSM9mxy74HMAyFUzpPjsc3uzTPLZDf5/j3V2eQqQQxKL0WKI44f /Q1ABYLmOkF42bFhsbXGhSRsFambX1Ie8IqU7vnD4/Xwy56xAFMWncphPg5Yn56Y5A8H FFbrv63p8hWPTpFgqnicMrfmKBojtqFYXAIiFYnj/Xw7KcIz0trv3ZXa5IVgVYMpy1jE SA3WPJUd3B3YFWofQizHLneB0F3If4k0D7qxAH9uVoeRcW517ChsuG5Yyd4bsT48LxtS wncB+fLyJmdCoBWefRjaQU+qbD7kZ9JTAGkSh8j1MC4qV07b7FYS0XWPekLZl4jejRJR rR0g==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1778049297; x=1778654097; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=5He6s+y/lssLM2XBLzdpQur9tWg9g33a2uGNrrJVKc8=; b=nWUMzp/mjkiCp0ZbWuvhO2JnUWcbWuHLzHs6pbakuo8LKhT9O6skFhl9ZVrNqOx0ul dkuQivPtsJsgDlbc6vsQkWYu4xxedl4trs5eQXF260Z1n01uysat9scwLfNENOdodZ+1 +sNrdjMjwiOcCFceS0yCJ16b7Q3QhBSOqlnd9zPvnxN7m/NuUGgnqHNnzIBHfMGxSA3J E5FSwOfbO4P+KA80S2jJo4iqjbLKRPzd02wLUqoNIEAlBgoBKKSwdLUx6eYFvOgwLKw/ PpHubNLgqw4wGU2++mUdGoQCxMqvuFeic9evtXqLjATd9w//2waRkXVhGfU8jYCeBwOE VgFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1778049297; x=1778654097; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=5He6s+y/lssLM2XBLzdpQur9tWg9g33a2uGNrrJVKc8=; b=rISG7ttYSBqW9lgJUBl0D0iFoSpL2h6Z5SZ7Na59ZgRCVgI7SUGNXx2BqLefLnVC9M 172ydTs594+wCSGCKGcsjalMgdbhyS4h/+B+Dr76081z7aoejFVxBGq/ZgbYnY6knkWh 0TmQSNr7NlQET3sqbejfw4BVbnssW7YBGn2b81XrclgfNdX396piDQAHb769RCsCqHGO oX4DHZMXiP9bL9dBCDk80lfhB8QZN7MBtdstQ4Yj52OCY+jISZXROYcIP9k83XRxxS01 TnzNTpBhQ782SVceBr5uBF5EjlqPLVuuERrwgQjdyLCnJc6D3HPkT5w4SVm/8QCXyLLT xIZw== X-Gm-Message-State: AOJu0YyvP/tk7ym7fASMH/ech9NnjBYujzFBy8kVA37yNF0dk/rQtUFs xB3SLxLpSghrjTCKIlJZHnR5OQ6TrJmSMOMJt+c9YHSYsK0lBZP8VhTGs8TErsS+vpQTVY/oXXJ MZ8eWNYEBtuXAiht+xiQQxQZjREWs0Vk= X-Gm-Gg: AeBDievzFI+7ne2X2fDdSeo2FAMXDjo/dlZnvDWCrAwwTkMyghDlvi/FXnLc65Ev39Y 9Gz3kKIdIAfJnOLXJEr+WwL8o+HwYjLjztM5FqTloLFWH+lYIZlq8S+FYbqkGwN7bQM/b+dN2k/ vWrOdSjTYSh5DmKIu40E+fFukjq3s9PJIR1/PM6WzCXXrivT2AKCQs+Qh6wXezi2SXmIOmO5ddM 9q/OEZscSGv3LmLea6tsrquBJJVswMZ2XL6cuZ7qXIUZtzhwo+rxTw8RsZmLx0W6Ar4fmhRkUap RO+QQoQpD7s8u3/8oj/1LpROH3ofNJNduuXd X-Received: by 2002:a67:e7c2:0:b0:602:8894:b76c with SMTP id ada2fe7eead31-630f72b5f5fmr726932137.5.1778049297018; Tue, 05 May 2026 23:34:57 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Zhongpu Chen Date: Wed, 6 May 2026 14:34:46 +0800 X-Gm-Features: AVHnY4KEpE9h7hrSaXpnZ8KJf3MUI6PPAaJncuLypDS2UUnLbdX2xdCT5le8cqc Message-ID: Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 To: "David G. Johnston" Cc: "pgsql-hackers@lists.postgresql.org" Content-Type: multipart/alternative; boundary="000000000000bdf2810651205c39" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --000000000000bdf2810651205c39 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I run a benchmark to test the performance over a Chinese classic novel with respect to various validation strategies: https://github.com/SWUFE-DB-Group/NUAV/blob/main/encoding-validation/NUAV/s= rc/gb2312.rs The running log of `cargo bench -- gb2312`: ``` Running benches/gb2312.rs (target/release/deps/gb2312-53d8e01b8e6785c8= ) gb2312::is_gb2312_iconv time: [2.5621 ms 2.5681 ms 2.5740 ms] change: [-2.6404% -2.3284% -2.0023%] (p =3D 0.00 < 0.05) Performance has improved. gb2312::is_gb2312_icu time: [3.2427 ms 3.2589 ms 3.2771 ms] change: [-1.5710% -1.0409% -0.4387%] (p =3D 0.00 < 0.05) Change within noise threshold. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe gb2312::is_gb2312_rs time: [2.8157 ms 2.8229 ms 2.8303 ms] change: [-1.6985% -1.2165% -0.7501%] (p =3D 0.00 < 0.05) Change within noise threshold. Benchmarking gb2312::is_gb2312_range: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.3s, enable flat sampling, or reduce sample count to 50. gb2312::is_gb2312_range time: [1.6237 ms 1.6294 ms 1.6351 ms] change: [+3.8720% +4.2901% +4.6933%] (p =3D 0.00 < 0.05) Performance has regressed. gb2312::is_gb2312_lookup time: [488.12 =C2=B5s 490.04 =C2=B5s 491.97 =C2= =B5s] change: [+0.9273% +2.2343% +3.2599%] (p =3D 0.00 < 0.05) Change within noise threshold. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) low mild gb2312::is_gb2312_simd time: [181.00 =C2=B5s 181.77 =C2=B5s 182.53 =C2= =B5s] change: [-4.4563% -3.6971% -3.0260%] (p =3D 0.00 < 0.05) Performance has improved. gb2312:is_gb2312_ranges_pg time: [467.69 =C2=B5s 469.27 =C2=B5s 470.82 =C2= =B5s] Benchmarking gb2312:is_gb2312_ranges_mysql: Warming up for 3.0000 s Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 6.4s, enable flat sampling, or reduce sample count to 60. gb2312:is_gb2312_ranges_mysql time: [1.2611 ms 1.2667 ms 1.2724 ms] ``` As we can see, the PG-style validation does not bring much improvement. Instead, it is slower than my strict-styles. On Sat, May 2, 2026 at 12:49=E2=80=AFPM Zhongpu Chen = wrote: > Thanks for the clarification. > > > I agree that validation on every input may have runtime-cost concerns. Bu= t > this can be well-controlled. For example, MySQL adopts a finer checking f= or > EUC-CN (i.e., GB2312) in > https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc: > > > ``` > > static int func_gb2312_uni_onechar(int code) { > if ((code >=3D 0x2121) && (code <=3D 0x2658)) > return (tab_gb2312_uni0[code - 0x2121]); > if ((code >=3D 0x2721) && (code <=3D 0x296F)) > return (tab_gb2312_uni1[code - 0x2721]); > if ((code >=3D 0x3021) && (code <=3D 0x777E)) > return (tab_gb2312_uni2[code - 0x3021]); > return (0); > } > > ``` > > where `code` is obtained by subtracting 0x8080. Of course, MySQL's > checking can also be enhanced. > > > Anyway, it is reasonable to note these details in the documentation. > > > On Sat, May 2, 2026 at 11:28=E2=80=AFAM David G. Johnston < > david.g.johnston@gmail.com> wrote: > >> On Friday, May 1, 2026, Zhongpu Chen wrote: >> >>> The issue is not specific to E'\\x..' literals. A normal COPY FROM data >>> file with ENCODING 'EUC_CN' can create text rows that later cannot be >>> retrieved with SELECT. >>> >>> This suggests that input validation for EUC_CN is only structural, >>> while the EUC_CN-to-UTF8 conversion table is stricter. >>> >> >> I suspect a lack of desire to maintain and ensure that specific values >> are verified; or accepting the runtime cost to do so. It is indeed >> structural. This point should probably be documented better. But it=E2= =80=99s >> hard to feel too bad if the input claims it is providing verifiable EUC_= CN >> data then proceeds to supply data that lacks meaning in reality. We are >> happy to just store and return your data to you - but it=E2=80=99s unrea= sonable to >> ask for it to be converted. It would be nice for the database to provid= e >> an extra layer of protection, so I=E2=80=99m not against the idea. Eith= er >> automatically or or at least providing a function that could, say, be >> called in a trigger for opt-in. But definitely feels like a problematic >> benefit-to-cost proposition. >> >> David J. >> >> > > -- > Zhongpu Chen > --=20 Zhongpu Chen --000000000000bdf2810651205c39 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
I run a benchmark to test the performance over a Chin= ese classic novel with respect to various validation strategies:=C2=A0https://github.com/SWUFE-DB-Group/NUAV/blob/main/encodi= ng-validation/NUAV/src/gb2312.rs

The running l= og of `cargo bench -- gb2312`:

```
=C2= =A0 =C2=A0 =C2=A0Running benches/gb2312.rs (target/release/deps/gb2312-53d8= e01b8e6785c8)
gb2312::is_gb2312_iconv time: =C2=A0 [2.5621 ms 2.5681 ms = 2.5740 ms]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 change: [-2.6404% -2.3284% -2.0023%] (p =3D 0.00 &= lt; 0.05)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 Performance has improved.

gb2312::is_gb2312_ic= u =C2=A0 time: =C2=A0 [3.2427 ms 3.2589 ms 3.2771 ms]
=C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 change: = [-1.5710% -1.0409% -0.4387%] (p =3D 0.00 < 0.05)
=C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Change with= in noise threshold.
Found 5 outliers among 100 measurements (5.00%)
= =C2=A0 3 (3.00%) high mild
=C2=A0 2 (2.00%) high severe

gb2312::i= s_gb2312_rs =C2=A0 =C2=A0time: =C2=A0 [2.8157 ms 2.8229 ms 2.8303 ms]
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 change: [-1.6985% -1.2165% -0.7501%] (p =3D 0.00 < 0.05)
= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 Change within noise threshold.

Benchmarking gb2312::is_gb= 2312_range: Warming up for 3.0000 s
Warning: Unable to complete 100 samp= les in 5.0s. You may wish to increase target time to 8.3s, enable flat samp= ling, or reduce sample count to 50.
gb2312::is_gb2312_range time: =C2=A0= [1.6237 ms 1.6294 ms 1.6351 ms]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 change: [+3.8720% +4.2901% +4= .6933%] (p =3D 0.00 < 0.05)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Performance has regressed.
gb2312::is_gb2312_lookup
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 time: =C2=A0 [488.12 =C2=B5s 490.04 = =C2=B5s 491.97 =C2=B5s]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 change: [+0.9273% +2.2343% +3.2599%] (p= =3D 0.00 < 0.05)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Change within noise threshold.
Found = 1 outliers among 100 measurements (1.00%)
=C2=A0 1 (1.00%) low mild
<= br>gb2312::is_gb2312_simd =C2=A0time: =C2=A0 [181.00 =C2=B5s 181.77 =C2=B5s= 182.53 =C2=B5s]
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 change: [-4.4563% -3.6971% -3.0260%] (p =3D 0.= 00 < 0.05)
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 Performance has improved.

gb2312:is_gb23= 12_ranges_pg
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 time: =C2=A0 [467.69 =C2=B5s 469.27 =C2=B5s 470.82= =C2=B5s]

Benchmarking gb2312:is_gb2312_ranges_mysql: Warming up for= 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish = to increase target time to 6.4s, enable flat sampling, or reduce sample cou= nt to 60.
gb2312:is_gb2312_ranges_mysql
=C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 time: =C2=A0 [1.261= 1 ms 1.2667 ms 1.2724 ms]

```

=
As we can see, the PG-style validation does not bring much improvement= . Instead, it is slower than my strict-styles.=C2=A0

On Sat, May 2, 2026 at 12:49=E2=80=AFPM Zhongpu Chen <chenloveit@gmail.com> wrote:

Thanks for the clarification.<= /p>


I agree that validation on every input may have runtime-cost concerns.= But this can be well-controlled. For example, MySQL adopts a finer checkin= g for EUC-CN (i.e., GB2312) in https://github.c= om/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:


```

static int func_gb= 2312_uni_onechar(int code) {
=C2=A0 if ((code >=3D 0x2121) &&= (code <=3D 0x2658))
=C2=A0 =C2=A0 return (tab_gb2312_uni0[code - 0x2= 121]);
=C2=A0 if ((code >=3D 0x2721) && (code <=3D 0x296F)= )
=C2=A0 =C2=A0 return (tab_gb2312_uni1[code - 0x2721]);
=C2=A0 if ((= code >=3D 0x3021) && (code <=3D 0x777E))
=C2=A0 =C2=A0 ret= urn (tab_gb2312_uni2[code - 0x3021]);
=C2=A0 return (0);
}

```

where `code` is obtained by subtractin= g 0x8080. Of course, MySQL's checking can also be enhanced.


Anyway, it is reasonable to note these= details in the documentation.



=
On Sat, May 2, 2026 at 11:28=E2=80=AF= AM David G. Johnston <david.g.johnston@gmail.com> wrote:
On Friday, May 1, 2026, Zhongpu Che= n <chenloveit@= gmail.com> wrote:
The issue is not specific to E'\\x..' lite= rals. A normal COPY FROM data file with ENCODING 'EUC_CN' can creat= e text rows that later cannot be retrieved with SELECT.

=C2=A0This suggests that input validation for EUC_CN is only structur= al, while the EUC_CN-to-UTF8 conversion table is stricter.

I suspect a lack of desire to maintain and ensu= re that specific values are verified; or accepting the runtime cost to do s= o.=C2=A0 It is indeed structural.=C2=A0 This point should probably be docum= ented better.=C2=A0 But it=E2=80=99s hard to feel too bad if the input clai= ms it is providing verifiable EUC_CN data then proceeds to supply data that= lacks meaning in reality.=C2=A0 We are happy to just store and return your= data to you - but it=E2=80=99s unreasonable to ask for it to be converted.= =C2=A0 It would be nice for the database to provide an extra layer of prote= ction, so I=E2=80=99m not against the idea.=C2=A0 Either automatically or o= r at least providing a function that could, say, be called in a trigger for= opt-in.=C2=A0 But definitely feels like a problematic benefit-to-cost prop= osition.

David J.



--
Zhongpu Chen


--
Zhongpu Chen
--000000000000bdf2810651205c39--