Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wJ2J1-0000d2-1U for pgsql-hackers@arkaria.postgresql.org; Sat, 02 May 2026 04:50:19 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wJ2I1-0006j9-17 for pgsql-hackers@arkaria.postgresql.org; Sat, 02 May 2026 04:49:17 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wJ2I0-0006j1-2X for pgsql-hackers@lists.postgresql.org; Sat, 02 May 2026 04:49:16 +0000 Received: from mail-vs1-xe36.google.com ([2607:f8b0:4864:20::e36]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.98.2) (envelope-from ) id 1wJ2Hy-000000000Mn-2pIM for pgsql-hackers@lists.postgresql.org; Sat, 02 May 2026 04:49:15 +0000 Received: by mail-vs1-xe36.google.com with SMTP id ada2fe7eead31-613207140bfso869424137.3 for ; Fri, 01 May 2026 21:49:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1777697352; cv=none; d=google.com; s=arc-20240605; b=eO8lPZ9uwKsWeDyTE0Qg5izZnbMxzjqncVk3SyVEO8QGEe/W9oyl8827zvCiZ5Rdrp wFEYK6eJOxDcu2yMdoHfM2DSwG52L8b9J2paTRArRdZ7B6/CPwGvWxE1NwcDvJN0ajMX tJv4DHTTeVcezVHgyv1tlyfWY1w74qgk2wWg4MsbmAEiTIpnqfyNsl13hmNL1u3X46y3 Cf722A0d9Giri+hqkMKrNgxcb27VYL6SmvOm7eDs96MbALa3PBqaRT/dCict673Hdc9S F/rrrHIVjeUI9IsrsYEOuMF5Kz/W/V1GPcKSjvs0bQckZsWy12ia+t0fZV9GtCF74KKS IOjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:dkim-signature; bh=zy7tFzmrSnIvOpGitSneQ//i7XQb0pp1r1B02e/gDSo=; fh=GFo+u67MzUl5ed+wjU0qijpu4VYkVdztqHO90dFo74c=; b=cXzlAggfHUFUeCscz+fqArY/DQtwSl7XBn/LJara+pV6H9byfIz4mZ3QNHXJq6eS1l OCuuOgajWidM/thIrNpX+Ywu3aKtPNRxJ91KnLshUrR593YZSCZq8R2pK5Jst6gz17PG wI34FJusJtgHTvPK/PDo/KXFk+WTvnR836atraz/iTVAE1/lUmBO0A5q2n7PR0fmzfnH h/foZnzdDw5RLgI6RRJuoXhrCSPpGrs61Na1j0cX6PJ6Vwu5jRhhGtwshRVfD8eTdgn7 2pvw4X6p9DDPNd1rQSuvgNzf/MNHIwZRet+4GMP5h4vVwF5wj/5o0yqyigXwpkKJPK58 Qfqg==; darn=lists.postgresql.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777697352; x=1778302152; darn=lists.postgresql.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=zy7tFzmrSnIvOpGitSneQ//i7XQb0pp1r1B02e/gDSo=; b=X0L7ee4HontRa9/tJ9SN9pnJiXF6ChTBR6OkH7xrhvkXsysHUv591wmY9ZyUZLpDF9 jGcoEfMvJDl5p9WgEsSsBMNB1aBq3rtJwn6JMT6O1ppBrpaim6jHOL+OU7MQTrtd3o70 JQgJ1D0dTKW1GAC1rkSTdVsC+WOQ821hjuWm4JlM7+Vlz28+DmgdkRiogUU3UzjfaiQn NuuyNRZSy+IYHq8jjmLkGIuz9RuHy6ezkKLGr2Ak0lm0/kjlHupm0BitDOpPC6newUaM YPONDt2coJEMyMym7j3B7tUIF22/c5OZ8ejNtPCULCyCkBoZ3Kqbv8emeiXtU1v8/+cW Qdxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777697352; x=1778302152; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=zy7tFzmrSnIvOpGitSneQ//i7XQb0pp1r1B02e/gDSo=; b=YxvZN6XGYnqy2tTXg/aGfoe46i/r+zJmOceAHfMtm+93Oco4br6U4GwOLIG15exSRP agFpHd6TjU1ynG6AOL3Is7O2k6TWZZOLTp3tl+F0ZJjpl9YEpeCESbxVkLL0o+SG58d/ w2O25MqpDqHa5q7QefWAckTm3rvnAdcWAtNG0WKhGolHHBf4wQcS1UIiH/1MDXlTbDGr VPwfJ47vP8EEOznhq9ZTyei9A3AIWQEfc10F7ItJM32puT87/E9KyiLnwTtU9Cwg3i1P jKOBU6LYvffhJZygA/LzuBSCUm6AdEXGCn/NonnyXN5JHAoLfIRzCGC1q8EQtFeDPfSN VPhg== X-Gm-Message-State: AOJu0Yzf3psPti3gOhkkFhJahLWQSTGQwkYG+Itvy39OM+mImKVrM6bK w7FAGSaHMGHwDg3NIhpqm28kVLgPMAgPkg4J9QRxnLFlURdLRMtiweP8Q21PH9v3rVLYxR/YsRh +mpjHPwAI5il1bKtvcNSywYSczAnZ1yI= X-Gm-Gg: AeBDieu/ni7/Ln2ApU4+FLCKQhYH4hUqGw4BTwdLeYPzZaiFmlwLVrbUg4jJOK5bsVe rgHibHfv34FAR1VmD51zoezgO0TLBhB3/HsO4mHbWkyBnCN7m1M6CHrhgFnUTzlKR2GS+VMppvW mMawTaCzu48Ea0IQLLa75XTcVhT2fqm8lpYrKB6LPAOF4666Xmz5Te3vNgGQle34UP5F4mcehoC UnOdOuT3j/tHWDCnaln5Ax42M4ok0c5YHJ/vJvZGy0UU+znUG/98yIAGvwWKB4EgEbGYowOii+N L4AomItAqegV/DkhwC6No1MSZSexdAeh3+E= X-Received: by 2002:a05:6102:2c86:b0:608:94e4:1c00 with SMTP id ada2fe7eead31-62d84c68103mr714169137.2.1777697352172; Fri, 01 May 2026 21:49:12 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Zhongpu Chen Date: Sat, 2 May 2026 12:49:00 +0800 X-Gm-Features: AVHnY4LwELi-9dfsW2_yVFulFmKZtMeV1fyVteYJz6-O16J60JXjF_7sO2HT2m8 Message-ID: Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 To: "David G. Johnston" Cc: "pgsql-hackers@lists.postgresql.org" Content-Type: multipart/alternative; boundary="00000000000031c75b0650ce6b7a" List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk --00000000000031c75b0650ce6b7a Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for the clarification. I agree that validation on every input may have runtime-cost concerns. But this can be well-controlled. For example, MySQL adopts a finer checking for EUC-CN (i.e., GB2312) in https://github.com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc: ``` static int func_gb2312_uni_onechar(int code) { if ((code >=3D 0x2121) && (code <=3D 0x2658)) return (tab_gb2312_uni0[code - 0x2121]); if ((code >=3D 0x2721) && (code <=3D 0x296F)) return (tab_gb2312_uni1[code - 0x2721]); if ((code >=3D 0x3021) && (code <=3D 0x777E)) return (tab_gb2312_uni2[code - 0x3021]); return (0); } ``` where `code` is obtained by subtracting 0x8080. Of course, MySQL's checking can also be enhanced. Anyway, it is reasonable to note these details in the documentation. On Sat, May 2, 2026 at 11:28=E2=80=AFAM David G. Johnston < david.g.johnston@gmail.com> wrote: > On Friday, May 1, 2026, Zhongpu Chen wrote: > >> The issue is not specific to E'\\x..' literals. A normal COPY FROM data >> file with ENCODING 'EUC_CN' can create text rows that later cannot be >> retrieved with SELECT. >> >> This suggests that input validation for EUC_CN is only structural, whil= e >> the EUC_CN-to-UTF8 conversion table is stricter. >> > > I suspect a lack of desire to maintain and ensure that specific values ar= e > verified; or accepting the runtime cost to do so. It is indeed > structural. This point should probably be documented better. But it=E2= =80=99s > hard to feel too bad if the input claims it is providing verifiable EUC_C= N > data then proceeds to supply data that lacks meaning in reality. We are > happy to just store and return your data to you - but it=E2=80=99s unreas= onable to > ask for it to be converted. It would be nice for the database to provide > an extra layer of protection, so I=E2=80=99m not against the idea. Eithe= r > automatically or or at least providing a function that could, say, be > called in a trigger for opt-in. But definitely feels like a problematic > benefit-to-cost proposition. > > David J. > > --=20 Zhongpu Chen --00000000000031c75b0650ce6b7a Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Th= anks for the clarification.


I agree that validation on every input may have run= time-cost concerns. But this can be well-controlled. For example, MySQL ado= pts a finer checking for EUC-CN (i.e., GB2312) in https://github.= com/mysql/mysql-server/blob/trunk/strings/ctype-gb2312.cc:


```

static int func_gb2312_uni_onechar(int code) {
=C2= =A0 if ((code >=3D 0x2121) && (code <=3D 0x2658))
=C2=A0 = =C2=A0 return (tab_gb2312_uni0[code - 0x2121]);
=C2=A0 if ((code >=3D= 0x2721) && (code <=3D 0x296F))
=C2=A0 =C2=A0 return (tab_gb2= 312_uni1[code - 0x2721]);
=C2=A0 if ((code >=3D 0x3021) && (c= ode <=3D 0x777E))
=C2=A0 =C2=A0 return (tab_gb2312_uni2[code - 0x3021= ]);
=C2=A0 return (0);
}

```

where `code` is obtained by subtractin= g 0x8080. Of course, MySQL's checking can also be enhanced.


Anyw= ay, it is reasonable to note these details in the documentation.



On Sat, May 2, 2026 at 11:28=E2=80=AFAM David G. Johnsto= n <david.g.johnston@gmail.= com> wrote:
On Friday, May 1, 2026, Zhongpu Chen <chenloveit@gmail.com> wrote:
The issue is= not specific to E'\\x..' literals. A normal COPY FROM data file wi= th ENCODING 'EUC_CN' can create text rows that later cannot be retr= ieved with SELECT.

=C2=A0This suggests that input = validation for EUC_CN is only structural, while the EUC_CN-to-UTF8 conversi= on table is stricter.

I suspect= a lack of desire to maintain and ensure that specific values are verified;= or accepting the runtime cost to do so.=C2=A0 It is indeed structural.=C2= =A0 This point should probably be documented better.=C2=A0 But it=E2=80=99s= hard to feel too bad if the input claims it is providing verifiable EUC_CN= data then proceeds to supply data that lacks meaning in reality.=C2=A0 We = are happy to just store and return your data to you - but it=E2=80=99s unre= asonable to ask for it to be converted.=C2=A0 It would be nice for the data= base to provide an extra layer of protection, so I=E2=80=99m not against th= e idea.=C2=A0 Either automatically or or at least providing a function that= could, say, be called in a trigger for opt-in.=C2=A0 But definitely feels = like a problematic benefit-to-cost proposition.

Da= vid J.



--
Zhongpu Chen
--00000000000031c75b0650ce6b7a--