Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKbDt-001AiH-2g for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 12:19:30 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wKbDs-00GnKM-0j for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 12:19:28 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKbDr-00GnKD-2x for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 12:19:27 +0000 Received: from meldrar.postgresql.org ([2a02:c0:301:0:ffff::31]) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wKbDp-00000000oqd-21tf for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 12:19:27 +0000 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=postgresql.org; s=20171124; h=Content-Transfer-Encoding:Content-Type: Mime-Version:References:In-Reply-To:From:Subject:Cc:To:Message-Id:Date:Sender :Reply-To:Content-ID:Content-Description; bh=v7iKSlL1im8mXuX3xiqKcdyMEO7fUKC6LnXRYCGnAdg=; b=0oCYXPvw+kZonOeFDZq8tN5m5R BFw1aYR9kKEhyUG79VLVVmdeFaRr5UwJvOmtLc4HMt+gML4bpxeHRDz/m/Y1Femmn0oswEtuo97qP wNgxoSa6AmMDCNMZ/LJMwd4UZcM6dh/aAMaE7bIwE4NdRF/vGSNYqMY/kM/NK9CiD+tzya4btE04K ERbWIw8OmW7eQiAoHwZZ9nI2VVeCtHZnTL0BYB5olcnegisikFY+7d6x0imd9iFFN+S+GH+InIuXV qGE1p/n6Ns0LAq+9JS1/Yh89phMB4zyZBcdRiS5bWuTYqZJaAlzWivMaxe9sTkocCcA9wGeCaLYUW NGGngsLg==; Received: from [2409:11:4120:300:4e1c:7abb:26ec:c245] (helo=localhost) by meldrar.postgresql.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKbDk-002bQG-2X; Wed, 06 May 2026 12:19:24 +0000 Date: Wed, 06 May 2026 21:19:07 +0900 (JST) Message-Id: <20260506.211907.1578384907621261702.ishii@postgresql.org> To: peter@eisentraut.org Cc: chenloveit@gmail.com, pgsql-hackers@lists.postgresql.org Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 From: Tatsuo Ishii In-Reply-To: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> References: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> X-Mailer: Mew version 6.8 on Emacs 29.3 Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Host-Lookup-Failed: Reverse DNS lookup failed for 2409:11:4120:300:4e1c:7abb:26ec:c245 (failed) List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk > It is in general not necessarily required that all text in all > non-UTF8 encodings must be convertible to UTF8. > > (This is also a result of history: These encodings were implemented in > PostgreSQL before Unicode.) > > That said, I can see how different behaviors might be desirable. > > My first question would be, are these non-convertible byte sequences > just characters that don't map to Unicode, or are they invalid within > the definition of the EUC-* encodings themselves? A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the Roman numerals (iii), which is not defined in the original GB2312 (the character set of EUC_CN), > If the latter, then > we should just reject them (modulo some backward compatibility), > similar to how we reject certain Unicode code points that exist > "structurally" but are not valid for one reason or another. After GB2312, GB18030 was defined. (It is claimed that GB18030 is a super set of GB2312). In DB18030, lowercase forms of the Roman numerals and other characters (e.g. Euro sign) were added. So I suspect that a) those characters are sometimes used with EUC_CN despite the fact that they are not valid GB2312 characters. b) some EUC_CN users might have already written those characters into EUC_CN databases. If so, tightening up the validation may break such that uses. This is just my guess. Please correct me if I am wrong. > Alternatively, if these byte sequences are valid characters but they > just didn't end up in Unicode for some reason, then rejecting them > might break valid uses. That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl explicitly rejects characters that are not part of GB2312, including 0xA2A3, as the script is using GB18030 as a source data. Regards, -- Tatsuo Ishii SRA OSS K.K. English: http://www.sraoss.co.jp/index_en/ Japanese:http://www.sraoss.co.jp