Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKWlA-0015Tr-2z for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 07:33:33 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.96) (envelope-from ) id 1wKWkA-00FT6w-1E for pgsql-hackers@arkaria.postgresql.org; Wed, 06 May 2026 07:32:30 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.96) (envelope-from ) id 1wKWk9-00FT6n-13 for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 07:32:30 +0000 Received: from fhigh-b8-smtp.messagingengine.com ([202.12.124.159]) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.98.2) (envelope-from ) id 1wKWk6-00000000RzS-2xBd for pgsql-hackers@lists.postgresql.org; Wed, 06 May 2026 07:32:28 +0000 Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfhigh.stl.internal (Postfix) with ESMTP id AE8697A006C; Wed, 6 May 2026 03:32:25 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-01.internal (MEProxy); Wed, 06 May 2026 03:32:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=eisentraut.org; h=cc:content-transfer-encoding:content-type:content-type:date :date:from:from:in-reply-to:in-reply-to:message-id:mime-version :references:reply-to:subject:subject:to:to; s=fm2; t=1778052745; x=1778139145; bh=xzQe/Sv0LBdPyWCE/5sLhOILaQT8xKKGyRvRk4h9sVQ=; b= wgopXDaMlMOplL+RlxGKVUhkf3IVgiXo+to2vMAS5/oyL+PmRtTyJQXoAjws5U1u HftK5GPHqk7VQv1/mGwObxt2Z3WESw1iBTsEdeAvntPsJQBzkLh22W+FjnwfGAxO L1dNMXjIc+cEovCBSJB33o292dyQ72f+oVcJxt1rM3e1IEkBq2wZZt6tRei3Jcyw NcuTMsFwY+/OubeAo7J5X8I8p0T9Hx8e93x5/7JaQYud/iyUqd4YT58H5oZHl369 aa94Kds5Mbzx6boKvcEqIk1PNtoGmIUhNNXpXTVuQ0c/xTfr1jqJsvLy6oXyNRM5 Dz7Snh7jBDqT6mfSjuBSFA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :content-type:date:date:feedback-id:feedback-id:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:subject:subject:to:to:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm3; t=1778052745; x=1778139145; bh=x zQe/Sv0LBdPyWCE/5sLhOILaQT8xKKGyRvRk4h9sVQ=; b=fCRTs7Gwr5K8c2kqh lz654R5J7Ft0NEGFUWl74oIoTkXZHkl3I1d9ORfjjQPZpHfj3J7VDLnOJw1Qroy1 pI4/t9IHjFj+p+J+dFDx5PbP/fu5jvmfTSxBJIaAF6TdLGr6DD21tVpIdgHW5Wo5 +7Dcp8v/SI5Gb8Q6XFNJVYoBF2f6AhcmC5qfXHcYj3V2E6wdv0fEyvKyitP6xy9V gj7LVPLJc0VuQqZSpqcoIsmbIF02oWX3xQrgAIdqbirwROiSai83wy9HvCLq+K3s UdW+cygN51RxXWDXp4Wp0PKDDFdTKFVR0zJpKH3GZGe84lkm1MGf/WVYlVul84iK HNoVA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgeefhedrtddtgddutdegtddtucetufdoteggodetrf dotffvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfurfetoffkrfgpnffqhgenuceu rghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmnecujf gurhepkfffgggfuffvfhfhjggtgfesthejredttddvjeenucfhrhhomheprfgvthgvrhcu gfhishgvnhhtrhgruhhtuceophgvthgvrhesvghishgvnhhtrhgruhhtrdhorhhgqeenuc ggtffrrghtthgvrhhnpeegueeuvdfhheeuieejtefguddthfevveefuefgtddvgfetteeh jeetleeuvdevudenucffohhmrghinhepphhoshhtghhrvghsqhhlrdhorhhgnecuvehluh hsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrhhomhepphgvthgvrhesvghi shgvnhhtrhgruhhtrdhorhhgpdhnsggprhgtphhtthhopedvpdhmohguvgepshhmthhpoh huthdprhgtphhtthhopegthhgvnhhlohhvvghithesghhmrghilhdrtghomhdprhgtphht thhopehpghhsqhhlqdhhrggtkhgvrhhssehlihhsthhsrdhpohhsthhgrhgvshhqlhdroh hrgh X-ME-Proxy: Feedback-ID: ie0a040ee:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 6 May 2026 03:32:24 -0400 (EDT) Message-ID: <30e628b4-03cd-43eb-9ea4-d211aaddcaf5@eisentraut.org> Date: Wed, 6 May 2026 09:32:23 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: Proposal: tighten validation for legacy EUC encodings or document that accepted byte sequences may be unconvertible to UTF8 To: Zhongpu Chen , pgsql-hackers@lists.postgresql.org References: Content-Language: en-US From: Peter Eisentraut In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk On 02.05.26 04:31, Zhongpu Chen wrote: > See the related bug report https://www.postgresql.org/message-id/ > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com > CA%2B1gyqL7uiQhfLcYWpHNUKQgHjQc7sOPthSTiaxLDZzcrGFYSg%40mail.gmail.com> > > Currently PostgreSQL accepts structurally well-formed EUC_CN byte > sequences such as 0xA2A3 into text columns. The value round-trips when > client_encoding is EUC_CN, but fails when client_encoding is UTF8 > because euc_cn_to_utf8 has no mapping. > > If this behavior is intentional for compatibility, the documentation > should explicitly say that validation for some legacy encodings is byte- > structure validation, not mapping-table validation. > If it is not intentional, stricter validation could reject unassigned > byte positions at input time. It is in general not necessarily required that all text in all non-UTF8 encodings must be convertible to UTF8. (This is also a result of history: These encodings were implemented in PostgreSQL before Unicode.) That said, I can see how different behaviors might be desirable. My first question would be, are these non-convertible byte sequences just characters that don't map to Unicode, or are they invalid within the definition of the EUC-* encodings themselves? If the latter, then we should just reject them (modulo some backward compatibility), similar to how we reject certain Unicode code points that exist "structurally" but are not valid for one reason or another. Alternatively, if these byte sequences are valid characters but they just didn't end up in Unicode for some reason, then rejecting them might break valid uses. (I don't know much about EUC-* to be able to answer these.)