Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2v8f-007ObZ-Ez for pgsql-hackers@arkaria.postgresql.org; Fri, 03 May 2024 15:47:57 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s2v8d-00AQHX-2l for pgsql-hackers@arkaria.postgresql.org; Fri, 03 May 2024 15:47:55 +0000 Received: from makus.postgresql.org ([2001:4800:3e1:1::229]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2v8c-00AQHP-Pf for pgsql-hackers@lists.postgresql.org; Fri, 03 May 2024 15:47:55 +0000 Received: from dverite2024.planet-service.net ([185.16.44.252] helo=mail.verite.pro) by makus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2v8a-001GPu-R7 for pgsql-hackers@postgresql.org; Fri, 03 May 2024 15:47:54 +0000 Received: by mail.verite.pro (Postfix, from userid 1000) id 2F6632C0C94; Fri, 3 May 2024 17:47:50 +0200 (CEST) Content-Type: text/plain; charset="iso-8859-15" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "Daniel Verite" Subject: Re: Support LIKE with nondeterministic collations To: "Peter Eisentraut" Cc: Robert Haas , Pgsql-Hackers In-Reply-To: <68263a89-b6af-4705-ac08-9a57cdd63bd0@eisentraut.org> Date: Fri, 03 May 2024 17:47:48 +0200 Message-Id: X-Mailer: Manitou v1.7.3 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Peter Eisentraut wrote: > However, off the top of my head, this definition has three flaws: (1)=20 > It would make the single-character wildcard effectively an=20 > any-number-of-characters wildcard, but only in some circumstances, which= =20 > could be confusing, (2) it would be difficult to compute, because you'd=20 > have to check equality against all possible single-character strings,=20 > and (3) it is not what the SQL standard says. For #1 we're currently using the definition of a "character" as=20 being any single point of code, but this definition fits poorly with non-deterministic collation rules. The simplest illustration I can think of is the canonical equivalence match between the NFD and NFC forms of an accented character. postgres=3D# CREATE COLLATION nd ( provider =3D 'icu', locale =3D 'und', deterministic =3D false );=09=09=20=20=20=20=20=20=20 -- match NFD form with NFC form of eacute postgres=3D# select U&'e\0301' like '=E9' collate nd; ?column?=20 ---------- t postgres=3D# select U&'e\0301' like '_' collate nd; ?column?=20 ---------- f (1 row) I understand why the algorithm produces these opposite results. But at the semantic level, when asked if the left-hand string matches a specific character, it says yes, and when asked if it matches any character, it says no. To me it goes beyond counter-intuitive, it's not reasonable enough to be called correct. What could we do about it? Intuitively I think that our interpretation of "character" here should be whatever sequence of code points are between character boundaries [1], and that the equality of such characters would be the equality of their sequences of code points, with the string equality check of the collation, whatever the length of these sequences. [1]: https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-bou= ndary Best regards, --=20 Daniel V=E9rit=E9 https://postgresql.verite.pro/ Twitter: @DanielVerite