Re: Support LIKE with nondeterministic collations

public inbox for [email protected]  
help / color / mirror / Atom feed

From: Daniel Verite <[email protected]>
To: Peter Eisentraut <[email protected]>
Cc: Robert Haas <[email protected]>
Cc: Pgsql-Hackers <[email protected]>
Subject: Re: Support LIKE with nondeterministic collations
Date: Fri, 03 May 2024 17:47:48 +0200
Message-ID: <[email protected]> (raw)
In-Reply-To: <[email protected]>

	Peter Eisentraut wrote:

>  However, off the top of my head, this definition has three flaws: (1) 
> It would make the single-character wildcard effectively an 
> any-number-of-characters wildcard, but only in some circumstances, which 
> could be confusing, (2) it would be difficult to compute, because you'd 
> have to check equality against all possible single-character strings, 
> and (3) it is not what the SQL standard says.

For #1 we're currently using the definition of a "character" as 
being any single point of code, but this definition fits poorly
with non-deterministic collation rules.

The simplest illustration I can think of is the canonical
equivalence match between the NFD and NFC forms of an
accented character.

postgres=# CREATE COLLATION nd (
  provider = 'icu',
  locale = 'und',
  deterministic = false
);		       

-- match NFD form with NFC form of eacute

postgres=# select U&'e\0301' like 'é' collate nd;
 ?column? 
----------
 t

postgres=# select U&'e\0301' like '_' collate nd;
 ?column? 
----------
 f
(1 row)

I understand why the algorithm produces these opposite results.
But at the semantic level, when asked if the left-hand string matches
a specific character, it says yes, and when asked if it matches any
character, it says no.
To me it goes beyond counter-intuitive, it's not reasonable enough to
be called correct.

What could we do about it?
Intuitively I think that our interpretation of "character" here should
be whatever sequence of code points are between character
boundaries [1], and that the equality of such characters would be the
equality of their sequences of code points, with the string equality
check of the collation, whatever the length of these sequences.

[1]:
https://unicode-org.github.io/icu/userguide/boundaryanalysis/#character-boundary

Best regards,
-- 
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

reply

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Reply to all the recipients using the --to and --cc options:
  reply via email

  To: [email protected]
  Cc: [email protected], [email protected], [email protected]
  Subject: Re: Support LIKE with nondeterministic collations
  In-Reply-To: <[email protected]>

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

This inbox is served by agora; see mirroring instructions
for how to clone and mirror all data and code used for this inbox