Content-Type: text/plain; charset="utf-8"
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
From: "Daniel Verite" <daniel@manitou-mail.org>
Subject: Re: Support LIKE with nondeterministic collations
To: "Peter Eisentraut" <peter@eisentraut.org>
Cc: Robert Haas <robertmhaas@gmail.com>,
 Pgsql-Hackers <pgsql-hackers@postgresql.org>
In-Reply-To: <2ab6ec19-f9a9-4960-81f8-54028afb496e@eisentraut.org>
Date: Fri, 03 May 2024 16:58:55 +0200
Message-Id: <9c81e66f-0e06-4756-b9a2-d7b7d46379c1@manitou-mail.org>
Archived-At: <https://www.postgresql.org/message-id/9c81e66f-0e06-4756-b9a2-d7b7d46379c1%40manitou-mail.org>
Precedence: bulk

	Peter Eisentraut wrote:

> Yes, certainly, and there is also no indexing support (other than for=20
> exact matches).

The ICU docs have this note about prefix matching:

https://unicode-org.github.io/icu/userguide/collation/architecture.html#gen=
erating-bounds-for-a-sort-key-prefix-matching

   * Generating bounds for a sort key (prefix matching)

   Having sort keys for strings allows for easy creation of bounds -
   sort keys that are guaranteed to be smaller or larger than any sort
   key from a give range. For example, if bounds are produced for a
   sortkey of string =E2=80=9Csmith=E2=80=9D, strings between upper and low=
er bounds
   with one level would include =E2=80=9CSmith=E2=80=9D, =E2=80=9CSMITH=E2=
=80=9D, =E2=80=9CsMiTh=E2=80=9D. Two kinds
   of upper bounds can be generated - the first one will match only
   strings of equal length, while the second one will match all the
   strings with the same initial prefix.

   CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with
   the maximum primary weight, so that for example the string
   =E2=80=9Csmith\uFFFF=E2=80=9D can be used as the upper bound rather than=
 modifying
   the sort key for =E2=80=9Csmith=E2=80=9D.

In other words it says that

  col LIKE 'smith%' collate "nd"

is equivalent to:

  col >=3D 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd"

which could be obtained from an index scan, assuming a btree
index on "col" collate "nd".

U+FFFF is a valid code point but a "non-character" [1] so it's
not supposed to be present in normal strings.

[1] https://www.unicode.org/glossary/#noncharacter


Best regards,
--=20
Daniel V=C3=A9rit=C3=A9
https://postgresql.verite.pro/
Twitter: @DanielVerite