Received: from malur.postgresql.org ([217.196.149.56]) by arkaria.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2uNM-007Gxy-Ny for pgsql-hackers@arkaria.postgresql.org; Fri, 03 May 2024 14:59:04 +0000 Received: from localhost ([127.0.0.1] helo=malur.postgresql.org) by malur.postgresql.org with esmtp (Exim 4.94.2) (envelope-from ) id 1s2uNJ-00A7Ds-KR for pgsql-hackers@arkaria.postgresql.org; Fri, 03 May 2024 14:59:02 +0000 Received: from magus.postgresql.org ([2a02:c0:301:0:ffff::29]) by malur.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2uNJ-00A7Di-BC for pgsql-hackers@lists.postgresql.org; Fri, 03 May 2024 14:59:02 +0000 Received: from dverite2024.planet-service.net ([185.16.44.252] helo=mail.verite.pro) by magus.postgresql.org with esmtps (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1s2uNH-001MGk-78 for pgsql-hackers@postgresql.org; Fri, 03 May 2024 14:59:01 +0000 Received: by mail.verite.pro (Postfix, from userid 1000) id BE2D72C0CBE; Fri, 3 May 2024 16:58:56 +0200 (CEST) Content-Type: text/plain; charset="utf-8" Content-Disposition: inline Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 From: "Daniel Verite" Subject: Re: Support LIKE with nondeterministic collations To: "Peter Eisentraut" Cc: Robert Haas , Pgsql-Hackers In-Reply-To: <2ab6ec19-f9a9-4960-81f8-54028afb496e@eisentraut.org> Date: Fri, 03 May 2024 16:58:55 +0200 Message-Id: <9c81e66f-0e06-4756-b9a2-d7b7d46379c1@manitou-mail.org> X-Mailer: Manitou v1.7.3 List-Id: List-Help: List-Subscribe: List-Post: List-Owner: List-Archive: Archived-At: Precedence: bulk Peter Eisentraut wrote: > Yes, certainly, and there is also no indexing support (other than for=20 > exact matches). The ICU docs have this note about prefix matching: https://unicode-org.github.io/icu/userguide/collation/architecture.html#gen= erating-bounds-for-a-sort-key-prefix-matching * Generating bounds for a sort key (prefix matching) Having sort keys for strings allows for easy creation of bounds - sort keys that are guaranteed to be smaller or larger than any sort key from a give range. For example, if bounds are produced for a sortkey of string =E2=80=9Csmith=E2=80=9D, strings between upper and low= er bounds with one level would include =E2=80=9CSmith=E2=80=9D, =E2=80=9CSMITH=E2= =80=9D, =E2=80=9CsMiTh=E2=80=9D. Two kinds of upper bounds can be generated - the first one will match only strings of equal length, while the second one will match all the strings with the same initial prefix. CLDR 1.9/ICU 4.6 and later map U+FFFF to a collation element with the maximum primary weight, so that for example the string =E2=80=9Csmith\uFFFF=E2=80=9D can be used as the upper bound rather than= modifying the sort key for =E2=80=9Csmith=E2=80=9D. In other words it says that col LIKE 'smith%' collate "nd" is equivalent to: col >=3D 'smith' collate "nd" AND col < U&'smith\ffff' collate "nd" which could be obtained from an index scan, assuming a btree index on "col" collate "nd". U+FFFF is a valid code point but a "non-character" [1] so it's not supposed to be present in normal strings. [1] https://www.unicode.org/glossary/#noncharacter Best regards, --=20 Daniel V=C3=A9rit=C3=A9 https://postgresql.verite.pro/ Twitter: @DanielVerite